Decorators
Ensure pandas dataframe fit the expectations of the function.
The two main entry points are the @argument() and the
@result() decorators to specify pandas DataFrame or Series inputs
and outputs, respectively.
They can check directly against a
pandera.DataFrameSchema
or use cross-argument checks like output
pc.checks.extends an input. See
pc.checks for the available checks.
The helper function pc.from_arg can be used to
get the argument value from a function call. This is useful for checks that depend
on another function argument, e.g., to check that a DataFrame contains a column, which
is set as an argument to the function.
- pandas_contract.argument(arg: str, /, *checks_: Check | BaseSchema | None, key: KeyT = <object object>, validate_kwargs: ValidateDictT | None = None) Callable[[_WrappedT], _WrappedT][source]
Check the input DataFrame.
- Parameters:
arg – The name of the argument to check. This must be the name of one of the arguments of the function to be decorated.
checks – Additional checks or the pandera schema verification to perform on the DataFrame. For checks, see module
pandas_contract.checks. For pandera, see the pandera documentation for DataFrameSchema and SeriesSchema.key – The key of the input to check. See
KeyT.validate_kwargs –
Additional Keywords to provide to pandera validate. Valid keys are
head: The number of rows to validate from the head. If None, all rows are used for validation. Used for pandera schema validation.
tail: The number of rows to validate from the tail. If None, all rows are used for validation. Used for pandera schema validation.
sample: The number of rows to validate randomly. If None, all rows are used for validation. Used for pandera schema validation.
random_state: The random state for the random sampling. Used for pandera schema validation.
Examples
Note that all examples use the following preamble:
>>> import pandas as pd >>> import pandera.pandas as pa >>> import pandas_contract as pc
Ensure columns exist in DataFrame
Ensure that input dataframe as a column “a” of type int and “b” of type float.
>>> @argument( ... "df", ... pa.DataFrameSchema( ... {"a": pa.Column(pa.Int), "b": pa.Column(pa.Float)} ... ), ... ) ... def func(df: pd.DataFrame) -> None: ... ...
Ensure same index
Ensure that the dataframes arguments df1 and df2 have the same indices by checking argument df1 against the argument df2.
>>> @argument("df1", pc.checks.same_index_as("df2")) ... def func(df1: pd.DataFrame, df2: pd.DataFrame) -> None: ... ...
Ensure same size
Ensure that the dataframe arguments df1 and df2 have the same size
>>> @argument("df1", pc.checks.same_length_as("df2")) ... def func(df1: pd.DataFrame, df2: pd.DataFrame) -> None: ... ...
All-together
Ensure that the input dataframe has a column “a” of type int, the same index as df2, and the same size as df3.
>>> @argument( ... "dfs", ... pa.DataFrameSchema({"a": pa.Column(pa.Int)}), ... pc.checks.same_index_as("df2"), ... pc.checks.same_length_as("df3"), ... ) ... def func(dfs: pd.DataFrame, df2: pd.DataFrame, df3: pd.DataFrame) -> None: ... ...
Data Series
Instead of a DataFrame, one can also validate a Series. Then the schema must be of type pa.SeriesSchema.
For example, to ensure that the input series is of type int, one can use:
>>> @argument("ds", pa.SeriesSchema(pa.Int)) ... def func(ds: pd.Series) -> None: ... ...
- pandas_contract.result(*checks_: Check | BaseSchema | None, key: Any = <object object>, validate_kwargs: ValidateDictT | None = None) Callable[[_WrappedT], _WrappedT][source]
Validate a DataFrame result using pandera.
- Parameters:
checks –
Additional checks and the pandera schema verification to perform on the DataFrame. For checks, see module
pandas_contract.checks.If a pandera schema is provided, it is used to validate the output. For pandera, see the pandera documentation for DataFrameSchema and SeriesSchema.
key – The key of the input to check. See
KeyT.validate_kwargs –
Additional Keywords to provide to pandera validate. Valid keys are
head: The number of rows to validate from the head. If None, all rows are used for validation. Used for pandera schema validation.
tail: The number of rows to validate from the tail. If None, all rows are used for validation. Used for pandera schema validation.
sample: The number of rows to validate randomly. If None, all rows are used for validation. Used for pandera schema validation.
random_state: The random state for the random sampling. Used for pandera schema validation.
Examples
Note that all examples use the following preamble:
>>> import pandas as pd >>> import pandera.pandas as pa >>> import pandas_contract as pc
Output column exists
Ensure that the output dataframe has a column “a” of type int.
>>> @result(pa.DataFrameSchema({"a": pa.Column(pa.Int)})) ... def func() -> pd.DataFrame: ... return pd.DataFrame({"a": [1, 2]})
Ensure that the output dataframe has a column “a” of type int and “b” of type float
>>> @result( ... pa.DataFrameSchema( ... {"a": pa.Column(pa.Int), "b": pa.Column(pa.Float)} ... ) ... ) ... def func() -> pd.DataFrame: ... return pd.DataFrame({"a": [1, 2], "b": [1.0, 2.0]})
Ensure that the output dataframe has the same index as df.
>>> @result( ... pa.DataFrameSchema({"a": pa.Column(pa.Int)}, pc.checks.same_index_as("df")) ... ) ... def func(df: pd.DataFrame) -> pd.DataFrame: ... return df
Ensure that the output dataframe has the same size as df.
>>> @result( ... pa.DataFrameSchema({"a": pa.Column(pa.Int)}), ... pc.checks.same_length_as("df"), ... ) ... def func(df: pd.DataFrame) -> pd.DataFrame: ... return df
Ensure same index. Ensure that the output dataframe has the same index as df1 and the same size as df2.
>>> @result( ... pa.DataFrameSchema({"a": pa.Column(pa.Int)}), ... pc.checks.same_index_as("df1"), ... pc.checks.same_length_as("df2"), ... ) ... def func(df1: pd.DataFrame, df2: pd.DataFrame) -> pd.DataFrame: ... return df1
Ensures that the output extends the input schema.
>>> @result( ... pc.checks.extends("df", modified=pa.DataFrameSchema({"b": pa.Column(int)})) ... ) ... def func(df: pd.DataFrame) -> pd.DataFrame: ... return df.assign(a=1)
Note that any columns listed the result schema can be modified. To specify a column that is returned, but cannot be modified, use the schema argument of the input argument.
Ensures that the output extends the input schema and has a column “a” of type int. The following will fail in any of the three cases:
df does not have a column “in” of type int
The result does not have a column “out” of type int
The column ‘a’ data was changed.
>>> @argument("df", pa.DataFrameSchema({"in": pa.Column(pa.Int)})) ... @result( ... pa.DataFrameSchema({"out": pa.Column(pa.Int)}), ... pc.checks.extends("df", modified=pa.DataFrameSchema({"a": pa.Column(int)})), ... ) ... def func(df: pd.DataFrame) -> pd.DataFrame: ... return df.assign(out=1)
- pandas_contract.from_arg(arg: str) Callable[[Callable[[...], Any], tuple[Any], dict[str, Any]], Any][source]
Get the named argument from the function call via a call-back.
Returns a call-back function that can be used to get the named argument from the function call. In combination with pandas_contract integration of pandera, it can be used to specify required columns that come from a function argument.
It will inspect all arguments provided to the function as well as the default values.
- Parameters:
arg – Name of function argument. The value of the argument must be either a valid column (i.e. a Hashable) or a list of hashables. If it’s a list, multiple coluns checks will be created, one for each item.
- Returns:
A function for meth:pandas_contract._private_checks.SchemaCheck that extracts the values from the argument at runtime-time. Its inteface is
Example
>>> import pandas as pd >>> import pandas_contract as pc >>> import pandera.pandas as pa
>>> @pc.argument("df", pa.DataFrameSchema({pc.from_arg("col"): pa.Column()})) ... @pc.result(pa.DataFrameSchema({pc.from_arg("col"): pa.Column(str)})) ... def col_to_string(df: pd.DataFrame, col: str) -> pd.DataFrame: ... return df.assign(**{col: df[col].astype(str)})
Multiple columns in function argument The decorator also supports multiple columns from the function argument.
>>> @pc.argument("df", pa.DataFrameSchema({pc.from_arg("cols"): pa.Column()})) ... @pc.result(pa.DataFrameSchema({pc.from_arg("cols"): pa.Column(str)})) ... def cols_to_string(df: pd.DataFrame, cols: list[str]) -> pd.DataFrame: ... return df.assign(**{col: df[col].astype(str) for col in cols})
- class pandas_contract._decorator.KeyT(*args, **kwargs)[source]
KeyType protocol, define a lookup key for an argument or the result.
A key can be used to get a DataFrame or Series from within a more complex argument or return value.
Its value is either any hashable or a function that takes a single argument as an input and returns a DataFrame/Series.
Note that None is a valid key in a dictionary and hence is not the default value. By default, the value is used as-is.
>>> import pandas as pd >>> import pandas_contract as pc >>> import pandera.pandas as pa >>> @pc.result(pa.SeriesSchema(int), key=1) ... def f1(): ... return "res", pd.Series([1,2,3])
The key can also be an arbitrary function that takes the input arg and has to return the DataFrame/Series to check.
This can be used to create a Series, which is then checkable:
>>> @pc.result(pa.SeriesSchema(int), key=pd.Series) ... def f1(): ... return [1, 2, 3]
Note, if the DataFrame/Series is wrapped in a mapping where the mapping keys are callables, then Key must be wrapped in another function:
>>> def fn_as_key(): ... ...
>>> # Get the dataframe from the output item `f1`. >>> # @pc.result(key=f1, schema=pa.DataFrameSchema({"name": pa.String})) - fail >>> @pc.result( ... pa.DataFrameSchema({"name": pa.Column(str)}), ... key=lambda res: res[fn_as_key], ... ) ... def return_function_to_df(): ... # f1 is a key to a dictionary holding the data frame to be tested. ... return { ... fn_as_key: pd.DataFrame([{"name": "f1"}]) ... }