Decorators

Ensure pandas dataframe fit the expectations of the function.

The two main entry points are the @argument() and the @result() decorators to specify pandas DataFrame or Series inputs and outputs, respectively.

They can check directly against a pandera.DataFrameSchema or use cross-argument checks like output pc.checks.extends an input. See pc.checks for the available checks.

The helper function pc.from_arg can be used to get the argument value from a function call. This is useful for checks that depend on another function argument, e.g., to check that a DataFrame contains a column, which is set as an argument to the function.

pandas_contract.argument(arg: str, /, *checks_: Check | BaseSchema | None, key: KeyT = <object object>, validate_kwargs: ValidateDictT | None = None) → Callable[[_WrappedT], _WrappedT][source]

Check the input DataFrame.

Parameters:

arg – The name of the argument to check. This must be the name of one of the arguments of the function to be decorated.
checks – Additional checks or the pandera schema verification to perform on the DataFrame. For checks, see module pandas_contract.checks. For pandera, see the pandera documentation for DataFrameSchema and SeriesSchema.
key – The key of the input to check. See KeyT.
validate_kwargs –
Additional Keywords to provide to pandera validate. Valid keys are
- head: The number of rows to validate from the head. If None, all rows are used for validation. Used for pandera schema validation.
- tail: The number of rows to validate from the tail. If None, all rows are used for validation. Used for pandera schema validation.
- sample: The number of rows to validate randomly. If None, all rows are used for validation. Used for pandera schema validation.
- random_state: The random state for the random sampling. Used for pandera schema validation.

Examples

Note that all examples use the following preamble:

>>> import pandas as pd
>>> import pandera.pandas as pa
>>> import pandas_contract as pc

Ensure columns exist in DataFrame

Ensure that input dataframe as a column “a” of type int and “b” of type float.

>>> @argument(
...     "df",
...     pa.DataFrameSchema(
...         {"a": pa.Column(pa.Int), "b": pa.Column(pa.Float)}
...     ),
... )
... def func(df: pd.DataFrame) -> None:
...     ...

Ensure same index

Ensure that the dataframes arguments df1 and df2 have the same indices by checking argument df1 against the argument df2.

>>> @argument("df1", pc.checks.same_index_as("df2"))
... def func(df1: pd.DataFrame, df2: pd.DataFrame) -> None:
...     ...

Ensure same size

Ensure that the dataframe arguments df1 and df2 have the same size

>>> @argument("df1", pc.checks.same_length_as("df2"))
... def func(df1: pd.DataFrame, df2: pd.DataFrame) -> None:
...     ...

All-together

Ensure that the input dataframe has a column “a” of type int, the same index as df2, and the same size as df3.

>>> @argument(
...     "dfs",
...     pa.DataFrameSchema({"a": pa.Column(pa.Int)}),
...     pc.checks.same_index_as("df2"),
...     pc.checks.same_length_as("df3"),
... )
... def func(dfs: pd.DataFrame, df2: pd.DataFrame, df3: pd.DataFrame) -> None:
...     ...

Data Series

Instead of a DataFrame, one can also validate a Series. Then the schema must be of type pa.SeriesSchema.

For example, to ensure that the input series is of type int, one can use:

>>> @argument("ds", pa.SeriesSchema(pa.Int))
... def func(ds: pd.Series) -> None:
...     ...

pandas_contract.result(*checks_: Check | BaseSchema | None, key: Any = <object object>, validate_kwargs: ValidateDictT | None = None) → Callable[[_WrappedT], _WrappedT][source]

Validate a DataFrame result using pandera.

Parameters:

checks –
Additional checks and the pandera schema verification to perform on the DataFrame. For checks, see module pandas_contract.checks.

If a pandera schema is provided, it is used to validate the output. For pandera, see the pandera documentation for DataFrameSchema and SeriesSchema.
key – The key of the input to check. See KeyT.
validate_kwargs –
Additional Keywords to provide to pandera validate. Valid keys are
- head: The number of rows to validate from the head. If None, all rows are used for validation. Used for pandera schema validation.
- tail: The number of rows to validate from the tail. If None, all rows are used for validation. Used for pandera schema validation.
- sample: The number of rows to validate randomly. If None, all rows are used for validation. Used for pandera schema validation.
- random_state: The random state for the random sampling. Used for pandera schema validation.

Examples

Note that all examples use the following preamble:

>>> import pandas as pd
>>> import pandera.pandas as pa
>>> import pandas_contract as pc

Output column exists

Ensure that the output dataframe has a column “a” of type int.

>>> @result(pa.DataFrameSchema({"a": pa.Column(pa.Int)}))
... def func() -> pd.DataFrame:
...     return pd.DataFrame({"a": [1, 2]})

Ensure that the output dataframe has a column “a” of type int and “b” of type float

>>> @result(
...     pa.DataFrameSchema(
...         {"a": pa.Column(pa.Int), "b": pa.Column(pa.Float)}
...    )
... )
... def func() -> pd.DataFrame:
...     return pd.DataFrame({"a": [1, 2], "b": [1.0, 2.0]})

Ensure that the output dataframe has the same index as df.

>>> @result(
...     pa.DataFrameSchema({"a": pa.Column(pa.Int)}, pc.checks.same_index_as("df"))
... )
... def func(df: pd.DataFrame) -> pd.DataFrame:
...     return df

Ensure that the output dataframe has the same size as df.

>>> @result(
...     pa.DataFrameSchema({"a": pa.Column(pa.Int)}),
...     pc.checks.same_length_as("df"),
... )
... def func(df: pd.DataFrame) -> pd.DataFrame:
...     return df

Ensure same index. Ensure that the output dataframe has the same index as df1 and the same size as df2.

>>> @result(
...     pa.DataFrameSchema({"a": pa.Column(pa.Int)}),
...     pc.checks.same_index_as("df1"),
...     pc.checks.same_length_as("df2"),
... )
... def func(df1: pd.DataFrame, df2: pd.DataFrame) -> pd.DataFrame:
...     return df1

Ensures that the output extends the input schema.

>>> @result(
...    pc.checks.extends("df", modified=pa.DataFrameSchema({"b": pa.Column(int)}))
... )
... def func(df: pd.DataFrame) -> pd.DataFrame:
...     return df.assign(a=1)

Note that any columns listed the result schema can be modified. To specify a column that is returned, but cannot be modified, use the schema argument of the input argument.

Ensures that the output extends the input schema and has a column “a” of type int. The following will fail in any of the three cases:

df does not have a column “in” of type int
The result does not have a column “out” of type int
The column ‘a’ data was changed.

>>> @argument("df", pa.DataFrameSchema({"in": pa.Column(pa.Int)}))
... @result(
...     pa.DataFrameSchema({"out": pa.Column(pa.Int)}),
...     pc.checks.extends("df", modified=pa.DataFrameSchema({"a": pa.Column(int)})),
... )
... def func(df: pd.DataFrame) -> pd.DataFrame:
...     return df.assign(out=1)

pandas_contract.from_arg(arg: str) → Callable[[Callable[[...], Any], tuple[Any], dict[str, Any]], Any][source]

Get the named argument from the function call via a call-back.

Returns a call-back function that can be used to get the named argument from the function call. In combination with pandas_contract integration of pandera, it can be used to specify required columns that come from a function argument.

It will inspect all arguments provided to the function as well as the default values.

Parameters:: arg – Name of function argument. The value of the argument must be either a valid column (i.e. a Hashable) or a list of hashables. If it’s a list, multiple coluns checks will be created, one for each item.
Returns:: A function for meth:pandas_contract._private_checks.SchemaCheck that extracts the values from the argument at runtime-time. Its inteface is

Example

>>> import pandas as pd
>>> import pandas_contract as pc
>>> import pandera.pandas as pa

>>> @pc.argument("df", pa.DataFrameSchema({pc.from_arg("col"): pa.Column()}))
... @pc.result(pa.DataFrameSchema({pc.from_arg("col"): pa.Column(str)}))
... def col_to_string(df: pd.DataFrame, col: str) -> pd.DataFrame:
...     return df.assign(**{col: df[col].astype(str)})

Multiple columns in function argument The decorator also supports multiple columns from the function argument.

>>> @pc.argument("df", pa.DataFrameSchema({pc.from_arg("cols"): pa.Column()}))
... @pc.result(pa.DataFrameSchema({pc.from_arg("cols"): pa.Column(str)}))
... def cols_to_string(df: pd.DataFrame, cols: list[str]) -> pd.DataFrame:
...     return df.assign(**{col: df[col].astype(str) for col in cols})

class pandas_contract._decorator.KeyT(*args, **kwargs)[source]

KeyType protocol, define a lookup key for an argument or the result.

A key can be used to get a DataFrame or Series from within a more complex argument or return value.

Its value is either any hashable or a function that takes a single argument as an input and returns a DataFrame/Series.

Note that None is a valid key in a dictionary and hence is not the default value. By default, the value is used as-is.

>>> import pandas as pd
>>> import pandas_contract as pc
>>> import pandera.pandas as pa
>>> @pc.result(pa.SeriesSchema(int), key=1)
... def f1():
...    return "res", pd.Series([1,2,3])

The key can also be an arbitrary function that takes the input arg and has to return the DataFrame/Series to check.

This can be used to create a Series, which is then checkable:

>>> @pc.result(pa.SeriesSchema(int), key=pd.Series)
... def f1():
...    return [1, 2, 3]

Note, if the DataFrame/Series is wrapped in a mapping where the mapping keys are callables, then Key must be wrapped in another function:

>>> def fn_as_key():
...    ...

>>> # Get the dataframe from the output item `f1`.
>>> # @pc.result(key=f1, schema=pa.DataFrameSchema({"name": pa.String}))  - fail
>>> @pc.result(
...     pa.DataFrameSchema({"name": pa.Column(str)}),
...     key=lambda res: res[fn_as_key],
... )
... def return_function_to_df():
...     # f1 is a key to a dictionary holding the data frame to be tested.
...     return {
...         fn_as_key: pd.DataFrame([{"name": "f1"}])
...     }