Typedframe

Typed wrappers over pandas DataFrames with schema validation.

TypedDataFrame is a lightweight wrapper over pandas DataFrame that provides runtime schema validation and can be used to establish strong data contracts between interfaces in your Python code.

The goal of the library is to reveal and make explicit all unclear or forgotten assumptions about your DataFrame.

Quickstart

Install typedframe library:

pip install typedframe

Assume an overly simplified preprocessing code like this:

def preprocess(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    c1_min, c1_max = df['col1'].min(), df['col1'].max()
    df['col1'] = 0 if c1_min == c1_max else (df['col1'] - c1_min) / (c1_max - c1_min)
    df['month'] = df['date'].dt.month
    df['comment'] = df['comment'].str.lower()
    return df

To add typedframe schema support for this transformation we will define two schema classes - for the input and for the output:

import numpy as np
from typedframe import TypedDataFrame, DATE_TIME_DTYPE

class MyRawData(TypedDataFrame):
    schema = {
        'col1': np.float64,
        'date': DATE_TIME_DTYPE,
        'comment': str,
    }


class PreprocessedData(MyRawData):
    schema = {
        'month': np.int8
    }

Then let’s modify the preprocess function to take a typed wrapper MyRawData as input and return PreprocessedData:

def preprocess(data: MyRawData) -> PreprocessedData:
    df = data.df.copy()
    c1_min, c1_max = df['col1'].min(), df['col1'].max()
    df['col1'] = 0 if c1_min == c1_max else (df['col1'] - c1_min) / (c1_max - c1_min)
    df['month'] = df['date'].dt.month
    df['comment'] = df['comment'].str.lower()
    return PreprocessedData.convert(df)

As you can see the actual DataFrame can be accessed via the .df attribute of the Typed DataFrame.

Now clients of the preprocess function can easily check what are the inputs and outputs without the need to look at its internals. And if there are some unforseen changes in the data an exception will be thrown before the actual function will be invoked.

Let’s check:

import pandas as pd

df = pd.DataFrame({
  'col1': [0.1, 0.2],
  'date': ['2021-01-01', '2022-01-01'],
  'comment': ['foo', 'bar']
})
df.date = pd.to_datetime(df.date)

bad_df = pd.DataFrame({
  'col1': [1, 2],
  'comment': ['foo', 'bar']
})

df2 = preprocess(MyRawData(df))
df3 = preprocess(MyRawData(bad_df))

The first call was successful. But when we’ve tried to pass a wrong dataframe as input we’ve got the following error:

AssertionError: Dataframe doesn't match schema
Actual: {'col1': dtype('int64'), 'comment': dtype('O')}
Expected: {'col1': <class 'numpy.float64'>, 'date': dtype('<M8[ns]'), 'comment': <class 'object'>}
Difference: {('col1', <class 'numpy.float64'>), ('date', dtype('<M8[ns]'))}

Problems with pandas DataFrame

Let’s return the initial code example above. What’s the problem here?

def preprocess(df: pd.DataFrame) -> pd.DataFrame:

Even when we have added type hints to our function, the user doesn’t really know how he can use it. He must dig inside the code of the function to find out things like expected columns and their types. This violates on of the core software development principles - the encapsulation.

Pandas DataFrame is an open data type. It introduces a lot of implicit assumptions about the data. Let’s explore some examples where one can easily overlook these implicit assumptions:

Required columns and data types:

df.grouby('state')['income'].mean()

The dataframe is expected to have state and income columns. income column must have a numeric type.

Index name and type

df.reset_index(inplace=True)
x = df['my_index']

It is expected that a dataframe has a named index with a name my_index.

Categorical columns

df3 = pd.merge(df1, df2, on='categorical_col')

The result above will differ based on whether a categorical_col in df1 and df2 has exactly the same set of categories or not.

All these scenarios above can lead to a variety of subtle bugs in our pipeline.

The concept of Typed DataFrame

A Typed DataFrame is a minimalistic wrapper on top of your pandas DataFrame. You create it by creating a subclass of a TypedDataFrame and defining schema static variable. Then you can wrap your DataFrame in it by passing it to your Typed DataFrame constructor. The constructor will do a runtime schema validation and the original dataframe can be accessed through df attribute of a wrapper.

This wrapper serves 2 purposes:

Formal explicit documentation about dataframe assumptions. You can use your Typed DataFrame schema definition as a form of documentation to communicate your data interfaces to others. This works very well especially in combination with Python type hints.
Runtime schema validation. In case of any data contracts violation you’ll get an exception explaining the exact reason. If you guard your pipeline with such Typed DataFrames you’ll be able to catch errors early - closer to the root causes.

Features

Required Schema

You can define the required schema by passing a dictionary to a static variable schema of a TypeFrame subclass. The dictionary defines the mapping from a column name to a dtype:

class MyTable(TypedDataFrame):
   schema = {
      "col1": str,
      "col2": np.int32,
      "col3": ('foo', 'bar')
   }

Schema Inheritance

You can inherit one Typed DataFrame from another one.

The semantics of the inheritance relation is the same as with class methods and attributes in classic OOP. I.e. if Typed DataFrame A is a subclass of a Typed DataFrame B, all the schema requirements for B must also be held for A. In case of any conflicts, the schema defined in A takes a precedence.

class MyDataFrame(TypedDataFrame):
    schema = {
        'int_field': np.int16,
        'float_field': np.float64,
        'bool_field': bool,
        'str_field': str,
        'obj_field': object
    }


class InheritedDataFrame(MyDataFrame):
    schema = {
        'new_field': np.int64
    }

Multiple Inheritance

Multiple Inheritance is allowed. It has a “union” semantics.

class Root(TypedDataFrame):

    schema = {
        'root': bool
    }


class Left(Root):
    schema = {
        'left': bool
    }


class Right(Root):
    schema = {
        'root': object,
        'right': bool
    }


class Down(Left, Right):
    pass

Index Schema

You can specify schema for the index of the DataFrame. It’s defined as a tuple of a dtype and a name which you assign to an index_schema static variable:

class IndexDataFrame(TypedDataFrame):
    schema = {
        'foo': bool
    }

    index_schema = ('bar', np.int32)

Optional Schema

You can specify optional columns in a schema definition. Optional column types will be checked only if present in a DataFrame. In case some optional column (or all of them) is missing no validation error will be raised. Besides that all columns from optional schema that are missing in a dataframe will be added with NaN values.

class DataFrameWithOptional(TypedDataFrame):
    schema = {
        'required': bool
    }
    optional = {
        'optional': bool
    }

Convert Method

TypedDataFrame provides a convenient convert classmethod that tries to convert a given DataFrame to be compliant with a schema.

class IndexDataFrame(TypedDataFrame):
    schema = {
        'foo': bool
    }

    index_schema = ('bar', DATE_TIME_DTYPE)

df = pd.DataFrame({'foo': [True, False]},
                  index=pd.Series(['2021-06-03', '2021-05-31']))
data = IndexDataFrame.convert(df)

Supported types

Integers

np.int16, np.int32, np.int64, etc.

Floats

np.float16, np.float32, np.float64, etc.

Boolean

bool

Python objects

str, dict, list, object

WARNING: no actual check is performed for Python objects. They are all considered to be of the same type object.

Categorical

Categorical dtype is specified as a tuple of categories. To avoid common categorical pitfalls categorical types are required to have an exact schema with all categories enumerated in the exact order.

class MyTable(TypedDataFrame):
   schema = {
      "col": ('foo', 'bar')
   }

df = pd.DataFrame({"col": ['foo', 'foo', 'bar']})
df.col = pd.Categorical(df.col, categories=('foo', 'bar'), ordered=True)
data = MyTable(df)

DateTime

np.dtype('datetime64[ns]')

typedframe library provides an alias for that also: DATE_TIME_DTYPE

UTC DateTime

pd.DatetimeTZDtype('ns', pytz.UTC)

typedframe library provides an alias for that also: UTC_DATE_TIME_DTYPE

Best practices to use Typed DataFrame

What are the best places to use Typed DataFrame wrappers in your codebase?

Our experience with typedframe library in a number of projects has shown the following scenarios where it’s use was justified the most:

Team Borders

Typed DataFrame helps to establish data contracts between teams. It also helps to spot the errors caused by miscommunication or inconsistent system evolution early. Whenever some dataset is being passed between teams it makes sense to define a Typed DataFrame class with its specification.

Public Functions and Methods

Typed DataFrame work especially well in combination with Python type hints. So a good place to use it is when you have a public function or method that takes as an argument / returns some pandas DataFrame.

Sources and Sinks of Data Pipelines

It is a good practice to provide schema definitions and runtime validation at the beginning and at the end of data pipelines. I.e. right after you read from the external storage and before you write to it. This is where Typed DataFrames can also be used.

Similar Projects

Great Expectations. It’s a much more feature-rich library which allows data teams to do a lot of assertions about the data. typedframe is a more light-weight library which can be considered as a thin extension layer on top of pandas DataFrame.
Marshmallow. A library for Python objects serialization and deserialization with schema validation. It’s not integrated with pandas or numpy and focuses only on Python classes and builtin objects.