PyVeritas: The Declarative Data Contract Engine for Python.

Timothy McCallum
8 min readFeb 12, 2025

--

Ad-hoc data validation clutters your code and leaves you vulnerable to errors. Traditional validation (often implemented as a series of verbose if/else statements) quickly becomes unmaintainable and difficult to test. The most common result is that most well-intentioned developers (working on data-driven applications) wrestle with denormalized “raw” data that breaks systems and leads to costly outages.

Photo by Francisco De Legarreta C. on Unsplash

PyVeritas offers a revolutionary approach: declarative data contracts that empower you to treat data validation as a first-class citizen in your development process.

What are data contracts?

Data contracts are clear, reusable, and testable formal agreements about the structure and content of your data.

What is PyVeritas?

Think of PyVeritas as a library of already-defined data contracts to improve interactions between:

  • different parts of your application,
  • your application and external services,
  • different teams using your application, and
  • different tests within your application.

PyVeritas can help you catch errors early, improve data quality, and build more robust applications.

Why Python-specific?

Photo by Rubaitul Azad on Unsplash

Python is a versatile, high-level programming language favoured by developers for its simplicity and extensive libraries. It is ideal for tasks ranging from scripting and automation to building full-scale applications. Python integrates with enterprise business intelligence tools like Microsoft Power BI. Python is also widely regarded as the primary language for Artificial intelligence (AI) due to its ease of use, extensive ecosystem of libraries like TensorFlow, PyTorch, NumPy, and Scikit-learn, and strong community support. Python is big enough, and used widely enough, to justify a reusable and testable data validation library, like PyVeritas.

An example

Let’s create a new directory, so we can learn how to use PyVeritas:

cd ~
mkdir my_pyveritas_project
cd my_pyveritas_project

Create a virtual environment:

python3 -m venv venv

Activate the virtual environment:

source venv/bin/activate 

(I am on macOS/Linux. If you are on Windows then use venv\Scripts\activate.bat to activate your virtual environment).

Install PyVeritas:

pip3 install PyVeritas

Ok, so let’s imagine that we are writing a section of our application that deals with a user. Create a new file called validate_user.py and paste the following code into that file:


from pyveritas.contracts import UserContract
from pyveritas.validator import Validator

user_contract = UserContract()
validator = Validator(user_contract)

user_data = {"name": "John", "email": "test@example.com", "age": 30}

if validator.is_valid(user_data):
print("User data is valid!")
else:
print("User data is invalid:")
errors = validator.validate(user_data)
for error in errors:
print(f"- {error}")

We can run this file, using the following command:

python3 validate_user.py

The above command will produce the following successful result:

User data is valid!

How PyVeritas works

With that basic example out of the way, let’s take a good look under the hood to see what is actually going on, and how you can contribute/use PyVeritas in the real world.

PyVeritas consists of several key modules:

  • rules.py: Defines the base Rule class and concrete rule implementations (e.g., StringLengthRule, NumberRangeRule). These rules encapsulate the actual validation logic.
  • contracts.py: Defines the DataContract class, which is responsible for holding a collection of Rule objects and applying them to a given dataset.
  • validator.py: Defines the Validator class, which takes a DataContract and provides a simple interface for validating data.
  • runner.py: The TestRunner is designed to load the rules and perform the testing to automatically assert the results

The rules.py module is the heart of PyVeritas. It defines the Rule class, which is the base class for all validation rules:

class Rule(ABC):
"""
Base class for all validation rules.
"""

@abstractmethod
def is_valid(self, data: t.Dict, context: RuleContext = None) -> bool:
"""
Checks if the rule is valid for the given data.
"""
pass

@abstractmethod
def error_message(self, data: t.Dict, context: RuleContext = None) -> str:
"""
Returns an error message if the rule is not valid.
"""
pass

The Rule class is an abstract base class (ABC), which means that it cannot be instantiated directly. Instead, it must be subclassed. Subclasses must implement the is_valid and error_message methods.

  • is_valid(data): This method takes a dictionary of data and returns True if the data is valid according to the rule, False otherwise.
  • error_message(data): This method returns a string containing an error message if the data is invalid.

The rules.py file also defines several concrete rule implementations, such as StringLengthRule, NumberRangeRule, and StringRegexRule. These rules provide ready-to-use validation logic for common data types and formats. And this is where we need your help!

If you would like to contribute to PyVeritas, please visit the GitHub repository at https://github.com/tpmccallum/PyVeritas. There are so many data contracts that we can create and reuse in the Python community.

Back to the theory …

The contracts.py module defines the DataContract class, which is responsible for holding a collection of Rule objects and applying them to a given dataset:

from abc import ABC, abstractmethod
import typing as t
from pyveritas.rules import Rule, RuleContext, StringRegexRule, NumberRangeRule, StringLengthRule, RequiredRule


class DataContract(ABC):
"""
Base class for all data contracts.
"""

def __init__(self, rules: t.List[Rule] = None):
self.rules = rules or []

@abstractmethod
def validate(self, data: t.Dict, context: RuleContext = None) -> t.List[str]:
"""
Validates the given data against the contract's rules.
Returns a list of error messages. If the list is empty, the data is valid.
"""
pass

def add_rule(self, rule: Rule):
"""
Adds a rule to the contract.
"""
self.rules.append(rule)

def __call__(self, data: t.Dict, context: RuleContext = None) -> t.List[str]:
"""
Allows the contract to be called like a function.
"""
return self.validate(data, context)

The DataContract class has the following key features:

  • __init__(rules): The constructor takes an optional list of Rule objects. These rules will be used to validate data against the contract.
  • validate(data): This method takes a dictionary of data and returns a list of error messages. If the data is valid according to all of the rules in the contract, the list will be empty.
  • add_rule(rule): This method allows you to add a new Rule to the contract after it has been created. We’ll discuss why this exists later.
  • __call__(data): This method allows you to call the DataContract object like a function. It simply calls the validate method and returns the result.

The validator.py module defines the Validator class, which provides a simple interface for validating data against a DataContract:

from pyveritas.contracts import DataContract
from pyveritas.rules import RuleContext
import typing as t

class Validator:
"""
A simple validator class that validates data against a DataContract.
"""

def __init__(self, contract: DataContract):
self.contract = contract

def validate(self, data: t.Dict, context: RuleContext = None) -> t.List[str]:
"""
Validates the given data against the contract.
"""
return self.contract.validate(data, context)

def is_valid(self, data: t.Dict, context: RuleContext = None) -> bool:
"""
Returns True if the data is valid, False otherwise.
"""
return not bool(self.validate(data, context))

def __call__(self, data: t.Dict, context: RuleContext = None) -> bool:
"""
Allows the validator to be called like a function.
"""
return self.is_valid(data, context)

The Validator class has the following key features:

  • __init__(contract): The constructor takes a DataContract object.
  • validate(data): This method takes a dictionary of data and returns a list of error messages. It simply calls the validate method on the DataContract object.
  • is_valid(data): This method takes a dictionary of data and returns True if the data is valid according to the DataContract, False otherwise. It simply calls the validate method and checks if the list of error messages is empty.
  • __call__(data): This method allows you to call the Validator object like a function. It simply calls the is_valid method and returns the result.

The runner.py module defines the TestRunner class, which is responsible for running the tests and making sure the data contracts are working correctly:

class TestRunner():
"""
A test runner for DataContracts.
"""

def run(self):
"""
Runs all test cases in the suite.
"""
print(f"Running test suite: {self.name}...")

for test_case in self.test_cases:
description = test_case["description"]
contract = test_case["contract"] # This is a string of the contract classname
data = test_case["data"]
expected_errors = test_case["expected_errors"]

# Dynamically load the contract class
module_name = "pyveritas.contracts" # Assuming contracts are in pyveritas/contracts.py
module = importlib.import_module(module_name)
contract_class = getattr(module, contract)
contract_instance = contract_class()

validator = Validator(contract_instance)
errors = validator.validate(data)

self._evaluate_test(description, data, errors, expected_errors)

print(f"Test suite {self.name} complete.")
self.summary()

Another example:

Let’s go over the same example from above, but this time with a twist. In this example, we trace the execution of a simple validation example to see how the different components interact.

Suppose we have the following data, and we want to validate it against the UserContract:

data = {"name": "John Doe", "email": "test@example.com", "age": 30}j

Here’s how the process works:

  1. We create an instance of the UserContract.
  2. We create an instance of the Validator, passing in the UserContract object.
  3. We call the validator.is_valid(data) method.
  4. The validator.is_valid(data) method calls the user_contract.validate(data) method.
  5. The user_contract.validate(data) method iterates over the rules in the UserContract.
  6. For each rule, it calls the rule.is_valid(data) method.
  7. If any of the rules return False, the user_contract.validate(data) method adds an error message to the list of error messages.
  8. The user_contract.validate(data) method returns the list of error messages.
  9. The validator.is_valid(data) method checks if the list of error messages is empty. If it is, it returns True. Otherwise, it returns False.

You might have noticed that the DataContract class has an add_rule method, but we haven’t used it in our examples:

    def add_rule(self, rule: Rule):
"""
Adds a rule to the contract.
"""
self.rules.append(rule)

This add_rule method allows you to add a new Rule to the contract after it has been created. This can be useful in situations where you need to dynamically add rules to a contract based on some runtime conditions.

For example, you might want to add a rule that checks if a user is an administrator only if the user’s account is active. You could do something like this:

user_contract = UserContract()

if user_data["is_active"]:
user_contract.add_rule(IsAdminRule())

You might have also noticed that the Rule class defines abstract methods (i.e. is_valid and error_message) with just the pass statement:

@abstractmethod
def is_valid(self, data: t.Dict, context: RuleContext = None) -> bool:
"""
Checks if the rule is valid for the given data.
"""
pass

The pass statement is a no-op statement in Python. It does nothing. It’s used here as a placeholder to indicate that the method must be implemented by subclasses.

We raise these points because ultimately we want to add more pre-defined rules for common data validation tasks. Are you interested in participating in a powerful and flexible library that can define and enforce data contracts in Python? If you want to contribute, or if you think of ways that you can leverage its features to build more robust and reliable applications please visit the GitHub repository (comment, raise issues, create pull requests).

Future development

The following is a list of future developments needed:

  • A comprehensive library of pre-built rules for common data validation tasks ( which will prevent developers from having to write the same validation logic over and over again)
  • Collaboration between developers by providing a common language for defining data contracts
  • Demonstrations about using PyVeritas to validate data coming from external APIs.
  • Demonstrations about using PyVeritas to validate data as it flows through a data pipeline.
  • Instructions and examples of how to use PyVeritas rules within pytest tests.
  • Instructions on creating your own custom validation rules.
  • Pydantic Integration.
  • More detailed error messages that clearly indicate which rule failed and why.
  • The ability to customize error messages.
  • Lots of real-world examples showing how to use PyVeritas to solve different data validation problems.

Thanks for reading.

--

--

Timothy McCallum
Timothy McCallum

Written by Timothy McCallum

Exploring WebAssembly’s potential as a researcher, writer, and software engineer. https://tpmccallum.github.io/timothy.mccallum.com.au/

No responses yet