Introducing Constraint-Augmented Generation (CAG)

Many years ago, we at DataCebo developed and introduced the concept of constraints: the ability to input logic into generative AI models to create better synthetic data. Adding a constraint to a generative AI model ensures that the resulting synthetic data follows specific rules, which statistical and neural network-based algorithms may fail to understand. For many of our users, these rules pertain to business logic that is core to their application, system, or business. User feedback has shown us that constraints are vital for the success of synthetic data in downstream use cases.

Today, we're doubling down on constraints by announcing a new and more powerful system called Constraint Augmented Generation (CAG). This system takes constraints to a new level, allowing users to pick between several pre-defined data patterns and apply them across rows, columns, or even across multiple tables. Once these are specified via our low-code interface, SDV Enterprise takes care of all the internal data processing and algorithmic complexities, and you are ready to sample 100% valid synthetic data.

Enhance Generative AI with CAG

Imagine that you have a complex, multi-table database schema that contains many different data tables and connections. You can use SDV Enterprise to learn overall patterns in your data and generate synthetic tables. SDV Enterprise guarantees that your synthetic data will match the table structure and database format of the original data. This means that you'll always see valid data in your table columns, and valid connections between tables.

Announcing CAG: Constraint Augmented Generation

Much like other generative AI algorithms, SDV Enterprise's synthesizers learn statistical data patterns. They aim to generate synthetic data that has the same overall statistical properties as the original, while also creating new, never-before-seen combinations of data. Usually, this type of synthetic data diversity is a good thing — but in some cases, it may inadvertently fail to meet your expectations.

Failures like this usually happen when your data has deterministic relationships between columns, rows, and tables that generative AI cannot automatically capture and emulate. We call such relationships database context. Database context describes hard and fast rules under which data is created and stored. Usually, this context is not explicitly stored within the database schema itself – but your team still knows that it exists. Downstream applications process this data based on the context using logic within the application software. Your expectation is that the synthetic data will also follow the database context.

For example, consider a multi-table setup from a ticket sales company. The data contains user accounts, and the perks each account-holder has access to. On the surface, this may seem like a simple relationship: A user account can have 0 or more associated perks.

But now imagine that there are two different types of user accounts – premium and basic – and only the premium accounts can have perks. This is database context: When you first create entries for basic accounts, you will never include perks. Then, any software that processes the data for downstream usage will expect the data to conform to it.

An example of database context: Only premium accounts, and not basic accounts, can have perks.

An example of database context: Only premium accounts, and not basic accounts, can have perks.

On its own generative AI can understand the overall connection between accounts and perks. However, it's harder for generative AI to learn this extra database context and create synthetic data that always conforms to it. This is where CAG comes in: By adding CAG to your synthesizer, you ensure that the synthetic data follows your database context 100% of the time.

Type of data pattern	Example	What does generative AI learn?
Data schema & format	The accounts table has a primary key. The perks table has a foreign key that refers to it.	SDV's generative AI is designed to fully enforce database concepts.
Data shape & correlations	Most accounts that were created in early 2021 are premium accounts (~70%).	Generative AI learns this pattern and matches it probabilistically
Database context	Only premium accounts can have associated perks	Generative AI does not enforce this pattern 100% of the time – you need CAG!

We have curated 20+ patterns that are supported by CAGs

Our team has worked with hundreds of users and customers who actively use synthetic data for their projects. Over the years, users have identified specific instances in which generative AI did not automatically enforce database context. We've collected a large number of these examples, and identified the common patterns behind them, as well as the business needs users are seeking to address. We're now incorporating these patterns into our CAG bundle.

So far, our team has identified 20+ different types of database context patterns that you may be maintaining in your schema. CAG patterns for some of these are available right now, and we will continue to release more as we develop them. To use the bundle, all you have to do is pick from one of the pre-defined patterns, and tell SDV where to apply it (columns, rows, and/or tables). Browse through some recent examples of CAG patterns in our docs.

CAG Usage 101: Augment your synthesizer with the pattern

To use CAG, all you need to do is pick the pre-defined pattern that corresponds to your database context, and tell SDV where to apply it. It will then augment your synthesizer directly with this information.

The example below shows how you can apply the CarryOverColumns pattern to two tables called Accounts and Transactions.

from sdv.cag import CarryOverColumns
from sdv.multi_table import HSASynthesizer

# pick a pre-defined pattern and specify where to apply it
my_pattern = CarryOverColumns(
parent_table_name='Accounts',
child_table_name='Transactions',
child_foreign_key='Account ID',
column_map={'Type': 'Type'})

# augment your synthesizer with the pattern
synthesizer = HSASynthesizer(metadata)
synthesizer.add_cag(patterns=[my_pattern])

And you're done! After fitting your synthesizer, the synthetic data it produces will match your database context, 100% of the time.

synthesizer.fit(data)
valid_synthetic_data = synthesizer.sample()

For more information about using CAG and our up-to-date Python API, check out our docs.

How it works: Algorithmic injection and data transformations

Augmenting your synthesizer with a CAG pattern is equivalent to injecting a smaller algorithm or data transformation into your workflow. Each CAG pattern comes with its own engine for satisfying its particular rule. The engine may modify the data in some way – for example, adding a column, creating a table, or merging existing information – and it may include a separate AI to ensure 100% data validity. Each CAG pattern is designed to be compatible with any single- or multi-table SDV synthesizer.

Picking a pre-defined CAG pattern is equivalent to injecting a special algorithmic engine into your synthesizer. It ensures that your synthesizer will learn to follow the pattern, 100% of the time.

How CAG Works — Picking a pre-defined CAG pattern is equivalent to injecting a special algorithmic engine into your synthesizer. It ensures that your synthesizer will learn to follow the pattern, 100% of the time.

Benefits of using CAG

In the past, users have tried to enforce database context themselves by preprocessing their data, or correcting their synthetic data after-the-fact. Using CAG has several advantages over this manual approach:

Increased trust in your synthetic data
CAG patterns are developed and maintained by DataCebo, the same team behind the popular SDV Enterprise synthesizers. Using CAG guarantees that your synthetic data will match the patterns you want 100% of the time, with optimized performance and minimal side effects.

Algorithmic transparency and control
Adding CAG patterns allows you to control the exact set of rules a synthesizer must learn, and the order in which it learns them. At any point in your synthetic data workflow, you can inspect the CAG patterns in your synthesizer for added reassurance.

Simpler synthetic data workflows
Without CAG, ensuring proper database context requires maintaining extra pre- and post-processing logic. CAG integrates right into your workflow, allowing you to maintain just one synthesizer file for your entire end-to-end synthetic data workflow.

CAG is now in Limited Availability for SDV Enterprise customers

The CAG Bundle is an optional add-on to SDV Enterprise. Purchasing the CAG Bundle will immediately give you access to 6 CAG patterns, plus 2 additional constraints. More are coming soon — as part of your purchase, you'll have access to all the new CAG patterns as we roll them out over time.

Currently, select SDV Enterprise customers can purchase CAG to beta-test the patterns. If you are interested, please reach out to us for more information.