With the Synthetic Data Vault, users can train generative models on their real data and then use them to sample any amount of synthetic data, which cannot be linked back to the real data. Two critical inputs are required to get started: metadata that richly describes the real data by annotating it with statistical types and semantic meaning, and a high quality training dataset extracted from real data.
Within the Synthetic Data Vault, we invented a metadata standard that uses the principled ways the database world describes data, while also augmenting it with annotations that provide the deeper meaning generative AI model training algorithms need. This metadata standard also includes what we call semantic or statistical types (also known as sdtypes).
Over the course of the past few years, as synthetic data initiatives have begun to expand beyond experimental phases, our users face two challenging scenarios: how to create highly accurate metadata with minimal manual work, and how to get high-quality training data from a complex database with numerous tables.
Today, we are announcing a new feature for our SDV Enterprise users: AI Connectors. AI Connectors allow users to create a robust, referentially sound training dataset by connecting to an existing database and automatically creating highly accurate metadata, regardless of the underlying database technology.
Below, we describe the highlights of this new feature and how you can access it. In an accompanying blog, Gaurav Sheni describes the key technological underpinnings of AI Connectors, how we assessed its benefits for our users, and more.
Automate Synthetic Data Creation Using AI Connectors
Highly accurate metadata and a robust training dataset can significantly increase the quality of your synthetic data. Before AI Connectors, users would export the data into CSVs and manually create metadata and a referentially sound training dataset. With AI Connectors, they can instead connect to the database, create more accurate metadata, and subsample a referentially sound robust training dataset — all automatically.



| Input required by SDV | Without AI Connectors | With AI Connectors |
|---|---|---|
| Metadata | Users export data and use SDV's auto detection feature to detect metadata. This requires SDV to learn from the data only, which is challenging. | AI Connectors integrates with the database and leverages the schema information available there, alongside an inference engine that creates highly accurate metadata. |
| Training data | Users export data from multiple tables independently, and must drop some data to achieve referential integrity. This reduces the amount of data that can be used. | AI Connectors contains inbuilt subsampling algorithms designed for multi-table datasets that can create referentially sound training data. |
How it works: Creating metadata and robust training data
AI Connectors starts by leveraging the information already available within a database — data types, relationships information and constraints. Within the AI Connectors bundle, a metadata inference engine maps this available information on to the deeper information required by SDV to train generative models, regardless of the underlying database technology. We built this so that SDV users don’t have to worry about building a custom workflow each time they have a new database. A second feature within this bundle creates a robust, referentially sound subsample from a multi-table dataset. Users can specify the size of the subsample of the data the algorithm needs to create; for example, users could specify the number of customers in a customer database, or number of transactions in a credit card transactions database.
With AI Connectors, metadata quality improved by 35 percent and synthetic data quality increased by an average of 18%
In addition to automating work and reducing time and effort, our benchmarking revealed that the quality of the metadata created by AI Connectors improved by 35% when compared to metadata created with the current inference methods available via SDV Community.
According to our benchmarks, using AI Connectors to create a referentially sound, robust, multi-table dataset for training also improved the quality of synthetic data by an average of 18%.
Using AI Connectors
from sdv.io.database import BigQueryConnector
from sdv.multi_table import HSASynthesizer
connector = BigQueryConnector()
connector.set_import_config(dataset_id='sales_dataset')
# Use the connector to create metadata and extract a real data training set
metadata = connector.create_metadata(tables=['products', 'transactions'])
real_data = connector.import_random_subset(metadata)
# Create synthetic data with SDV
synthesizer = HSASynthesizer(metadata)
synthesizer.fit(real_data)
synthetic_data = synthesizer.sample()
# Use the same connector to export synthetic data to a database
connector.set_export_config(dataset_id='synthetic_sales_dataset')
connector.export_data(synthetic_data)Benefits of using AI Connectors
In the past, users have tried to create metadata and training data manually. AI Connectors automates these two steps while leveraging the information already available within a database.
Fully automated metadata creation
Fully automated metadata creation allows users to speed up the metadata creation process. Since this step no longer requires real data, it can also be integrated with our DayZSynthesizer, which can create synthetic data using metadata only. In our internal tests, metadata detection accuracy is 35% higher than when inferred from the data.
Improved synthetic data quality
In our internal tests, we found that using AI connectors increases the synthetic data quality by an average of 18%, thanks to the combined improvements of greater metadata accuracy and using a referentially sound, robust subsample of training data
Incremental training data provision for generative AI models
Your database may have terabytes of data, raising the common question of how much of this data should be used to train the generative AI model. The AI Connectors inbuilt subsampling algorithm allows you to provide training data incrementally — bringing in a referential sound subsample, training a synthesizer, generating synthetic data, measuring its quality, and then bringing in more training data if necessary.
Scaling for synthetic data initiatives
Enterprise users have their data stored in numerous databases. AI Connectors allows generative AI models to be developed for any of these databases, allowing users to scale their synthetic data initiatives without having to build custom workflows for each new database. Currently, AI Connectors supports AlloyDB, Oracle, Microsoft SQL Server, Google Cloud Spanner, and Google BigQuery. We are actively expanding AI Connectors to support various databases. If you have a database type that you would like us to prioritize please fill out this form.


