Question 1

Where can I learn more about SDGym?

Accepted Answer

SDGym is a publicly available benchmarking system for synthetic data generation techniques. You can learn more about it in our documentation.

Question 2

Why did you create a publicly available, continuously running benchmark?

Accepted Answer

Most benchmarks are point-in-time and often designed to showcase a vendor’s product. Continuous benchmarking keeps datasets up-to-date and problems challenging, while enabling ongoing evaluation of synthetic data generators across multiple dimensions—quality, speed, coverage, stability, and reliability. Many benchmarks focus only on quality, overlooking factors like maintainability. In practice, synthetic data systems can quickly fall out of maintenance, and a well-maintained software library is critical for enterprise production use.

Question 3

How are “wins” calculated in the benchmark?

Accepted Answer

A synthesizer is considered a “winner” for a dataset if it achieves a strong balance of quality and speed. Specifically, it must lie on the Pareto frontier—meaning no other synthesizer is both higher quality and faster—and it must match or exceed the quality of a reliable baseline (the Gaussian Copula synthesizer). This ensures that winners are not just high-quality models, but those that deliver the best practical trade-offs for real-world use.

Question 4

Why does a synthetic data generator have to win against Gaussian Copula synthesizer?

Accepted Answer

Copulas, introduced in the 1960s, are powerful and computationally efficient generative models. As a result, we believe it is reasonable for any synthetic data generator to perform at least as well as—if not better than—this baseline.

Question 5

Why compare synthesizers across both quality and speed—shouldn’t quality matter most?

Accepted Answer

While quality is critical, it is not the only constraint in real-world deployments. Generating high-quality synthetic data often comes with increased computational cost and time. Many use cases—such as testing, rapid iteration, or on-demand data generation—require results within strict time limits. Evaluating both quality and speed ensures that teams can choose a synthesizer that fits their operational needs. The Pareto frontier highlights the optimal trade-offs, helping identify models that deliver the best possible quality for a given level of compute.

Question 6

What about other factors like privacy?

Accepted Answer

Synthetic data generators trained using differential privacy require empirical evaluation, as outlined in our trust-but-verify framework. We are expanding these evaluations and currently run them on a limited set of datasets and synthesizers due to computational constraints. For enterprises, we offer customized evaluations and comparisons for a small fee—please contact us to learn more.

Question 7

How can I add my own synthesizer to the benchmark?

Accepted Answer

We welcome contributions of new synthesizers. To get started, follow the instructions in the SDGym documentation for adding a custom synthesizer. Once implemented, you can submit a pull request, and our team will review it to ensure it meets the required standards. If approved, your synthesizer will be merged into SDGym and included in a future benchmark run. If accepted, we will publish the leaderboard and the results and share with you.

Question 8

How can enterprises use these results in practice?

Accepted Answer

Enterprises can use what we call exclusion principle. The synthetic data generators that are not on the pareto curve are not ready for enterprise adoption. In addition, all our code used to run these synthetic data generators is publicly available. Enterprises are welcome to run these benchmarks privately.

Question 9

As an enterprise how can I contribute?

Accepted Answer

We have built a stable, continuously running benchmark for synthetic data generators. As synthetic data becomes integral across enterprises—addressing challenges of access, availability, and quality—maintaining this system transparently requires substantial computational resources. We welcome contributions in any form—funding, compute credits, or otherwise. Please reach out to get involved, and we’re happy to support custom deployments on your cloud as well.

Question 10

How can I propose a dataset for inclusion in the benchmark?

Accepted Answer

If you would like to contribute a dataset, you can create an issue on our GitHub repository. While we cannot guarantee inclusion of every submission due to computational constraints, we prioritize dataset suggestions from enterprise users.

Question 11

What are the current limitations of the benchmarking approach?

Accepted Answer

Our biggest challenge today is sourcing and integrating datasets. Publicly available datasets are often highly curated and do not reflect the complexity of real enterprise data. While we have included many datasets from the literature and continue to expand the benchmark, capturing true enterprise complexity remains an ongoing challenge. While success on these datasets does not directly translate to enterprise complexity, failure on them—or not appearing on the Pareto frontier—is a strong signal that a synthesizer can be confidently excluded.

Leaderboard

About

SDGym

The Quality-Speed Tradeoffs

Model Cards

Datasets

Common Q&A

Try it out now

Basetransformer

Join our Community

DataCebo Forum

Github

Linkedin