DataCebo

Leaderboard

Last Run:

Models
Wins

The Quality-Speed Tradeoffs

The Optimal Frontier of Synthetic Data—Where Performance Becomes Clear, and Comparisons Go Beyond Quality Alone.

Model Cards

Datasets

Common Q&A

Below are some common Q&A that might help you understand more about SDGym.

Most benchmarks are point-in-time and often designed to showcase a vendor’s product. Continuous benchmarking keeps datasets up-to-date and problems challenging, while enabling ongoing evaluation of synthetic data generators across multiple dimensions—quality, speed, coverage, stability, and reliability. Many benchmarks focus only on quality, overlooking factors like maintainability. In practice, synthetic data systems can quickly fall out of maintenance, and a well-maintained software library is critical for enterprise production use.

A synthesizer is considered a “winner” for a dataset if it achieves a strong balance of quality and speed. Specifically, it must lie on the Pareto frontier—meaning no other synthesizer is both higher quality and faster—and it must match or exceed the quality of a reliable baseline (the Gaussian Copula synthesizer). This ensures that winners are not just high-quality models, but those that deliver the best practical trade-offs for real-world use.

Copulas, introduced in the 1960s, are powerful and computationally efficient generative models. As a result, we believe it is reasonable for any synthetic data generator to perform at least as well as—if not better than—this baseline.

While quality is critical, it is not the only constraint in real-world deployments. Generating high-quality synthetic data often comes with increased computational cost and time. Many use cases—such as testing, rapid iteration, or on-demand data generation—require results within strict time limits. Evaluating both quality and speed ensures that teams can choose a synthesizer that fits their operational needs. The Pareto frontier highlights the optimal trade-offs, helping identify models that deliver the best possible quality for a given level of compute.

Synthetic data generators trained using differential privacy require empirical evaluation, as outlined in our trust-but-verify framework. We are expanding these evaluations and currently run them on a limited set of datasets and synthesizers due to computational constraints. For enterprises, we offer customized evaluations and comparisons for a small fee—please contact us to learn more.

We welcome contributions of new synthesizers. To get started, follow the instructions in the SDGym documentation for adding a custom synthesizer. Once implemented, you can submit a pull request, and our team will review it to ensure it meets the required standards. If approved, your synthesizer will be merged into SDGym and included in a future benchmark run. If accepted. we will publish the leaderboard and the results and share with you.

Enterprises can use what we call exclusion principle. The synthetic data generators that are not on the pareto curve are not ready for enterprise adoption. In addition, all our code used to run these synthetic data generators is publicly available. Enterprises are welcome to run these benchmarks privately

We have built a stable, continuously running benchmark for synthetic data generators. As synthetic data becomes integral across enterprises—addressing challenges of access, availability, and quality—maintaining this system transparently requires substantial computational resources. We welcome contributions in any form—funding, compute credits, or otherwise. Please reach out to get involved, and we’re happy to support custom deployments on your cloud as well.

If you would like to contribute a dataset, you can create an issue on our GitHub repository. While we cannot guarantee inclusion of every submission due to computational constraints, we prioritize dataset suggestions from enterprise users.

Quick start

Try it out now

Quickly discover SDV with just a few lines of code!

Install SDV
from sdv.datasets.demo import download_demo
from sdv.single_table import GaussianCopulaSynthesizer

real_data, metadata = download_demo(
  'single_table', 'fake_hotel_guests')

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_data)

synthetic_data = synthesizer.sample(num_rows=10)

Follow us

Join our Community

Chat with developers across the world. Stay up-to-date with the latest features, blogs, and news.

DataCebo Forum

DataCebo Forum

Discuss SDV features, ask questions, and receive help.

Join the DataCebo Forum
Github

Github

SDV is publicly available.

Follow us on Github
Linkedin

Linkedin

Connect with DataCebo on LinkedIn.

Follow us on Linkedin
Datacebo logo

Make synthetic data a reality

© 2026 DataCebo, Inc.