We've reached 10 million downloads!

We've reached 10 million downloads!

Kalyan VeeramachaneniNeha Patki
June 11, 2025
|
Company

We at DataCebo are excited to share that the Synthetic Data Vault (SDV) Community recently hit an important milestone: 10 million downloads!

SDV Community comprises an entire suite of open core libraries. Users can install and use whichever ones they want during their synthetic data journeys:

  • Users who simply want to transform data and prepare it for generative modeling can use our transformers. Users can then try out their own techniques to build a generative model with the transformed data.
  • Users can choose from a multitude of models[1], ranging from CTGAN (our most popular GAN-based model), to GaussianCopula (our fastest model) to TVAE, (the model that reliably produces the highest-quality synthetic data) and CopulaGAN (our model that combines both copulas and GANs). Some users even choose all of these options — for example, the AI team from J. P. Morgan AI Research Team used all of them[2].
  • Users who want to evaluate their synthetic data can use our SDMetrics library, which allows for assessment along a variety of metrics.
  • Our benchmarking library, SDGym, enables users to test their own models against our open core models on a fixed set of datasets.

We wanted to take this moment and share with our community a few vital takeaways, how we got here and most importantly our value system and how we will continue to support our users and march towards the next 100 million downloads .

SDV Core

Synthetic data is filling a very important need for enterprises — and you can see it in our download trajectories.

Data is the lifeblood of AI, business analytics, and digital transformation. Synthetic data is unlocking the potential of these technologies and initiatives by providing data where it would otherwise be unavailable or restricted. We see this day in and day out as users apply SDV to an ever-widening variety of use cases, ranging from performance testing Extract-Transform-Load (ETL) pipelines —which create aggregate data to feed into reports/dashboards for critical, timely decision-making —to addressing training data shortages for AI models[3].

This growing need for synthetic data is reflected in the growth of downloads of SDV Community library over the past few years with a 5x growth in 2024. It is also especially stark when we compare Q1 of 2021 and 2025.

SDV Downloads
SDV Downloads

Building a full, enterprise-grade ecosystem for generative modeling of tabular data is a tall order.

When we started building the SDV, we quickly realized that modeling alone couldn't fulfill our users' needs. To be a full-service product, we needed to be able to:

  • Create synthetic data for various data modalities within tabular data, including time series, single table, and multi-table data.
  • Process various data types for generative modeling. To do this, we created and doubled down on the reversible data transforms library. This library transforms any data type into numeric data, which a generative model can then work with.
  • Provide a number of modeling options. With models, the no-free-lunch theorem applies[4] — that is, no modeling technique can win across all the vital axes of quality, performance, privacy and efficacy. It became essential that we innovate and provide a variety of modeling options.
  • Provide a benchmarking suite. Often, big enterprises have teams that want to develop their own modeling techniques—and why shouldn’t they? We want to enable enterprises to build new modeling approaches, or enhance current ones that exist within the SDV. To enable this, we created and released our benchmarking framework, which we use internally to test our models and compare them with others.
  • Provide a library to measure the efficacy of synthetic data across various axes, including quality, utility, and diagnostics (for format, structure and so much more).
  • Expand into SDV Enterprise, to develop features that enterprise-grade datasets need.

We’re particularly proud of the outsized contributions our team has made in this domain. They have defined this field, from creating the metadata standard for generative modeling, to identifying data modalities, to developing a robust evaluation library. At DataCebo, every library comes with its own documentation, and we continuously innovate, build new releases — 29 in the first 5 months of 2025 — and update documentation while maintaining full backwards compatibility. To deliver on this promise continuously for 4 years is a tall order, but our team does it.

The result: users are voting with their downloads.
The need for synthetic data is growing, and SDV's download numbers are growing alongside it. Many users use our main library, SDV, but they use other libraries as well. If we calculate the number of downloads for all other tabular synthetic data generation libraries, both open source non-commercial and open source/core commercial, SDV makes up an 80% share of the total number of downloads in this space. [5][6]

SDV Core downloads
SDV Core

Getting here took a lot of discipline

On our way to this milestone, we made a lot of decisions, some of which required us to muster a level of discipline trading-off some short sighted choices. In making these choices, we ended up forming our core value system. Some of these values are reflected in how we build and maintain the ecosystem, and some in how we support our users. But all reflect our guiding principle: we want you to succeed in your synthetic data journey, and we are here to help. Here are a few critical decisions we made.

Changing to a business source license was a risk — but most enterprise users understood why we did it: we were playing the long game and they continued to use SDV Community.
In 2023, we changed to a business source license (BSL). This changes the library from a fully permissive open source MIT license to a more restricted license[7]. We made it clear that anyone who is not competing directly with our products is welcome to use our community/open core libraries within their software. This was not a decision we made lightly. (We wrote about this here[8].)

So what led to this decision? As we worked to make SDV an enterprise-grade system, we started noticing other teams using our open source, claiming they did better than our models or wrapping around our software to provide ad-hoc synthetic data solutions. Despite this noise, our enterprise users were giving us feedback on a daily basis. The truth was: In order for enterprises to be able to adopt synthetic data, many, many more modules and features needed to be built. Because synthetic data generation was still maturing as a field — and SDV was still maturing as a product — we switched to a BSL to prevent SDV from becoming a shortcut vehicle for other people who weren't playing the long game and thinking holistically.

While we wondered whether this change would prevent our enterprise users from using SDV, now almost 3 years later we are happy to see so many small and large enterprises still on board. The downloads of BSL versions of the library dwarf the pre BSL versions. And our product has only gotten better, with 120 new releases of SDV Community.

SDV Core
We appreciate that the industry largely understood where we were coming from and that usage has continued to grow even with our new license. The plot above shows the downloads for the Pre BSL version vs the versions for which the license is BSL. If you are using SDV Community, and want to get official permission, you can use this form[9].

Not advertising on our support channels was also a risk, but matches our guiding principles.
AI-generated synthetic data is a new technology. Many users are still at the stage of learning and proof-of-concept. When people come to our support channels, while using the SDV Community, we want to make sure that they succeed with their projects. For this reason, we deliberately decided not to promote any of the coverage we are getting from media analysts, in our support channels. Simply put, when you come to SDV's support channels, the whole focus is on making you successful.

Now that many other vendors have joined our channels, we enforce this standard across the board[10]. While we can monitor public channels, we cannot control private direct messages, and are saddened to see that some vendors are promoting their products — below, you can see an example of a vendor promoting a third party evaluation that is favorable to their solution. One user who had a tight deadline on a project had to handle this distraction as they had to read this report, decipher what it was trying to convey and ultimately decide it was not relevant - this took time away from their project. While we cannot control this behavior, we did not waver in our own commitment - no promotion in our support channels and no distractions - focus is on you and your project. If you encounter any such advertising, please know that it does not have our endorsement.

SDV Core

We launched SDV Enterprise, but are committed to the SDV Community.
To address enterprise needs we launched an enterprise-grade product, SDV enterprise in December 2023. But our commitment to keeping our open core version SDV Community, innovating on it, and improving it did not waver. Since the release of SDV Enterprise, we have made 77 releases for SDV Community.

With both products live, another important question arose: how do we choose which feature goes into SDV Community, and which goes into SDV Enterprise? Although it might seem intuitive to funnel all improvements into SDV Enterprise, we again thought of our guiding principle: we want you to be successful in your journey towards using synthetic data. So when we find improvements — for instance, an encoder drastically improves the quality of synthetic data for categorical variables — we release them in the SDV Community[11]. Similarly, since December 2023, we have released 14 metrics in our open source SDMetrics library, sticking to our value that synthetic data evaluation should be open source, and that metrics and their calculations should be transparent.

Ten million thank yous to all of our users out there!
We're so proud when we see all the things our users have done with our product — winning competitions with SDV, building platforms for internal use around SDV, kickstarting their careers and internships using SDV, and making SDV the cornerstone of their online courses on practical applications of generative AI, to name just a few. We thank you all for putting your trust in us and sticking by us as we navigate our way to building the largest and most powerful system out there for synthetic data generation.


How we calculate our download metrics

Installing the main SDV library also installs all the other libraries. To calculate our total downloads we:

  • Calculate the exclusive downloads for each library by subtracting the number of overall SDV downloads from each of the library's downloads. For example, SDMetrics has 3.07M downloads and SDV has 2.28M downloads, meaning the exclusive downloads for SDMetrics add up to (3.07M- 2.28M) = ~790K
  • Add up all the exclusive downloads.
  • Add in the downloads of the main SDV library.

We reached 10 million total downloads on January 12, 2025.

SDV Core
By installing SDV users install all the other libraries as SDV depends on them. Installing SDGym users also install SDV as it depends on it. We calculate independent downloads for each library.
The Synthetic Data Vault

Let’s put synthetic data to work

Contact us

© 2026, DataCebo, Inc.