SDV

What is SDV?

SDV (Synthetic Data Vault) is a Python library that uses machine learning to generate synthetic data that maintains the statistical properties of the original data while ensuring privacy.

Install with pip:

pip install sdv

Install with conda:

conda install -c pytorch -c conda-forge sdv

Maintain Data Relationships Automatically with GaussianCopulaSynthesizer

Motivation

When generating synthetic data, maintaining the real-world relationships between columns is essential for creating useful datasets for analysis, modeling, and testing. Without preserving these relationships, synthetic data may lead to incorrect insigtransformers or non-functional test systems.

Imagine trying to generate synthetic hotel guest data where room types should correlate with room rates. If these relationships aren’t preserved, you migtransformer end up with luxury suites priced cheaper than standard rooms, creating unrealistic patterns.

import pandas as pd
import numpy as np
from sklearn.datasets import make_classification

# Create synthetic hotel data with random values
np.random.seed(42)
n_samples = 100

# Create room types and assign random rates without preserving relationships
room_types = np.random.choice(["BASIC", "DELUXE", "SUITE"], size=n_samples)

# Random rates that don't correlate with room types
room_rates = np.random.uniform(100, 500, size=n_samples)

# Create a DataFrame
hotel_data = pd.DataFrame({"room_type": room_types, "room_rate": room_rates})

# Check average price by room type
hotel_data.groupby("room_type")["room_rate"].mean().sort_values()

room_type
SUITE     266.506664
BASIC     292.467652
DELUXE    310.835909
Name: room_rate, dtype: float64

As we can see, with random generation, there’s no meaningful relationship between room types and room rates. The SUITE room migtransformer cost less than a BASIC room, which doesn’t reflect reality. For accurate analysis and testing, you’d need to manually implement complex rules to enforce these relationships.

Preserving Column Relationships with GaussianCopulaSynthesizer

The GaussianCopulaSynthesizer in SDV automatically learns and preserves the statistical relationships between columns, allowing you to generate realistic synthetic data without manually coding complex rules.

Let’s use the GaussianCopulaSynthesizer to maintain these relationships. First, we’ll load demo data, that will be used as real data for training:

from sdv.datasets.demo import download_demo

real_data, metadata = download_demo(
    modality="single_table", dataset_name="fake_hotel_guests"
)
real_data.info(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   guest_email         500 non-null    object 
 1   has_rewards         500 non-null    bool   
 2   room_type           500 non-null    object 
 3   amenities_fee       455 non-null    float64
 4   checkin_date        500 non-null    object 
 5   checkout_date       480 non-null    object 
 6   room_rate           500 non-null    float64
 7   billing_address     500 non-null    object 
 8   credit_card_number  500 non-null    int64  
dtypes: bool(1), float64(2), int64(1), object(5)
memory usage: 31.9+ KB

Check relationships between columns:

print("Real data average prices by room type:")
real_data.groupby("room_type")["room_rate"].mean().sort_values()

Real data average prices by room type:

room_type
BASIC     131.446406
DELUXE    207.673846
SUITE     253.176579
Name: room_rate, dtype: float64

Now let’s create and train a GaussianCopulaSynthesizer to learn these relationships:

from sdv.single_table import GaussianCopulaSynthesizer

# Create and fit the synthesizer
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_data)

# Generate synthetic data
synthetic_data = synthesizer.sample(100)

# Check if the relationships are preserved
print("Synthetic data average prices by room type:")
synthetic_data.groupby("room_type")["room_rate"].mean().sort_values()

Synthetic data average prices by room type:

room_type
BASIC     145.365060
DELUXE    202.877333
SUITE     244.730000
Name: room_rate, dtype: float64

The generated synthetic data maintains expected price patterns, with DELUXE and SUITE room types showing higher average rates compared to BASIC rooms.

Validate Synthetic Data Integrity With SDV Diagnostic

Motivation

Data validation is a critical step in the synthetic data generation process. It ensures that the synthetic data maintains the same structure, constraints, and characteristics as the real data before deploying models trained on it.

When working with synthetic datasets, detecting issues like incorrect data types, out-of-range values, or broken constraints can be challenging without proper validation tools.

To demonstrate this, let’s load the hotel guests demo data:

from sdv.datasets.demo import download_demo
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.evaluation.single_table import run_diagnostic

# Load the hotel guests demo data
real_data, metadata = download_demo(
    modality="single_table", dataset_name="fake_hotel_guests"
)

Now we’ll create synthetic data using the GaussianCopulaSynthesizer:

# Create and fit the synthesizer
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_data)

# Generate synthetic data
synthetic_data = synthesizer.sample(num_rows=100)

# Examine the first few rows of synthetic data
synthetic_data.head()

	guest_email	has_rewards	room_type	amenities_fee	checkin_date	checkout_date	room_rate	billing_address	credit_card_number
0	dsullivan@example.net	False	BASIC	0.29	27 Mar 2020	09 Mar 2020	135.15	90469 Karla Knolls Apt. 781\nSusanberg, CA 70033	5161033759518983
1	steven59@example.org	False	DELUXE	8.15	07 Sep 2020	25 Jun 2020	183.24	6108 Carla Ports Apt. 116\nPort Evan, MI 71694	4133047413145475690
2	brandon15@example.net	False	BASIC	11.65	22 Mar 2020	01 Apr 2020	163.57	86709 Jeremy Manors Apt. 786\nPort Garychester...	4977328103788
3	humphreyjennifer@example.net	False	BASIC	48.12	04 Jun 2020	14 May 2020	127.75	8906 Bobby Trail\nEast Sandra, NY 43986	3524946844839485
4	joshuabrown@example.net	False	DELUXE	11.07	08 Jan 2020	13 Jan 2020	180.12	732 Dennis Lane\nPort Nicholasstad, DE 49786	4446905799576890978

Create a copy of the data with intentional problems:

problematic_data = real_data.copy()

# Introduce duplicate primary keys (should be unique)
problematic_data.loc[5, "guest_email"] = problematic_data.loc[0, "guest_email"]

# Add an out-of-range value for a numeric column
problematic_data.loc[10, "room_rate"] = problematic_data["room_rate"].max() * 2

# Add an invalid category for a categorical column
problematic_data.loc[15, "room_type"] = "NonExistentRoomType"

# Check for these issues manually
print(
    f"Number of unique guest emails: {problematic_data['guest_email'].nunique()} (should equal {len(problematic_data)})"
)
print(
    f"Max room rate: {problematic_data['room_rate'].max()} (should be less than {synthetic_data['room_rate'].max()})"
)
print(f"Unique room types: {problematic_data['room_type'].unique()}")

Number of unique guest emails: 499 (should equal 500)
Max room rate: 849.68 (should be less than 367.66)
Unique room types: ['BASIC' 'DELUXE' 'NonExistentRoomType' 'SUITE']

This manual validation is tedious and error-prone. We need to write custom checks for each potential issue, and it’s easy to miss subtle problems that could impact downstream applications of the synthetic data.

Diagnostic

The diagnostic functionality in SDV provides an automated way to validate synthetic data against the original data, ensuring that basic structural and content requirements are met before using the synthetic data.

Let’s run the diagnostic to check if our problematic synthetic data meets all the basic requirements:

# Run the diagnostic
diagnostic_report = run_diagnostic(
    real_data=real_data,
    synthetic_data=problematic_data,
    metadata=metadata
)

Generating report ...

|          | 0/9 [00:00<?, ?it/s]|(1/2) Evaluating Data Validity: |          | 0/9 [00:00<?, ?it/s]|(1/2) Evaluating Data Validity: |██████████| 9/9 [00:00<00:00, 1635.42it/s]|
Data Validity Score: 99.91%

|          | 0/1 [00:00<?, ?it/s]|(2/2) Evaluating Data Structure: |          | 0/1 [00:00<?, ?it/s]|(2/2) Evaluating Data Structure: |██████████| 1/1 [00:00<00:00, 522.92it/s]|
Data Structure Score: 100.0%

Overall Score (Average): 99.96%

The diagnostic report provides a comprehensive assessment of the synthetic data’s validity. The report checks two main categories:

Data Validity: Ensures primary keys are unique and non-null, continuous values stay within the original range, and categorical values match the original categories
Data Structure: Verifies that column names are consistent between real and synthetic data

We can examine the details of the diagnostic report to get insigtransformers about specific columns:

# Print detailed results for data validity
validity_details = diagnostic_report.get_details(property_name="Data Validity")
validity_details

	Column	Metric	Score
0	guest_email	KeyUniqueness	0.998
1	has_rewards	CategoryAdherence	1.000
2	room_type	CategoryAdherence	0.998
3	amenities_fee	BoundaryAdherence	1.000
4	checkin_date	BoundaryAdherence	1.000
5	checkout_date	BoundaryAdherence	1.000
6	room_rate	BoundaryAdherence	0.998

# Print detailed results for data structure
structure_details = diagnostic_report.get_details(property_name="Data Structure")
print("\nStructure details:")
structure_details


Structure details:

	Metric	Score
0	TableStructure	1.0

The output clearly identifies the specific issues in our problematic synthetic data:

KeyUniqueness score below 1.0 for guest_email indicates duplicate primary keys
CategoryAdherence score below 1.0 for room_type shows invalid categories
BoundaryAdherence score below 1.0 for room_rate reveals out-of-range values

Using the diagnostic report before deploying synthetic data helps prevent downstream issues in applications, models, or analyses that would use this data, saving time and preventing potentially costly errors.

Preserve Data Integrity with Powerful Constraints

Motivation

Constraints are essential for ensuring your synthetic data follows the same business logic and rules as your real data. Without proper constraint implementations, synthetic data may generate technically valid but logically impossible values - such as employees whose current age is less than their age when they joined the company or negative years of experience.

When generating synthetic data, it’s often challenging to maintain logical relationships between columns without explicit rules.

import pandas as pd
import numpy as np

# Generate synthetic data with no constraints - could create logically impossible data
np.random.seed(1)
bad_synthetic_data = pd.DataFrame(
    {
        "age": np.random.randint(25, 60, size=5),
        "age_when_joined": np.random.randint(22, 50, size=5),
    }
)

print("Example of synthetic data without constraints:")
print(bad_synthetic_data)
print(
    "\nNumber of logically invalid records (age < age_when_joined):",
    sum(bad_synthetic_data["age"] < bad_synthetic_data["age_when_joined"]),
)

Example of synthetic data without constraints:
   age  age_when_joined
0   37               37
1   33               22
2   34               38
3   36               23
4   30               34

Number of logically invalid records (age < age_when_joined): 2

This example demonstrates a common problem when generating synthetic data: without constraints, we’ve created employee records where the current age is less than the age when the employee joined the company, which is logically impossible in real employee data.

Constraints

The Constraints feature in SDV enables you to enforce logical rules on your synthetic data, ensuring it follows the same business logic as your real data. This powerful feature ensures your synthetic data is not just statistically similar to real data but also logically valid.

Let’s see how we can use constraints to enforce valid age relationships in our synthetic data:

First, we’ll create our synthesizer and add an inequality constraint to ensure current age is always greater than or equal to age when joined:

from sdv.datasets.demo import download_demo
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.evaluation.single_table import run_diagnostic

# Load the fake companies demo data
real_data, metadata = download_demo(
    modality="single_table", dataset_name="fake_companies"
)

# Create synthesizer
synthesizer = GaussianCopulaSynthesizer(metadata)

# Define an inequality constraint
age_constraint = {
    "constraint_class": "Inequality",
    "constraint_parameters": {
        "low_column_name": "age_when_joined",
        "high_column_name": "age",
    },
}

# Add constraint to synthesizer
synthesizer.add_constraints([age_constraint])

# Train the synthesizer
synthesizer.fit(real_data)

# Generate synthetic data
synthetic_data = synthesizer.sample(num_rows=10)

print("Generated synthetic data with constraints:")
synthetic_data[["age", "age_when_joined"]]

  0%|          | 0/10 [00:00<?, ?it/s]Sampling rows:   0%|          | 0/10 [00:00<?, ?it/s]Sampling rows: 100%|██████████| 10/10 [00:00<00:00, 220.87it/s]

Generated synthetic data with constraints:

	age	age_when_joined
0	33	24
1	36	29
2	33	24
3	31	29
4	40	37
5	43	39
6	46	45
7	44	43
8	49	47
9	49	48

print(
    "Number of logically invalid records (age < age_when_joined):",
    sum(synthetic_data["age"] < synthetic_data["age_when_joined"]),
)

Number of logically invalid records (age < age_when_joined): 0

The output highligtransformers how the SDV constraints feature ensures the constraint is automatically enforced during the data generation process.

Using constraints allows you to define complex business rules - from simple inequalities like age relationships to more complex logic like conditional values or fixed combinations - ensuring your synthetic data is not only statistically similar but also logically valid according to your domain-specific rules.

Anonymize Sensitive Data Securely with Preprocessing

Motivation

Preprocessing in SDV allows users to anonymize or pseudo-anonymize sensitive data.

This feature is crucial for creating synthetic data that can be shared or analyzed without exposing sensitive details.

Handling sensitive data directly poses risks of privacy breaches or non-compliance with data protection laws.

import pandas as pd
from sdv.datasets.demo import download_demo

# Load demo data
real_data, metadata = download_demo(
    modality="single_table", dataset_name="fake_hotel_guests"
)

# Display a sample of the real data
print("Real data sample:")
real_data[["guest_email", "credit_card_number", "billing_address"]].head()

Real data sample:

	guest_email	credit_card_number	billing_address
0	michaelsanders@shaw.net	4075084747483975747	49380 Rivers Street\nSpencerville, AK 68265
1	randy49@brown.biz	180072822063468	88394 Boyle Meadows\nConleyberg, TN 22063
2	webermelissa@neal.com	38983476971380	0323 Lisa Station Apt. 208\nPort Thomas, LA 82585
3	gsims@terry.com	4969551998845740	77 Massachusetts Ave\nCambridge, MA 02139
4	misty33@smith.biz	3558512986488983	1234 Corporate Drive\nBoston, MA 02116

The example shows a dataset containing sensitive columns such as guest email, credit card numbers, and billing addresses. Without anonymization, sharing or analyzing such data directly could lead to data breaches or non-compliance with privacy regulations.

Preprocessing

The Preprocessing feature in SDV provides comprehensive tools to anonymize sensitive data while maintaining realistic synthetic data outputs. It uses transformers to replace sensitive values with anonymized or pseudo-anonymized equivalents.

To anonymize data, you can update transformers for specific columns to use the AnonymizedFaker or PseudoAnonymizedFaker classes, which generate fake but realistic substitutes for sensitive data.

Pseudo-anonymization maintains a connection between original sensitive data and synthetic replacements, allowing for reverse mapping when needed.

Anonymization is permanent and irreversible—synthetic data cannot be traced back to the original values.

Here’s an example of how to anonymize sensitive data:

First, the synthesizer will auto-assign transformers based on the data and then update specific columns for anonymization.

from sdv.single_table import GaussianCopulaSynthesizer
from rdt.transformers import AnonymizedFaker

# Create a synthesizer
synthesizer = GaussianCopulaSynthesizer(metadata)

# Automatically assign transformers
synthesizer.auto_assign_transformers(real_data)

# Update transformers for anonymization
synthesizer.update_transformers(
    column_name_to_transformer={
        "guest_email": AnonymizedFaker(
            provider_name="internet", function_name="email", cardinality_rule="unique"
        ),
        "credit_card_number": AnonymizedFaker(
            provider_name="credit_card", function_name="credit_card_number"
        ),
        "billing_address": AnonymizedFaker(
            provider_name="address", function_name="address"
        ),
    }
)

# Fit the synthesizer to the real data
synthesizer.fit(real_data)

# Generate synthetic data
synthetic_data = synthesizer.sample(num_rows=5)

print("Synthetic data with anonymization:")
synthetic_data[["guest_email", "credit_card_number", "billing_address"]]

Synthetic data with anonymization:

	guest_email	credit_card_number	billing_address
0	dsullivan@example.net	5161033759518983	90469 Karla Knolls Apt. 781\nSusanberg, CA 70033
1	steven59@example.org	4133047413145475690	6108 Carla Ports Apt. 116\nPort Evan, MI 71694
2	brandon15@example.net	4977328103788	86709 Jeremy Manors Apt. 786\nPort Garychester...
3	humphreyjennifer@example.net	3524946844839485	8906 Bobby Trail\nEast Sandra, NY 43986
4	joshuabrown@example.net	4446905799576890978	732 Dennis Lane\nPort Nicholasstad, DE 49786

In this code:

The synthesizer.auto_assign_transformers(real_data) step automatically assigns appropriate transformers to all columns based on their data type, streamlining the preprocessing process.
The update_transformers step customizes the transformers for the guest_email, credit_card_number, and billing_address columns to use AnonymizedFaker.
The cardinality_rule='unique' parameter ensures that the generated fake email addresses are unique, maintaining the uniqueness constraint of the original data while anonymizing it.

This use of preprocessing ensures sensitive data is anonymized effectively, enabling safe data sharing and analysis.

Transform Data with RDT’s HyperTransformer

Motivation

Data scientists often grapple with inconsistent data formats, missing values, and non-numeric fields, which complicate preprocessing and hinder the application of machine learning models.

# Import the demo dataset utility from SDV
from sdv.datasets.demo import download_demo

# Load demo hotel guests dataset and its metadata
hotel, metadata = download_demo(
    modality="single_table", dataset_name="fake_hotel_guests"
)

# Extract state abbreviation from billing address using regex
hotel["state"] = hotel["billing_address"].str.extract(r",\s*(\w{2})\s+\d+")

# Select only state and amenities_fee columns to focus analysis
selected_columns = ["state", "amenities_fee", "has_rewards"]
hotel = hotel[selected_columns]

# Automatically detect metadata schema from the filtered dataframe
metadata = metadata.detect_from_dataframe(hotel)

# Display first 5 rows of the processed dataset
hotel.head()

	state	amenities_fee	has_rewards
0	AK	37.89	False
1	TN	24.37	False
2	LA	0.00	True
3	MA	NaN	False
4	MA	16.45	False

Check for missing values:

hotel.isna().sum()

state            34
amenities_fee    45
has_rewards       0
dtype: int64

This dataset includes various data types: dates, booleans, categories, and numerics, some with missing values. Manually handling each column’s transformation can be tedious and error-prone.

HyperTransformer

The HyperTransformer automates preprocessing in just a few lines by detecting data types and applying the right transformations.

from rdt import HyperTransformer
from rdt.transformers import LabelEncoder, BinaryEncoder

# Initialize HyperTransformer
ht = HyperTransformer()

# Automatically detect configuration based on data
ht.detect_initial_config(data=hotel)

# Fit the transformer to the data
ht.fit(hotel)

# Transform the data
transformed_data = ht.transform(hotel)
transformed_data.head()

	state	amenities_fee	has_rewards
0	0.003770	37.890000	0.016369
1	0.010727	24.370000	0.285175
2	0.031839	0.000000	0.956822
3	0.199210	18.176066	0.793918
4	0.064229	16.450000	0.529275

Check for missing values:

transformed_data.isna().sum()

state            0
amenities_fee    0
has_rewards      0
dtype: int64

The HyperTransformer converts all columns into numerical formats suitable for modeling, handling missing values and encoding categorical variables appropriately.

You can also customize the encoding method for specific columns by updating the transformer configuration manually:

# Apply customer transformers to room_type and has_rewards columns
ht.set_config(
    {
        "sdtypes": {"state": "categorical", "has_rewards": "boolean"},
        "transformers": {"state": LabelEncoder(), "has_rewards": BinaryEncoder()},
    }
)

# Fit the transformer to the data
ht.fit(hotel)

# Transform the data
transformed_data = ht.transform(hotel)
transformed_data.head()

	state	amenities_fee	has_rewards
0	0	37.890000	0.0
1	1	24.370000	0.0
2	2	0.000000	1.0
3	3	18.176066	0.0
4	3	16.450000	0.0

To revert the transformed data back to its original format:

# Reverse transform to original format
original_data = ht.reverse_transform(transformed_data)
original_data.head()

	state	amenities_fee	has_rewards
0	AK	37.890000	False
1	TN	24.370000	False
2	LA	0.000000	True
3	MA	18.176066	False
4	MA	16.450000	False

This ensures that any synthetic or processed data can be interpreted in its original context, maintaining data integrity throughout the machine learning pipeline.

Transform Categorical Data with UniformEncoder

Motivation

Handling categorical data is a common challenge in data preprocessing. Many machine learning models and data synthesis tools require numerical inputs, but categorical columns often contain non-numeric values. Converting these columns into a numerical format while addressing imbalances is critical for accurate modeling and synthesis.

Traditional encoding methods, such as one-hot encoding or label encoding, can lead to high-dimensional data or fail to capture the underlying distribution of categories. This can result in synthetic data that disproportionately represents frequent categories while under-representing rare ones.

Let’s explore this issue using the degree_type column from the student_placements dataset.

from rdt.transformers import LabelEncoder
from sdv.datasets.demo import download_demo

# Load demo data
real_data, metadata = download_demo(
    modality="single_table", dataset_name="student_placements"
)
real_data = real_data[["degree_type"]]
metadata = metadata.detect_from_dataframe(real_data)

# Display the first few rows of the dataset
real_data.head()

	degree_type
0	Sci&Tech
1	Sci&Tech
2	Comm&Mgmt
3	Sci&Tech
4	Comm&Mgmt

import matplotlib.pyplot as plt

# Plot the frequency of the original dataset
real_data["degree_type"].value_counts().plot(kind="bar", color="#03AFF1", alpha=0.7)
plt.title("Frequency of Degree Types")
plt.xlabel("Degree Type")
plt.ylabel("Frequency")
plt.xticks(rotation=45)
plt.show()

The bar chart shows the imbalanced distribution of the degree_type column, with some categories being significantly more frequent than others. This imbalance can lead to biased synthetic data generation if not addressed.

UniformEncoder

The UniformEncoder from the RDT library solves this problem by transforming categorical columns into a uniform numerical distribution. This ensures that the encoded values are evenly distributed, preserving the original data’s characteristics.

from rdt import HyperTransformer
from rdt.transformers import UniformEncoder


# Use HyperTransformer to detect and apply transformations
transformer = HyperTransformer()

transformer.set_config(
    {
        "sdtypes": {"degree_type": "categorical"},
        "transformers": {"degree_type": UniformEncoder()},
    }
)


# Transform the data
transformed_data = transformer.fit_transform(real_data)
transformed_data.head()

	degree_type
0	0.222982
1	0.228482
2	0.625732
3	0.073374
4	0.327050

The UniformEncoder maps each category to a unique numerical value between 0 and 1, ensuring uniform distribution.

Let’s compare the encoded values with the original categorical data:

# Create a figure with two subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Plot the original distribution on the first subplot
real_data.value_counts(normalize=True).plot(
    kind="bar", color="#03AFF1", alpha=0.7, label="Original Data", ax=ax1
)
ax1.set_title("Original Distribution of degree_type")
ax1.set_ylabel("Proportion")

# Plot the encoded distribution on the second subplot
transformed_data.plot(
    kind="hist", bins=10, color="#01E0C9", alpha=0.7, label="Encoded Data", ax=ax2
)
ax2.set_title("Encoded Distribution of degree_type")
ax2.set_ylabel("Frequency")
ax2.set_xlabel("Encoded Values")

plt.show()

The first plot shows the imbalanced distribution of the original categorical data, while the second plot demonstrates how the UniformEncoder transforms the data into a uniform numerical distribution.

Next, we can use the transformed data for synthetic data generation.

from sdv.single_table import GaussianCopulaSynthesizer

# Initialize and fit the synthesizer
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(transformed_data)

# Generate synthetic data
synthetic_data = synthesizer.sample(num_rows=100)

# Reverse transform the synthetic data
reversed_data = transformer.reverse_transform(synthetic_data)

# Display the reversed column
reversed_data.head()

	degree_type
0	Others
1	Sci&Tech
2	Comm&Mgmt
3	Sci&Tech
4	Comm&Mgmt

The reverse_transform method converts the numerical values back to their original categorical form, making the synthetic data interpretable.

Finally, we can evaluate the quality of the synthetic data.

from sdv.evaluation.single_table import get_column_plot

# Visualize the distribution of the real and synthetic data
fig = get_column_plot(
    real_data=real_data,
    synthetic_data=reversed_data,
    metadata=metadata,
    column_name="degree_type",
)
fig.show()

This plot compares the distribution of the degree_type column in the real and synthetic datasets, demonstrating how well the UniformEncoder preserves the original data’s characteristics.