Case studies > MAPFRE
Logo
community Case Study

MAPFRE: improving detection of homeowner insurance fraud by 31 percent with synthetic data

July 10, 2025
Applications

+31%

more fraud cases caught

$310K

in estimated annual savings per every 100 fraud reports

Hero image

Share this post:

Every year insurance fraud costs the U.S. more than $300 billion - a number that may actually be an underestimate, considering that fraudsters’ methods often evade the detection of insurance companies.

To reduce potential losses, the insurance company MAPFRE created an AI-driven detection system that analyzes historical data to flag suspicious claims that their agents can then investigate in more detail.

After first using the system to flag fraudulent auto claims, MAPFRE turned its attention to homeowner (HO) insurance fraud, when false, inflated or misrepresented claims are submitted on a homeowner’s or renter’s policy for financial gain. (For example, a fraudster may forge a homeowner's name on a deed to a home.)

The team discovered an unfortunate but inevitable paradox about property fraud: compared to other frauds it is generally both more costly and more rare, which means that MAPFRE did not initially have enough examples of it that could be used as training data for its AI detection system. 

Since the researchers couldn’t use conventional data-imbalance solutions such as under-sampling, they opted to supplement their sizable datasets with additional AI generated synthetic data that retains the qualities of the original data.

Using synthetic data improved the fraud detection rate by 31 percent

By augmenting synthetic data to real data the team was able to increase the recall by 31% resulting in improved detection rates, and still increasing the precision by 0.85% which helps reduce investigation time spent on false positives. Usually such improvements in recall cannot be achieved while still improving precision. The team did a comprehensive set of experiments changing how much synthetic data should be added, and what type of synthetic data should be added.

By reducing false positives and more accurately identifying fraudulent claims, the team estimated that they could save roughly $310,000 per every 100 fraudulent claims- a finding that spurred them to put the solution into production.

The findings were presented in a Medium blog post written by Mireia Rojo Arribas, Head of Advanced Analytics, alongside data scientists Faxi Yuan and Hamed Farahmand. To develop the detection model, the researchers incorporated data related to claims and policies information, as well as graph-powered data of interconnections between claims, geocode data and even data about the weather. 

“MAPFRE demonstrates how synthetic data can be used to boost AI model performance by providing additional training data,” the authors wrote in the paper. “This is especially important when the real data is limited in size or imbalanced/not diverse enough, like it was for MAPFRE.”

AI based approach to generating synthetic data helps deliver a higher quality synthetic data

The practice of synthetic data generation has grown in recent years thanks to AI approaches such as generative adversarial networks (GANs), which improve through an alternating pattern of generating data and discriminating between real and fake data. However, GANs are limited in being able to generate tabular data, due to the fact that (A) continuous values are non-Gaussian in their distribution (versus, say, image data), and (B) categorical variables have highly imbalanced distribution that make the minority class more likely to be ignored during training. 

To overcome these issues, the MAPFRE team used the Synthetic Data Vault’s CTGAN model (from DataCebo), which introduced the conditional vector based on one of the categorical variables and inserted it alongside the randomly generated vectors to train the generator to learn the data patterns and distribution. (For example, if there are the two categorical variables of loss city and claimant type, they will be encoded with the one-hot encoding method.)

Arribas and her colleagues leveraged both CTGAN and other third-party vendors to improve the accuracy of the fraud detection model, ultimately finding CTGAN to yield slightly better performance.

“This project can be considered a showcase of the capability of synthetic data for improving performance of AI models and encourage the expansion of applying synthetic data in other predictive models’ use cases,” the authors concluded.”While the improvement is not guaranteed, the use of synthetic data can be recommended when facing data imbalance and  small data size.”

Read the full Medium post from MAPFRE

Share this post:

Let's put synthetic data to work

Contact Us

© 2025, DataCebo, Inc.