Data continues to be an integral part of today’s world, especially in the daily interactions between humans and machines. There is an undeniable need for more data around the world.

Data scientists from several organizations require access to large volumes of data to train and deploy cutting-edge Machine Learning or Deep Learning models for solving challenging problems.

State-of-the-art Machine Learning models need quantitative and qualitative-labeled datasets for accurate training performance. However, It is time consuming and expensive to manually annotate large amounts of data with millions of attributes per data point.

In this article, we will cover:

  1. Synthetic Data, Defined
  2. The Importance of Synthetic Data
  3. How to Generate Synthetic Data
  4. The Challenges in Synthetic Data
  5. Real-world Applications of Synthetic Data

Synthetic Data, Defined

Synthetic data is a form of data that mimics the real-world patterns generated through Machine Learning algorithms. It reflects real-world data mathematically or statistically, despite being artificial.

Synthetic data is often used as a substitute when suitable real-world data is unavailable – for instance, augmenting a limited dataset with additional examples. In other cases where real-world data cannot be used due to privacy concerns or compliance risks, synthetic data helps address the issues with data collection, annotation, and quality assurance.

Synthetic Data Generation Flow

The Importance of Synthetic Data

Whether the nature of data is real or synthetic is irrelevant to data scientists. The characteristics and patterns inside the data are what matter – its quality, balance, and bias. Synthetic data allows you to optimize and enrich your data, unlocking several key benefits.

Data quality and diversity

Real-world data is hard and expensive to source. Synthetic data generation increases confidence in the data’s quality, variety, and balance. From auto-completing missing values to automated labeling, it dramatically increases the reliability and accuracy of your data and, in turn, the accuracy of your predictions.

Synthetic data can also be a synthetic video, image, or sound. You artificially render media with properties close enough to real-world data.

None of these individuals are real. These synthetic images were artificially generated by the Generative Adversarial Network, StyleGAN2 (Dec 2019), from the work of Karras et al. and Nvidia.

Similarly, there are well-known algorithms to generate high-quality and authentic texts using transformer-based language modeling architecture. The GPT-3 algorithm falls into a category of Deep Learning called Large Language Models, a kind of neural network trained on a colossal amount of text.

The GPT-3 model generated this Shakespeare-like text after training on original texts. 

Source: GPT-3 Creative Fiction.

Scalability

Many data scientists supplement their real-world records with synthetic data, rapidly scaling up existing data – or just the relevant subsets of this data – to create more meaningful observations and trends.

Quality Control

Once the data requirements are clear and the algorithm is well tested, generating data according to the constraints is quite easy. Synthetic data can control how the resulting data is structured, formatted, and labeled. That means a ready-to-use source of high-quality, dependable data is just a few clicks away.

For certain ML applications, it’s easier to create synthetic data than to collect and annotate real data.

  • Generate as much synthetic data as you need;
  • Generate data that may be hard to collect;
  • Automate annotation.

How to Generate Synthetic Data

To generate synthetic data, scientists must create a robust model that mimics a real dataset. They can generate realistic synthetic data points based on the probability that certain data points occur in the real dataset.

Neural networks are especially adept at learning and generalizing an underlying data distribution. This enables a neural network architecture to create similar data points (but not identical) to samples from the original distribution. Here are a few state-of-the-art neural techniques used to generate synthetic data.

Once such a latent space has been developed, you can sample points from it, either deliberately or at random, and, by mapping them to image space, generate images that have never been seen before

Using Generative Adversarial Networks (GANs)

Generative modeling is an unsupervised learning task in Machine Learning where the goal is to find the hidden patterns in the input data and produce plausible images having similar characteristics to input data.

The GAN model architecture involves two sub-models: a generator model for generating new examples and a discriminator model for classifying whether generated examples are real (from the domain) or fake (generated by the generator model).

  • Discriminator: Model that learns how to classify input as real (from the domain) or fake (generated).
  • Generator: Model that generates new similar images from the problem domain.

How GANs game the networks into creating high-quality synthetic data

Algorithm using GANs to generate synthetic data:

  1. Input: A random noise to the generator module.
  2. The generator produces a fake data sample and passes it to the discriminator for evaluation against real-world data.
  3. The discriminator evaluates the generated data sample and assigns it a real or fake label. This information is used to train the generator.
  4. The model training continues until the discriminators cannot distinguish between real and fake data samples.

Using Variational Autoencoder

Before we jump into specifics about encoder and decoder, let me explain low and high-dimension data and their relevance to autoencoders.

Consider a case where we have a 4K video stream with a resolution of 3840 x 2160 pixels. If we want to process this video in real time, it would be very expensive to load and process it at full resolution. We call this 4K image frame High Dimension Data. In contrast, if we compress the data to a full HD image that retains most features of the original input, it is a low-dimensional representation of the original 4K image.

Autoencoders are used to process high-dimension inputs and compress them to a low-dimension representation that effectively captures all important saliency features.

Classical Autoencoder is used to learn efficient embeddings of unlabeled data for a given network configuration.

The autoencoder consists of two parts, an encoder and a decoder.

The encoder compresses the data from a higher to a lower-dimensional space (also called the latent space). The lower dimensional space retains the most important information from the higher-order original data distribution.

The decoder does the opposite (i.e., convert the latent space back to higher-dimensional space). The decoder ensures that latent space can capture most of the information from the dataset space by forcing it to output what was fed as input to the decoder.

Source: Classical Autoencoder Architecture

The AE can compress valid inputs to fewer bits, eliminating redundancy (encoder). Still, due to non-regularized latent space AE, the decoder can not be used to generate valid input data from latent vectors sampled from the latent space.

Variational Autoencoders

A Variational Autoencoder (VAE) addresses the non-regularized latent space in autoencoders and provides the generative capability to the entire space.

A non-regularized latent space indicates that it has few regions in the distribution that generalizes well. This means that if we sample a point from the distribution, it will most likely not give us a good variety of original data, which may result in not-so-authentic results.

To address this issue of non-regularized latent space, instead of outputting the vectors in the latent space directly, the encoder of VAE outputs parameters of a pre-defined distribution (e.g., normal) for every input. Why does VAE output parameters of distribution instead of directly generating a vector?

The advantage of generating parameters of distribution (e.g., mean and variance) of a normal distribution helps in modeling the input data in the form of specific normal distribution. Hence, it provides more flexibility to generate good data variations and enhances the quality of generated data.

This is needed to generate a data distribution that mimics the input data and produces realistic samples resembling input data.

Source: Variational Autoencoder Architecture

Synthetic data generation using Variational AutoEncoders(VAE)

Algorithm to generate synthetic data using VAE:

  1. An encoder module turns the input samples input_img into two parameters in a latent space of representations, mean and log_variance.
  2. You randomly sample a point z from the latent normal distribution assumed to generate the input image via z = mean + exp(log_variance) * epsilon,  where epsilon is a random tensor of small values.
  3. A decoder module maps this point in the latent space back to the original input image.

The parameters of a VAE are trained via two loss functions:

  1. Reconstruction Loss forces the decoded samples to match the initial inputs.
  2. Regularization Loss helps learn well-formed latent spaces and reduces overfitting training data.

The Challenges in Synthetic Data

Synthetic data’s rising popularity entails more of its usage. Nevertheless, synthetic data is usually trained based on real data. Synthetic data generates “within” distribution samples and doesn’t generalize well to produce realistic samples. It can even combine distributions (such as DALLE2), but it cannot generate something it has never seen hence data collection is crucial.

There are some significant challenges to synthetic data adoption:

  • Lack of outliers: It can be hard to program rare events in the data distribution.
  • Variable data quality: Depends on the input data and requires tight quality control to avoid faulty samples.
  • Synthetic data replicates specific statistical properties of the source data. So it can miss random behavior of real-world data.

Real-world Applications of Synthetic Data

Synthetic data can be generated from different sources, such as images, texts, audio, and videos. Many companies have adopted synthetic data generation techniques to augment their in-house datasets and improve training performance.

  • Amazon uses synthetic data to train Alexa’s language system.
  • Google’s Waymo uses synthetic data to train its self-driving cars.
  • Roche utilizes synthetic medical data for clinical research.
  • Fake it till you make it: face analysis in the wild using synthetic data alone, researched by Microsoft. Researchers generated diverse human 3D faces, including labeling, to use as material for training machine learning models in computer vision, landmark localization, and face parsing
  • American Express trained AI fraud prevention models on synthetic data. The company used GANs to synthesize fraudulent cases with insufficient data. The goal was to augment the real data set with synthesized data to balance the availability of different fraud variations.

PREDICTioN2020 research by Charité Lab for Artificial Intelligence in Medicine wanted to create a comprehensive platform for stroke outcome prediction.