Neural networks training requires thousands of data per class. It’s fine because data is everywhere… innit? Well… data being the new gold, even if you dig deep, it’s not always there.

Ground training data can be expensive and complicated to gather, scarce because of privacy concerns or simply nonexistent. Synthetic data is artificially created data to mimic real world situations. Researcher and companies are now using simulated environments to collect this “fake data” where they feel that the accuracy level can be compromised. While this process seems to be here to stay, it’s not without challenges.

Synthetic data, a very real trend

Videogames to train self-driving cars

Autonomous vehicle is maybe the best example of critical AI application. If you put cars without drivers on the roads, you REALLY want to get it right! No doubt it requires a gigantic amount of training data. Sure, there is many available but not nearly enough. Waymo made the front page by having its cars riding 8 millions miles but in the meantime they covered 5 billions in simulation.
GTA hasn’t helped your studies? Well it sure helps self driving car technology. Automakers are using video games like GTA vice city for getting their training data and building the corresponding neural network models. Video games for training the neural networks are also being provided by companies like TextWorld.


Medical Diagnostics without patient’s data

While artificial intelligence applied to medical challenges has tremendous potential, patient confidentiality makes data hard to come by. Dr Shahrokh Valaee, an electrical engineer, collected synthetic data for training the AI techniques by using machine learning for generating computer based X-rays.

“We are creating simulated X-rays that reflect certain rare conditions so that we can combine them with real X-rays to have a sufficiently large database to train the neural networks to identify these conditions in other X-rays.” says Valaee.

Video datasets without privacy issues and the companies that build them

Building large ground truth datasets for video recognition implies to overcome heavy privacy and ethical concerns. TwentyBN corporation provides labelled synthetic data for such purpose. They developed an in-house data factory for producing high definition videos closely relating to the real world situations. and Neuromation are also surfing this wave “When I first saw the synthetic dataset I thought ‘This is terrible. How is it possible the computer can be learning from this? But what matters is what the computer understands from an image,” says Schuster of

Synthetic data to patch incomplete 3D mapping

3D mapping technology is becoming increasingly used for construction, maintenance and general building information management. It often relies on data gathered from the outside of the buildings.  Even with drones aerial surveys or Lidar (Light Detection and Ranging), the data is patchy, making the models incomplete and sketchy. Intel data scientists used synthetic data to fill those blanks. 

Reality gap: synthetic data still comes with a price

Well, as promising as it may seems, synthetic data doesn't come without price - the reality gap is still there! Not every situation can be mimicked accurately and in the right way. For example, if autonomous vehicles require high accuracy level it would need to mine real data in addition to the synthetic data. Synthetic data is only good for the situations that can compromise on their accuracy level. You would need to be an expert in your field to make sure that synthetic data provides acceptable results otherwise you might end up with poorly generalized trained models.  Further, in many new fields data scientists don’t even have enough experience to make sure that the gathered synthetic data is close to the real one.

Synthetic data is surely helpful but the AI models based on real training data are still more reliable and accurate. Even in the situations where synthetic data is necessary it should be mixed with reality.

Originally published on August 10, 2018 Topics: Data Science Machine Learning Deep Learning Computer vision Datasets


You may also like: