directions for addressing these challenges.
Gaussian Copula simulates a scenario where a company may be unwilling to share the real dataset but is willing to release a synthetic copy which preserves many of the real dataset ' s properties for researchers to use. This method uses a statistical methodology to generate realistic synthetic data with desired properties, such as being normally distributed.
And lastly, transformer-based models such as OpenAI ' s GPT excel at capturing intricate patterns and dependencies within the data. By training on large datasets, they learn the underlying structure and generate synthetic data that closely resembles the original distribution. They are extensively applied in natural language processing tasks, but they have also found applications in computer vision, speech.
That synthetic data is revolutionizing industries from healthcare to financial services to automotive, by enabling simulations and data augmentation needs no belaboring, going by the excitement it has generated.
AI presents an opportunity to improve the speed, and potentially, the success rate of new innovations, and how we go about doing this will determine whether we succeed. The promise of being able to generate synthetic data, at will, and at scale, is therefore extremely attractive.
Public sentiments on synthetic data, however, are quite polarized. Companies offering synthetic data services might say“ this is all you need, no humans required”! Researchers who are more cautious may adopt a wait-and-see approach and are hesitant to use synthetic data for the time being. Underlying these opinions is a tendency to categorize the world in a binary way: Good or bad? Synthetic or real-world data?
At Ipsos, through our pilots, we have demonstrated that AI could generate synthetic data to mimic real-world data, but first, it requires quality human data for training purposes. Therefore, the answer is
not synthetic or real-world data. We need both.
The accuracy of synthetic data is not good or bad; rather,“ it depends”. If product differences are small among humans, we need to investigate subgroups. If the human data used to train the AI is not representative for the target group or relevant to the business, then the accuracy of the synthetic data will be compromised.
If we want to use synthetic data, we must accept that it may not work under some conditions. As researchers, our responsibility is to ensure we use synthetic data only when appropriate: under conditions that will maximize success. Augmenting synthetic data offers several advantages over using smaller sample sizes, including the ability to conduct subgroup analyses, retain statistical power, and perform more complex analyses.
At Ipsos, we believe synthetic data opens brand new possibilities for market research, particularly in product testing. However, many businesses remain uncertain about the quality of synthetic data or how to evaluate it.
To generate synthetic data that effectively mimics real-world data, an artificial intelligence model must first be trained on relevant, real-world data. AIs are simply algorithms; they have no intelligence of their own, until they are trained. It is through learning from training data that AIs acquire the intelligence we associate with them.
The development of AI requires data, and the quality of the data determines the quality of the AI model. There are two main forms of learning for AI models: supervised learning, where a human teaches the AI what to learn, and selfsupervised learning, where the AI is fed with a large amount of text to generate predictions.
New product ideas are more likely to succeed if the ideation and evaluation phases are grounded in data reflecting consumers’ intrinsically human needs and desires. This data needs to be timeless, or at minimum, up to date. As data is so central to AI, Humanizing AI starts by explaining how training data determines the accuracy of its model.
For this reason, off-the-shelf AI models have their limitations because what is needed is real consumer data to generate and predict better innovations. Ipsos, for example, uses human reactions to new product concepts to train AI models for concept evaluation.
Humanizing AI calls for the use real human data to better understand and predict real human behavior. By incorporating relevant, representative, and timeless data, AI models can be more valuable.
When using synthetic data in product testing, it is worthy to note of three issues: Firstly, AI will never be human and the data can never echo our product experiences, which combine the five senses, emotions, expectations, and context. Therefore, the goal is to augment human input with synthetic data, and not to replace it.
Secondly, value of synthetic data is not binary( good or bad); the accuracy of synthetic data depends on many factors including the differences in the data we are trying to replicate, and the representativeness of the real-world data we are training an AI to learn from. The use of synthetic data should be strategic, considering the associated risks and benefits.
And lastly, synthetic data can boost market research agility, making it ideal for resource-intensive areas like product testing- reducing costs, saving time, with additional benefits for detailed sub-group analyses.
While the potential of synthetic data has everyone in the market research industry excited over the prospects for accelerating research processes and expanding data sets, we need to tread cautiously. Validating generated data and addressing bias are essential to getting the full potential of synthetic data. As advancements continue, the market research industry will need to remain adaptive, embracing new methodologies while upholding standards of quality and validity.
Chris Githaiga, is the Ipsos in Kenya Managing Director and the Country Manager, East Africa Cluster. You can commune with him on this or related matters via email at: Chris. Githaiga @ ipsos. com.