Real-World Data for AI Training Has Been Exhausted, Shifting Focus to Synthetic Data
Following a recent livestream with Stagwell chairman Mark Penn, Musk shared his opinion about the AI industry, including AI data trainer, by mentioning they had run ran out of new real-world data to train models.
Musk, CEO of xAI, mentioned “we’ve now exhausted basically the cumulative sum of human knowledge” for AI training data.
The sentiment of Musk is echoed by former OpenAI chief scientist Ilya Sutskever, who declared that training data for AI had reached “peak data”, and that was at the NeurIPS conference in December.
Synthetic Data for AI Training
Major tech players like Microsoft, Meta, OpenAI, and Anthropic have already used synthetic data in AI to supplement training in their flagship models.
Synthetic data, by 2024, Gartner says, will comprise 60% of all data used in AI and analytics. With such increasing dependence on synthetic data, the face of AI training is bound to change.
Synthetic data solutions have already been used to develop several major AI models. Microsoft’s Phi-4 models, which were open-sourced in early January, were trained on both synthetic and real-world data. Synthetic data is also used in Google’s Gemma models and Anthropic’s Claude 3.5 Sonnet. Meta’s Llama series was fine-tuned with AI-generated data to improve performance.
One significant advantage of using synthetic data in AI is cost efficiency. AI startup Writer developed its Palmyra X 004 model almost fully using synthetic sources, which cut the development cost to $700,000 compared with $4.6 million for a similar model based on real-world data.
While apparent are the benefits, synthetic data also come with some challenges. Research indicates that reliance on synthetic data could mean model collapse, or at least AI models being “creative” less and biased more.
Because synthetic data arises from AI models themselves, any biases or limits within the original data are simply reproduced and amplified in the output.
This in turn could compromise functionality in terms of diversity of model and increase the risk for reinforcing existing bias.
The AI data trainer Process
This trend represents a serious inflection in the development of AI. As technology companies continue to experiment with its possibilities, the AI industry must weigh up how to balance the benefits of synthetic data with the risks of bias and model collapse.
Musk and other experts say the future of AI development hangs on how effective can be in improving AI models with a minimum of drawbacks.
The collapse of real-world data for AI training has pushed the industry toward synthetic data solutions, with both chances and risks.
As AI giants adopt synthetic data to reduce costs and improve model performance, the industry must study the challenges of bias and creativity loss to ensure future AI systems remain effective and unbiased.
Inside Telecom provides you with an extensive list of content covering all aspects of the tech industry. Keep an eye on our Tech sections to stay informed and up-to-date with our daily articles.