arrow Products
Glide CMS image Glide CMS image
Glide CMS arrow
The powerful intuitive headless CMS for busy content and editorial teams, bursting with features and sector insight. MACH architecture gives you business freedom.
Glide Go image Glide Go image
Glide Go arrow
Enterprise power at start-up speed. Glide Go is a pre-configured deployment of Glide CMS with hosting and front-end problems solved.
Glide Nexa image Glide Nexa image
Glide Nexa arrow
Audience authentication, entitlements, and preference management in one system designed for publishers and content businesses.
For your sector arrow arrow
Media & Entertainment
arrow arrow
Built for any content to thrive, whomever it's for. Get content out faster and do more with it.
Sports & Gaming
arrow arrow
Bring fans closer to their passions and deliver unrivalled audience experiences wherever they are.
Publishing
arrow arrow
Tailored to the unique needs of publishing so you can fully focus on audiences and content success.
For your role arrow arrow
Technology
arrow arrow
Unlock resources and budget with low-code & no-code solutions to do so much more.
Editorial & Content
arrow arrow
Make content of higher quality quicker, and target it with pinpoint accuracy at the right audiences.
Developers
arrow arrow
MACH architecture lets you kickstart development, leveraging vast native functionality and top-tier support.
Commercial & Marketing
arrow arrow
Speedrun ideas into products, accelerate ROI, convert interest, and own the conversation.
Technology Partners arrow arrow
Explore Glide's world-class technology partners and integrations.
Solution Partners arrow arrow
For workflow guidance, SEO, digital transformation, data & analytics, and design, tap into Glide's solution partners and sector experts.
Industry Insights arrow arrow
News
arrow arrow
News from inside our world, about Glide Publishing Platform, our customers, and other cool things.
Comment
arrow arrow
Insight and comment about the things which make content and publishing better - or sometimes worse.
Expert Guides
arrow arrow
Essential insights and helpful resources from industry veterans, and your gateway to CMS and Glide mastery.
Newsletter
arrow arrow
The Content Aware weekly newsletter, with news and comment every Thursday.
Knowledge arrow arrow
Customer Support
arrow arrow
Learn more about the unrivalled customer support from the team at Glide.
Documentation
arrow arrow
User Guides and Technical Documentation for Glide Publishing Platform headless CMS, Glide Go, and Glide Nexa.
Developer Experience
arrow arrow
Learn more about using Glide headless CMS, Glide Go, and Glide Nexa identity management.

AI is eating the internet so fast it's running out of data - luckily, publishers have the solution

Future LLMs could require more data than exists on the entire internet to train them. So surely that's good news for our industry?

by Rob Corbidge
Published: 15:05, 04 April 2024

Last updated: 17:47, 04 April 2024
You can do more with your precious data than give it to robots

China's most famous contemporary author Liu Cixin produces words at the rate of around 2,000 per day, or 3,000 words a day maximum if the flow comes to him. 

In an interview with that fabulous populariser of science, Jim Al-Khalili, Liu explained that to produce more than this quantity was something he considered "impossible" for writers of fiction, although he did refer to the Chinese culture of "Internet Stories" - stories that appear on popular platforms and are updated daily - as having writers who could produce "10,000 words a day".

Little of contemporary Chinese culture reaches outside its immediate neighbours, unlike say Japan, so Liu's powerful contribution to the global canon of science fiction with his Remembrance of Earth's Past trilogy and its most well-known book The Three-Body Problem is something to be welcomed to the fullest extent.

(Incidentally, the Chinese language serialisation of The Three-Body Problem is available on YouTube and is superior to the recent Netflix adaptation. Don't be a subtitle coward.)

A writer with such a powerful imagination as Lui's can still only produce 2,000 words a day. Even given, as he explained, he must have the whole story in his mind before he can commit it to words, he can only pull 2,000 words a day from that same mind.

I was reminded of this when reading of the possibility that the next generation of AI models might require more data than actually exists on the entire internet to train them. So even an army of Lui's 10,000-words-a-day human writing machines working 24/7 would only be feeding an LLM Tyrannosaurus the equivalent of salad.

Pablo Villalobos, a researcher for the Epoch institute, told the Wall Street Journal that "based on a computer-science principle called the Chinchilla scaling laws, an AI system like GPT-5 would need 60 trillion to 100 trillion tokens of data if researchers continued to follow the current growth trajectory". GPT-5 being the next generation of LLM.

"Harnessing all the high-quality language and image data available could still leave a shortfall of 10 trillion to 20 trillion tokens or more, Villalobos said. And it isn’t clear how to bridge that gap," the WSJ continued.

Now, as Villalobos and other researchers concede, there's an element of "peak oil" to this estimation game. As a reminder, we're supposed to be out of the black stuff by now and fighting over rat carcasses with spears, according to some woefully over-confident predictions in the 1970s and 80s.

Yet it exactly doesn't matter to publishers. The realisation has at least dawned that this is a gold rush and publishers produce a lot of gold. If you wish to be absolutely mercenary - and we're up against OpenAI so it is advisable - then you can see your content in terms of the glittering high quality tokens that the LLM squad crave to feed their voracious systems.

There is must be a good price for this material if it's value is so high. If it works for the lithium mining companies feeding the power revolution, then surely it must be for the publishers feeding the AI revolution. Particularly as they are one of the major sources of the 10% of high quality data that the training models value so much.

Well maybe not at the moment, and certainly not if you're outside the circle of top tier publishers by company value. 

The value of and need for such data is so important that is has been reported some AI researchers are trying synthetic data generation. Quite what that means isn't clear yet, but you can be certain it won't be 2,000 words a day.

The WSJ report also makes mention of a different way of ascribing value to content under consideration by OpenAI that is interesting to us. The company "has discussed creating a data market where it could build a way to attribute how much value each individual data point contributes to the final trained model and pay the provider of that content".

"Who is doing the value calculation and how?" is the first question for any publisher here, but it is at least an indication of the changing attitudes around data collection for AI training and what that data is worth within the industry itself.

As is the experience of Ricky Sutton, who as he writes here has recently found ChatGPT unwilling to provide copyrighted news content, against its previously "what's yours is ours" mode of operation. 

The Three-Body Problem is a reference to the difficulty in physics of modelling the motion of three particles around one another, dynamic chaos. 

Publishing is in a spot of dynamic chaos right now but don't lose sight of the value it holds.