arrow Products
Glide CMS image Glide CMS image
Glide CMS arrow
The powerful intuitive headless CMS for busy content and editorial teams, bursting with features and sector insight. MACH architecture gives you business freedom.
Glide Go image Glide Go image
Glide Go arrow
Enterprise power at start-up speed. Glide Go is a pre-configured deployment of Glide CMS with hosting and front-end problems solved.
Glide Nexa image Glide Nexa image
Glide Nexa arrow
Audience authentication, entitlements, and preference management in one system designed for publishers and content businesses.
For your sector arrow arrow
Media & Entertainment
arrow arrow
Built for any content to thrive, whomever it's for. Get content out faster and do more with it.
Sports & Gaming
arrow arrow
Bring fans closer to their passions and deliver unrivalled audience experiences wherever they are.
Publishing
arrow arrow
Tailored to the unique needs of publishing so you can fully focus on audiences and content success.
For your role arrow arrow
Technology
arrow arrow
Unlock resources and budget with low-code & no-code solutions to do so much more.
Editorial & Content
arrow arrow
Make content of higher quality quicker, and target it with pinpoint accuracy at the right audiences.
Developers
arrow arrow
MACH architecture lets you kickstart development, leveraging vast native functionality and top-tier support.
Commercial & Marketing
arrow arrow
Speedrun ideas into products, accelerate ROI, convert interest, and own the conversation.
Technology Partners arrow arrow
Explore Glide's world-class technology partners and integrations.
Solution Partners arrow arrow
For workflow guidance, SEO, digital transformation, data & analytics, and design, tap into Glide's solution partners and sector experts.
Industry Insights arrow arrow
News
arrow arrow
News from inside our world, about Glide Publishing Platform, our customers, and other cool things.
Comment
arrow arrow
Insight and comment about the things which make content and publishing better - or sometimes worse.
Expert Guides
arrow arrow
Essential insights and helpful resources from industry veterans, and your gateway to CMS and Glide mastery.
Newsletter
arrow arrow
The Content Aware weekly newsletter, with news and comment every Thursday.
Knowledge arrow arrow
Customer Support
arrow arrow
Learn more about the unrivalled customer support from the team at Glide.
Documentation
arrow arrow
User Guides and Technical Documentation for Glide Publishing Platform headless CMS, Glide Go, and Glide Nexa.
Developer Experience
arrow arrow
Learn more about using Glide headless CMS, Glide Go, and Glide Nexa identity management.

AI and the sea of content: are you the fish, the sea, or the trawler?

The plundering of data for use in LLMs could even have spread to what most people would regard as "their" content

by Rob Corbidge
Published: 16:48, 11 April 2024
Is AI the fish or the trawler or both

“I met a man at a party. He told me 'I'm writing a novel.' I said "Oh really? Neither am I!” 

The words of English satirist Peter Cook will ring true with any of us who have attempted, or are attempting, to put their thoughts down in the form of a book. It seems though that something still regarded as one of the highest forms of intellectual endeavour could possibly now be exposed to prying eyes even before it reaches the author's satisfaction to make it public.

The New York Times have published what is essentially a follow-up to the WSJ article we discussed last week about the voracious appetite for data that the next generation of LLMs will have. More than exists on the entire internet was the conclusion. 

The NYT follow-up is a thorough piece of work and, headlined "How Tech Giants Cut Corners to Harvest Data for AI", reveals some new and intimate information about the effort to collect data and how the headlong rush to claim the AI crown is leading the big players to do some very questionable things.

To return to the intro, think on this quote from the NYT: "Last year, Google also broadened its terms of service. One motivation for the change, according to members of the company’s privacy team and an internal message viewed by The Times, was to allow Google to be able to tap publicly available Google Docs, restaurant reviews on Google Maps and other online material for more of its AI products."

Google Docs. Where people write things. Where, to be absurd, an aspirant writer constructing a story about a heroic intellectual property lawyer with a limp and a troubled past who takes on the big boys might write things, for example. The question is, is that content being harvested for tokenisation? Being harvested while being written? 

Is anything actually safe?

The NYT quotes Google spokesman Matt Bryant saying "Google did not use information from Google Docs or related apps to train language models 'without explicit permission' from users."

Then it turns out that Google changed the T&Cs around the use of such data, and released the new terms and conditions on the Fourth of July weekend last year, a time when - trust me on this one - most Stateside residents aren't overly concerned with reading small print.

It doesn't look good does it?

We know already that Google and OpenAI have both filleted YouTube for material, and thus this telling quote: "Some Google employees were aware that OpenAI had harvested YouTube videos for data ... but they didn’t stop OpenAI because Google had also used transcripts of YouTube videos to train its AI models. That practice may have violated the copyrights of YouTube creators."

And so it gets worse. That is a case of not acting on something wrong on the basis that you're doing it yourself. Business is business, I know, but you don't have to be a big fan of the moral high ground to think that maybe that's not very good practice. 

While AI experts describe to us the concept of Model Collapse as AI content is fed into AI models - think of it like a dog doing something unmentionable - here we have AI ethics collapse: they do it, so we do it, so we must both be right.

We are in a world where Sundar Pichai could walk on stage at Google's I/O next month dragging a trawler net full of unconscious YouTubers and informing the audience "this is just a small part of our harvest".

It's all rather reminiscent of 2013, when it was revealed the Russian federal security service, the FSB, had purchased £10,000 worth of typewriters in order to prevent certain messages from being hacked by being hard copy only.

Is that were we're heading, against, not spy versus spy subterfuge, but publicly listed companies?

They're all up in our stuff it seems, copying it and learning it, and will continue to be so while they happen to be locked in their own arms race to create ever more powerful LLMs. 

This game needs to end, or it has to be pay-to-play.