Meta's AI, built on ill-gotten content, can probably build a digital you

14,04,25
Llama 4 Scout is just the right size to ingest a lifetime of Facebook and Insta posts In the last twelve months generative AI has transformed from a helpful and cheeky tool into something more worrying. That cycle began for me in February last year, when Australian science magazine COSMOS magazine fired all its freelancers […]

Llama 4 Scout is just the right size to ingest a lifetime of Facebook and Insta posts

In the last twelve months generative AI has transformed from a helpful and cheeky tool into something more worrying.

That cycle began for me in February last year, when Australian science magazine COSMOS magazine fired all its freelancers – myself included - and replaced us with AI-generated content.

The decision went down badly, and the magazine later “paused” its use of AI-generated articles.

So far, so bad.

While COSMOS squirmed, we learned how it became possible to consider 'replacing' people with AI when allegations emerged that Meta used the LibGen dataset to train its Llama family of AI models.

LibGen has an exotic history, with roots in the USSR's samizdat culture that saw citizens create copies of texts because the government controlled printing presses and distribution.

LibGen eventually blossomed into a 'shadow library' of scientific papers and journals that became a friend to academics who could not afford the ever-higher fees demanded by publishers who put research behind paywalls.

Repeated efforts to take LibGen down saw it become increasingly resistant to attack. Today, it hosts and links to nearly a million published works, but sadly many were included without consulting or compensating authors.

LibGen’s problems are well known. Yet Meta, a publicly listed company with over $150 billion of annual revenue, allegedly sought it out to train its Llama models.

It’s possible to search LibGen at this wonderful form. I used it to discover that nine books I wrote are in there, including my latest: Getting Started with ChatGPT and AI Chatbots.

The irony of this is not lost on me.

Many of my friends are writers. All but one has found their work in LibGen. One told me searching LibGen revealed a German translation of one of his earlier works he didn’t know about!

Meta paid nothing for LibGen, a fabulous deal which looks even more formidable following this past weekend's launch of Llama 4 Scout, the latest-and-greatest of Meta's open source-ish models.

Although received with only tepid applause, Llama 4 Scout managed to get one thing astonishingly correct: A massive 10 million token "context window" that gives it the capacity to digest five million words (10,000 pages), and - because of its training as a 'multimodal' model - around 10,000 images.

Why would Meta choose or need that capacity? Consider the amount of data generated by someone using Facebook over the last 20 years. I think it would all fit quite comfortably within Llama 4 Scout, giving Meta the ability to quickly find and analyze anything any of its users has ever posted.

Meta can use that capability to generate a simulacrum of its users with near-perfect fidelity, because it is digesting everything those users have ever posted to Facebook/Instagram/Messenger (and all of the other web sites Meta invisibly monitors).

In an unappealing twist on the Turing Test, that means it will become increasingly difficult to discriminate between responses generated by Meta's simulacra of its users and the actual humans who use its services.

I don’t know if Facebook will create bots based on all its users – but it sounds like a great way to test their susceptibility to advertising. If it does, we'll need to rely on 'shibboleths' and other forms of out-of-band signaling to signify our human presence amidst all the bots. Or maybe we'll just let Facebook's fake users jabber on at each other forever, while we go off in search of something more interesting. I like that idea - it's harder to steal. ®

Source: https://www.theregister.com/2025/04/10/meta_copyright_digital_you/

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram