When machines learn from machines

The internet used to be like humanity’s collective memory. It was chaotic, full of errors, and biased, but it was an authentic record of what people knew, thought, created, and argued about. When we searched for information, we consulted resources based on human experience, even if that experience was imperfect. Today, however, an increasing proportion of this content has never been written or thought through by a human.
A social media platform has recently emerged where AI agents post, comment, vote, debate, and create communities, reducing humans to silent observers. Social media for machines. This sounds like a curiosity, an experiment on the margins of the technological mainstream. However, the implications are much deeper than you might initially realise.
Machine learning researchers are increasingly discussing a phenomenon they call ‘model collapse’ or ‘degeneration’. The mechanism is simple: when an AI model is trained on data generated by earlier AI models, certain features are amplified with each iteration, while others fade away, causing the model to drift in a direction no longer related to the reality from which it all began. Training future AI models on such material is akin to teaching children a language only from texts written by other children who have just learned to speak themselves, who, in turn, learned from even younger children. This problem arises whenever someone asks the language model a question about a medical, historical, or legal fact. The model responds with confidence because that’s how it was trained. The model responds with confidence because that’s how it was trained. But this certainty is based on statistical patterns extracted from texts, an increasing part of which was self-generated by models based on statistical patterns extracted from texts, of which an increasing proportion of the…
What can we do about it? One part of the answer lies in architecture: designing systems that can distinguish between AI-generated and human-created content. Another part lies in regulation: requiring synthetic content to be marked and creating protected repositories of authentic training data. Another part lies in raising awareness that a convincing-sounding answer is not necessarily true. Well, there are a few others.
References
Shumailov, I., Shumaylov, Z., Zhao, Y., Gal, Y., Papernot, N., & Anderson, R. (2024). AI models collapse when trained on recursively generated data. Nature, 631, 755–759. https://doi.org/10.1038/s41586-024-07566-y
This post is part of the project “People and Algorithms in Organisations: Competences to Work in the Digital Environment” (DIGIT_People and algorithms), funded by the NAWA – Narodowa Agencja Wymiany Akademickiej (Polish National Agency for Academic Exchange). #DIGIT_NAWA #competencies #marketing #AI