The Rise of the Machine: New Study Charts the Explosive Growth of AI-Generated Web Content
A major new study from researchers at Imperial College London, the Internet Archive, and Stanford University has quantified the rapid takeover of the English-language web by artificial intelligence. By analyzing millions of web pages, the team has painted a stark picture of a digital landscape transformed in just a few years.
By mid-2025, approximately 35% of newly published websites were fully or partially AI-generated—a figure that was essentially zero before ChatGPT's launch in late 2022.
The analysis covered English-language websites archived in the Internet Archive's Wayback Machine, sampling at 33 monthly intervals from August 2022 to May 2025. The team used the Pangram v3 detector to identify AI-generated text.
What the Data Actually Shows: Two Clear TrendsThe study tested six common hypotheses about AI's impact on the web. Only two were statistically supported by the data, revealing a dual effect of "semantic contraction" and a forced "positivity shift."
-
Semantic Contraction: AI-generated texts were found to be 33% more semantically similar to each other than human-written content. This indicates a significant narrowing of idea diversity, with AI models tending to converge on safe, predictable phrasing and concepts.
-
Positivity Shift: AI texts scored a massive 107% higher on positive sentiment than human-written content. Researchers attribute this to language models' inherent tendency toward sycophancy and overoptimism—an algorithmic urge to please and avoid negativity.
Interestingly, several widely feared outcomes were not supported by the evidence. The study found:
- No disappearance of individual writing styles.
- No decline in external links.
- No drop in information density.
- No statistically significant increase in verifiable factual errors.
To test the "truth decay" hypothesis, researchers used GPT-4o-mini to extract verifiable claims from websites. Fifty human annotators then checked these claims against external sources.
While the study found no statistically significant correlation between the share of AI content and the proportion of refuted statements, researchers caution against complacency. This analysis was based on a small subsample of approximately 250 websites, compared to the full study's 10,000 URLs per month over 33 months. Furthermore, the method only captured clearly refutable claims.
A Gap Between Public Perception and Data"It's worth noting that we were specifically looking for an increase in verifiably untrue statements, which we didn't find. But it could still be the case that AI is quietly increasing the volume of unverifiable claims, ones that can't be checked against existing fact-checking tools and infrastructure."
— Jonas Dolezal, AI researcher and co-author, Stanford University
A parallel survey of 853 U.S. adults reveals a fascinating disconnect. Most respondents believed in all negative hypotheses, including those not backed by data. For instance, 83% agreed that individual writing styles are disappearing, despite the study finding no evidence for this.
The data also reveals a stark divide in perception:
- People who rarely use AI were more likely to believe in negative effects than regular users (88.3% vs. 76.2%).
- Among self-identified AI skeptics, the gap widened even further (91.3% vs. 71.1%).
The team warns that the current trajectory could lead to "model collapse," where AI models poison their own training data by learning from a web increasingly filled with AI-generated content.
Their primary recommendations include:
- Adopting cryptographic provenance standards, like C2PA, to tag and track AI-generated content.
- Redesigning search and recommendation algorithms to rewrite the rules of engagement, rewarding semantic diversity over simple engagement metrics.
Co-author Maty Bohacek of Stanford confirmed that the team is working with the Internet Archive to create a continuous monitoring tool to track AI content on the web in real-time.
A Final Thought from the ResearchersStudy Limitations"Rather than forcing models to be perfectly compliant and agreeable, allowing them to have a more distinct personality or 'friction' might help them act as a creative partner rather than a replacement for human voice."
— Jonas Dolezal, Stanford University
It is important to note the study's boundaries:
- Only English-language texts were analyzed.
- The analysis depends on the Pangram v3 detector's reliability, which may shift as language models evolve.
- Data came solely from the Internet Archive, which may not fully represent the entire web.
- Non-text formats (images, video) were excluded from the analysis.