AI crawlers cause Wikimedia Commons bandwidth demands to surge 50%

April 2, 2025

0 2 minutes read

The Wikimedia Foundation, the umbrella organization of Wikipedia and a dozen or so other Crowdsourced Knowledge Projects, on Wednesday to say that bandwidth consumption for multimedia downloads of Wikimedia Commons has risen by 50% since January 2024.

The reason, the outfit wrote in one Blog post Tuesday is not due to the growing demand of knowledge-thirsty people, but of automated, data-hungry scrapers who want to train AI models.

“Our infrastructure is built to support people’s sudden traffic peaks during events with high interest rates, but the amount of traffic generated by scraper bots is unprecedented and gives growing risks and costs,” the mail is.

Wikimedia Commons is a freely accessible repository of images, videos and audio files that are available under open licenses or are otherwise in the public domain.

Wikimedia says that almost two-thirds (65%) of the most “expensive” traffic-that means the most resource-intensive in terms of the type of consumation use came from bots. However, only 35% of the total page views come from these bots. The reason for this inequality, according to Wikimedia, is that there is often viewed content closer to the user in his cache, while other less frequently accessible content is stored further away in the “core data center”, which is more expensive to operate content. This is the type of content that bots usually go looking for.

“Although human readers tend to concentrate on specific – often comparable – topics, Crawler -Bots tend to ‘read’ larger numbers of pages and also visit the less popular pages,” Wikimedia writes. “This means that these types of requests are more often forwarded to the core data center, making it much more expensive in terms of consumption of our resources.”

The long and shortage of all this is that the site relationship team of the Wikimedia Foundation has to spend a lot of time and resources on blocking crawlers to prevent disturbance for regular users. And all this before we consider the cloud costs that the foundation is confronted with.

In reality, this represents part of a fast -growing trend that threatens the existence of the open internet. Last month software -Engineer and Open Source AdvocaatDrew Devault complained the fact That AI -Crawlers “Robots.txt” files ignore that are designed to ward off automated traffic. And “pragmatic engineer“Gergely Orosz also complained Last week AI scrapers of companies such as Meta Bandwidte -requirements for his own projects stimulated.

While open source infrastructure, in particular, stands in the shooting lineDevelopers fight back with “Cleverness and Vengeance”, as WAN wrote last week. Some technology companies also do their bit to tackle the problem – for example, Cloudflare, for example recently launched Ai Labyrinththat used by AI used to delay crawlers.

However, it is very much a cat-and-mouse game that could ultimately force many publishers to cover for coverage behind logins and payment walls-on the Schabel from everyone who uses the web today.

Source link