Small Business

News sites are locking out the Internet Archive to stop AI crawling. Is the ‘open web’ closing?

When the World Wide Web went live in the early 1990s, its founders hoped it would be a space for everyone to share information and collaborate. But today the free and open web is shrinking.

The Internet Archive has been recording the history of the Internet since 1996 and making it available to the public through the Wayback Machine. Now some of the world’s biggest news outlets are blocking the archive’s access to their pages.

Major publishers – including The Guardian, The New York Times, the Financial Times and USA Today – have confirmed they are ending the Internet Archive’s access to their content.

While publishers say they support the archive’s preservation mission, they argue that unrestricted access has unintended consequences, leaving journalism open to AI crawlers and members of the public trying to circumvent their paywalls.

Still, publishers don’t want to simply shut out AI crawlers. Instead, they want to sell their content to data-hungry tech companies. Their back catalogs of news, books and other media have become a hot commodity as data to train AI systems.

Robot readers

Generative AI systems such as ChatGPT, Copilot and Gemini require access to large archives of content (such as media content, books, art and academic research) for training and to answer user questions.

Publishers claim that tech companies accessed much of this content for free and without permission from copyright owners. Some began taking tech companies to court, claiming they had stolen their intellectual property. High-profile examples include The New York Times’ case against ChatGPT’s parent company OpenAI and News Corp’s lawsuit against Perplexity AI.

See also  Best Side Hustles To Make Money From Home On The Fly
The exterior of the New York Times building in New York
The New York Times has sued OpenAI for alleged copyright infringement.
Sarah Yenesel/EPA

Old news, new money

In response, some tech companies have entered into agreements to pay for access to publishers’ content. NewsCorp’s contract with OpenAI is reportedly worth more than $250 million over five years.

Similar deals have been struck between academic publishers and technology companies. Publishers like Taylor & Francis and Elsevier have come under scrutiny in the past for locking government-funded research behind commercial paywalls.

Now Taylor & Francis has signed a $10 million non-exclusive deal with Microsoft, giving the company access to more than 3,000 magazines.

Publishers are also using technology to prevent unwanted AI bots from accessing their content, including the crawlers used by the Internet Archive to record Internet history. News publishers call the Internet Archive a “backdoor” to their catalogs, allowing unscrupulous tech companies to continue scrapping their content.

Someone is browsing the Internet Archive on a laptop
The Internet Archive has been systematically archiving the Internet for about thirty years.
Serene Lee/SOPA Images/LightRocket via Getty Images

The cost of making news for free

The Wayback Machine has also been used by the public to bypass newspaper paywalls. It is understandable that media outlets want readers to pay for news.

News is a business, and the advertising revenue model is increasingly under pressure from the same tech companies that use news content for AI training and retrieval. But this comes at the expense of public access to credible information.

When newspapers first began posting their content online and making it available to the public for free in the late 1990s, they contributed to the sharing and collaboration ethos of the early Internet.

See also  Prince Harry tore by Queen Elizabeth Aide: "Stop being victim"

In retrospect, however, one commentator called free access the “original sin” of online news. The public became accustomed to getting their digital editions for free, and as online business models changed, many medium and small news companies struggled to finance their operations.

The opposite approach – putting all commercial news behind paywalls – has its own problems. As news publishers move to subscription-only models, people must combine multiple expensive subscriptions or limit their news appetite. Otherwise, they will be left with all the news that remains free online or provided by social media algorithms. The result is a more closed, commercial Internet.

This isn’t the first time the Internet Archive has been in the crosshairs of publishers, as the organization was previously sued and found to be infringing copyright through its Open Library project.

The past and future of the Internet

The Wayback Machine has served as the Internet’s public record for more than thirty years, used by researchers, educators, journalists, and amateur Internet historians.

Blocking access to major international newspapers will leave significant holes in the Internet’s public record.

Today, you can use the Wayback Machine to view the front page of The New York Times from June 1997—the first time the Internet Archive searched the newspaper’s website. In another thirty years, Internet researchers and curious citizens will no longer have access to today’s front page, even though the Internet Archive still exists.

Today’s websites will become tomorrow’s historical documents. Without the preservation efforts of nonprofits like The Internet Archive, we risk losing vital documents.

See also  Nu open: 1 Hotel Tokyo – een stedelijk heiligdom vol natuur | Nieuws

Despite the actions of commercial publishers and the emerging challenges of AI, nonprofits like the Internet Archive and Wikipedia strive to keep the dream of an open, collaborative, and transparent Internet alive.

Source link

Back to top button