New project makes Wikipedia data more accessible to AI

October 1, 2025

1 2 minutes read

On Wednesday, Wikimedia Deutschland announced a new database that will make Wikipedia’s wealth of knowledge more accessible to AI models.

The Wikidata integration project is called the system and uses a vector-based semantic search and a technique that helps computers to understand the meaning and relationships between words-on the existing data about Wikipedia and its sister platforms, consisting of nearly 120 million entries.

Combined with new support for the context protocol (MCP) model, a standard that helps AI systems communicate with data sources, makes the project the data more accessible for natural language questions from LLMS.

The project was carried out by the German branch of Wikimedia in collaboration with the neural search company Jina.Ai and DataStax, a real-time training data company that is owned by IBM.

Wikidata has been offering machine-readable data from Wikimedia properties for years, but the already existing tools have only allowed searches and SPARQL querys, a specialized query language. The new system will work better with the collection of the collection (RAG) systems with which AI models can record external information, giving developers the opportunity to earn their models in knowledge verified by Wikipedia-Editors.

The data is also structured to offer a crucial semantic context. Request the database for The word ‘scientist’ For example, will produce lists of prominent nuclear scientists and scientists who have worked at Bell Labs. There are also translations of the word ‘scientist’ in different languages, an image of scientists at work and extrapolations filmed by Wikimedia into related concepts such as ‘researcher’ and ‘learned’.

The database is publicly accessible on Tolforge. Wikidata host too A webinar for interested developers on October 9.

WAN event

San Francisco
|
27-29 October 2025

The new project is because AI developers are looking for high-quality data sources that can be used to refine models. The training systems themselves have become more advanced – often assembled as complex training environments instead of simple data sets – but they still require closely composed data to function properly. For implementations that require high accuracy, the need for reliable data is in particular The common crawlA huge collection of web pages that have been scraped from the internet.

In some cases, the urge for high-quality data can have expensive consequences for AI laboratories. In August, Anthropic offered to arrange a lawsuit with a group of authors whose works were used as training material, by agreeing to pay $ 1.5 billion to put an end to claims of misconduct.

In a statement to the press, Wikidata AI project manager Philippe Saadé emphasized the independence of his project of large AI laboratories or large technology companies. “This embedding project launch shows that powerful AI does not have to be checked by a handful of companies,” Saadé told reporters. “It can open, collaborate and be built to serve everyone.”

Source link