AI

Leaked data exposes a Chinese AI censorship machine

A complaint about poverty in the countryside of China. A news item about a corrupt member of the Communist Party. A call for help about corrupt agents who shake entrepreneurs.

These are just a few of the 133,000 examples that were entered in an advanced large language model that is designed to automatically mark each piece of content as sensitive by the Chinese government.

A leaked database that is seen by WAN reveals that China has developed an AI system that supercharge its already formidable censorship machine, which extends much further than traditional taboos such as the Tiananmen Square massacre.

The system seems mainly focused on censing Chinese citizens online, but can be used for other purposes, such as improving the extensive censorship of Chinese AI models.

Chinese flag on pale rear razor
This photo taken on 4 June 2019, shows the Chinese flag behind Razor Wire at a housing complex in Yengisar, south of Kashgar, in the Western Xinjiang region in China.Image Credits:Greg Baker / AFP / Getty images

Xiao Qiang, a researcher at UC Berkeley who studies Chinese censorship and who also investigated the data set, WAN said that it was “clear evidence” that the Chinese government or its affiliated companies want to use LLMS to improve repression.

“In contrast to traditional censorship mechanisms, which depend on human labor for keyword -based filtering and manual assessment, an LLM that has been trained on such instructions would significantly improve the efficiency and granularity of information control,” Qiang told WAN.

This contributes to growing evidence that authoritarian regimes quickly accept the latest AI technology. In February for example, OpenAi said It caught several Chinese entities with the help of LLMs to follow anti-government posts and spread Chinese dissidents.

See also  Singapore grants bail for Nvidia chip smugglers in alleged $390M fraud

The Chinese embassy in Washington, DC, WAN told in a statement That it opposes “unfounded attacks and defamation against China” and that China attaches great importance to the development of ethical AI.

Data found in sight

The data set has been discovered By security researcher NetaskariHe shared an example with TechCrunch after finding it stored in an uncovered ElasticSearch database hosted on a Baidu server.

This does not indicate any involvement of both companies – all kinds of organizations store their data with these providers.

There is no indication of who exactly built the dataset, but records show that the data is recently, with the latest entries from December 2024.

An LLM for detecting disagreements

In language that is creepy thinking of how people chatgpt, the maker of the system has an unnamed LLM to find out If a piece of content has something to do with sensitive topics related to politics, social life and the army. Such content is considered “highest priority” and must be marked immediately.

Top-priority topics include pollution and food safety scandals, financial fraud and labor conflicts, which are hot-button problems in China that sometimes lead to public protests, for example, the Shifang anti-pollution protests from 2012.

Every form of “political satire” is explicitly the target. For example, if someone uses historical analogies to make a point about ‘current political figures’, which must be marked immediately, and that must also be related to ‘Taiwan politics’. Military affairs are extensively focused, including reports of military movements, exercises and weapons.

A fragment of the data set can be seen below. The code in IT references promptly tokens and LLMS, which confirms that the system uses an AI model to make its bidding:

See also  Reinforcement Learning Meets Chain-of-Thought: Transforming LLMs into Autonomous Reasoning Agents
An excerpt from JSON code that promptly tokens and LLMS. Much of the content is in Chinese.
Image Credits:Charles Rollet

In the training data

From this enormous collection of 133,000 examples that the LLM must evaluate for censorship, tech crunch collected itself 10 representative pieces of content.

Topics that are likely to generate social unrest are a recurring theme. For example, an excerpt is a message from a business owner who complains about corrupt local police officers who shoot entrepreneurs, An increasing problem in China While his economy is struggling.

Another piece of content complains in China national poverty, which describes dilapidated cities that only older people and children have left in it. There is also a news item about the Chinese Communist Party (CCP) that pours out a local officer for serious corruption and believing in “superstition” instead of Marxism.

Extensive material is related to Taiwan and military matters, such as comments about the military capacities of Taiwan and details about a new Chinese jetjager. The Chinese word for Taiwan (台湾) Only is mentioned more than 15,000 times in the data, according to a search query from WAN.

Subtle abnormal opinion also seems to be the target. An excerpt that is included in the database is an anecdote about the volatile nature that uses the popular Chinese idiom “when the tree falls, spread the monkeys.”

Power transitions are a particularly sensitive subject in China thanks to the authoritarian political system.

Built for “Public Opinion Work

The dataset contains no information about the makers. But it says that it is intended for ‘public opinion work’, which offers a strong indication that it is intended to serve goals from the Chinese government, an expert told WAN.

Michael Caster, the Asia program manager of the law organization Article 19, explained that “public opinion work” is checked by a powerful Chinese government regulator, the Cyberspace Administration of China (CAC), and usually refers to censorship and propaganda efforts.

See also  DeepSeek vs. OpenAI: The Battle of Open Reasoning Models

The final goal is to ensure that Chinese government stories are protected online, while all alternative views are removed. Chinese President Xi Jinping has described it itself The internet as the ‘front line’ of the ‘public opinion work’ of the CCP.

Repression is getting smarter

The dataset investigated by WAN is the latest proof that authoritarian governments are trying to use AI for repressive purposes.

Openi Released a report last month From revealing that a non -mischievous actor, who probably operated from China, used generative AI to control social media interviews – in particular those who argue for human rights protests against China – and send it to the Chinese government.

Contact us

If you know more about how AI is used in the priority of the state, you can safely contact Charles Rollet on Signal at Charlesrollet.12 You can also contact WAN via SecureDrop.

OpenAi also thought that the technology was used to generate comments that is very critical of a prominent Chinese dissident, Cai Xia.

Traditionally, China censors are dependent on more basic algorithms that automatically block the content with stating the blacklist, such as “Tiananmen Massacre” or “Xi Jinping”, as many users first experienced that he has experienced Deepseek for the first time.

But newer AI technology can, such as LLMS, make censorship more efficient by finding subtle criticism on a large scale. Some AI systems can also continue to improve if they swallow more and more data.

“I think it is crucial to emphasize how AI-driven censorship is evolving, making state control over public discourse even more advanced, especially at a time when Chinese AI models such as Deepseek make headwaves,” said Xiao, the Berkeley researcher, to WAN.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button