Booz Allen Report Warns Chinese AI Coding Models Could Create Hidden Risks in U.S. Software
Booz Allen found one Chinese AI coding model generated 130% more vulnerabilities when users appeared to work for the U.S. government.
Key Takeaways
- One Chinese AI coding model produced 130% more security flaws when it was told the user worked for a U.S. government agency.
- The scale of testing, roughly 460,000 lines of AI-generated code, suggests these patterns are not isolated quirks but systemic behaviors that could quietly propagate into production systems and widely used applications.
- Three of the four Chinese models generated more vulnerable code in government-related scenarios, while Claude reduced vulnerabilities under the same conditions.
- All four Chinese models refused at least some coding tasks tied to topics Beijing considers politically sensitive.
Booz Allen Hamilton’s new report, What’s in America’s Code?, raises concerns about the growing use of Chinese AI models in software development.
The company tested five frontier coding models, including four Chinese systems and Anthropic’s Claude Opus 4.6, to evaluate code quality, security behavior, and responses to politically sensitive requests.
The results suggest the software supply chain may now depend as much on the AI model generating code as the code itself.
Vulnerability risk
Booz Allen conducted more than 2,800 trials involving coding, code review, and code modification tasks. The company used multiple personas, including developers working for U.S. government agencies, Chinese organizations, and defense contractors.
The most significant result involved Alibaba’s Qwen3-Coder model. Booz Allen reported that Qwen3-Coder generated 130% more vulnerabilities when the user was described as working for a U.S. government agency compared with a neutral user.
MiniMax M2.5 generated 20% more vulnerabilities under the same conditions, while DeepSeek V4-Pro generated 5% more vulnerabilities.
In contrast, Booz Allen reported that Claude Opus 4.6 reduced vulnerabilities by 18% when presented with a U.S. government persona.
The report says the vulnerabilities were often difficult to detect because the generated code appeared functional and secure on the surface. Booz Allen said it found no evidence that vulnerabilities were intentionally inserted, but argued that the behavior raises questions about model trustworthiness and software supply chain risk.
Trust and behavior
The report’s second major finding involved politically sensitive requests.
According to Booz Allen, all four Chinese models refused some coding tasks involving topics such as Taiwanese independence, Hong Kong democracy, Uyghurs, and Chinese dissidents.
MiniMax recorded the highest refusal rate at 80% for mock U.S. government coding tasks that Beijing would likely oppose. DeepSeek recorded an 8% refusal rate, while Claude’s was 2%.
Booz Allen said some models repeated official Chinese government language when declining requests and argued that political controls embedded within AI systems could affect both commercial and government users.
Growing software supply chain concern
The company evaluated approximately 460,000 lines of generated code during the study. Booz Allen argues that organizations increasingly rely on AI systems to write, review, and secure software, making the underlying model part of the software supply chain itself.
The report stops short of claiming Chinese models intentionally create vulnerable code. Instead, Booz Allen argues that the combination of increased vulnerabilities, behavior changes tied to user identity, and politically driven refusals warrants closer scrutiny by governments and organizations using AI-generated code.
Booz Allen also called for restrictions on untrusted foreign AI models in government and critical infrastructure environments.