The Compliance Gap: What Happens When Regulation Meets the Machine

These questions of compliance, measurement, and what "trustworthy AI" actually means in practice are exactly what Human x AI Europe will address on May 19 in Vienna, where Europe's founders, policymakers, and builders gather to work through the hard problems together.

The Artifact That Reveals the Gap

Stand in front of a compliance document and notice what happens. The language is precise, the categories are clear, the obligations are enumerated. Now stand in front of a large language model and ask it to demonstrate "robustness" or "fairness" or "transparency." The gap between these two experiences is not merely technical. It is phenomenological. One is a text that can be read; the other is a system that can only be probed.

This is the problem that researchers at ETH Zurich's Secure, Reliable, and Intelligent Systems Lab (SRI Lab) have been working to close. Their COMPL-AI framework, developed in collaboration with Bulgaria's INSAIT and the ETH spinout LatticeFlow AI, represents the first systematic attempt to translate the EU AI Act's regulatory requirements into concrete, measurable technical benchmarks for large language models.

The European Commission has welcomed the framework as "a first step in translating the EU AI Act into technical requirements, helping AI model providers implement the AI Act." This endorsement matters. It signals that the gap between legal language and technical reality is now officially recognized as a problem requiring collaborative solutions.

What the Framework Actually Measures

The COMPL-AI framework organizes its evaluation around five primary categories, each mapped to specific provisions of the AI Act:

Technical robustness and safety examines whether models return consistent responses despite minor variations in input prompts and resist adversarial attacks. The framework uses established benchmarks like MMLU and BoolQ to assess prompt sensitivity, while Tensor Trust and LLM RuLES gauge resistance to cyberattacks.

Privacy and data protection assesses whether model outputs are free of errors, bias, and violations of laws governing privacy and copyright. Since many developers do not provide their models' training datasets, the researchers use open datasets such as the Pile as a proxy for identifying problematic training data.

Transparency and interpretability tests whether models can gauge their own accuracy and whether they disclose their machine nature to users. Measures include TriviaQA and Expected Calibration Error.

Fairness and non-discrimination uses tests like RedditBias, BBQ, BOLD, and FaiRLLM to gauge biased language and assess equitable outputs across demographic categories.

Social and environmental wellbeing addresses the broader societal impacts that the AI Act requires high-risk system developers to consider.

The framework currently incorporates 27 state-of-the-art LLM benchmarks and maintains a public leaderboard on Hugging Face where evaluation results can be compared.

The Results: Where Models Fall Short

When the researchers evaluated 12 prominent LLMs, including models from OpenAI, Meta, Google, Anthropic, and Alibaba, the results revealed significant shortcomings in areas the AI Act specifically targets.

Robustness proved particularly problematic. Models that perform well on standard capability benchmarks often show surprising sensitivity to minor prompt variations. Safety evaluations revealed inconsistent resistance to adversarial inputs. Fairness assessments exposed biases that persist despite extensive alignment efforts.

The framework does not propose specific thresholds for compliance. The scores are relative measures, designed to reveal comparative strengths and weaknesses rather than deliver binary pass/fail verdicts. This is a deliberate choice. The researchers acknowledge that determining what level of performance constitutes "compliance" is ultimately a regulatory and political question, not a purely technical one.

The Benchmark Gap Problem

Recent research has exposed an even deeper problem. A study titled "Bench-2-CoP" analyzed 194,955 questions from widely-used benchmarks against the EU AI Act's taxonomy of model capabilities and propensities. The findings reveal a profound misalignment in the evaluation ecosystem.

On average, benchmarks devote 61.6% of their regulatory-relevant questions to "Tendency to hallucinate" and 31.2% to "Lack of performance reliability." Meanwhile, capabilities central to loss-of-control scenarios, including evading human oversight, self-replication, and autonomous AI development, receive zero coverage in the entire benchmark corpus.

This means the tools the industry uses to evaluate AI systems are not designed to measure the risks that regulators are most concerned about. The evaluation infrastructure and the regulatory framework are speaking different languages.

The August 2026 Deadline

The stakes are not abstract. According to the European Commission's implementation timeline, the bulk of the AI Act's provisions for high-risk AI systems become enforceable on 2 August 2026. Penalties for non-compliance can reach up to €35 million or 7% of global annual turnover for violations of prohibited practices, and up to €15 million or 3% of turnover for high-risk system violations.

The phased implementation has already begun. Prohibitions on certain AI practices, including social scoring, manipulative AI, and emotion recognition in workplaces, became effective in February 2025. Rules for general-purpose AI models (GPAI) have applied since August 2025.

For organizations deploying AI in high-risk domains, including biometric identification, critical infrastructure, education, employment, credit scoring, and law enforcement, the compliance clock is running.

Data Minimization: The Next Frontier

The SRI Lab's research extends beyond compliance evaluation. Their 2026 publication "SoK: Data Minimization in Machine Learning," presented at SaTML 2026, addresses another critical intersection of AI and regulation: the GDPR's data minimization principle.

The challenge is fundamental. Machine learning systems typically require large amounts of data to perform well. The GDPR requires that only data necessary to fulfill a specific purpose be collected. These imperatives exist in tension.

The SRI Lab's earlier work on "Vertical Data Minimization for Machine Learning," published at IEEE S&P 2024, demonstrated methods to reduce the amount of personal data needed for predictions by removing or generalizing input features while maintaining model accuracy. This research suggests that compliance and capability need not be zero-sum.

What This Means for the Ecosystem

The COMPL-AI framework is not an official auditing tool. The researchers are explicit that their assessments should not be interpreted in a legally binding context. But the framework's existence changes the conversation.

For model providers, it offers a concrete way to identify compliance gaps before regulators do. For policymakers, it demonstrates that translating high-level principles into measurable requirements is possible, even if difficult. For the broader ecosystem, it reveals how much work remains to align AI development practices with regulatory expectations.

Professor Martin Vechev, who leads the SRI Lab and founded INSAIT, has invited "AI researchers, developers, and regulators to join us in advancing this evolving project." The framework is open-source, the methodology is documented, and the invitation to contribute is genuine.

The Cultural Shift

What makes this research significant is not just its technical contribution but what it reveals about the current moment. Regulation is no longer something that happens after technology is deployed. It is becoming a design constraint, a parameter that shapes development from the beginning.

This represents a cultural shift as much as a technical one. The question is no longer whether AI systems should be trustworthy, transparent, and fair. The question is how to measure these qualities, how to verify them, and how to build systems that embody them by design rather than by accident.

The COMPL-AI framework does not answer all these questions. But it makes them concrete. It transforms abstract regulatory language into specific tests that can be run, results that can be compared, and gaps that can be identified.

The artifact, in other words, does what good artifacts do: it makes visible what was previously only felt.

Frequently Asked Questions

Q: What is COMPL-AI and who developed it?

A: COMPL-AI is the first technical framework that translates the EU AI Act's regulatory requirements into measurable benchmarks for large language models. It was developed by researchers at ETH Zurich's SRI Lab, Bulgaria's INSAIT, and the ETH spinout LatticeFlow AI.

Q: When do the EU AI Act's high-risk AI system requirements take effect?

A: The bulk of high-risk AI system obligations become enforceable on 2 August 2026. Prohibitions on certain AI practices became effective in February 2025, and rules for general-purpose AI models have applied since August 2025.

Q: What are the penalties for non-compliance with the EU AI Act?

A: Fines can reach up to €35 million or 7% of global annual turnover for violations of prohibited practices, and up to €15 million or 3% of turnover for violations related to high-risk AI systems.

Q: Which AI models were evaluated using the COMPL-AI framework?

A: The framework evaluated 12 prominent LLMs including models from OpenAI, Meta, Google, Anthropic, and Alibaba. Results revealed shortcomings particularly in robustness, safety, diversity, and fairness.

Q: Is COMPL-AI an official EU compliance certification tool?

A: No. COMPL-AI is a research framework, not an official auditing tool. The researchers explicitly state that assessments should not be interpreted in a legally binding context. The European Commission has welcomed it as a "first step" in translating the AI Act into technical requirements.

Q: What is the "benchmark gap" problem identified by recent research?

A: Analysis of 194,955 benchmark questions found that existing evaluation tools focus heavily on hallucination (61.6%) and reliability (31.2%), while capabilities central to loss-of-control scenarios, such as evading human oversight and self-replication, receive zero coverage.