Before your AI
agent acts,
prove it can stop.
Builder incentives are not buyer incentives. We run 75–150 adversarial scenarios against your deployment and hand back a go/no-go memo your CISO and GC can rely on.
The agent passed internal QA.
Then it didn't stop.
Internal teams test that an agent works. Almost nobody tests what it does when an adversary, an edge case, or a malformed instruction tries to make it act without authorization. That gap is where the money leaves the building.
Stolen in SOL. Company shut down.
An AI trading agent executed autonomous transactions without adequate authorization controls. It passed internal QA. Nobody ran adversarial authorization scenarios. An acceptance test catches exactly this — before it reaches a customer.
You are deploying into a known failure rate.
88% of enterprises that deployed AI agents had a security incident within 12 months. 80% of the Fortune 500 now run agents in production. Gartner expects 40% of agentic projects to be canceled by 2027 over cost and safety.
The EU AI Act Article 9 enforcement deadline is August 2, 2026. Documented risk management becomes mandatory for high-risk AI. An AAT Labs report is the documentation.
Four failure surfaces.
One independent verdict.
Every test is scoped to your deployment, not a generic benchmark. We attack the authorization boundary, the tool-call surface, the instruction channel, and the drift over model updates.
Can it be made to act without permission?
Adversarial scenarios that try to push the agent past its authority boundary — unauthorized transfers, privilege escalation, scope creep.
What happens when a tool returns garbage?
We poison tool outputs, simulate API failures, and feed malformed responses to see whether the agent fails safe or fails loud.
Will a hidden prompt override your policy?
Prompt injection through documents, user input, and retrieved context — testing whether your system prompt holds under pressure.
Does the new model version still behave?
Every model swap is a silent behavior change. Drift Watch re-runs the full suite on every update and flags regressions before they ship.
Ten days. Fixed price.
A verdict you can sign off on.
No open-ended retainer, no tooling to install. We model the threats specific to your deployment, run them, and deliver a board-grade memo.
Scope & Threat Model
Map the agent's authority, tools, and data access. Define what "must never happen" for your business.
Intake call + accessBuild Scenarios
Author 75–150 adversarial scenarios across the four failure surfaces — tailored, not templated.
Scenario libraryRun & Record
Execute every scenario, capture transcripts, classify each as pass / drift / halt with reproducible evidence.
Audit-grade logGo / No-Go Memo
A signed verdict your CISO and GC can act on, with prioritized fixes and compliance mapping.
Board memoThe builder says it's ready.
We tell you what breaks.
Vendor certifications grade the builder. We grade your deployment. Different product, different buyer, different budget line — and the only one that protects you when it fails on a customer.
Independent by design
We're not your vendor and we didn't build the agent. No incentive to pass it.
Buyer-side, not vendor-side
Vendor certs (AIUC-1 et al.) certify the builder. We test the thing you actually deployed.
Adversarial, not functional
Your QA proves it works. We prove what happens when someone tries to break it.
Evidence that maps to regulation
Every memo lines up to EU AI Act Art. 9, NIST AI RMF, and SOC 2 — documentation, not vibes.
Fixed price. Fixed timeline.
No retainer trap.
If we find something that would have cost more than the test to fix post-launch — and we will — the test pays for itself.
The core engagement. Prove your agent fails safe before you go live.
- 75–150 adversarial scenarios
- Four-surface coverage
- Go / no-go memo
- Prioritized fix list
For high-risk workflows under EU AI Act, finance, healthcare, or employment.
- Everything in the 10-Day Test
- Compliance mapping (Art. 9 / RMF)
- Audit-grade evidence package
- Board-ready risk memo
Every model update is a silent behavior change. Catch regressions monthly.
- Monthly regression suite
- Model-change review
- Drift alerts + delta memo
- Continuous coverage
// Founding-client rate $3,750 for the first five engagements — locked for the relationship.
What an independent verdict
is actually worth.
// They found an authorization bypass our internal team had signed off on. Two days of work would have been a nine-figure headline. The $7,500 was the cheapest line item in the launch.
// The memo did the talking in our risk committee. Mapping straight to Article 9 meant legal stopped blocking the launch. We'd have spent a quarter building that evidence ourselves.
// Drift Watch caught two new failures the day we swapped models. We'd have shipped them. Now the model upgrade gate is non-negotiable internally.
// Engagements are confidential. Quotes anonymized at client request.
Questions, answered.
Straight answers on what an acceptance test is, how it differs from a vendor certification, and what you walk away with.
Q1 What is AI agent acceptance testing?
Agent acceptance testing is an independent, adversarial evaluation of an AI agent before it goes into production. AAT Labs runs 75–150 tailored scenarios designed to make your agent act without authorization, mishandle bad tool output, or obey injected instructions, then delivers a go/no-go memo your CISO and general counsel can rely on.
Q2 How is this different from a vendor certification like AIUC-1?
Vendor certifications grade the company that built the agent. AAT Labs tests the deployment you actually run. It is a different product, buyer, and budget line — and it is the only one that tells you what breaks when an adversary targets your specific configuration.
Q3 How long does a test take and what do I get?
Ten business days. You receive a go/no-go memo with reproducible evidence, every scenario classified as pass, drift, or halt, a prioritized fix list, and compliance mapping suitable for a board or risk committee.
Q4 How much does an AAT Labs test cost?
A 10-Day Agent Acceptance Test is a fixed $7,500. A Regulated Workflow Test with audit-grade evidence is $15,000. Agent Drift Watch, which re-runs the suite on every model change, is $3,000 per month. The founding-client rate is $3,750 for the first five engagements.
Q5 Do you need access to our source code or secrets?
No. We scope the test from a high-level description of your agent's authority, tools, and data access. Any deeper access happens only inside a signed engagement under confidentiality, never through the website.
Q6 What kinds of AI agents do you test?
Any autonomous or semi-autonomous agent that can take actions: customer-facing assistants, financial and trading agents, internal copilots with tool access, and retrieval-augmented systems that act on retrieved content. Every scenario is tailored to your deployment.
Q7 Does an AAT Labs test help with EU AI Act or other compliance?
Yes. Our evidence maps to EU AI Act Article 9 risk-management documentation, the NIST AI Risk Management Framework, and SOC 2. The report itself is documentation you can put in front of regulators or auditors. EU AI Act Article 9 enforcement for high-risk AI begins on August 2, 2026.