User beware: AI application vulnerability scans may not meet expectations for enterprise security teams. While a few hundred dollars for a scan API fee may sound cheap, a recent Contrast Labs analysis found a basic AI scan against a 1.8-million-line Java codebase initially returned 3,560 findings, including 1,000 rated “high-severity.”
The firm estimates, at a conservative 30 minutes of security engineer triage per finding reading the alert, opening the source file, tracing data flow, and making a judgment call down that list would cost roughly $128,000 in labor before a single vulnerability is mitigated.
Contrast Labs tested three AI scanning approaches against enterprise Java codebases and found that the economics of AI-powered AppSec don't improve with higher scan spend. They shift.
The broader implications are significant. According to Cycode's "State of Product Security for the AI Era 2026", 97% of enterprises now ship AI-generated code to production, and every AppSec vendor in the market has attached the "AI" label to its scanner. The pitch is straightforward: point the model at a repository, wait for the report, and reduce the burden on security teams. Contrast Labs' testing found the opposite.
David Lindner, chief information security officer (CISO) at Contrast Security, suspected that AI-based scans would not be ideal for many organizations. “With all of the noise around finding vulnerabilities and active exploits, and the thinking that such scans won’t be feasible for numerous reasons, is what drove this testing,” says Lindner.
AI code assessments run up the bill
The testing ran in three segments: a simple scan that issued basic OWASP Top 10 prompts to Claude Sonnet 4.6; a multi-agent system with specialized sub-agents per vulnerability class, also on Sonnet 4.6; and Claude Security, Anthropic's dedicated AppSec scanner built on Claude Opus 4.7. Contrast Labs ran each against a well-established 1.8-million-line Java application that had already been vetted through years of SAST, DAST, penetration testing, and bug bounty programs.
The simple scan's findings were largely noise. Two findings flagged as critical were SQL injections. That’s the type of result that typically gets immediate attention. However, these proved to be false positives. One supposed SQL injection involved a query built with correct parameterized PreparedStatement binds. The other used a hardcoded constant with no user input anywhere near the vulnerable-looking code path. The model pattern-matched correctly on form, and incorrectly on context. However, determining that required human time and expertise.
The multi-agent architecture produced better findings but didn't budge the costs. The multi-agent scan uncovered a genuine mass-assignment issue on a password change flow, tracing the data flow end-to-end and confirming a real privilege escalation path. However, projected API costs for a full scan at that quality level ran $43,000 to $107,000. That’s an AI scan and for one application. When adding triage labor, the all-in estimate ranged from $65,000 to $150,000, in line with the cheap scan, or exceeded it once the full bill arrived.
The third assessment tested all three methods simultaneously against a 50,000-line Java codebase. Here, 59 unique findings were identified in total. Forty-two were flagged by a single scanner and never corroborated. Only three of the 59, or five percent, were identified by all three tools. The two strongest confirmed true positives in the entire dataset, an authorization bypass and an insecure direct object reference (IDOR) flaw, appeared only in the multi-agent and Claude Security results. The simple scan missed both entirely.
Consistency proved to be an issue as well. Contrast Labs ran each scanner three times against the same 50,000-line codebase. The Sonnet-based simple scan reproduced only 17% of its own findings across three runs. Upgrading to Opus slightly improved reproducibility to 25%, with a 28.6% swing in the number of findings between its best and worst runs. One Opus run flagged three critical findings; another run against the identical code found two.
"These scans are not going to provide you any in any way, shape, or form results that are repeatable," Lindner says.
Such challenges compound for larger organizations. A mid-size software company might have 50 meaningful repositories, each averaging 500,000 lines. Running even the economy scan across all of them generates tens of thousands of findings. Running a quality agent scan generates a smaller mountain, but still at a cost that few organizations can afford.
AI assessments have an unexpected niche
According to Lindner, AI scanning performed well in one area where traditional tools have typically struggled: authorization logic. Things such as broken access control, IDOR, and missing ownership checks in complex multi-tenant code depend on understanding what a function is supposed to do, not just what the syntax says it should do. Pattern-matching SAST has never excelled here.
"If you can get AI focused on access control, it can reason its way through the code," Lindner said. "That’s not something we ever would have found with a commercial product."
This isn’t to say AI shouldn’t be in the application security mix. AI assessments belong in the development cycle, specifically against authorization logic, where they catch what traditional tooling misses. However, this data implies that AI doesn't belong as the foundation of a production AppSec program, and the economics seem to argue against running it broadly across large portfolios.
Lindner also flagged a shift in how he approaches the key metrics that govern AppSec programs. Mean time to remediate, long the golden KPI, assumes remediation pace can keep pace with discovery. Lindner argues it can't. At least not with AI-assisted offensive research accelerating CVE discovery. The metric he's moving toward is mean time to contain. The goal? Stop the bleeding as soon as possible, but don’t attempt to patch or fix everything immediately.
"We can't keep up with the vulnerabilities we have in our backlog, let alone any new ones we might see," Lindner said. "That’s why I'm switching to focus more on mean time to contain."
