Research

Are CTFs Now Dead Because of LLMs?

Written by Reflare Research Team | May 29, 2026 8:37:43 AM

For years, CTFs gave cybersecurity a rare public arena for proving talent, but as artificial intelligence begins reshaping the test itself, practitioners are being forced to reassess skill, competition, and the fragile mythology of human advantage.

Something still moves beneath the scoreboard

For nearly three decades, Capture the Flag competitions have been the proving ground of the cybersecurity world. The first CTF contest was held at DEFCON 4 in 1996, three years after the conference itself was founded. From those origins it grew into a strange and beloved subculture. Teams spend weekend-long sessions trying to extract hidden "flag" strings from deliberately vulnerable systems. Over time these events became training grounds, hiring filters, and the closest thing the industry has to a meritocratic scoreboard.

Then large language models showed up. By 2023, players were quietly pasting challenges into ChatGPT. By 2025, full agentic systems were autonomously solving entire categories of problems. By the first BSides San Francisco CTF of 2026, a researcher at Include Security reported that he could no longer realistically compete without bringing his own AI agent to the contest. The obvious question follows: is the CTF, as we knew it, finished?

The honest answer is more interesting than yes or no. The classical, purely human CTF is dying. What is replacing it is something altogether different, and arguably more consequential.


The Case That CTFs are Dead

The empirical picture has shifted dramatically in the last eighteen months, though the headline numbers deserve more scepticism than they usually get.

The most cited result comes from Palisade Research, which in December 2024 published a paper claiming a 95% solve rate on InterCode-CTF using a relatively simple LLM agent. That figure beat prior state-of-the-art results of 29% and 72% set earlier that same year, and the paper concluded that LLM hacking capabilities were "underelicited" rather than fundamentally limited. It became the touchstone citation for the "AI has solved CTFs" narrative.

But the methodology has come under sustained criticism, including from the authors themselves. The 95% figure is 81 out of 85 tasks, not 81 out of 100. Some tasks were excluded as unsolvable. More worrying, the InterCode-CTF benchmark draws from picoCTF, a public training platform whose challenges and writeups have been on the internet for years. Subsequent contamination testing found that roughly 14% of Claude 3.5 Sonnet runs on the benchmark involved memorized flags rather than genuine problem solving. Palisade itself ran a controlled experiment asking the agent to submit flags without solving anything, and found that nine tasks were "solved" anyway. The authors conceded that "evidence suggests partial inclusion" of the dataset in training data and offered this as a likely explanation for why GPT models outperformed Gemini models on the benchmark. None of this invalidates the broader trend, but it should put a star next to the headline number.

The academic literature beyond InterCode tells a more measured story. A team from HKUST presenting at ACM CCS 2025 introduced CTFAgent, a system using two-stage retrieval-augmented generation. According to the paper, at the 2024 picoCTF competition the agent finished in the top 23.6% of nearly 7,000 participating teams. That is a respectable showing, not a dominant one, and it required substantial engineering rather than a one-line prompt.

The data that has held up best comes from Hack The Box, the commercial training platform that releases new challenge "machines" weekly. In a March 2026 paper titled "The Death of the CTF", Suzu Labs researcher Jacob Krell examined first-blood times for 423 machines released between March 2017 and October 2025. He found that root first-blood times have been compressing by roughly 16% per year on a logarithmic scale, with statistical significance at p < 1e-10. The drops were sharpest after large language models and agentic frameworks emerged, and they scaled with difficulty: post-LLM compression measured 27% at the "Hard" tier and 67% at the "Insane" tier. Because Hack The Box machines are unreleased before publication, contamination is far less plausible as an explanation. Something real is happening to solve times, even if the InterCode numbers are inflated.

The community has even coined a term for it. Challenges that an LLM agent can autonomously solve are now called "sloppable". At hxp CTF in December 2025, long considered one of the most punishing competitions in the world, the cryptography category was reportedly sloppable. At DEFCON 2025 in August, two challenges in the qualifier rounds fell to heavy LLM assistance.


Mythos and the New Generation of Cyber-tuned Models

The other development worth flagging is that frontier model releases have started explicitly targeting cybersecurity work as a core capability rather than a side effect.

Anthropic's Claude Mythos, released in preview in spring 2026, is the clearest example. Anthropic positions Mythos as "a new class of intelligence built for ambitious projects focusing on cybersecurity, autonomous coding, and long-running agents." The UK AI Security Institute's evaluation found that Mythos Preview succeeded on expert-level CTF tasks 73% of the time. More notably, on a multi-step attack range that takes a skilled human operator around 20 hours to complete end-to-end, Mythos achieved full success in three of ten independent runs, becoming the first model to chain that exercise from start to finish without human intervention.

The AISI report carries an important caveat. The evaluations were run against undefended targets, with no penalties for triggering security alerts. Mythos has not been shown to defeat well-defended systems with active monitoring and incident response. That distinction matters for real-world threat modelling, and it matters for CTFs too, because most CTF environments are likewise undefended sandboxes. The conditions under which Mythos excels are precisely the conditions of a typical jeopardy-style competition.

Other developments point in the same direction. A research group at Alias Robotics released CAI (Cybersecurity AI), an open-source cybersecurity agent framework. According to their paper, CAI won the $50,000 top prize at the 155-team Neurogrid CTF in 2025 by capturing 41 of 45 flags, ranked first at the Dragos OT competition (out of more than 1,200 teams), and reached 10,000 points 37% faster than human teams in the AI vs Humans CTF. Their accompanying paper opens with the question: "if autonomous agents now dominate competitions designed to identify top security talent at negligible cost, what are CTFs actually measuring?"

That is no longer the same competition.


The Case That CTFs are not Dead

Step back from the leaderboards, and a more nuanced picture emerges.

First, the bleeding edge of the CTF world is holding the line. Despite the headlines, top-tier events like DEFCON CTF, hxp, and PlaidCTF still feature challenges that current frontier models cannot solve autonomously. Players from elite teams report that the genuinely hard problems remain resistant. Because most major CTFs use dynamic scoring, where the hardest, least-solved challenges are worth several times the easiest, the winners are still decided by the unsloppable problems. AI raises the floor far more than it raises the ceiling.

Second, challenge designers are adapting. There is now a meta-discipline within CTF authorship: anticipating what the next frontier model will be able to do, and engineering problems that route around its strengths. "Guessy" challenges, with little training data and unconventional logic, tend to be more LLM-resistant. Highly novel cryptosystems, idiosyncratic binaries with custom architectures, and tasks requiring genuine insight into newly-published research all remain difficult to slop. The constraint is real, but designing for AI resistance has become its own craft.

Third, and this is the chess parallel that keeps coming up, competitive activities tend to survive their automation. When chess engines began routinely beating grandmasters in the late 1990s, predictions of the game's death proliferated. What actually happened was bifurcation. Online play required engine-detection systems and proctoring software like Chess.com's Proctor. The Suzu Labs analysis raises a difficult complication, though. Chess has move-by-move telemetry that allows statistical detection of cheating after the fact. CTFs have nothing comparable. Platforms see only a flag and a timestamp. Enforcement of human-only competition will likely require in-person, proctored events.

Fourth, an entirely new genre has emerged: CTFs about LLMs themselves. Projects like AI Goat ship deliberately vulnerable language model deployments for players to attack via prompt injection, model exfiltration, and tool-use abuse. DEFCON's AI Village has been running large-scale red-teaming exercises against frontier models since 2023.

Fifth, the most ambitious CTF in history is itself a contest of AI systems. DARPA's AI Cyber Challenge, AIxCC, concluded its two-year competition at DEFCON 33 in August 2025. Team Atlanta took home the $4 million grand prize, with Trail of Bits and Theori winning $3 million and $1.5 million respectively, out of a total $8.5 million prize pool. The competition tasked autonomous systems with finding and patching real vulnerabilities in critical infrastructure software. Far from killing the CTF format, agentic AI has given rise to its highest-stakes incarnation.


What is Actually Happening

The most accurate framing is not death, but transition. The CTF as a contest of unaided human cognition is ending, in the same sense that long-distance running ended when bicycles were invented, which is to say, it did not, but it became one option among many. What is emerging in its place is a layered ecosystem.

At the bottom, beginner and educational CTFs like picoCTF and OverTheWire's Bandit have largely been solved by general-purpose models. Their value now lies in pedagogy, not competition. A learner using ChatGPT to walk through Bandit is doing something closer to a guided tutorial than a contest, and that is fine.

In the middle tier, contests like BSidesSF have entered the agentic era. The competition has shifted from "who is the best hacker" to "who has built the best pipeline of hackers." The relevant skills now include prompt engineering, multi-agent orchestration, tooling integration, and the financial willingness to spend on compute and API tokens. This is uncomfortable for purists, but it is also a real skill set that maps onto how professional offensive security increasingly works.

At the top, the elite CTFs are running an arms race between challenge designers and agent builders. Models like Mythos are forcing that race forward faster than designers had planned for. It looks more like the present state of competitive programming, where the existence of strong AI solvers has not killed the ICPC but has changed how it is run, judged, and proctored.

So, are CTFs dead because of LLMs? No, but the version many practitioners loved is fading, and we should be honest that some of the benchmarks announcing its demise have been sloppier than the headlines suggest. The new game is harder to define, harder to police, and harder to win. It is also undeniably more reflective of what cybersecurity itself is becoming. The flags are still out there. The hands holding the keyboards have just grown a little less exclusively human.