Share this
Are CTFs Now Dead Because of LLMs?
by Reflare Research Team on May 29, 2026 9:37:43 AM
For years, CTFs gave cybersecurity a rare public arena for proving talent, but as artificial intelligence begins reshaping the test itself, practitioners are being forced to reassess skill, competition, and the fragile mythology of human advantage.
.jpg?width=1200&height=800&name=Are%20CTFs%20Now%20Dead%20Because%20of%20LLMs%20(1200).jpg)
Something still moves beneath the scoreboard
For nearly three decades, Capture the Flag competitions have been the proving ground of the cybersecurity world. The first CTF contest was held at DEFCON 4 in 1996, three years after the conference itself was founded. From those origins it grew into a strange and beloved subculture. Teams spend weekend-long sessions trying to extract hidden "flag" strings from deliberately vulnerable systems. Over time these events became training grounds, hiring filters, and the closest thing the industry has to a meritocratic scoreboard.
Then large language models showed up. By 2023, players were quietly pasting challenges into ChatGPT. By 2025, full agentic systems were autonomously solving entire categories of problems. By the first BSides San Francisco CTF of 2026, a researcher at Include Security reported that he could no longer realistically compete without bringing his own AI agent to the contest. The obvious question follows: is the CTF, as we knew it, finished?
The honest answer is more interesting than yes or no. The classical, purely human CTF is dying. What is replacing it is something altogether different, and arguably more consequential.
The Case That CTFs are Dead
The empirical picture has shifted dramatically in the last eighteen months, though the headline numbers deserve more scepticism than they usually get.
The most cited result comes from Palisade Research, which in December 2024 published a paper claiming a 95% solve rate on InterCode-CTF using a relatively simple LLM agent. That figure beat prior state-of-the-art results of 29% and 72% set earlier that same year, and the paper concluded that LLM hacking capabilities were "underelicited" rather than fundamentally limited. It became the touchstone citation for the "AI has solved CTFs" narrative.
But the methodology has come under sustained criticism, including from the authors themselves. The 95% figure is 81 out of 85 tasks, not 81 out of 100. Some tasks were excluded as unsolvable. More worrying, the InterCode-CTF benchmark draws from picoCTF, a public training platform whose challenges and writeups have been on the internet for years. Subsequent contamination testing found that roughly 14% of Claude 3.5 Sonnet runs on the benchmark involved memorized flags rather than genuine problem solving. Palisade itself ran a controlled experiment asking the agent to submit flags without solving anything, and found that nine tasks were "solved" anyway. The authors conceded that "evidence suggests partial inclusion" of the dataset in training data and offered this as a likely explanation for why GPT models outperformed Gemini models on the benchmark. None of this invalidates the broader trend, but it should put a star next to the headline number.
The academic literature beyond InterCode tells a more measured story. A team from HKUST presenting at ACM CCS 2025 introduced CTFAgent, a system using two-stage retrieval-augmented generation. According to the paper, at the 2024 picoCTF competition the agent finished in the top 23.6% of nearly 7,000 participating teams. That is a respectable showing, not a dominant one, and it required substantial engineering rather than a one-line prompt.
The data that has held up best comes from Hack The Box, the commercial training platform that releases new challenge "machines" weekly. In a March 2026 paper titled "The Death of the CTF", Suzu Labs researcher Jacob Krell examined first-blood times for 423 machines released between March 2017 and October 2025. He found that root first-blood times have been compressing by roughly 16% per year on a logarithmic scale, with statistical significance at p < 1e-10. The drops were sharpest after large language models and agentic frameworks emerged, and they scaled with difficulty: post-LLM compression measured 27% at the "Hard" tier and 67% at the "Insane" tier. Because Hack The Box machines are unreleased before publication, contamination is far less plausible as an explanation. Something real is happening to solve times, even if the InterCode numbers are inflated.
The community has even coined a term for it. Challenges that an LLM agent can autonomously solve are now called "sloppable". At hxp CTF in December 2025, long considered one of the most punishing competitions in the world, the cryptography category was reportedly sloppable. At DEFCON 2025 in August, two challenges in the qualifier rounds fell to heavy LLM assistance.
Mythos and the New Generation of Cyber-tuned Models
The other development worth flagging is that frontier model releases have started explicitly targeting cybersecurity work as a core capability rather than a side effect.
Anthropic's Claude Mythos, released in preview in spring 2026, is the clearest example. Anthropic positions Mythos as "a new class of intelligence built for ambitious projects focusing on cybersecurity, autonomous coding, and long-running agents." The UK AI Security Institute's evaluation found that Mythos Preview succeeded on expert-level CTF tasks 73% of the time. More notably, on a multi-step attack range that takes a skilled human operator around 20 hours to complete end-to-end, Mythos achieved full success in three of ten independent runs, becoming the first model to chain that exercise from start to finish without human intervention.
The AISI report carries an important caveat. The evaluations were run against undefended targets, with no penalties for triggering security alerts. Mythos has not been shown to defeat well-defended systems with active monitoring and incident response. That distinction matters for real-world threat modelling, and it matters for CTFs too, because most CTF environments are likewise undefended sandboxes. The conditions under which Mythos excels are precisely the conditions of a typical jeopardy-style competition.
Other developments point in the same direction. A research group at Alias Robotics released CAI (Cybersecurity AI), an open-source cybersecurity agent framework. According to their paper, CAI won the $50,000 top prize at the 155-team Neurogrid CTF in 2025 by capturing 41 of 45 flags, ranked first at the Dragos OT competition (out of more than 1,200 teams), and reached 10,000 points 37% faster than human teams in the AI vs Humans CTF. Their accompanying paper opens with the question: "if autonomous agents now dominate competitions designed to identify top security talent at negligible cost, what are CTFs actually measuring?"
That is no longer the same competition.
The Case That CTFs are not Dead
Step back from the leaderboards, and a more nuanced picture emerges.
First, the bleeding edge of the CTF world is holding the line. Despite the headlines, top-tier events like DEFCON CTF, hxp, and PlaidCTF still feature challenges that current frontier models cannot solve autonomously. Players from elite teams report that the genuinely hard problems remain resistant. Because most major CTFs use dynamic scoring, where the hardest, least-solved challenges are worth several times the easiest, the winners are still decided by the unsloppable problems. AI raises the floor far more than it raises the ceiling.
Second, challenge designers are adapting. There is now a meta-discipline within CTF authorship: anticipating what the next frontier model will be able to do, and engineering problems that route around its strengths. "Guessy" challenges, with little training data and unconventional logic, tend to be more LLM-resistant. Highly novel cryptosystems, idiosyncratic binaries with custom architectures, and tasks requiring genuine insight into newly-published research all remain difficult to slop. The constraint is real, but designing for AI resistance has become its own craft.
Third, and this is the chess parallel that keeps coming up, competitive activities tend to survive their automation. When chess engines began routinely beating grandmasters in the late 1990s, predictions of the game's death proliferated. What actually happened was bifurcation. Online play required engine-detection systems and proctoring software like Chess.com's Proctor. The Suzu Labs analysis raises a difficult complication, though. Chess has move-by-move telemetry that allows statistical detection of cheating after the fact. CTFs have nothing comparable. Platforms see only a flag and a timestamp. Enforcement of human-only competition will likely require in-person, proctored events.
Fourth, an entirely new genre has emerged: CTFs about LLMs themselves. Projects like AI Goat ship deliberately vulnerable language model deployments for players to attack via prompt injection, model exfiltration, and tool-use abuse. DEFCON's AI Village has been running large-scale red-teaming exercises against frontier models since 2023.
Fifth, the most ambitious CTF in history is itself a contest of AI systems. DARPA's AI Cyber Challenge, AIxCC, concluded its two-year competition at DEFCON 33 in August 2025. Team Atlanta took home the $4 million grand prize, with Trail of Bits and Theori winning $3 million and $1.5 million respectively, out of a total $8.5 million prize pool. The competition tasked autonomous systems with finding and patching real vulnerabilities in critical infrastructure software. Far from killing the CTF format, agentic AI has given rise to its highest-stakes incarnation.
What is Actually Happening
The most accurate framing is not death, but transition. The CTF as a contest of unaided human cognition is ending, in the same sense that long-distance running ended when bicycles were invented, which is to say, it did not, but it became one option among many. What is emerging in its place is a layered ecosystem.
At the bottom, beginner and educational CTFs like picoCTF and OverTheWire's Bandit have largely been solved by general-purpose models. Their value now lies in pedagogy, not competition. A learner using ChatGPT to walk through Bandit is doing something closer to a guided tutorial than a contest, and that is fine.
In the middle tier, contests like BSidesSF have entered the agentic era. The competition has shifted from "who is the best hacker" to "who has built the best pipeline of hackers." The relevant skills now include prompt engineering, multi-agent orchestration, tooling integration, and the financial willingness to spend on compute and API tokens. This is uncomfortable for purists, but it is also a real skill set that maps onto how professional offensive security increasingly works.
At the top, the elite CTFs are running an arms race between challenge designers and agent builders. Models like Mythos are forcing that race forward faster than designers had planned for. It looks more like the present state of competitive programming, where the existence of strong AI solvers has not killed the ICPC but has changed how it is run, judged, and proctored.
So, are CTFs dead because of LLMs? No, but the version many practitioners loved is fading, and we should be honest that some of the benchmarks announcing its demise have been sloppier than the headlines suggest. The new game is harder to define, harder to police, and harder to win. It is also undeniably more reflective of what cybersecurity itself is becoming. The flags are still out there. The hands holding the keyboards have just grown a little less exclusively human.
Share this
- April 2026 (1)
- March 2026 (1)
- February 2026 (1)
- January 2026 (1)
- December 2025 (1)
- November 2025 (1)
- October 2025 (1)
- September 2025 (1)
- August 2025 (1)
- July 2025 (1)
- June 2025 (1)
- May 2025 (1)
- April 2025 (1)
- March 2025 (1)
- February 2025 (1)
- January 2025 (1)
- December 2024 (1)
- November 2024 (1)
- October 2024 (1)
- September 2024 (1)
- August 2024 (1)
- July 2024 (1)
- June 2024 (1)
- April 2024 (2)
- February 2024 (1)
- January 2024 (1)
- December 2023 (1)
- November 2023 (1)
- October 2023 (1)
- September 2023 (1)
- August 2023 (1)
- July 2023 (1)
- June 2023 (2)
- May 2023 (2)
- April 2023 (3)
- March 2023 (4)
- February 2023 (3)
- January 2023 (5)
- December 2022 (1)
- November 2022 (2)
- October 2022 (1)
- September 2022 (11)
- August 2022 (5)
- July 2022 (1)
- May 2022 (3)
- April 2022 (1)
- February 2022 (4)
- January 2022 (3)
- December 2021 (2)
- November 2021 (3)
- October 2021 (2)
- September 2021 (1)
- August 2021 (1)
- June 2021 (1)
- May 2021 (14)
- February 2021 (1)
- October 2020 (1)
- September 2020 (1)
- July 2020 (1)
- June 2020 (1)
- May 2020 (1)
- April 2020 (2)
- March 2020 (1)
- February 2020 (1)
- January 2020 (3)
- December 2019 (1)
- November 2019 (2)
- October 2019 (3)
- September 2019 (5)
- August 2019 (2)
- July 2019 (3)
- June 2019 (3)
- May 2019 (2)
- April 2019 (3)
- March 2019 (2)
- February 2019 (3)
- January 2019 (1)
- December 2018 (3)
- November 2018 (5)
- October 2018 (4)
- September 2018 (3)
- August 2018 (3)
- July 2018 (4)
- June 2018 (4)
- May 2018 (2)
- April 2018 (4)
- March 2018 (5)
- February 2018 (3)
- January 2018 (3)
- December 2017 (2)
- November 2017 (4)
- October 2017 (3)
- September 2017 (5)
- August 2017 (3)
- July 2017 (3)
- June 2017 (4)
- May 2017 (4)
- April 2017 (2)
- March 2017 (4)
- February 2017 (2)
- January 2017 (1)
- December 2016 (1)
- November 2016 (4)
- October 2016 (2)
- September 2016 (4)
- August 2016 (5)
- July 2016 (3)
- June 2016 (5)
- May 2016 (3)
- April 2016 (4)
- March 2016 (5)
- February 2016 (4)


