Hunting Malware with LLMs at Scale

Is this malware, or just terrible security practices? Getting an LLM to answer that question consistently, at millions of packages per day, in production.

April 2026

Security has been losing for a long time. Most people haven't noticed.

Not because they're not paying attention — but because the tooling has trained everyone to accept a certain kind of failure as normal. Signatures miss things. Alerts get ignored. The security team triages an endless stream of noise, and somewhere in that noise, the real threat sits quietly and waits.

We built something different. And what we learned in the process changed how we think about detection entirely.

What We Found When We Started Looking

The first thing we discovered when we onboarded customers: malware was everywhere.

Browser extensions, npm packages, IDE plugins — sitting quietly on employee endpoints across multiple companies, completely undetected. The signature-based detectors hadn't missed these threats by a little. They hadn't caught them at all. These were things the entire paradigm was blind to, by design.

That moment changes you. The gap between what security tooling promises and what it actually delivers isn't small — it's enormous. Existing tools aren't almost good enough. They're operating in a fundamentally different paradigm from the threat.

And the threat has been winning.

The Problem with Tools Built for Clean Code

This is worth understanding deeply, because it explains why the tooling has failed.

Tools like Claude Code, Cursor, and Copilot were built for clean, readable, well-intentioned code. They're trained on GitHub repositories where developers write functions with meaningful names, add comments, follow conventions. The entire prior is that the author is trying to communicate something.

Malware authors don't write that.

Malware is intentionally opaque. Variables are single characters or hex strings. Logic is fragmented across dozens of dynamic calls. Behavior is hidden behind layers of obfuscation, split across install scripts and seemingly unrelated utilities. A model optimized for clean code breaks down completely when the code is adversarial by design — not because the model is bad, but because its entire assumption about what code looks like is wrong.

Building an AI agent to hunt threats at scale forced us to rethink almost every assumption AI coding tools make.

The False Positive Rate Is the Whole Game

We started by hunting false positives, not malware.

This is the counterintuitive part. Most detection systems optimize for recall — catch everything, tolerate the noise. We went the opposite direction. It was a deliberate choice with real consequences.

If you flag legitimate software as malicious, you tell an organization that something they depend on is a threat. They act on it. Things break. And the next time your system fires an alert — on something real — they've already learned to doubt you. Trust is the whole product. Once you've burned it, you don't get it back.

You can't block legitimate software. The blast radius of a wrong call is too large.

So before we expanded coverage, we made the false positive rate near zero. We built the system to understand that poor security practices aren't malicious intent. That a developer tool spawning shell commands is doing what developer tools do. That an extension modifying CORS headers globally is probably negligent, not attacking anyone. That "this code looks scary" and "this code is malware" are completely different statements — and the model has to know the difference.

Only once we had near-zero FPs did we start expanding coverage. In that order. Never reversed.

The false positive rate isn't a secondary metric. It is the system.

Context Management at Adversarial Scale

Malware is often 2 lines in a 10,000-line file.

At millions of packages per day, reading every line isn't an economics problem — it's an impossibility. But you can't miss those 2 lines either, because they're why the other 9,998 exist.

We ran the numbers: read everything and pay 100x cost, or use targeted context selection and catch 90% of cases at 1x cost. We chose the second option. That means accepting that we'll miss some things. We're honest about that tradeoff.

The targeted approach requires the model to know where to look — and that's harder than it sounds when the author has specifically designed the code to misdirect attention. Entrypoints are buried. Malicious logic is scattered across files that look unrelated. Install scripts masquerade as build tooling.

We had to build heuristics for where threat actors hide things. Those heuristics had to be robust enough to work on code the model has never seen before, written by authors actively trying to evade detection.

Context selection is, in a lot of ways, the core engineering challenge of the whole system.

Deobfuscation Is a Denial of Service Against Yourself

Obfuscated code isn't a bad-code problem. It's a compute problem.

Malware authors obfuscate for a reason: it makes static analysis hard, and it makes dynamic analysis expensive. CPU-intensive by design. When we first tried to run deobfuscation inline — unpacking and executing obfuscated JavaScript to understand what it actually does — it destroyed our infrastructure. A single package could spike CPU for minutes. At scale, that's not a bottleneck. It's a denial of service against yourself.

The solution was isolation. We pulled deobfuscation entirely out of the main pipeline and ran it in sandboxes using E2B — ephemeral, containerized environments that can absorb the compute spike and die cleanly when they're done. The rest of the system never touches it. No shared resources. No blast radius.

It sounds obvious in retrospect. It wasn't obvious when we were watching our servers melt.

The Alignment Problem (The Practical One)

The hardest part wasn't engineering. It was alignment.

Not alignment in the AGI-safety sense. In the very practical sense: getting the model to produce the same classification a senior security researcher would produce, on the same case, every time, with a reasoning chain that would survive peer review.

That required going deep with Koi's research team — the people who have spent years developing intuitions about what malware actually is, where the gray zone lives, how to reason about edge cases. We didn't just ask them for labels. We extracted their reasoning on ambiguous cases, mapped the terminology they use to discuss threat categories, and built a dataset where every single example carries a full reasoning chain written in shared, Koi-defined language.

The model doesn't just output a verdict. It thinks through the case the way a researcher would: what is this thing supposed to do, what does it actually do, does the behavior align with the stated purpose, is there evidence of intent to harm, how sensitive is the data involved, who bears the cost if this goes wrong.

Each of those questions has a structure we defined, refined, and validated against real cases.

That process — building a shared vocabulary for malware reasoning and encoding it into the model's behavior — is what made the system defensible. Not just accurate, but explainable. An alert our researchers can stand behind. One a customer can read and understand. One that holds up when someone pushes back.

The engineering made it possible. The alignment made it real.