The One-Metric Trap

The one-metric rule

In A/B testing, you pick one metric. You have to — otherwise you’re p-hacking, cherry-picking, telling yourself a story instead of reading data. The discipline is real. The rule is correct.

But every good analyst I’ve ever worked with does something the rule doesn’t describe: they look at everything else too. Not to override the metric — to hear the harmonics around it.

The metric is the fundamental frequency. Everything else is the overtones. The rule says: “listen to the fundamental.” The skill is hearing the whole chord.

What makes a model good

This was the hardest thing to explain to students. Not the math — the math is teachable. Not the tools — the tools are googlable. The hard part was: what makes a model good?

A flat model captures the obvious signal. A good model captures structure you didn’t ask for — patterns the spec didn’t mention, relationships that weren’t in the brief. Depth. “Volumetricness.” Something that goes beyond what was requested and touches what’s actually there.

And here’s the meta-problem: you can’t teach this to someone who doesn’t have flat/volumetric as a concept yet. It’s like explaining harmonics to someone who’s only ever heard square waves. They’re not dumb — they just don’t have the perceptual axis. Some students get it immediately. Others need the training set of flat models first — enough “technically correct but somehow empty” results to start hearing what’s missing.

”I’m trusting my gut”

There’s a movie called Mercy (2026, dir. Bekmambetov) — a cop vs. an AI judge. The AI operates on data: surveillance footage, probability scores, crime patterns. It calculates guilt at 97.5%. Its metrics are flawless.

The cop says: “I’m trusting my gut.”

That’s not anti-intellectual. That’s a different instrument. The gut is a full-spectrum processor that can’t articulate what it’s detecting — but it’s detecting something real. Something the data model compressed out.

The AI in Mercy is a perfect dashboard: every metric green, every prediction justified, every decision auditable. 97.5% guilt probability (the same threshold we use to call an A/B test significant) — and the 2.5% contained the entire truth. The cop is the analyst who says “something feels off” and can’t point to a row in the table. One passes every review. The other catches what the review missed.

The framework already knows

The problem of “one metric isn’t enough” has a timeline. Each step got closer. None of them arrived.

But first — the one-metric rule isn’t just a management philosophy. It’s a mathematical constraint. A statistical test accepts one metric. Test multiple metrics simultaneously, and you get the multiple comparisons problem: inflated false positive rates, p-hacking by selection. The math requires compression. That’s what makes the tension so fundamental — it’s not that someone chose to simplify. The tool itself can only see one thing at a time.

Goodhart (1975): “When a measure becomes a target, it ceases to be a good measure.” The metric gets gamed. People optimize the number, not the thing the number was supposed to represent. Fix: better incentives.

Campbell (1979): “The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures.” Same insight, broader scope. Fix: awareness of corruption.

McNamara Fallacy (Yankelovich, 1971): Body counts said America was winning Vietnam while America was losing. “What can’t be easily measured really doesn’t exist. This is suicide.” Fix: include qualitative data.

Kohavi et al. (2009, 2020): The OEC — one metric to rule them all. But immediately surrounded by guardrail metrics, debug metrics, segment breakdowns. The framework builds the one-metric rule and then patches it with escape hatches. Fix: more metrics around the main one.

STEDII (Microsoft ExP, 2022): “Almost every metric has a blind spot, because it is aggregating a large number of measurements into a single number.” Fix: add breakdown metrics, segment metrics, diagnostic tools. More dashboards.

Muller (2018): The Tyranny of Metrics. Institutional damage from metric obsession in education, healthcare, policing. Fix: stop worshipping metrics in policy.

Each of these notices the gap. Each proposes a patch: fix the incentives, add the qualitative data, add more metrics, add guardrails, add segments. All engineering solutions.

But guardrail metrics are the model crying “one metric is not enough.” The rule says: “one metric.” The practice says: “one metric, plus all these other things you have to watch or you’ll break something important.” Nobody calls that a contradiction. But it is one.

I think the compression is inherent. Not fixable by adding more compressed signals. Not a bug — a property of measurement itself. The act of measuring is lossy. The skill is hearing what you lost.

The compression isn’t the problem

Here’s what I’m not saying: “throw away your metrics.” That’s as flat as worshipping them.

A/B testing with one metric works. NPS works. DAU works. They’re useful — genuinely useful, good enough for most decisions.

The problem isn’t compression. It’s forgetting you compressed.

There’s no such thing as a purely positive or negative outcome — that’s 1-bit encoding of an analog reality. There are no clean winners and losers — that’s a scoreboard applied to something that isn’t a game. Even “good model vs. bad model” is too flat — the interesting question is what does this model see that the other one doesn’t?

Every time you choose a metric, you choose what to throw away. The question isn’t whether your measurement is accurate. It’s whether you remember what you lost when you chose it.

The real skill

The analysts and PMs who are genuinely good — not just competent, but good — hold both at once. They run the A/B test with one metric AND hear the harmonics. They check the dashboard AND read the room. They trust the model AND trust the gut.

Not because either one alone is wrong. Because either one alone is flat.

Upd. After publishing, I realized the article needed an illustration. I found one immediately — the inscription on the One Ring: “One Ring to Rule Them All, One Ring to Find Them, One Ring to Bring Them All, and in the Darkness Bind Them.”

The parallel isn’t that one metric is evil. It’s that the Ring’s appeal is the same: one thing to control everything. Clean, powerful, efficient. But the Ring doesn’t just simplify — it blinds. Whoever wears it stops seeing what they used to see. “In the darkness bind them” — the darkness isn’t a technique. It’s what happens to everything the metric doesn’t illuminate. You stop noticing what you lost.

Sauron didn’t set out to make the world dark. He set out to control it efficiently. One ring. One metric. The darkness is a side effect of the compression.

And while we’re at it — Tomorrowland (2015, dir. Brad Bird). A machine built to predict the future detects that Earth is heading toward destruction. It broadcasts the probability to humanity. Instead of motivating change, the number makes people give up. The prediction becomes self-fulfilling. One metric — probability of doom — compressed a complex situation into a single number, and the number replaced the reality it was supposed to measure. The dashboard said “you’re losing,” and everyone stopped trying to win.

#productmanagement #analytics #systemsthinking