The Proof Problem: What Counts as Evidence You Did the Thing?

Most accountability systems fail at exactly the same point: the evidence standard is undefined.

You commit to a daily workout. Your friend asks if you did it. You say yes. Did you?

There is no answer to that question — not because of dishonesty, but because the commitment was never paired with a definition of what would count as proof. Without an evidence standard, “I did it” and “I claim I did it” become the same statement. And the behavioral-economics literature on commitment devices is unambiguous on what happens when claim and proof collapse into each other: behavior decays toward whatever requires the least effort to claim.

This piece is about how to fix that, by importing a framework from a place that has spent centuries thinking about evidence: forensic science.

What forensic science gets right about evidence

In forensic investigation, evidence is graded by what is called probative value — how strongly it supports the underlying claim, and how resistant it is to alternative explanations. The grading is hierarchical: physical evidence outranks photographic evidence; photographic evidence outranks eyewitness testimony; eyewitness testimony outranks self-report.

The hierarchy is structured around one question: how much effort would it take to fake this evidence relative to the underlying behavior?

If the answer is “more effort than just doing the thing,” the evidence is strong. If the answer is “less effort than doing the thing,” the evidence is weak.

Habits work the same way.

The four-tier proof hierarchy

For accountability commitments, here is the working hierarchy. Each tier produces more behavior than the tier above, because each tier raises the cost of false claim.

Tier 1 — Claim

“I did it.” A check-in, a verbal yes, a checkbox in an app. No verification mechanism. The lowest tier on the hierarchy and the most common in practice.

Falsification cost: Zero. You can claim anything.

When it works: For low-stakes commitments you genuinely want to complete and where ambient social pressure is doing most of the work. For everything else, it does not work.

Tier 2 — Photo evidence

A still image showing you, the activity, and ideally the setting and timestamp metadata. Gym selfie, page of completed work, finished plate, finished run on the watch.

Falsification cost: Moderate. Old photos exist. Setups exist. But the effort to maintain a falsification system over time usually exceeds the effort to just do the thing.

When it works: Daily-habit commitments at moderate stakes — workouts, study sessions, reading, language practice. The metadata-richness of phone photography makes Tier 2 surprisingly hard to game over a multi-week streak.

Tier 3 — Video evidence

A short clip showing the activity happening in real time, with implicit timestamp and setting. Pre-workout face cam, real-time language practice clip, video of you in the chair at 6am.

Falsification cost: High. Faking continuous motion and ambient context requires a level of premeditation that almost no one will actually undertake. (See why video proof beats self-report for the full explanation of why this is true.)

When it works: High-stakes habits where streak integrity matters — sober streaks, fitness goals tied to events, study commitments tied to outcomes. Also: one-off events where missing once nullifies the entire commitment.

Tier 4 — Witnessed video

Video evidence captured in front of, or live-streamed to, at least one named witness. The recording exists and a specific human can attest to its real-time authenticity.

Falsification cost: Approaching prohibitive. Faking witnessed video requires conspiracy with the witness, which is structurally different from solo falsification.

When it works: The hardest commitments — quit dates, recovery streaks, one-shot life events. Also: any time you have caught yourself rationalizing your way out of lower tiers.

How to assign a tier to a specific commitment

The simplest rule: pick the lowest tier whose falsification cost is higher than the cost of just doing the behavior.

For a 6am workout where you are mildly motivated and the social cost of skipping is real, Tier 2 is usually enough. The photo is hard enough to fake that you will just do the workout.

For a sober streak where the cost of failure is severe and the temptation to “claim a clean day” is high, Tier 1 will not survive a single bad week. You need Tier 3 minimum, Tier 4 ideally.

For training toward a public-event commitment — a marathon, a stage performance, a deadline — Tier 3 with a named witness is the sustainable configuration.

Why most habit apps stop at Tier 1

Three reasons, none of them about whether the higher tiers would work:

Friction. Tier 1 is the lowest-friction interface (tap a box). User retention metrics favor low friction.
Trust in self-report. App designers assume their users want to be honest with themselves. Sometimes this is true. For habits that fail, it almost never is.
The “we are a tool, not an enforcer” framing. Many tracking apps explicitly position themselves as awareness tools rather than commitment devices. This is honest about their scope but obscures their limits.

The proof hierarchy is not an argument against Tier 1 tracking. It is an argument that Tier 1 tracking is necessary but insufficient for any commitment that the user has previously failed to keep with willpower alone.

What this changes about habit design

If you have been trying to build a habit for months and failing, the question is rarely “do I want this enough?” It is almost always “what tier of proof am I requiring of myself, and is that tier high enough given how badly I have previously failed at this?”

If the answer is Tier 1, raise it. If the answer is Tier 2 and the habit is high-stakes, raise it again. The hierarchy is the lever. Use it.

Frequently asked

What is the difference between a claim and proof in accountability? A claim is what you say happened. Proof is what an outside observer can verify happened. Most accountability systems collapse the two — they treat your claim as proof. This is the single largest reason habits fail to stick: the evidence standard is missing.

What is the minimum acceptable evidence for a habit? For low-stakes habits, photographic evidence with timestamp metadata is the floor. For high-stakes commitments — financial, relational, or health — short video clips with a witness present in real time are the appropriate standard. Self-report alone is below the floor for anything that matters.

Can AI-generated images defeat photo or video proof? Not yet, at the level of ordinary use. Producing a convincing fake photo of you at the gym takes significantly more effort than just going to the gym. The economics of evidence still favor the honest path, which is the property a good commitment device requires.