Skip to content

FMEA and risks

The stack has a formal FMEA (Failure Mode and Effects Analysis): 20 identified failure modes, prioritized by RPN = Severity × Probability × Detection, each with its mitigation.

Why it matters

The FMEA is why the stack is production-grade. We don't wait for things to fail in production — we anticipate failure modes and design mitigations beforehand.

Top failures by RPN

ID Failure mode RPN Status Mitigation
F12 PEER NGA-West2 manual download blocks pipeline 378 CLOSED Fallback to synthetic GM (Kanai-Tajimi) with disclaimer
F10 OpenSeesPy Python 3.12 vs stack 3.13 324 NO-FIX Boot health check + README Step 0
F20 Numeric claim without source passes Gate 0.5 320 CLOSED claim_detector.py regex + claim-verb requirement
F09 VERIFY Gate 2 (Q1/Q2) blocks late 280 CLOSED preflight_statistics.py in COMPUTE C5.5 pre-IMPLEMENT
F16 Retraction Watch not wired into verify_citations 243 CLOSED Fast-path RETRACTED verdict (CrossRef Labs)
F06 Claude API quota exhausted mid-batch 240 CLOSED claude_budget_tracker.py circuit breaker + checkpoints
F05 OpenAlex rate limit without shared backoff 210 CLOSED rate_limiter.py token bucket + http_cache.py
F04 Session compaction mid-IMPLEMENT 192 CLOSED canonical topic_key + mandatory per-phase saves
F19 params.yaml with null fields post-edit 180 CLOSED validate_params_ssot.py pre-commit

The scary one: F02 — stale data

F02 (COMPUTE C5 passes with stale data) has severity 10 because it's not a crash: it's a silent success on the wrong data.

The horror flow: the user edits params.yaml, forgets to regenerate and delete data/processed/, COMPUTE runs with old params over same-named files, all gates go green, and the paper ships with an invisible lie. External discovery = retraction.

Fix: generate_compute_manifest.py embeds a SHA-256 of params.yaml + a metadata token in each CSV. Gate C5 fails if the hash doesn't match. verify_inputs_integrity() is wired into validate_submission.py.

How mitigations are designed

Each mitigation follows the same pattern:

  1. Identify the failure mode and its chain of effects.
  2. Quantify RPN (severity, probability, detection).
  3. Design a code mitigation + a regression test.
  4. Close the failure (CLOSED) or document why it's not fixed (NO-FIX).

Failures discovered in real production (e.g. the laicsee-2026 child) feed the FMEA: 5 bugs closed in a single batch, all with regression tests.

Current state

Of the 20 failures: most are CLOSED, a few are NO-FIX (resolved another way or covered transitively). Regression suite: tests/test_fmea_mitigations.py.

See also

Canonical source

Derives from docs/shared/FMEA.md, which keeps the full table of 20 failures.