FMEA and risks¶
The stack has a formal FMEA (Failure Mode and Effects Analysis): 20 identified failure modes, prioritized by RPN = Severity × Probability × Detection, each with its mitigation.
Why it matters
The FMEA is why the stack is production-grade. We don't wait for things to fail in production — we anticipate failure modes and design mitigations beforehand.
Top failures by RPN¶
| ID | Failure mode | RPN | Status | Mitigation |
|---|---|---|---|---|
| F12 | PEER NGA-West2 manual download blocks pipeline | 378 | CLOSED | Fallback to synthetic GM (Kanai-Tajimi) with disclaimer |
| F10 | OpenSeesPy Python 3.12 vs stack 3.13 | 324 | NO-FIX | Boot health check + README Step 0 |
| F20 | Numeric claim without source passes Gate 0.5 | 320 | CLOSED | claim_detector.py regex + claim-verb requirement |
| F09 | VERIFY Gate 2 (Q1/Q2) blocks late | 280 | CLOSED | preflight_statistics.py in COMPUTE C5.5 pre-IMPLEMENT |
| F16 | Retraction Watch not wired into verify_citations | 243 | CLOSED | Fast-path RETRACTED verdict (CrossRef Labs) |
| F06 | Claude API quota exhausted mid-batch | 240 | CLOSED | claude_budget_tracker.py circuit breaker + checkpoints |
| F05 | OpenAlex rate limit without shared backoff | 210 | CLOSED | rate_limiter.py token bucket + http_cache.py |
| F04 | Session compaction mid-IMPLEMENT | 192 | CLOSED | canonical topic_key + mandatory per-phase saves |
| F19 | params.yaml with null fields post-edit | 180 | CLOSED | validate_params_ssot.py pre-commit |
The scary one: F02 — stale data¶
F02 (COMPUTE C5 passes with stale data) has severity 10 because it's not a crash: it's a silent success on the wrong data.
The horror flow: the user edits params.yaml, forgets to regenerate and delete data/processed/, COMPUTE runs with old params over same-named files, all gates go green, and the paper ships with an invisible lie. External discovery = retraction.
Fix: generate_compute_manifest.py embeds a SHA-256 of params.yaml + a metadata token in each CSV. Gate C5 fails if the hash doesn't match. verify_inputs_integrity() is wired into validate_submission.py.
How mitigations are designed¶
Each mitigation follows the same pattern:
- Identify the failure mode and its chain of effects.
- Quantify RPN (severity, probability, detection).
- Design a code mitigation + a regression test.
- Close the failure (CLOSED) or document why it's not fixed (NO-FIX).
Failures discovered in real production (e.g. the laicsee-2026 child) feed the FMEA: 5 bugs closed in a single batch, all with regression tests.
Current state¶
Of the 20 failures: most are CLOSED, a few are NO-FIX (resolved another way or covered transitively). Regression suite: tests/test_fmea_mitigations.py.
See also¶
- Cache and rate limiter — F05 mitigation.
- Troubleshooting — symptoms of each failure.
- The pipeline — F09 preflight in COMPUTE C5.5.
Canonical source
Derives from docs/shared/FMEA.md, which keeps the full table of 20 failures.