FMEA and risks¶

The stack has a formal FMEA (Failure Mode and Effects Analysis): 20 identified failure modes, prioritized by RPN = Severity × Probability × Detection, each with its mitigation.

Why it matters

The FMEA is why the stack is production-grade. We don't wait for things to fail in production — we anticipate failure modes and design mitigations beforehand.

Top failures by RPN¶

ID	Failure mode	RPN	Status	Mitigation
F12	PEER NGA-West2 manual download blocks pipeline	378	CLOSED	Fallback to synthetic GM (Kanai-Tajimi) with disclaimer
F10	OpenSeesPy Python 3.12 vs stack 3.13	324	NO-FIX	Boot health check + README Step 0
F20	Numeric claim without source passes Gate 0.5	320	CLOSED	`claim_detector.py` regex + claim-verb requirement
F09	VERIFY Gate 2 (Q1/Q2) blocks late	280	CLOSED	`preflight_statistics.py` in COMPUTE C5.5 pre-IMPLEMENT
F16	Retraction Watch not wired into verify_citations	243	CLOSED	Fast-path RETRACTED verdict (CrossRef Labs)
F06	Claude API quota exhausted mid-batch	240	CLOSED	`claude_budget_tracker.py` circuit breaker + checkpoints
F05	OpenAlex rate limit without shared backoff	210	CLOSED	`rate_limiter.py` token bucket + `http_cache.py`
F04	Session compaction mid-IMPLEMENT	192	CLOSED	canonical topic_key + mandatory per-phase saves
F19	params.yaml with null fields post-edit	180	CLOSED	`validate_params_ssot.py` pre-commit

The scary one: F02 — stale data¶

F02 (COMPUTE C5 passes with stale data) has severity 10 because it's not a crash: it's a silent success on the wrong data.

The horror flow: the user edits params.yaml, forgets to regenerate and delete data/processed/, COMPUTE runs with old params over same-named files, all gates go green, and the paper ships with an invisible lie. External discovery = retraction.

Fix: generate_compute_manifest.py embeds a SHA-256 of params.yaml + a metadata token in each CSV. Gate C5 fails if the hash doesn't match. verify_inputs_integrity() is wired into validate_submission.py.

How mitigations are designed¶

Each mitigation follows the same pattern:

Identify the failure mode and its chain of effects.
Quantify RPN (severity, probability, detection).
Design a code mitigation + a regression test.
Close the failure (CLOSED) or document why it's not fixed (NO-FIX).

Failures discovered in real production (e.g. the laicsee-2026 child) feed the FMEA: 5 bugs closed in a single batch, all with regression tests.

Current state¶

Of the 20 failures: most are CLOSED, a few are NO-FIX (resolved another way or covered transitively). Regression suite: tests/test_fmea_mitigations.py.