We have 8 observations — too few to be sure yet.
Selection bias: the log captures notable days (giant swells, storms), not quiet competent ones. Inflates apparent miss rate. Automated daily snapshots would correct this.
The accuracy score.
Brier score: average squared gap between predicted and actual. Perfect = 0. Always-50/50 = 0.25. Lower is better.
Quality-to-probability mapping: 1–3 → P=0.10, 4–5 → 0.40, 6 → 0.60, 7–8 → 0.80, 9–10 → 0.95. Fixed reference points, not calibrated. Recalibrate once each bucket has ≥5 observations.
Accuracy by quality rating.
One observation per bucket. Numbers shown for transparency only.
Engine called 2/10 on 5/31 — user surfed shoulder-only
Engine called 4/10 on 6/1 — actually unsurfable (directionally correct call, n=1)
Engine called 9/10 on 6/2 — actually unrideable (the large miss)
Gold bars = predicted probability of surfable. Red bars = actual surfable rate from observations. Each bucket has exactly one data point — no statistical meaning. The gap in the 7–10 bucket reflects the June 2 miss, not a confirmed pattern of overconfidence. With one observation per bucket, the two are indistinguishable.
| Quality Bin | n (observations) | Predicted P(surfable) | Actual P(surfable) | Reliability |
|---|---|---|---|---|
| 1–3 (poor forecast) | 1 | 0.10 | 1.0 | n=1 — no statistical value |
| 4–6 (marginal forecast) | 1 | 0.40 | 0.0 | n=1 — no statistical value |
| 7–10 (confident forecast) | 1 | 0.875 | 0.0 | n=1 — dominated by 6/2 miss |
Breaking down each miss.
Three misses traceable through the engine:
| Date | Engine Q | Actual condition | Grade error | Attributed rule | Status |
|---|---|---|---|---|---|
| 2026-05-31 | 2 | shoulder-only (surfable) | +5 vs ideal |
Engine under-scored: mixed-trains penalty (std=1.17s) fired correctly for the chaotic day-level water, but the shoulder of the wave was still rideable. The engine has no "shoulder-rideable" sub-call — it scores the whole day, not the shoulder of the break.
No snapshot — heuristic attribution |
|
| 2026-06-02 | 9 | unrideable | −6 vs ideal |
Wave-power penalty set too leniently. Wave power was around 30 kW/m (roughly equivalent to an 8–9 ft swell at 14+ seconds — beyond what Saladita can handle cleanly). The engine's penalty only triggered at ≥25 kW/m (−2 points), which was not enough to cancel the bonuses awarded for a favorable swell direction (+3 for 200°) and long period (+1.5 for 14s+). The threshold needed to be lower (≤20 kW/m to trigger a −3 penalty) to flag that the size-plus-period combination produces walling closeouts at Saladita, not peeling rides.
Partially fixed — threshold tightened |
|
| 2026-06-08 | — | unrideable (blown, stormy) | — |
No storm proximity feature existed. Tropical Storm Boris (~270 km southeast of Saladita) created onshore chaotic conditions. Standard weather apps flagged the storm; the engine had no National Hurricane Center advisory feed at all. This is a feature gap — not a miscalibrated threshold, but a completely missing input. No engine snapshot exists for this date; the miss was reported by the user and confirmed by the storm track record.
Fixed 2026-06-08 |
What the accuracy log needs to work.
Three data gaps: (1) ordinary good days go unlogged; (2) only one matched engine snapshot exists (2026-05-31); (3) all 8 observations fall in a single 9-day swell window. Fix: scheduled daily snapshot saving date, quality, wave height, period variability, direction, wave power. At one snapshot/day: 20 observations in 3 weeks; 365/year sustained.
| Analysis | Current status | Unlock condition |
|---|---|---|
| Brier score with tight CI | CI spans 0.16–0.90 — uninformative | n≥20 scoreable pairs |
| Calibration plot (per-bin P) | n=1 per bin — no statistical value | n≥5 per bin (≈n=15 total with diverse quality) |
| Season-of-year decomposition | 9-day window — no seasonal signal | n≥50, spanning ≥3 calendar months |
| Per-rule ablation study | No counterfactual snapshots available | Automated daily snapshots + n≥30 |
| Calibrated P(surfable) per bucket | Fixed midpoints only | n≥5 per quality bin |
What this tells us so far.
Brier 0.624 at n=3 reads as: one large miss, cause identified (wave-power threshold), fix applied. June 8 miss had a different cause (missing NHC proximity feature), also fixed. Both fixes would produce different calls on the same conditions today. Whether they're sufficient requires more observations. Until daily snapshots run automatically, this audit only describes days someone found memorable enough to report.
scripts/analyze_engine_skill.py is rerun against the latest data/ground_truth/observations.jsonl. The underlying data artifact is at functions/api/_findings_engine_skill.js. Numbers will change as observations accumulate.