Forecast · Methodology · Skill Audit

Forecast engine skill audit.

8 observations. 3 scoreable. Brier 0.624 (95% CI: 0.16–0.90). One large miss, two known causes, both patched.

We have 8 observations — too few to be sure yet.

Load-bearing caveat

8 observations total. 7 human-reported. 3 fully scoreable. Every statistic on this page has a confidence interval wide enough to include both "useful engine" and "no better than guessing" as plausible outcomes. This is not a finished skill assessment — it is a working framework with early numbers. Treat the accuracy score as a pointer toward where to look, not as a verdict. Meaningful conclusions require at least 20 observations with varied conditions (small days, blown days, ordinary good days — not just memorable events).

Selection bias: the log captures notable days (giant swells, storms), not quiet competent ones. Inflates apparent miss rate. Automated daily snapshots would correct this.

The accuracy score.

Brier score: average squared gap between predicted and actual. Perfect = 0. Always-50/50 = 0.25. Lower is better.

Brier Score
0.624
95% CI: 0.160 to 0.903
Only 3 scoreable observations. The June 2 miss (a 9/10 call on an unrideable day) drives most of this number.
Random-guess baseline
0.25
reference: always predicting 50/50
A model that guesses 50% surfable every day would score 0.25 on a balanced dataset.
Scoreable observations
3
of 7 human reports
4 reports lack a matching engine snapshot and cannot be scored without retroactive data.
0.624 looks poor, but the CI (0.160–0.903) spans "better than guessing" to "worse than coin-flip." The June 2 miss contributes 0.722 of the total score alone. At n=3, one outlier defines the metric. Draw no conclusions before n≥20.

Quality-to-probability mapping: 1–3 → P=0.10, 4–5 → 0.40, 6 → 0.60, 7–8 → 0.80, 9–10 → 0.95. Fixed reference points, not calibrated. Recalibrate once each bucket has ≥5 observations.

Accuracy by quality rating.

One observation per bucket. Numbers shown for transparency only.

1 observation per bucket

Each rating bucket contains exactly one observation. A single data point cannot estimate a probability — the numbers below are starting points, not measured rates. Each bucket needs at least 5 observations to be informative.

1–3
Predicted P=0.10 · Actual P=1.0  (n=1)
Engine called 2/10 on 5/31 — user surfed shoulder-only
4–6
Predicted P=0.40 · Actual P=0.0  (n=1)
Engine called 4/10 on 6/1 — actually unsurfable (directionally correct call, n=1)
7–10
Predicted P=0.875 · Actual P=0.0  (n=1)
Engine called 9/10 on 6/2 — actually unrideable (the large miss)

Gold bars = predicted probability of surfable. Red bars = actual surfable rate from observations. Each bucket has exactly one data point — no statistical meaning. The gap in the 7–10 bucket reflects the June 2 miss, not a confirmed pattern of overconfidence. With one observation per bucket, the two are indistinguishable.

Quality Binn (observations)Predicted P(surfable)Actual P(surfable)Reliability
1–3 (poor forecast) 1 0.10 1.0 n=1 — no statistical value
4–6 (marginal forecast) 1 0.40 0.0 n=1 — no statistical value
7–10 (confident forecast) 1 0.875 0.0 n=1 — dominated by 6/2 miss

Breaking down each miss.

Three misses traceable through the engine:

DateEngine QActual conditionGrade errorAttributed ruleStatus
2026-05-31 2 shoulder-only (surfable) +5 vs ideal Engine under-scored: mixed-trains penalty (std=1.17s) fired correctly for the chaotic day-level water, but the shoulder of the wave was still rideable. The engine has no "shoulder-rideable" sub-call — it scores the whole day, not the shoulder of the break.
No snapshot — heuristic attribution
2026-06-02 9 unrideable −6 vs ideal Wave-power penalty set too leniently. Wave power was around 30 kW/m (roughly equivalent to an 8–9 ft swell at 14+ seconds — beyond what Saladita can handle cleanly). The engine's penalty only triggered at ≥25 kW/m (−2 points), which was not enough to cancel the bonuses awarded for a favorable swell direction (+3 for 200°) and long period (+1.5 for 14s+). The threshold needed to be lower (≤20 kW/m to trigger a −3 penalty) to flag that the size-plus-period combination produces walling closeouts at Saladita, not peeling rides.
Partially fixed — threshold tightened
2026-06-08 unrideable (blown, stormy) No storm proximity feature existed. Tropical Storm Boris (~270 km southeast of Saladita) created onshore chaotic conditions. Standard weather apps flagged the storm; the engine had no National Hurricane Center advisory feed at all. This is a feature gap — not a miscalibrated threshold, but a completely missing input. No engine snapshot exists for this date; the miss was reported by the user and confirmed by the storm track record.
Fixed 2026-06-08
The two large misses have distinct failure modes. The 6/2 miss is a calibration failure: a rule existed but the threshold was wrong. The 6/8 miss is a feature gap: no rule existed at all. Both are more tractable than "the model is fundamentally wrong" — they have specific, addressable causes, and both have been patched.

What the accuracy log needs to work.

Three data gaps: (1) ordinary good days go unlogged; (2) only one matched engine snapshot exists (2026-05-31); (3) all 8 observations fall in a single 9-day swell window. Fix: scheduled daily snapshot saving date, quality, wave height, period variability, direction, wave power. At one snapshot/day: 20 observations in 3 weeks; 365/year sustained.

AnalysisCurrent statusUnlock condition
Brier score with tight CI CI spans 0.16–0.90 — uninformative n≥20 scoreable pairs
Calibration plot (per-bin P) n=1 per bin — no statistical value n≥5 per bin (≈n=15 total with diverse quality)
Season-of-year decomposition 9-day window — no seasonal signal n≥50, spanning ≥3 calendar months
Per-rule ablation study No counterfactual snapshots available Automated daily snapshots + n≥30
Calibrated P(surfable) per bucket Fixed midpoints only n≥5 per quality bin

What this tells us so far.

Brier 0.624 at n=3 reads as: one large miss, cause identified (wave-power threshold), fix applied. June 8 miss had a different cause (missing NHC proximity feature), also fixed. Both fixes would produce different calls on the same conditions today. Whether they're sufficient requires more observations. Until daily snapshots run automatically, this audit only describes days someone found memorable enough to report.

This page is updated each time scripts/analyze_engine_skill.py is rerun against the latest data/ground_truth/observations.jsonl. The underlying data artifact is at functions/api/_findings_engine_skill.js. Numbers will change as observations accumulate.