WALK-FORWARD BACKTEST · HISTORICAL SCORECARD

Prediction Accuracy Report

For every act, the model is retrained from scratch using only data available at that point, then asked to predict that act. No future information leaks into past evaluation — only predictions the model could realistically have made in real time are scored.

TL;DR

Direction hit rate

61%

Share of 453 predictions where nerf/buff/stable direction was correct.

Random baseline 33% · Always-stable baseline 55%. Lift: +28pp / +6pp.

High-conf nerf precision

51%

Of predictions made at p_nerf ≥ 0.60, how many turned into actual nerfs.

Evaluation coverage

18 ACT

E6A2 → V26A1 · 453 predictions total.

453 prediction samplesRange: E6A2 → V26A118 act foldsMethod: walk-forward

OPERATOR NOTERead this before staring at the numbers.

The model is most confident at Stable calls — F1 0.70, precision 67%, recall 74%.
Nerf calls are conservative — only 35% of real nerfs are caught in advance, the other 65% slip through (precision 46%).
Confidence does carry signal — predictions made at p_nerf ≥ 0.70 hit 60% of the time (n=15).

Glossary Precision = "of predictions the model called nerf, share that were actual nerfs". Recall = "of actual nerfs, share that the model caught in advance". You can't push both to 100% — it's a balancing act.

Overall metricsPer-class performance

Direction hit rate

61%

Across 453 predictions.

Balanced accuracy

0.541

Class-imbalance corrected.

5-class hit rate

45%

Mild/strong intensity also correct.

Top-3 nerf / act

50%

Actual nerfs among top-3 nerf picks per act.

stablen=249

Precision

0.67

Recall

0.74

0.70

buffn=101

Precision

0.54

Recall

0.54

nerfn=103

Precision

0.46

Recall

0.35

0.40

Per-agent scoreboardBest hits · biggest misses

Cumulative hit rate per agent across all evaluated acts. Only agents with at least 3 predictions are listed.

Top hits

Harbor18/18

100%

Phoenix17/18

94%

Jett15/18

83%

Reyna14/18

78%

Breach13/18

72%

Top misses

Omen5/18

28%

Astra6/18

33%

Miks1/3

33%

Veto6/14

43%

Neon8/18

44%

Confusion matrixPredicted vs actual

	Predicted
	stable	buff	nerf
Actualstable	184	28	37
Actualbuff	41	54	6
Actualnerf	49	18	36

Diagonal cells = exact matches. Greener = better.

Confidence calibrationDoes higher probability mean higher hit rate?

When the model fires a higher probability, the real-world hit rate should rise too. Each row: share of predictions at that threshold that matched reality.

Nerf predictions

Threshold	n	Precision
≥ 0.30	134	42%
≥ 0.40	95	44%
≥ 0.50	65	48%
≥ 0.60	43	51%
≥ 0.70	15	60%

Buff predictions

Threshold	n	Precision
≥ 0.15	260	34%
≥ 0.20	226	37%
≥ 0.25	193	39%
≥ 0.35	144	46%
≥ 0.50	86	56%

Lead predictionsCaught one act ahead of the patch

The model raised a nerf signal before any nerf had landed, and the next act confirmed it.

Fadep_nerf 75.0%

V25A6stable→V26A1mild nerf

At V25A6 the agent was still untouched, but the model already saw the nerf coming — confirmed one act later at V26A1.

Fadep_nerf 72.6%

V25A4stable→V25A5mild nerf

At V25A4 the agent was still untouched, but the model already saw the nerf coming — confirmed one act later at V25A5.

Sovap_nerf 68.3%

E7A2stable→E7A3strong nerf

At E7A2 the agent was still untouched, but the model already saw the nerf coming — confirmed one act later at E7A3.

Omenp_nerf 68.0%

V25A6mild buff→V26A1mild nerf

At V25A6 the agent was still untouched, but the model already saw the nerf coming — confirmed one act later at V26A1.

Sovap_nerf 67.5%

V25A4stable→V25A5strong nerf

At V25A4 the agent was still untouched, but the model already saw the nerf coming — confirmed one act later at V25A5.

Notable hitsHigh-confidence predictions that landed

Cases where the model fired a strong probability and reality went the same direction.

ViperV25A4

predicted strong nerf · actual mild nerf

84%

p_nerf

OmenV26A1

predicted strong nerf · actual mild nerf

80%

p_nerf

ViperE8A1

predicted strong nerf · actual mild nerf

80%

p_nerf

TejoV25A4

predicted strong buff · actual mild buff

79%

p_buff

SovaV26A1

predicted strong nerf · actual mild nerf

78%

p_nerf

YoruE6A3

predicted strong buff · actual mild buff

78%

p_buff

Notable missesHigh-confidence predictions that didn't

Cases where the model fired a strong probability and reality went the opposite way.

FadeE8A1

predicted strong buff · actual stable

p_nerf

AstraE7A3

predicted strong buff · actual strong nerf

p_nerf

AstraV26A1

predicted strong nerf · actual stable

83%

p_nerf

ViperV26A1

predicted strong nerf · actual stable

82%

p_nerf

AstraV25A4

predicted strong buff · actual stable

p_nerf

TejoV25A3

predicted strong buff · actual mild nerf

p_nerf

Per-act trendHit rate over time

As more acts accumulate, training data grows. The chart below checks whether hit rate stabilizes over time — a sanity check against early overfitting.

Direction hit rate5-class hit rateAvg 60%

E6A2

52% · 5c 29%

E6A3

48% · 5c 33%

E7A1

64% · 5c 41%

E7A2

58% · 5c 38%

E7A3

58% · 5c 38%

E8A1

67% · 5c 38%

E8A2

60% · 5c 40%

E8A3

56% · 5c 40%

E9A1

56% · 5c 40%

E9A2

64% · 5c 52%

E9A3

73% · 5c 54%

V25A1

50% · 5c 46%

V25A2

74% · 5c 67%

V25A3

70% · 5c 59%

V25A4

63% · 5c 52%

V25A5

39% · 5c 29%

V25A6

61% · 5c 54%

V26A1

71% · 5c 46%

Dashed line = overall average (60%) · 5c = hit rate including mild/strong intensity.

All predictions453 raw rows

Search agentACTPredictedHit

453 / 453rows

Act	Agent	Actual	Predicted	p_stable	p_buff	p_nerf	Hit
E6A2	Killjoy	mild nerf	strong nerf	30.9	2.9	66.2	✓
E6A2	Neon	stable	strong nerf	24.9	9.3	65.8	✗
E6A2	Raze	stable	strong nerf	45.9	4.1	50.0	✗
E6A2	KAYO	mild nerf	strong nerf	40.9	15.1	44.0	✓
E6A2	Omen	stable	strong nerf	30.8	27.9	41.3	✗
E6A2	Brimstone	mild nerf	stable	56.0	7.4	36.5	✗
E6A2	Breach	stable	stable	55.2	11.3	33.5	✓
E6A2	Reyna	stable	stable	62.7	5.4	31.9	✓
E6A2	Fade	mild nerf	stable	43.9	27.8	28.3	✗
E6A2	Gekko	mild nerf	stable	71.2	0.6	28.2	✗
E6A2	Harbor	stable	stable	73.9	0.5	25.7	✓
E6A2	Astra	mild buff	strong buff	22.1	56.7	21.3	✓
E6A2	Phoenix	stable	stable	64.5	14.5	20.9	✓
E6A2	Jett	mild buff	strong buff	25.1	55.4	19.5	✓
E6A2	Viper	mild nerf	strong buff	32.4	49.1	18.5	✗
E6A2	Sage	stable	stable	64.5	18.7	16.8	✓
E6A2	Skye	mild buff	stable	61.8	21.4	16.8	✗
E6A2	Cypher	mild buff	stable	57.9	26.0	16.1	✗
E6A2	Sova	stable	stable	55.0	34.0	10.9	✓
E6A2	Yoru	mild buff	strong buff	16.6	77.2	6.1	✓

Methodology

▸

Walk-forward Each fold trains on act_idx < T and predicts act_idx == T. Future information never leaks into past evaluation.

▸

Two-stage model Stage A (XGBoost) classifies *touched next patch vs. stable*. Stage B (Logistic Regression) splits touched into nerf vs. buff. Final output: 5 classes (strong/mild nerf · stable · mild/strong buff).

▸

Ground truth Labels come from actual post-patch nerf/buff history, including mid-patch hotfixes and reworks.

▸

Evaluation scope Only acts with confirmed patch outcomes are evaluated (current in-flight act V26A2 excluded).

▸

Generated at 2026-04-25 05:09:17 (UTC)