EuroJackpot PyLab

Coding the lottery. Keeping it human.

Contrarian & Forbidden Spacing Strategies for High-Cardinality Features

2025-12-27

Low-cardinality features are comfy. Odds count, small count, maybe a few grid buckets — neat little state spaces.

Then you look at something like sum or OSLCS and the state-space explodes. Lots of values, lots of room, lots of “uhh… where do I even start?”

This post is the first update pack for handling that kind of feature, using one clean idea:

Spacing + “who follows whom” + survivor sets.

And yes, we backtest it so we can see what it did on real history instead of vibes.


Responsible play note

If a method makes you feel sure the next draw “must” behave a certain way, that’s your cue to slow down. Models are fun. The lottery is still the lottery. Set limits, keep it light, and don’t chase.

Summary (the fast version)

We build three things on a single column:

1) Spacing stats (global)
For each value x in a column, measure how often it hits and how “late” it is compared to its own past gaps.

2) Following stats (conditional on the last value)
Take the last observed value in history, find every time it happened before, and record what came right after it. Those are the followers.

3) Survivors (non-followers)
Values that exist in history but have never followed the current last value.

Then we split survivors into two sets:

  • Strategy 2 — Contrarian corridor: take the best-looking survivors (high score).
    Think: “not the obvious followers… give me the strongest alternatives.”

  • Strategy 3 — Forbidden list: take the worst-looking survivors (low score).
    Think: “if I’m filtering, these are the zones I’d rather skip.”

It’s not a guarantee of anything. It’s a way to cut a huge space into “more interesting” vs “meh, probably not today.”


The main principles (what you should remember)

Principle A — “Late” must be measured against itself

A value can be “late” only relative to its own gap history.
So we compute a percentile score for the current waiting time inside its own past gaps.

Principle B — Frequency still matters

A value that hits once every blue moon can look “late” all the time.
So we blend “lateness” with “how often it appears” so tiny samples don’t fool us too easily.

Principle C — Condition on the present (the last seen value)

If the column has memory (even weak), it tends to show up in “follower” patterns.
We measure that by tracking what historically followed the current last value.

Principle D — The survivor pool is the playground

We exclude followers, then work only with the remaining values.
Both strategies share that exact survivor pool — only the direction changes (top vs bottom).


1) Spacing engine on a single column

Pick one column, say:

  • col = "sum" (the total of the 5 main numbers)
  • or col = "OSLCS" (a packed state encoding Odd/Small/Lines/Cols)

For each distinct value x in that column:

  • collect the indices where x appears: i1, i2, ..., ik
  • build the gap list:
  • head gap: i1
  • internal gaps: i2-i1, i3-i2, ...
  • tail gap: T - ik (waiting time since last hit)

From gaps we compute:

  • emf: hit count (how many times it appeared)
  • kath: tail gap (current waiting time)
  • percentiles of gaps (median, P75, P90, P95, P99, max)
  • Pct_score: percentile rank of kath inside that value’s gap list
  • Norm: normalized frequency share
  • Prod: Norm * Pct_score (a simple blended score)

2) Following stats: “who tends to come after whom?”

Now take the last value in your history window:

  • last_val = df[col].iloc[-1]

Find every historical index where df[col] == last_val, and record the next value:

  • followers are: df[col].iloc[k+1] for each such k (when k+1 exists)

Then compute spacing stats again, but only on these follower values.

That gives a “conditional” view:
given today’s last state, which states have historically tended to come next?


3) Survivor pool: values that never followed the last value

Now the key move:

  • stats_all = spacing stats for the whole column
  • follow_stats = spacing stats for followers of the current last value
  • survivors = all values minus follower values

These survivors are “real” values (they happened in history), yet they have never appeared as immediate followers of the current last value.

That survivor pool is the shared input for Strategy 2 and Strategy 3.


Strategy 2 — Contrarian top survivors (the “aim here” corridor)

Goal: choose a corridor of survivor values that score highest by Prod.

You can read it like this:

“I don’t want the typical followers of the last value.
Among the non-followers, which values look active and late?”

Backtest hit definition: - Strategy 2 is a hit at time t if the true value at t is inside the candidate corridor.


Strategy 3 — Forbidden tail survivors (the “avoid these” list)

Same survivor pool, flipped:

Goal: pick the lowest scoring survivor values as a “do-not-like” list.

Backtest success definition: - Strategy 3 is a success at time t if the true value at t is not inside the forbidden list.

This is basically a filter mindset:

“If I’m going to avoid anything, let it be the weakest survivor zones.”


Code (core engine)

Notes: - Works for a single column. - Uses the same scoring style for both global stats and follower stats. - Keep it simple first; fancy tweaks can come later.

import numpy as np
import pandas as pd
from scipy import stats


def calculate_statistics_for_column(df: pd.DataFrame, col: str) -> pd.DataFrame:
    values = df[col].tolist()
    unique_vals = sorted(set(values))

    rows = []
    T = len(values)

    # build index lists
    idx_map = {v: [] for v in unique_vals}
    for i, v in enumerate(values):
        idx_map[v].append(i)

    for v, idxs in idx_map.items():
        # gaps: head + internal + tail
        gaps = [idxs[0]] + [idxs[i] - idxs[i - 1] for i in range(1, len(idxs))] + [T - idxs[-1]]

        emf = len(gaps) - 1
        kath = gaps[-1]
        pct_score = int(stats.percentileofscore(gaps, kath))

        rows.append({
            "value": v,
            "emf": emf,
            "kath": kath,
            "median": int(np.percentile(gaps, 50)),
            "P75": int(np.percentile(gaps, 75)),
            "P90": int(np.percentile(gaps, 90)),
            "P95": int(np.percentile(gaps, 95)),
            "P99": int(np.percentile(gaps, 99)),
            "max": int(np.max(gaps)),
            "Pct_score": pct_score,
        })

    out = pd.DataFrame(rows)
    out["Norm"] = (100 * out["emf"] / out["emf"].sum()).round(2)
    out["Prod"] = (out["Norm"] * out["Pct_score"]).round(2)

    return out.sort_values("Prod", ascending=False).reset_index(drop=True)


def calculate_following_statistics_for_column(df: pd.DataFrame, col: str) -> pd.DataFrame:
    values = df[col].tolist()
    T = len(values)
    if T < 2:
        return pd.DataFrame(columns=["value"])

    last_val = values[-1]
    follower_vals = []

    for i in range(T - 1):
        if values[i] == last_val:
            follower_vals.append(values[i + 1])

    if not follower_vals:
        return pd.DataFrame(columns=["value"])

    # Build a mini-df for follower sequence so we can reuse the same engine style
    tmp = pd.DataFrame({col: follower_vals})
    return calculate_statistics_for_column(tmp, col).rename(columns={"value": "value"})


def survivor_stats_excluding_followers(df: pd.DataFrame, col: str) -> pd.DataFrame:
    stats_all = calculate_statistics_for_column(df, col)
    follow_stats = calculate_following_statistics_for_column(df, col)

    if follow_stats.empty:
        return stats_all

    followers = set(follow_stats["value"].tolist())
    survivors = stats_all[~stats_all["value"].isin(followers)].reset_index(drop=True)
    return survivors


def candidates_strategy2_top_survivors(df: pd.DataFrame, col: str, max_candidates: int) -> list:
    survivors = survivor_stats_excluding_followers(df, col)
    return survivors.sort_values("Prod", ascending=False)["value"].head(max_candidates).tolist()


def candidates_strategy3_forbidden_tail(df: pd.DataFrame, col: str, max_candidates: int) -> list:
    survivors = survivor_stats_excluding_followers(df, col)
    return survivors.sort_values("Prod", ascending=True)["value"].head(max_candidates).tolist()


def backtest_strategies_2_and_3(
    df: pd.DataFrame,
    col: str,
    warmup: int = 300,
    max_candidates: int = 50,
    verbose_every: int = 200,
) -> pd.DataFrame:
    n = len(df)
    rows = []

    for t in range(warmup, n):
        hist = df.iloc[:t].copy()
        target = df.iloc[t][col]

        cand2 = candidates_strategy2_top_survivors(hist, col, max_candidates)
        forb3 = candidates_strategy3_forbidden_tail(hist, col, max_candidates)

        hit2 = int(target in cand2)
        in_forbidden = int(target in forb3)
        success3 = int(target not in forb3)

        rows.append({
            "t": t,
            "target": target,
            "hit_strategy2": hit2,
            "success_strategy3": success3,
            "in_forbidden_strategy3": in_forbidden,
        })

        if verbose_every and (t - warmup) % verbose_every == 0:
            print(f"[t={t}] {col}={target} | hit2={hit2}, success3={success3} | cand2[:5]={cand2[:5]} forb3[:5]={forb3[:5]}")

    res = pd.DataFrame(rows)
    if len(res) > 0:
        print("\n=== Backtest summary ===")
        print(f"Rows used               : {len(res)} (t={warmup}..{n-1})")
        print(f"Strategy 2 hit rate     : {res['hit_strategy2'].mean():.4f}")
        print(f"Strategy 3 success rate : {res['success_strategy3'].mean():.4f}")
        print(f"In-forbidden rate       : {res['in_forbidden_strategy3'].mean():.4f}")

    return res

Backtest results (as recorded in the chapter)

OSLCS (warmup=300, max_candidates=50)

  • Strategy 2 hit rate: 0.3875 (~38.75%)
  • Strategy 3 success rate: 0.9343 (~93.43%)
  • In-forbidden rate: 0.0657 (~6.57%)

Interpretation:

  • Strategy 2 gives a 50-state corridor that caught the next true OSLCS almost 4 times out of 10 in history.
  • Strategy 3 gives a 50-state “avoid zone” that the next true OSLCS landed in only about 1 time out of 15.

sum (warmup=300, max_candidates=30)

  • Strategy 2 hit rate: 0.3087 (~30.87%)
  • Strategy 3 success rate: 0.8900 (~89.00%)
  • In-forbidden rate: 0.1100 (~11.00%)

Interpretation:

  • Even with a wide sum range, a 30-sum corridor still behaves like a real filter.
  • The forbidden list behaves like a “danger zone” you can choose to skip.

Keep your feet on the ground: these are historical backtests, not a contract with the future.


How to use this in your wider playbook

Two simple patterns:

1) Aim filter (corridor)

Keep combinations whose sum (or OSLCS) lies inside Strategy 2 candidates.

2) Avoid filter (forbidden)

Drop combinations whose sum (or OSLCS) lies inside Strategy 3 forbidden values.

That’s it. One tool says “focus here”, the other says “skip these zones”.


Download the full chapter (DOCX)

Download: update_chapter_contrarian_forbidden_strategies_v2.docx