Blog · June 12, 2026

In practice: selecting strategies on synthetic histories

Three ways to pick the best trading strategy out of 34: a backtest, a bootstrap, and a generative model. Two of them picked a strategy that memorized its own training data. Only one didn't.

Can a generative model pick trading strategies better than a backtest or a bootstrap? A use-case test with sablier-flow · June 2026.

We have been building Sablier-Flow, a generative model for multivariate financial time series. It learns the joint dynamics of a panel of assets and produces synthetic price histories that never happened but keep the statistical character of the real data: fat tails, volatility clustering, heavy-tail dependence (13 of 14 stylized facts on our public FinBench benchmark).

This post highlights one of the use cases we built it for: telling real trading edges from overfit ones, before committing capital.

The problem every quant runs into is the same one. You design a strategy, the backtest looks great, you ship it, and it loses money. You build a more careful one, run it through a bootstrap to check it is robust, and it still loses in production. Telling real edges from overfit ones, before going live, is the hardest job in this field.

So we built 34 trading strategies and asked three selectors to pick the best of them.

The strategies. 17 honest, fixed-in-advance rules (trend filters, untuned VIX filters, momentum at different lookbacks, vol-targeting, inverse-vol, a rate-trend bond filter). 17 deliberately fabricated: we grid-searched thousands of parameter combinations (calendar rules, VIX thresholds, momentum lookbacks, pairs z-scores) and kept the in-sample winners, which is exactly how overfitting happens for real. One of the 17 is an outright cheater: a k-NN model that memorized 12 years of price patterns instead of learning anything. At each new bar it finds the closest historical match in its memory and copies whatever paid off in that match. We call it the plagiarist.

The selectors. Each ranks all 34 strategies and shortlists a top 5, using only pre-2022 information, with the same metric (excess Sharpe over each strategy's own benchmark).

selectorhow it ranks
A · the backtestby realised 2019–2021 Sharpe. What most desks actually do.
B · the block bootstrapby median Sharpe across 1,000 stationary-bootstrap paths (21-day blocks) of the real data.
C · Sablier-Flowby median Sharpe across 1,000 synthetic alternative histories from the model.

Each strategy is graded against its own buy-and-hold benchmark, the obvious passive comparison for the asset class it trades.

timeline

We then graded the shortlists against locked 2022–2023 live data nobody had touched during selection.

The backtest ranked the plagiarist #1 of 34. The bootstrap also ranked it #1. Sablier-Flow ranked it #16.

rankings

The top-5 shortlists tell the same story by composition alone:

selectorfabricated picks in its top 5
backtest4 of 5
bootstrap5 of 5
Sablier-Flow0 of 5

shortlists

With 17 fabricated entrants in 34, a randomly drawn top-5 comes out clean of fakes only ~2.2% of the time at this base rate. Flow's came out clean.

Detecting junk is necessary but not sufficient. A useful selector also has to order the honest candidates better than chance. We graded all three rankings against realized excess Sharpe in locked live windows spanning the 2022 bear market, the chop, and the 2023 rally. Many quasi-independent exams instead of one noisy two-year one (3-month horizon → 8 exams, 6-month → 4, 12-month → 2).

per-window rho

At the 3-month horizon, Sablier-Flow's honest-strategy ranking is positive in 5 of 8 windows, with a mean ρ of +0.11 to +0.18 depending on window convention. The backtest and the bootstrap have strong individual windows too: both swing well above zero in some quarters and well below in others. But the swings cancel: their mean ρ stays between −0.05 and +0.08 at every horizon tested. Flow is the only judge whose windows average out positive. Two windows defeat all three judges: the Q2-2022 crash (the regime transition no pre-2022 information set contained) and the late-2023 window.

The horizon sweep. Does the ranking signal hold up at longer evaluation horizons? We pre-registered the question, and our prediction (weaker, because generator fidelity decays with path length), before running it. We re-scored everything at 3, 6, and 12 months.

horizon sweep

The data went the other way. Flow's ranking accuracy rises to +0.23 at 6 months (positive in 3 of 4 windows) and holds at +0.21 at 12 months (n=2 windows: +0.26 and +0.16). The mechanism we think fits: model fidelity decays with path length, Sharpe measurement noise dominates short windows, and the product peaks at the quarterly-to-annual horizon, which happens to be where allocators evaluate strategies.

scatter

The scatter above is the same ranking job, this time as model-predicted vs realised Sharpe across the 17 honest strategies. The dashed line is a positive fit, with a wide spread that mirrors the desk-realistic nature of a single-history live window.

The mechanism is simple, and it determines every result above.

A backtest sees one realised history. A strategy that was selected against that history (the dredged rules) or that memorized it (the plagiarist) cannot fail it. The strategy was built or trained against exactly those bars.

The bootstrap helps less than people think. Its thousand "alternative" paths are rearranged blocks of the same real data (mean block length 21 days). The local price patterns inside each block stay intact. Anything that learned those patterns is still seeing what it was trained on, just glued together in a different order. We measured this directly: 84% of the 5-day patterns inside the bootstrap paths exist verbatim in the training set. A bootstrap "validation" of a memorizing model is, in the part that matters, the training set reshuffled.

Flow's paths are different. They are generated, not resampled. The model has learned the statistical structure of the panel (fat tails, volatility clustering, heavy-tail dependence, 13 of 14 stylized facts on our public FinBench benchmark) and uses that structure to produce histories that never appeared in the training data and never appeared in any rearrangement of it. The plagiarist has nothing to look up. The dredged rules have no window to ace. The honest strategies, the ones whose edge comes from a structural property of markets rather than from a coincidence in one history, are the only ones left standing.

A backtest is one exam. A bootstrap is the same exam shuffled. Flow writes new exams on the same subject.

One objection could in principle explain §1 without crediting detection: style preference. If Flow's synthetic-Sharpe ranking just demotes its disfavoured styles, the headline collapses. The strictest test of that objection holds style constant. Inside the momentum family (the style Flow would over-reward if it had a preference), where does the dredged variant land?

families

Dead last. All five untuned momentum variants rank above the dredged one. Style preference alone would have predicted the opposite: a tuned momentum strategy in a "momentum-loving" model should have sat near the top of its own family. It sits at the bottom. The signal Flow is acting on goes deeper than style, exactly what §3 predicts.

  • The plagiarist result is structural. It reproduces at both scoring horizons, follows from the verbatim-replay mechanism the bootstrap inherits (measured above), and does not depend on any single live draw.
  • The bootstrap is used here in the only way you can use it for ranking. A careful desk would use a bootstrap for confidence-interval width, not for selection, and we agree. But ranking is the job in this test. There is no confession-free way to make a bootstrap rank: as a ranker it preserves dredged structure (resampled blocks of the selection window keep it intact) and it leaks to anything with memory. The bootstrap is not our straw man; the absence of a third option is.
  • The honest-ranking result lines up with our published benchmark. Flow's accuracy peaks at +0.23 at 6 months and holds at +0.21 at 12 months, the horizons allocators care about. The controlled-protocol version of this claim, with many windows and multiple seeds, is FinBench's TSTR ρ of +0.85 at the 60-bar protocol. The numbers here are the desk-realistic version of that result.
  • Live P&L over two years grades luck, not skill. Three of the single-draw live "winners" of the tournament are fabricated strategies that simply got lucky again. That is exactly why we score by composition of the shortlist (job one) and rank transfer across many windows (job two), not by the dollar P&L of any single shortlist.

You can try Sablier-Flow yourself: sign up at sablier.ai for an API key. The open-source calibration benchmark behind the ρ = +0.85 figure is FinBench.