The problem
A backtest is a story your strategy tells about one specific past. Sharpe 3.2 in-sample, max drawdown 8%, beautiful equity curve. Then you ship it, and the curve flattens. Sometimes inverts. The strategy did not fail. Markets just dealt a different hand, and your strategy was tuned to the one it never saw.
The industry has been calling this overfitting for decades, and the response has been to hold out a test set. But the test set is still one history. Splitting 2010–2018 / 2019–2023 does not protect against overfitting to the joint behaviour of US equities during a 13-year bull run with one COVID-shaped shock in the middle. Both halves come from the same draw. The question we have been asking is too weak.
The honest question is not "does this work out-of-sample?" It is "does this work out-of-history?" The strategy that delivered Sharpe 3 in your backtest, would it deliver Sharpe 3 in the 999 alternative versions of 2010–2023 that almost happened? If you cannot answer that, you do not know if your edge is real or just an artifact of the specific path you got.
We built Sablier-Flow because that is a measurable problem now.
What Sablier-Flow does
Sablier-Flow learns the joint dynamics of your data. Not just the marginal distributions, but how assets move together, how volatility clusters and then breaks, how correlations tighten in a crisis and unwind afterwards. From that learned model, it generates synthetic alternative histories: same statistical fingerprint as your input, completely different specific paths.
Run your existing backtest on each path. You stop getting one Sharpe and start getting a distribution of Sharpes. You stop getting one max drawdown and start getting a tail. The strategies that consistently rank near the top across 1,000 alternative versions of history are the ones with a real edge. The ones that look great in one history and median in the other 999 are the ones you would have shipped and regretted.
The synth is not a remix. We measure it. On our memorization audit, the nearest-neighbor distance ratio between generated paths and the training tail is 0.93, vs. a 0.02 floor for a replay-style generator that just copies bars back at you. That is a 57.7× separation. The paths are genuinely new, not recoloured rearrangements of the data you already had.
How it differs from what is already there
We are not the first attempt at this. Three families exist and each has a known failure mode.
Block bootstrap takes your history and reshuffles blocks of it. It is robust to vol clustering within a block, but it can never produce a sequence of joint events you did not already see. The 2020 COVID shock is in your history exactly once. Block bootstrap will give you 1,000 paths that contain that exact shock, never one that contains a similar-but-different shock. It is an honest tool for some questions, and the wrong tool for the overfitting question.
GARCH + copula requires you to hand-specify the dynamics. Pick a GARCH variant, fit a copula, hope your specification covers the regimes your strategy will face. The specification IS the hard part. If you knew the right specification, you would not need to backtest.
TimeGAN, Diffusion-TS, KoVAE are the general-purpose ML time-series families. The optimization objective is some version of "can a classifier distinguish synth from real?" That proxy rewards mode collapse: a model that produces low-variance smooth synth aces the classifier check while violating the stylized facts that destroy backtests (fat tails, vol clustering, leverage effect). On FinBench, our public benchmark for multivariate financial generation, TimeGAN passes 9 of 14 stylized facts, Diffusion-TS and TimeVAE pass 10 of 14. Sablier-Flow passes 13 of 14. Of the comparators that also clear the downstream-utility check (does a strategy trained on synth rank correctly on real?), only one besides us comes through clean.
The space was not empty. It just had not been measured against the right metrics.
What it looks like in practice
Five lines of Python after pip install sablier-flow:
import sablier_flow as sf
# 1. Auth (one-time device flow)
sf.login()
# 2. Load data: bundled demo or your own DataFrame
df = sf.demo_data()
backtest_window = df.iloc[-252:] # the year you'd backtest on
# 3. Train + generate 200 alternative versions of that window
fit = sf.fit(df,
features=list(df.columns),
data_types=df.attrs['data_types'],
horizon=252)
paths = sf.generate(fit.model_id, n_paths=200, like=backtest_window)
# 4. Your existing backtest
def my_backtest(prices):
rets = prices['SPY'].pct_change().dropna()
return {'sharpe': float(rets.mean() / rets.std() * 252**0.5)}
# 5. Score robustness across the 200 synthetic versions
real = my_backtest(backtest_window)
synth = [my_backtest(p) for p in paths.as_dataframes()]
report = sf.robustness(real, synth, primary_metric='sharpe')
print(report.summary())
Five steps: log in, load data, fit a model, generate paths, score your strategy across them. The robustness call returns a Sharpe distribution, a p-value, an overfit score, and the specific paths your strategy bombed on. The whole workflow runs on free starter credits.
We have packaged three example notebooks with executed outputs:
- Backtest Robustness. At the 0.7 overfit-score threshold, flags 29 of 30 selection-biased lucky strategies (top 30 from a 500-strategy pure-noise pool) while flagging 0 of 12 designed-honest strategies. Concrete separation, not vibes.
- TSTR Predictive Rank. Spearman ρ of +0.78 between synth-fitted strategy ranks and real out-of-sample ranks, 95% bootstrap CI [+0.55, +0.89], p = 7.8e-06 on n = 24. Fitting on synth predicts which strategies win on real.
- Memorization Audit. The 57.7× separation cited above. The synth is new.
Each is a notebook, not a screenshot. Clone it, run it, change the parameters, see if you can break it.
The honest framing
We do not claim our synth is real. Calling generated data real would be the same epistemic move that makes one-history backtests dangerous in the first place. What we claim is that we measure how close it is, on the metrics that actually matter for backtesting, and we report the failures too.
That is why FinBench exists. We needed a benchmark that scored generators on what quants care about, not on what classifiers care about. 14 stylized facts grounded in the empirical-finance literature (Cont 2001, Black 1976, Joe 1997, Bailey & López de Prado 2014). The leaderboard is public. We are at the top of it for now, and the protocol is frozen, so any future model has to clear the same bar. If a competitor beats us, the leaderboard will say so.
The accompanying validation suite, finval, is the scoring library underneath FinBench. It is the same code that powers sf.validate(...) in the SDK, so anything we measure on a leaderboard run is also something you can measure on your own model. The audit is the same on our side and yours.
Try it
pip install sablier-flow
Sign up at sablier.ai for an API key. Free starter credits cover the entire getting-started notebook plus the three demo notebooks above.
If you ship strategies and you are tired of guessing which ones overfit, this is exactly what we built it for. Tell us where it breaks, we would rather hear from you than from the live P&L.