Backtesting vs. Live Performance: What the Gap Really Means

Every algorithmic trading platform shows you a backtest. The returns look compelling. Then the strategy goes live and underperforms. Every time.

This isn't fraud. It's physics. The gap between backtesting and live performance is real, well-documented, and almost always in one direction: live is worse. Understanding why — and by how much — is the most important due diligence question you can ask before trusting any trading system with your capital.

The backtesting vs. live trading performance gap isn't a failure of the technology. It's a predictable consequence of how historical simulations work. Here's what creates it, and how Lukra approaches each issue.

Overfitting: When the Model Learns the Past Too Well

Overfitting is the most pervasive cause of the gap. When you build a model on historical data, you're optimizing parameters to maximize performance on that specific history. If you test enough parameters — enough combinations of indicator weights, regime thresholds, and timing rules — you will eventually find a configuration that fits the past beautifully.

The problem: the past is not the future. An overfitted model has learned the specific idiosyncrasies of its training data, not the underlying market dynamics that will persist. In backtesting, it looks exceptional. In live trading, it reverts toward random.

The solution is out-of-sample testing: reserving a portion of historical data that the model never sees during optimization, then evaluating performance on that unseen data. Walk-forward analysis goes further — repeatedly re-training on rolling windows and evaluating on the subsequent period, simulating how the model would have performed if deployed progressively through history.

Lukra's models are developed with strict train/test splits and walk-forward validation. We treat any result that only shows in-sample performance as meaningless until it replicates out-of-sample.

Slippage: The Silent Tax on Every Trade

Backtests assume you trade at the exact price shown in historical data. Real execution doesn't work that way. The moment your order hits the market, you're competing with every other participant. Large orders move price. Speed matters.

Slippage is the difference between the price you assumed in the backtest and the price you actually got. For a small retail account trading liquid large-cap ETFs, slippage is modest — a few basis points per trade. But it accumulates. A strategy executing 100 trades per year with 5bps of average slippage loses roughly 0.5% annually that the backtest never captures.

Lukra models slippage explicitly in backtesting using realistic assumptions based on the instrument's average daily volume and the order size being executed. We don't assume midpoint execution; we assume market impact. This produces more conservative backtest figures that are closer to what live trading will actually deliver.

Survivorship Bias: The Graveyard Is Hidden

Historical market data has a quiet problem: it mostly contains winners. Indices are rebalanced. Failing companies get delisted. ETFs that close their doors don't appear in the data your model trained on.

If you backtest a strategy on the current S&P 500 constituents going back 20 years, you're training on 500 companies that survived and grew large enough to remain in the index. The companies that went bankrupt, got acquired for pennies, or simply stagnated — they're not in your dataset. Your model is learning from a curated sample of success.

For strategies trading major liquid instruments like SPY — the ETF itself rather than individual constituents — survivorship bias is less severe. SPY has been trading since 1993 and reflects the index's total return including rebalancing. This is one reason Lukra's primary focus on broad market instruments rather than individual equities reduces this distortion.

Look-Ahead Bias: Using Information You Wouldn't Have Had

Look-ahead bias is a coding error that introduces future information into historical decisions. It's surprisingly easy to introduce accidentally.

A common example: using today's closing price to make a trade decision that the model executes at today's open. In the real world, you don't know the close at open time. In a poorly written backtest, the data is already available in the array.

Less obvious examples include using index constituent membership that was announced retroactively, or using fundamental data with publication lags that aren't properly modeled.

Lukra's backtesting infrastructure enforces strict event-time logic — each decision only sees data that would have been available at the moment the decision was made. Every data source has a timestamp and a publication lag applied before the model sees it.

Execution Timing and Real-World Friction

Backtests assume instantaneous, costless order execution. Reality involves:

Latency between signal generation and order placement
Queue position — limit orders may not fill at the target price if the market moves through before your order fills
Partial fills — large orders in less liquid instruments may only partially execute at the desired price
Commission costs — though these are small for ETF trading, they're real

Modeling these frictions realistically is the difference between a backtest that guides useful decisions and one that sets unrealistic expectations.

How Lukra Reports the Gap

We don't hide the gap. We measure it and publish both figures.

For every strategy, you can compare the historical backtest performance to the live trading track record for the period when the strategy has been active. The live record is auditable against actual brokerage statements. The comparison tells you what the realistic gap has been in practice.

This is the only honest way to evaluate an algorithmic trading system: don't just look at the backtest — look at how closely live performance has tracked the backtest, and in which direction it diverges.

A strategy whose live performance closely tracks its backtest is a strategy with minimal overfitting and realistic simulation assumptions. A strategy that dramatically outperforms its backtest in live trading has likely been lucky. A strategy that dramatically underperforms has likely been overfit.

The goal isn't a backtest that looks impressive. It's a backtest that accurately predicts live performance.

To understand the architecture behind SPY v4's backtest and how the model was validated before going live, see Inside the SPY v4 Model: Architecture of an AI Trading Strategy.

See how Lukra's live performance compares to the backtest — the gap is part of what we publish. View live performance →

Past performance is not indicative of future results. Algorithmic trading involves risk of loss. Backtested results are hypothetical and do not represent actual trading.