Statistical Significance

Shoplift continuously evaluates the validity of your test results using a combination of frequentist and Bayesian statistical methods. Throughout the life of a test, we provide clear indicators to help you understand how your test is performing, how close it is to reaching statistical significance, and when it’s safe to make decisions based on the results.

How Shoplift Measures Test Validity

To determine whether a test result is statistically significant, Shoplift considers multiple factors:

Time elapsed (minimum duration)
Sample size (minimum number of conversion events)
Statistical power (target: 80%)
Probability to win (estimated certainty from our Bayesian model)

Once these conditions are met, your test is certified as Significant, and the winning variant can be confidently implemented.

Key Progress Indicators

At the top of each test report, Shoplift displays progress indicators that help you track how close your test is to significance. These indicators include:

Status

A semantic label indicating how reliable your test results are. It accounts for time, sample size, and the statistical quality of your data to reflect whether results are likely to be real or the result of random noise. For a list of statuses, see the tables below.

Lift

Lift is the difference in performance between your original and variant experiences. The upper and lower bounds represent where your true lift likely falls, based on current test data. It's calculated using a statistical measure called standard error, which accounts for variation in both the original and variant performance. A tighter range means more confidence in the result, a wider range means more uncertainty.

Win Chance

An estimated likelihood that a variant outperforms the original, calculated using Bayesian modeling. As the test progresses and more data is collected, this probability becomes more stable and reliable.

The upper and lower bounds represent the 95% credible interval, which is the range within the true win chance is likely to exist with 95% probability. A tighter range means more confidence in the result, a wider range means more uncertainty.

Time to Significance

A dynamic estimate of how much longer your test needs to run to reach significance. This is based on the current velocity and volume of data being collected. As this is an estimate, it becomes more accurate as data is collected.

Active Test Statuses

Gathering Data

The test has launched but hasn’t yet met the minimum time (3 days) or sample size (30 conversions per variant). Results are not yet reliable.

Trending

Early trends suggest the variant may be outperforming (or underperforming) the original. Results are directional but not yet validated.

Significant

The test has met the criteria for statistical significance. This means the test has run for at least 7 days, and has 95% confidence and 80% power. Results are considered reliable and actionable, and certify the provided Win Chance and Lift values as true.

Significance Unlikely

The test has run for 14+ days, and significance is projected to take more than 60 additional days. Consider ending the test and trying a new approach, as the effect of this test is so minimal that it probably isn't worth the time.

Ended Test Statuses

Trend Identified

The test was ended after the initial sample and time thresholds were met, but before significance was achieved. These outcomes can be interpreted as directional, but not validated as true.

Significant

The test was ended after reaching statistical significance. The results are valid and can be acted upon.

Inconclusive – Not Enough Data

The test was ended before enough data was collected to draw any meaningful inference.

Inconclusive – Significance Unlikely

The test was ended after it became clear that statistical significance was unlikely to be achieved. This helps distinguish between promising ideas that lacked time and ideas that were unlikely to yield a measurable impact.

How Statistical Significance is Determined

Frequentist Criteria

Time Requirement: Tests must run for at least 3 full days, even if early data shows promising trends. This helps account for the "newness effect", where visitors could respond strongly to novel changes that may not persist over time.
Sample Size Requirement: Each variant must accumulate at least 30 conversion events. While this is the minimum threshold, more data generally improves reliability.
Power Requirement: Shoplift targets 80% statistical power, which helps ensure that your results aren’t just a fluke (false positive or false negative). Power is calculated based on the observed sample size and effect size.

Bayesian Methodology

As data is collected, Shoplift’s Bayesian engine models a posterior distribution of possible outcomes for each variant using a Markov Chain Monte Carlo algorithm. This distribution narrows over time as confidence grows.

The model continuously computes the Win Chance for each variant.
As trends become more consistent, probabilities stabilize and standard deviations narrow.
Once power and confidence thresholds are met, this probability is validated and your test reaches Significance.

Outlier Handling

Shoplift’s model is adaptive to outliers. Temporary reversals or data spikes are weighted appropriately, with consistent trends receiving greater emphasis than short-lived anomalies.

FAQ: Why is my Win Chance high, but the test isn’t significant?

Shoplift requires both time and sample size thresholds to be met before a test can be marked as significant—even if your Win Chance appears promising. While Win Chance can have a high mean early on in a test, the 95% credible interval will remain wide, indicating that there is still a wide range of possible outcomes. As more data is collected and your test approaches significance, this distribution of outcomes will narrow.

PreviousViews NextTracking

Last updated 2 months ago

Was this helpful?