Statistical Significance

For each test, Shoplift determines the validity of your test data by utilizing a hybrid of frequentist and Bayesian methodologies to determine if your test results are valid.

Understanding the status of your test

There are three key pieces of information displayed in test reports that provide different information relating to the validity of your test results before significance is reached:

Progress

Progress is a semantic indicator that measures the reliability with which test results can be certified as real. Any test is only a small sample of data in the overall lifetime of your store, so progress is used to tell you if your test results are a representative sample or if they are an anomaly.

Probability to win

Probability to win is the estimated chance that one test experience outperforms another. The percentage relates to the certainty of your test overall. As data is collected and your test approaches statistical significance, the probability to win for each variant becomes more credible.

Estimated time to significance

Estimated time to significance reflects the approximate remaining time necessary to reach a statistically significant result. This estimate varies based on the volume and velocity of data collected, and is updated dynamically as your test progresses.

How does Shoplift determine if a test is statistically significant?

To determine a test as statistically significant, it must meet specific criteria with regard to time elapsed, sample size, and power, as well as minimum values for median probability to win as derived from our Bayesian statistical model.

FAQ: Why isn't my test significant but my probability to win is high?

Shoplift requires a test to run for at least three days and obtain at least 30 orders for each variant experience before the test can reach significance. We strongly recommend waiting for these requirements to be met before ending your test. Please read below for more information on these requirements.

Frequentist methodologies

Time

As behavior of visitors can vary with time, Shoplift requires a minimum duration of 3 days on a test for a test to be determined statistically significant, regardless of the sample size observed or power calculated.

We impose restrictions to ensure that data collected early on is not acted upon - even if the behavior is consistent - because of the "newness" effect for variants. The "newness effect" is a phenomenon wherein shoppers will often act upon a new experience more often simply because it's new. This generally tapers off after a few days.

Sample Size

For all tests, Shoplift requires a minimum of 30 observations (conversion events) on the original and the variant. This is a hard minimum, and while it is unlikely that a test would be determined statistically significant upon reaching this threshold, we recommend continuing to run your test for additional time if it does.

Power

All tests are evaluated at a confidence level of 95% and a statistical power value of 80%. As data is observed, we perform a standard power analysis derived from the observed sample size and effect size. Our power analysis helps control for false-positives (Type I Error) and false-negatives (Type II Error) in your test results even when our Bayesian probabilities might strongly indicate a winning variant.

When power reaches 80%, Shoplift makes the determination that there is a significant difference between your original and variant. Thus, it is strongly advised that even if your probability to win indicates 95% or better on a variant, you wait until the test is determined statistically significant before acting on the results.

Bayesian methodologies

As data is collected during a test, the volume, velocity, and trend of the data is evaluated by our Bayesian regression and inference model, which calculates outcome values simulated from the posterior predictive distribution using a Markov chain Monte Carlo algorithm.

The posterior distribution is just the technical term for the range of outcomes that could result from a given test. As a test collects more data, the posterior distribution grows narrower, meaning that the possible range of outcomes shrinks and Shoplift's certainty in the results grows.

At any time during a test, this model will provide a "probability to win" for each variant. If the trend in data is consistent, then the probability to win for each variant will spread over time, and the standard deviations of each probability will narrow.

When a test reaches significance, this probability to win value is validated and your test has reached its conclusion. However, you can always choose to run your test for longer periods of time to inform a greater degree of belief in the winning experience.

How does your model adjust for outliers in the data?

Our model is highly adaptive to outliers in the data. If a trend in performance that has been consistent reverts for a period of time due to an outlier in the data, then it will be weighted less heavily than the preexisting trend when making its determinations.

Last updated