Kevin Hillstrom: MineThatData: Statistical Significance of A/B Tests

March 02, 2023

Statistical Significance of A/B Tests

There is a ton of misinformation out there about A/B testing. Those lacking rigorous statistical training tell you that you need "x" responses for a valid A/B test.

That's not how this stuff works.

More than thirty years ago (Lands' End), we developed an equation to estimate the variance associated with our A/B mailing tests. It turned out that the variance of our estimates was non-constant. In other words, the variance might be "x" when the dollar-per-book was $3.00 ... it might be "1.5x" at a dollar per book of $5.00.

We developed an equation ... as long as dollars-per-book was >= $2.00 variance could be estimated as -188 + 192*x, where "x" was the dollars-per-book in a test group (aside ... there are going to be statistical experts who balk at creating this equation ... one that accounts for non-constant variance ... and will say that everything that follows is garbage ... just want you to know that view is out there, I need to be forthright here).

Let's pretend that our control dollar-per-book was $3.00, and we expected the test dollar-per-book to be $3.25. Let's pretend that we wanted 10,000 customers in the test group and 10,000 customers in the control group. Would our results be statistically significant?

The t-test equation looked like this:

Test Group Dollar-Per-Book = $3.25.
Control Group Dollar-Per-Book = $2.75.
Test Group Sample Size = 10,000.
Control Group Sample Size = 10,000.
Variance of Test Group = -188 + 192*3.25 = 436.
Variance of Control Group = -188 + 192*3.00 = 388.
T-Score = (3.25 - 3.00) / SQRT(436/10000 + 388/10000) = (0.25) / (0.29) = 0.86.

The T-Score is nowhere close to 2.00 ... so the results are not statistically significant.

Now, does that mean that the results aren't meaningful? Maybe. What would happen if we had 100,000 customers in each of the test/control group?

T-Score = (3.25 - 3.00) / SQRT(436/100000 + 388/100000) = (0.25) / (0.09) = 2.78.

Now the results are statistically significant.

What was the difference?

Well, the original sample size was too small.

We used the equation above to determine the appropriate sample size for all tests based on the amount of variance associated with our expectations for test group performance and holdout group performance.

When I worked at Nordstrom, we used a comparable equation - one specific to Nordstrom. We learned that we needed 100,000 or 200,000 customers to measure what we wanted to measure ... not 10,000 or 20,000.

Comparable issues impact website conversion. You don't want to measure conversions ... you want to measure sales per visitor. Your test group might spend more/less than the control group once the decision to purchase is made ... so you have to measure sales per visitor.

Follow the math.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Kevin Hillstrom, President, MineThatData

Kevin is President of MineThatData, a consultancy that helps CEOs understand the complex relationship between Customers, Advertising, Products, Brands, and Channels. Kevin supports a diverse set of clients, including internet startups, thirty million dollar catalog merchants, international brands, and billion dollar multichannel retailers. Kevin is frequently quoted in the mainstream media, including the New York Times, Boston Globe, and Forbes Magazine.

Prior to founding MineThatData, Kevin held various roles at leading multichannel brands, including Vice President of Database Marketing at Nordstrom, Director of Circulation at Eddie Bauer, and Manager of Analytical Services at Lands' End.

You may contact kevin at kevinh@minethatdata.com.

How Is Your Information Used?

When you subscribe to this blog, your information and email address will never be bought/sold. Ever. You are simply subscribing to the newsletter. You are welcome to unsubscribe at any time, no worries.

Cookies are used to measure website usage via Google Analytics and StatCounter.

FAQ For Vendors / Content Providers

1 - I do not accept advertising on this blog.

2 - I do not accept unsolicited content, including interviews, press releases, podcasts, discussions, posts, or other associated content promoting your products, services, or events. This blog is designed to promote my products, services, and content.

3 - As a continuation of (2), I do not accept guest blog posts, regardless of your situation. And I will not link to your blog post or white paper.

4 - I do not exchange links. In fact, I no longer publish reciprocal links to other websites.

March 02, 2023

Statistical Significance of A/B Tests

No comments:

Post a Comment

The Contrast