Kevin Hillstrom: MineThatData: A/B Testing Gone Bad

July 20, 2010

A/B Testing Gone Bad

For many of us, testing website strategies is a matter of running simple "A/B" tests.

This process seems easy enough. There are automated algorithms that compute how many conversions are required before one strategy can be labeled as the "winner" against another strategy.

Most of the time, this process is done on the basis of "conversions" ... did a customer buy something, did a customer subscribe to something ... you get the picture.

And most of the time, this strategy yields an outcome that is wrong.

Yup, you heard it here first. Your conversion-based A/B tests, while yielding a statistically significant outcome from a conversion rate standpoint, are yielding an "incorrect" outcome when evaluating what matters ... spend per visitor.

Here's why ... for those of us in e-commerce, we are measuring "conversion", when we should be measuring "dollars per visitor". And "dollars per visitor" is the component of two metrics.

Conversion Rate.
Dollars per Conversion.

When you multiply variability (conversion rate) by variability (dollars per conversion), you get ... VARIABILITY! Lots of variability!

Take a look at this example. This is a test, eleven customers in each group (for illustrative purposes only).

The test group clearly outperforms the control group (7/11 conversions vs. 2/11 conversions).

Also notice that the average amount spent per conversion is $100 per conversion, the same across each group.

However, the variability associated with spending amounts (they vary between $30 and $170 per order in both the test group and the control group) causes the t-test to yield a result that is not nearly as statistically significant as when measuring conversion rate alone.

Well guess what? This happens ... all of the time! Every time you run one of those software-based automated conversion rate A/B tests that measures responses until one outcome is statistically significant, you guarantee that measurement based on dollars per visitor is not going to be significant ... in other words, every test you are running gives you a statistically insignificant outcome!!

So do me a favor ... please, I beseech thee ... please run your A/B conversion rate tests against the statistical significance of dollars per visitor, not on the basis of conversion rate.

You will learn that you have to have 2x - 4x as many customers in your test, but your results will be valid. As being executed now, so many of the tests being executed out there are yielding garbage for results ... and this partially explains why a decade of conversion rate optimization has yielded websites that, by and large, convert fewer customers than in the past.

14 comments:

Anonymous7:14 AM
I think you're missing a column of data in your table (control group spend)?
ReplyDelete
Replies
MineThatData7:26 AM
It is there now ... depending upon the browser you are using, it either appeared or did not appear, sorry.
ReplyDelete
Replies
Chris Goward9:53 AM
Your point that tracking revenue is important is a good one, Kevin. It's especially important for catalog eCommerce sites (and less so for single product eCommerce landing pages where, in many cases, average order value may have low variability).

However, this statement doesn't seem accurate: "Every time you run one of those software-based automated conversion rate A/B tests that measures responses until one outcome is statistically significant, you guarantee that measurement based on dollars per visitor is not going to be significant".

Conversion rate and dollars per visitor are not inversely correlated. When we track tests using conversions and revenue per visitor, we often see the same variation producing the best results for *both* metrics. They can both be calculated for the same test.

Your conclusion is also a little over dramatic when you say, "this partially explains why a decade of conversion rate optimization has yielded websites that, by and large, convert fewer customers than in the past." More likely, it's because companies are testing inconsequential tips & tricks (think button color), rather than analyzing the persuasive experience of the visitor.

We also find that any testing is better than no testing. Once a company gets addicted to making data-based design and content decisions they are taking the first step on the path to ever-increasing sophistication in their metrics.

I would encourage a company to get started with testing and improve from there. Avoid the "paralysis by analysis" and insecurity from thinking that you have to boil the ocean to get any value from testing.
ReplyDelete
Replies
MineThatData10:09 AM
Provide the audience with a dataset of 20,000 customers involved in a test, 10,000 seeing one landing page, 10,000 seeing the other landing page.

Show the statistical test for conversion rate, and for $ per visitor. Show the audience the variance and standard errors calculated for each metric.

You have the data to do this!!
ReplyDelete
Replies
Justin Palmer11:45 PM
Kevin, your analysis is spot on with most of the test results I myself have witnessed.

Google website optimizer's lack of tracking the $ per conversion is only further evidence that adding this variable makes running a/b tests significantly more variable, therefore making their "product" harder to use. However this is leading to tons of false conclusions for retailers.

Nearly everytime GWO tells me I have a significant winner, when I factor in $ per conversion the winner is far from significant.
ReplyDelete
Replies
MineThatData6:28 AM
Hi Justin, thanks for the comment. As a practitioner, you get to see what happens simply by hand-calculating the metrics, not by trusting a software product.

And, as a result, you get to learn more, too!
ReplyDelete
Replies
dan7:40 AM
If you have the ability to calculate profit or margin you should do that as well, especially if you are testing discounts. Be careful with a t-test as well as it assumes that you will have a normal distribution around the value you are measuring which may not always be true.
ReplyDelete
Replies
MineThatData10:32 AM
Dan --- true enough. If you pull random samples from your original population and measure the variance, then do a simple histogram of the variance, you'll know!
ReplyDelete
Replies
Anonymous3:43 PM
Hi Kevin,

Can you explain what software or statistical models you use to run the A/B conversion rate tests against the statistical significance of dollars per visitor?

I recognize the need for that, but I am not aware of software that can do that.
ReplyDelete
Replies
MineThatData3:53 PM
I use SPSS to do almost all of my statistical work. SAS works well, too.
ReplyDelete
Replies
Anonymous11:16 AM
Hi Kevin, which statistical test do you use to evaluate the difference in RPV? Most of the ttests used by A/B significance calculators do not work when RPV exceeds $1. Is there a particular test that should be used?
ReplyDelete
Replies
MineThatData11:28 AM
I'm fine with a standard t-test ... I use SPSS software to analyze test results. I can get revenue per visitor of $10 or more, doesn't matter. Online calculators are suspect.
ReplyDelete
Replies

Note: Only a member of this blog may post a comment.

Subscribe to: Post Comments (Atom)

Kevin Hillstrom, President, MineThatData

Kevin is President of MineThatData, a consultancy that helps CEOs understand the complex relationship between Customers, Advertising, Products, Brands, and Channels. Kevin supports a diverse set of clients, including internet startups, thirty million dollar catalog merchants, international brands, and billion dollar multichannel retailers. Kevin is frequently quoted in the mainstream media, including the New York Times, Boston Globe, and Forbes Magazine.

Prior to founding MineThatData, Kevin held various roles at leading multichannel brands, including Vice President of Database Marketing at Nordstrom, Director of Circulation at Eddie Bauer, and Manager of Analytical Services at Lands' End.

You may contact kevin at kevinh@minethatdata.com.

How Is Your Information Used?

When you subscribe to this blog, your information and email address will never be bought/sold. Ever. You are simply subscribing to the newsletter. You are welcome to unsubscribe at any time, no worries.

Cookies are used to measure website usage via Google Analytics and StatCounter.

FAQ For Vendors / Content Providers

1 - I do not accept advertising on this blog.

2 - I do not accept unsolicited content, including interviews, press releases, podcasts, discussions, posts, or other associated content promoting your products, services, or events. This blog is designed to promote my products, services, and content.

3 - As a continuation of (2), I do not accept guest blog posts, regardless of your situation. And I will not link to your blog post or white paper.

4 - I do not exchange links. In fact, I no longer publish reciprocal links to other websites.

July 20, 2010

A/B Testing Gone Bad

14 comments:

The Contrast