July 20, 2010

A/B Testing Gone Bad

For many of us, testing website strategies is a matter of running simple "A/B" tests.

This process seems easy enough. There are automated algorithms that compute how many conversions are required before one strategy can be labeled as the "winner" against another strategy.

Most of the time, this process is done on the basis of "conversions" ... did a customer buy something, did a customer subscribe to something ... you get the picture.

And most of the time, this strategy yields an outcome that is wrong.

Yup, you heard it here first. Your conversion-based A/B tests, while yielding a statistically significant outcome from a conversion ra
te standpoint, are yielding an "incorrect" outcome when evaluating what matters ... spend per visitor.

Here's why ... for those of us in e-commerce, we are measuring "conversion", when we should be measuring "dollars per visitor". And "dollars per visitor" is the component of two metrics.
  • Conversion Rate.
  • Dollars per Conversion.
When you multiply variability (conversion rate) by variability (dollars per conversion), you get ... VARIABILITY! Lots of variability!

Take a look at this example. This is a test, eleven customers in each group (for illustrative purposes only).







































































































The test group clearly outperforms the control group (7/11 conversions vs. 2/11 conversions).

Also notice that the average amount spent per conversion is $100 per conversion, the same across each group.

However, the variability associated with spending amounts (they vary between $30 and $170 per order in both the test group and the control group) causes the t-test to yield a result that is not nearly as statistically significant as when measuring conversion rate alone.

Well guess what? This happens ... all of the time! Every time you run one of those software-based automated conversion rate A/B tests that measures responses until one outcome is statistically significant, you guarantee that measurement based on dollars per visitor is not going to be significant ... in other words, every test you are running gives you a statistically insignificant outcome!!

So do me a favor ... please, I beseech thee ... please run your A/B conversion rate tests against the statistical significance of dollars per visitor, not on the basis of conversion rate.

You will learn that you have to have 2x - 4x as many customers in your test, but your results will be valid. As being executed now, so many of the tests being executed out there are yielding garbage for results ... and this partially explains why a decade of conversion rate optimization has yielded websites that, by and large, convert fewer customers than in the past.

14 comments:

  1. Anonymous7:14 AM

    I think you're missing a column of data in your table (control group spend)?

    ReplyDelete
  2. It is there now ... depending upon the browser you are using, it either appeared or did not appear, sorry.

    ReplyDelete
  3. Your point that tracking revenue is important is a good one, Kevin. It's especially important for catalog eCommerce sites (and less so for single product eCommerce landing pages where, in many cases, average order value may have low variability).

    However, this statement doesn't seem accurate: "Every time you run one of those software-based automated conversion rate A/B tests that measures responses until one outcome is statistically significant, you guarantee that measurement based on dollars per visitor is not going to be significant".

    Conversion rate and dollars per visitor are not inversely correlated. When we track tests using conversions and revenue per visitor, we often see the same variation producing the best results for *both* metrics. They can both be calculated for the same test.

    Your conclusion is also a little over dramatic when you say, "this partially explains why a decade of conversion rate optimization has yielded websites that, by and large, convert fewer customers than in the past." More likely, it's because companies are testing inconsequential tips & tricks (think button color), rather than analyzing the persuasive experience of the visitor.

    We also find that any testing is better than no testing. Once a company gets addicted to making data-based design and content decisions they are taking the first step on the path to ever-increasing sophistication in their metrics.

    I would encourage a company to get started with testing and improve from there. Avoid the "paralysis by analysis" and insecurity from thinking that you have to boil the ocean to get any value from testing.

    ReplyDelete
  4. Provide the audience with a dataset of 20,000 customers involved in a test, 10,000 seeing one landing page, 10,000 seeing the other landing page.

    Show the statistical test for conversion rate, and for $ per visitor. Show the audience the variance and standard errors calculated for each metric.

    You have the data to do this!!

    ReplyDelete
  5. Kevin, your analysis is spot on with most of the test results I myself have witnessed.

    Google website optimizer's lack of tracking the $ per conversion is only further evidence that adding this variable makes running a/b tests significantly more variable, therefore making their "product" harder to use. However this is leading to tons of false conclusions for retailers.

    Nearly everytime GWO tells me I have a significant winner, when I factor in $ per conversion the winner is far from significant.

    ReplyDelete
  6. Hi Justin, thanks for the comment. As a practitioner, you get to see what happens simply by hand-calculating the metrics, not by trusting a software product.

    And, as a result, you get to learn more, too!

    ReplyDelete
  7. If you have the ability to calculate profit or margin you should do that as well, especially if you are testing discounts. Be careful with a t-test as well as it assumes that you will have a normal distribution around the value you are measuring which may not always be true.

    ReplyDelete
  8. Dan --- true enough. If you pull random samples from your original population and measure the variance, then do a simple histogram of the variance, you'll know!

    ReplyDelete
  9. Anonymous3:43 PM

    Hi Kevin,

    Can you explain what software or statistical models you use to run the A/B conversion rate tests against the statistical significance of dollars per visitor?

    I recognize the need for that, but I am not aware of software that can do that.

    ReplyDelete
  10. I use SPSS to do almost all of my statistical work. SAS works well, too.

    ReplyDelete
  11. Anonymous11:16 AM

    Hi Kevin, which statistical test do you use to evaluate the difference in RPV? Most of the ttests used by A/B significance calculators do not work when RPV exceeds $1. Is there a particular test that should be used?

    ReplyDelete
  12. I'm fine with a standard t-test ... I use SPSS software to analyze test results. I can get revenue per visitor of $10 or more, doesn't matter. Online calculators are suspect.

    ReplyDelete
    Replies
    1. Anonymous8:40 AM

      Thanks Kevin. I guess you need the raw data though right? I was wondering if there was a test that could measure significance based on rolled up # visitors participating in each cell and the revenue per cell.

      Delete
    2. Once you roll up the data, you lose the ability to measure variance, and therefore, make it impossible to calculate a statistical test.

      There are ways to cheat, to estimate variance, but that gets kind of advanced for most people.

      Delete

Note: Only a member of this blog may post a comment.

Accountability

Part of the system I advocate is a process that leads to Merchant Accountability. This can happen in many different ways. At Nordstrom, Blak...