November 08, 2010

A/B Testing: Persecution

One this is certain to happen when you advocate A/B tests over traditional metrics to measure the performance of marketing campaigns.

You are going to be persecuted.

You're going to identify customer behavior that is not identified by traditional metrics, in fact, what you identify will directly contradict what is "known" when measured by traditional metrics.

In 1994, at Lands' End, we used a year-long 2^7 factorial design to measure the impact of all of our catalog titles on overall profitability.  Traditional metrics (response) showed that customers were leaking out of our main title.  Holdout tests showed that newer catalog titles were not generating a lot of incremental demand, and were largely unprofitable.  Those on the "wrong side" of the test results weren't pleased.  Eventually, those on the "wrong side" of the test results won the battle for Management, the least profitable strategy was pursued, and for good reason ... traditional metrics looked better.  Holdout tests showed that the most profitable strategy was very different, and not very popular.

In 1998 at Eddie Bauer, we executed a six-month long promotional holdout test.  No customers were allowed to receive free shipping or 20% off promotions.  Traditional metrics showed that discounts and promotions worked tremendously well.  A/B holdout tests showed that customers spent the exact same amount, promos or no promos.  Guess what?  The test panel not offered promotions was far more profitable, though the test panel that received promotions looked much better via traditional response metrics.  Though nobody believed the test results (look, our response metrics show that promotions work, your test is wrong!), I was in a position of authority, so I could execute a strategy without promotions.  We had the most profitable year in the history of the catalog/online division in 1999.

In 2001 at Nordstrom, we executed mail/holdout tests to measure the impact of catalog marketing on the online channel.  It was a well-known "best practice" to mail catalogs, because catalog caused customers to buy online, as proven by a methodology called "matchbacks", the early version of "attribution models" that we hear about today.  Our test results showed something interesting ... when a customer stopped receiving catalogs, the customer started buying online on her own, so the catalog was being given too much credit for online orders that were "matched back" to the catalog.  In essence, we could mail fewer catalogs and generate a lot more profit.  Needless to say, leadership aligned with catalog marketing hated what the tests illustrated, and fought the results tooth-and-nail.  A few years later, the catalog strategy was discontinued, and profit significantly increased, a result contrary to what traditional response metrics and attribution algorithms suggested would happen.

In my consulting practice, I continually see cases where traditional metrics (response, opens, clicks, conversion rate) yield outcomes that are contrary to what you see when you execute holdout tests.  In every case in my career that I can remember, when leadership chose to listen to what holdout tests were indicating, profit increased.

In most cases, however, folks choose to align with response/open/click/conversion-rate metrics instead.  They persecute those who advocate A/B and factorial testing with what a former boss called "lizard logic".  Lizard logic, of course, is an older version of 'truthiness'.
  • "Your sample sizes are too small and probably aren't statistically significant".
  • "Your samples probably have better customers in one group than in another, invalidating your test.  How can you prove your samples are 100% statistically random?  You can't.  Therefore, your results are invalid."
  • "Measurement via open/click/convert is a best practice, are you telling me that best practices are wrong?  If you are telling me that best practices are wrong, then you are telling me that I am wrong, and I take offense with your arrogant claim.  Maybe YOU are wrong!"
  • "I've been an analyst/manager/executive for ten years, are you telling me that I don't know what I'm doing?'
  • "You only tested for one month, your results won't hold for a year, try testing for a year."
  • "You tested for a year, your results are biased because we'd never execute that strategy, try testing for one month."
  • "Digital marketing is different, you can't apply A/B testing to digital marketing, attribution is so much more complicated than what can be measured via an A/B test."
The persecution will come from anybody who benefits from flawed metrics.  You really can't blame people for persecuting you.  Their livelihood is being threatened, their metrics tell them that they are right, and often, they simply don't understand your testing strategy or methodology.

People execute A/B tests incorrectly all of the time --- they optimize for conversion rate and not for profit, their sample sizes are too small because they are sampling for conversions instead of profit dollars ... but that being said, their results are usually more reliable than those obtained via classic response/open/click/conversion metrics.

I'm not saying you don't measure via response/open/click/conversion.

I am saying that traditional response/open/click/conversion metrics give misleading results.

I've analyzed more than 500 tests in my career, after earning a degree in statistics.  I know a little something about this.  A/B and factorial tests do not lead you astray.  Instead, they illustrate secrets that your customers are telling you about how they like to respond, secrets that cannot be illustrated via traditional metrics.

A/B tests allow you to listen to your customers.  Isn't that the most important thing?

When you go down the testing route, you will be persecuted by others.

If you love traditional metrics and are not a fan of testing, give Anne Holland's "Which Test Won" website a try.  See if you can correctly guess which strategy performs better ... each strategy was A/B tested.  If you can guess 85% or more of the tests correctly, then maybe you don't need to believe in the results of A/B tests, because your gut instinct and business knowledge is just that good.

But if you aren't that good, then why not give A/B tests a try?  Why not believe what the customer tells you?  You're going to generate more profit by measuring via A/B tests.  Do you love profit and knowledge, or do you love traditional metrics?