Kevin Hillstrom: MineThatData: Test Design and Analysis

From time to time, I'm asked to talk about designing tests.

I can honestly say that I wouldn't have my own business, had I (or the teams I led or worked with) not executed somewhere between five hundred and a thousand tests during my career at Lands' End, Eddie Bauer, and Nordstrom. I learned more from 500 tests than I learned from 7,000 days of work at the companies I worked for.

All of the secrets of your business are embedded in tests, not in KPIs or metrics or dashboards or analytical work. Few people actually talk about this fact ... you learn the most when you change things, when you vary your strategy. It turns out that customers react differently, depending upon the marketing plan you implement.

Type of Test: The most common test is the A/B test. You test a control strategy, the strategy you are currently executing, against a new strategy. There are more complex designs. Many folks use Fractional Factorial Designs. My favorite is a Factorial Design, something I learned in college and used at the Garst Seed Company prior to taking a job in 1990 at Lands' End. My favorite test of all time is a 2^7 design we executed at Lands' End in 1993-1994. We tested every combination of seven different business unit mailings for a year (main catalogs yes/no by home catalogs yes/no by mens catalogs yes/no by womens catalogs yes/no by clearance catalogs yes/no by sale inserts in catalogs yes/no by a targeted inserts into main catalogs yes/no). We executed this test for a full year ... yes, a full year. The results were mind-blowing. I still have the results of the analysis sitting on my desk, the results were that important, that revolutionary, that spectacular. If you want to divide a room of Executives against each other, provide them the results of a 2^7 factorial design that does not yield test results congruent with existing "best practices".

Sample Size: I'll focus on A/B tests here, as the concepts apply to more complex designs, but become harder to explain. The most commonly asked question is "how big should my sample size be?" The Google Analytics Generation likes to focus on "conversion rate". Most web analytics tools are calibrated to measure anything, but strongly guide the user toward measuring conversion rate. This is a problem. You see, "response" or "conversion" are "1/0" variables, binary in nature. Binary variables don't have a lot of variability associated with them, as the value can only be 1 or 0. As a result, you don't need a lot of customers/outcomes in your sample, in order to achieve statistical significance. However, and especially for those of us in catalog/retail/e-commerce, we're not interested in measuring conversion or response ... no, we want to measure something more important, like $/customer. This is more problematic. Spending varies much more than response (1/0). A customer could spend $50, a customer could spend $100, a customer could spend $200. Now, there are tons of sample size calculators available on the internet, so do a Google search and find one. What is important is that you pre-calculate the variance of customer spend. Here's what you can do. Take any 12-month buyer through, say, October 31. Then, for this audience, calculate the average amount spent, per customer, in November. Most customers will spend $0. Some will spend $50, some $100. Calculate the variance of this $/customer metric. The variance will be used in your sample size calculations. It has been my experience that you need about 4x as many customers in a test sample to detect spending differences as you need to detect response/conversion differences.

Unequal Samples: It is acceptable to have 10,000 in one test group and 25,000 in another test group. Surprisingly, many Executives have a hard time understanding this concept, they cannot conceive differences between test groups unless the sample sizes are equal. I tend to use equal sample sizes for just this reason ... no need to get into an argument about why one group has 10,000 and another 25,000, even if you were trying to minimize the opportunity cost of a test and generate more profit for your company.

Opportunity Cost: I once had a boss who gave me a "testing budget" ... I could execute any test I wanted, as long as I didn't hurt annual sales by more than 3%. This is a very reasonable request from a forward-thinking CEO. Your best bet is to calculate the amount of demand/sales you'll lose by not executing the normal strategy. If you are going to hold out 50,000 e-mail addresses for one campaign, you will likely lose 50,000 * $0.20 = $10,000 demand, and maybe $3,500 profit, by not sending the e-mail campaign to 50,000 customers. Be honest and up-front about this. At the end of the year, show your Executive team how much sales and profit were lost by your various testing strategies, then show your Executive team how much you learned, quantifying the increase in sales and profit on an annual basis, given what your tests taught you. Make sure that what you learned offsets what you spent to execute tests!

Statistical Significance: I hate this concept, and I'm a trained statistician! Everything in marketing is a probability, not a certainty. Executives will teach you this. I once watched a statistician tell a Chief Marketing Officer not to implement a strategy because it failed a significance test at a 95% level. What was interesting is that the test passed at a 94% level. Focus on the probabilities associated with your test. If there is a 94% chance that one sample outperformed another sample, be willing to take a stand ... who wouldn't want to be "right" 94 times out of 100? Executives are used to making decisions that have a 52% chance of success, so when you present them with something that will be better than an existing strategy 88 times out of 100, you're going to find a receptive audience.

Testing and Timing, Part 1: The Google Analytics Generation have been trained to stop after sampling 2,200 customers (or 3,800 customers or however many) because at that stage, a statistically significant outcome occurs ... this is a problem, because you want to be able to measure more things than one simple result. In other words, if you let the test go through 10,000 customers instead of the 2,200 needed to detect statistical significance, you can learn "more". For instance, I like to measure results of tests across various customer segments within a test. When I have 50,000 customers in my sample, I can slice and dice the results across great customers, good customers, lousy customers, new customers! When I have 2,200 customers in my sample, I can only measure results at a macro-level, and that's unsatisfying. So often, test results differ across different customer segments. Set up your tests to measure many customers, and you can measure results across many customer segments!

Testing and Timing, Part 2: The Google Analytics Generation have been taught to iterate rapidly, meaning that they test until they achieve statistical significance, then they move on to another test, rapidly iterating toward an optimized environment. That's a good thing, don't get me wrong. I prefer to complement this strategy by extending the testing window. For instance, if you executed an e-mail holdout test for one of your one-hundred campaigns, you only truly learn what happened in that one test over a three-day window. The results of the test may not hold up in different timeframes, the results of the test may be highly influenced by variability. The longer your test is conducted, the less variability you have in the results, and therefore, the more confident you are in the outcome of your test.

Controlling For Other Factors: I can't tell you how many people get this concept wrong. As long as each test group is randomly selected from the same population, and your random number calculator hasn't gone bonkers, you don't have to control for other factors. I'll give you an example. I met a marketer who wanted to do an A/B test in e-mail, and wanted to exclude customers who recently received a 20% off promotion because they would "skew the results of the test". Each test group had an equal number of these customers in the test, so the equal numbers of customers cancel each other out! This e-mail marketing expert, however, disagreed 100%, hypothesizing that the interaction between the previous discount and the new strategy would yield an unexpected outcome that would bias the test. Don't get trapped by this form of lizard logic. As long as your groups are randomly selected from the same population, you're fine.

Sharing Results: Back in 1993 at Lands' End, we "typed-up" all test results, saving them in a black binder. All tests results were circulated to all Director/VP/SVP/CXO level individuals, using the same template, so that the results could be evaluated on a common platform. Today, folks use wikis and internal blogs/websites and files on network drives to store results. Storing and sharing results is important. And avoid a mistake I commonly made 20 years ago --- keep your political thoughts out of the test writeup! Don't tell the Creative Director that his idea was "stupid", as verified by the results of the test. Let the data speak for itself!

Retest: Just because you learned in 2001 that catalog marketing to marginal customers yielded poor results doesn't mean that it will yield poor results today. Retest findings, if you do, you're going to learn a lot about sampling error!

Geek Speak: Don't use it. SAS Programmers from 1990, Business Intelligence Analysts from 2000, and Web Analysts in 2010 all make/made the same mistake. Stay away from the geeky, mathematical details of your test, and focus on the sales, profit, staffing, and workflow outcomes of your test. Your Executive Team cares that you use best practices in your test design, they don't care that a paired-t-test yielded a one-tailed outcome that is significant at a p<0.02 level. Your Executive Team does care that the findings of your test yield $129,000 of incremental, annual profit. Your Executive Team does care that the results of the test suggest that 8% of the staff be downsized. So focus on what is important to your Executive Team, using language that your Executive Team understands.

Get Out Of Analytics/Marketing: Study what folks in academia and agriculture and clinical trials do. This is a good way to see the impact of Geek Speak, you'll find their writeups to be incomprehensibly obtuse, then imagine how an outsider would feel reading your writeups!

Conflict: Not everybody is going to embrace your analysis. You have employees who get to keep their job, as is, by not testing any new strategies. Anytime you prove that the current "best practice" doesn't work as well as a "new practice", you are going to be disliked, doubted, disrespected, dishonored, demeaned, you name it. You will be pelted with every conceivable and illogical argument on the planet. You'll be told you executed the test incorrectly, you'll be told your analysis is wrong, you'll be told that the test was tested at the wrong time of the year to the wrong audience, you'll be told that you are right but you've been wrong in the past so therefore maybe you really should stop testing altogether. You'll be banned from meetings. You'll be censured (it happened to me) by an Executive. You may even be fired. When people get political and angry with you, don't get defensive (I've gotten defensive, no good folks, no good), focus instead on being the "voice of the customer". Just remind everybody that this is how the customer responded, it's as plain and simple as that.

Test Things With A Long Half-Life: This is an important concept that few understand. People will test a black arial font on a yellow background vs. a blue times new roman font on a white background. I'm not saying you shouldn't test this, I'm saying that if you have a limited testing budget, focus your efforts on things that have significant strategic value, test things that yield outcomes that last for years, not weeks.

Future, Not Past: Instead of talking about the results of the test, talk about what the results mean to the future of your business. The Google Analytics Generation have not been given the tools to forecast five years into the future. Use your test results to illustrate what your business looks like in 2014 because you implemented the findings of a test you executed last week.

Ok, your turn. Use the comments section to publish your tips for test design and analysis!

4 comments:

Linda Bustos9:11 AM
Hi Kevin,

I have a question about "Controlling for Other Factors."

For example, we often will exclude a geographical segment because they are not the target of the test/promotion. It might not "skew" to have them in due to randomization, but it would not give us the insights we want since we will reach a statistical significance faster than if we tested with our "pure" population. We would actually have a statistically insignificant result among the segment that matters. I would argue the same for the exclusion of returning visitors or those who have seen a coupon before (depending on what variables you are actually testing for, whether that would affect anything or not) - the idea being, better to segment out a population that will dilute your true testing population. Is this lizard logic?

Appreciate your comments,

Linda Bustos
MineThatData9:22 AM
The answer is "it depends".

If an audience is not going to be part of the "roll out" of your test results, then it is ok in my opinion to exclude the audience. For instance, if you were not going to implement the findings of a test in Quebec, then it is ok to exclude folks from Quebec in your test.

If you exclude a population because you believe that the population biases the results of the test, but that population is in the roll-out of the results of the test, then you as a tester have introduced a bias while at the same time you think you are eliminating a bias.

Ask a lot of experts, you'll get a lot of different answers. I am not in favor eliminating audiences that will be included in the roll-out of the test results. I am fine with eliminating audiences that will not be included in the roll-out of test results. That's my opinion, not necessarily right or wrong.
Dennis12:44 PM
I love this. It so importent to test for every type of business. Most businesses with many year run on autopilot and need testing to shake up the business with client info they even did not know.

New business like ours need testing to figure out what users want, how they see a product, what are they using and how they respond.

I always get exited when testing effect a funnel in such a way that makes my day :-)
MineThatData5:05 PM
Thank you, Dennis!

Note: Only a member of this blog may post a comment.

Kevin Hillstrom: MineThatData

December 22, 2010

Test Design and Analysis

4 comments:

You're Not Doing It Right

StatCounter