Showing posts with label A/B Test. Show all posts
Showing posts with label A/B Test. Show all posts

## November 09, 2010

### A/B Testing: Here's An Example

Here's an example of what I see, over and over and over again, when evaluating A/B tests within the e-mail marketing genre.

Say you have a list of 500,000 e-mail addresses.  You send your standard campaign on a Monday.  Later in the week, you tabulate your results:
• 500,000 recipients.
• 20% open rate = 100,000.
• Of the opens, 20% click through to the website = 20,000 visit website.
• Of the clicks, 5% convert and buy something = 1,000 orders.
• Average Order Value = \$100.
• Total Demand = 1,000 * \$100 = \$100,000.
• Demand per Recipient = \$100,000 / 500,000 = \$0.20.
Here's one of the usual outcomes, when measuring e-mail marketing campaigns via A/B tests.  You hold out a big quantity, so that you can accurately measure the spend with confidence.
• Mailed Group = 400,000 Recipients, \$300,000 spent = \$0.75 per customer.
• Holdout Group = 100,000 Held Out, \$45,000 spent = \$0.45 per customer.
• Incremental Lift = \$0.75 - \$0.45 = \$0.30 per customer.
By the way, yes, I realize many of you want to apply significance tests and confidence intervals and all that stuff, go ahead and do so.

This is why I'm not a fan of open/click/conversion.  A mail/holdout test proves the actual value of an e-mail marketing campaign.  In this case, we observe \$0.30 lift, whereas open/click/conversion yields \$0.20 lift.

E-mail marketers, why would you not want to know that your campaigns are working 50% better than when measured via opens/click/conversion?

Just as often, the results aren't optimistic.
• Mailed Group = 400,000 Recipients, \$300,000 spent = \$0.75 per customer.
• Holdout Group = 100,000 Held Out, \$75,000 spent = \$0.75 per customer.
• Incremental Lift = \$0.75 - \$0.75 = \$0.00 per customer.
So often, opens/clicks/conversion takes credit for orders that would have happened anyway.  This is a very difficult concept for the non-testing audience to grasp.  You see, customers will order regardless whether you market to them or not.  In some companies, more than 80% of orders will happen without marketing.  In other companies, less than 20% of orders will happen without marketing.  I've measured both instances, strategically, you end up taking very different marketing approaches with outcomes on either end.

Here's another tidbit.  You usually see the \$0.30 outcome, or you see the \$0.00 outcome ... you seldom see the numbers tie out with opens/clicks/converts.  Furthermore, there isn't a ton of variability ... so if you start to see the \$0.30 outcome, you're likely to see a result that is consistently better than opens/clicks/converts, or vice versa.  Consistency of results will happen if you pick a control group that is large enough to be stable.  You don't want a control group of 5,000 customers, you need big numbers in order to get "big reliability"!

This is why you have to execute A/B or multivariate or factorial tests.  You need to measure how much of your business will happen without marketing.  Classic open/click/conversion metrics really struggle with this topic.

## November 08, 2010

### A/B Testing: Persecution

One this is certain to happen when you advocate A/B tests over traditional metrics to measure the performance of marketing campaigns.

You are going to be persecuted.

You're going to identify customer behavior that is not identified by traditional metrics, in fact, what you identify will directly contradict what is "known" when measured by traditional metrics.

In 1994, at Lands' End, we used a year-long 2^7 factorial design to measure the impact of all of our catalog titles on overall profitability.  Traditional metrics (response) showed that customers were leaking out of our main title.  Holdout tests showed that newer catalog titles were not generating a lot of incremental demand, and were largely unprofitable.  Those on the "wrong side" of the test results weren't pleased.  Eventually, those on the "wrong side" of the test results won the battle for Management, the least profitable strategy was pursued, and for good reason ... traditional metrics looked better.  Holdout tests showed that the most profitable strategy was very different, and not very popular.

In 1998 at Eddie Bauer, we executed a six-month long promotional holdout test.  No customers were allowed to receive free shipping or 20% off promotions.  Traditional metrics showed that discounts and promotions worked tremendously well.  A/B holdout tests showed that customers spent the exact same amount, promos or no promos.  Guess what?  The test panel not offered promotions was far more profitable, though the test panel that received promotions looked much better via traditional response metrics.  Though nobody believed the test results (look, our response metrics show that promotions work, your test is wrong!), I was in a position of authority, so I could execute a strategy without promotions.  We had the most profitable year in the history of the catalog/online division in 1999.

In 2001 at Nordstrom, we executed mail/holdout tests to measure the impact of catalog marketing on the online channel.  It was a well-known "best practice" to mail catalogs, because catalog caused customers to buy online, as proven by a methodology called "matchbacks", the early version of "attribution models" that we hear about today.  Our test results showed something interesting ... when a customer stopped receiving catalogs, the customer started buying online on her own, so the catalog was being given too much credit for online orders that were "matched back" to the catalog.  In essence, we could mail fewer catalogs and generate a lot more profit.  Needless to say, leadership aligned with catalog marketing hated what the tests illustrated, and fought the results tooth-and-nail.  A few years later, the catalog strategy was discontinued, and profit significantly increased, a result contrary to what traditional response metrics and attribution algorithms suggested would happen.

In my consulting practice, I continually see cases where traditional metrics (response, opens, clicks, conversion rate) yield outcomes that are contrary to what you see when you execute holdout tests.  In every case in my career that I can remember, when leadership chose to listen to what holdout tests were indicating, profit increased.

In most cases, however, folks choose to align with response/open/click/conversion-rate metrics instead.  They persecute those who advocate A/B and factorial testing with what a former boss called "lizard logic".  Lizard logic, of course, is an older version of 'truthiness'.
• "Your sample sizes are too small and probably aren't statistically significant".
• "Your samples probably have better customers in one group than in another, invalidating your test.  How can you prove your samples are 100% statistically random?  You can't.  Therefore, your results are invalid."
• "Measurement via open/click/convert is a best practice, are you telling me that best practices are wrong?  If you are telling me that best practices are wrong, then you are telling me that I am wrong, and I take offense with your arrogant claim.  Maybe YOU are wrong!"
• "I've been an analyst/manager/executive for ten years, are you telling me that I don't know what I'm doing?'
• "You only tested for one month, your results won't hold for a year, try testing for a year."
• "You tested for a year, your results are biased because we'd never execute that strategy, try testing for one month."
• "Digital marketing is different, you can't apply A/B testing to digital marketing, attribution is so much more complicated than what can be measured via an A/B test."
The persecution will come from anybody who benefits from flawed metrics.  You really can't blame people for persecuting you.  Their livelihood is being threatened, their metrics tell them that they are right, and often, they simply don't understand your testing strategy or methodology.

People execute A/B tests incorrectly all of the time --- they optimize for conversion rate and not for profit, their sample sizes are too small because they are sampling for conversions instead of profit dollars ... but that being said, their results are usually more reliable than those obtained via classic response/open/click/conversion metrics.

I'm not saying you don't measure via response/open/click/conversion.

I am saying that traditional response/open/click/conversion metrics give misleading results.

I've analyzed more than 500 tests in my career, after earning a degree in statistics.  I know a little something about this.  A/B and factorial tests do not lead you astray.  Instead, they illustrate secrets that your customers are telling you about how they like to respond, secrets that cannot be illustrated via traditional metrics.

A/B tests allow you to listen to your customers.  Isn't that the most important thing?

When you go down the testing route, you will be persecuted by others.

If you love traditional metrics and are not a fan of testing, give Anne Holland's "Which Test Won" website a try.  See if you can correctly guess which strategy performs better ... each strategy was A/B tested.  If you can guess 85% or more of the tests correctly, then maybe you don't need to believe in the results of A/B tests, because your gut instinct and business knowledge is just that good.

But if you aren't that good, then why not give A/B tests a try?  Why not believe what the customer tells you?  You're going to generate more profit by measuring via A/B tests.  Do you love profit and knowledge, or do you love traditional metrics?

## October 08, 2008

### Catalog And Retailer Differences In Matchback Strategy And Contact Strategy Optimization

There's this huge shift in multichannel marketing strategy in recent years, with catalog matchback algorithms playing a significant role in the shift.

Fashion retailers (Neiman Marcus, Saks, Bloomingdales, Nordstrom) either eliminated traditional catalog marketing programs, or are in the process of significantly reducing circulation. Folks at Williams Sonoma are significantly trimming circulation.

When I talk to some of you, you tell me that these folks can cut circulation because they are retailers --- the retail channel somehow generates brand awareness that fuels a brand in a way that minimizes the need for advertising. You might be right, we simply cannot test your hypothesis.

Mechanically, retail brands are better at developing a testing discipline.

Here's an example. We randomly sample twenty customers, ten receive a catalog, ten do not, and measure performance across channels during the three weeks that a catalog is active. Here's what we observe:

 Mailed Holdout Cust 1 Buy Store Cust 11 Cust 2 Cust 12 Cust 3 Cust 13 Buy Online Cust 4 Cust 14 Cust 5 Buy Phone Cust 15 Cust 6 Buy Online Cust 16 Cust 7 Cust 17 Cust 8 Cust 18 Cust 9 Buy Online Cust 19 Cust 10 Cust 20 Buy Online

Here's the fundamental difference between the retailer and the catalog brand.

The retailer will compare the mailed group and the holdout group. In the mailed group, four out of ten customers responded --- in the holdout group, two out of ten customer responded. The retailer calculates response as (4 - 2) / 10 = 20%.

The cataloger does not execute the test. Instead, the cataloger takes the mailed group, identifies the four responses, matches the responses back to the mail file, and calculates response as 4 / 10 = 40%.

Again, notice the significant difference in response, using the two methodologies.
• Retailer = 20% Response Rate.
• Cataloger = 40% Response Rate.
In this comparison, the organic percentage is 20% / 40% = 50%. Half of the demand would happen without any advertising.

This fundamental difference in approach causes a shift in strategy.
• Retailer = Cut Circulation, Re-Allcoate Marketing Dollars Elsewhere, Learn!!
• Cataloger = Maintain Circulation, Ask For Additional Funding For Online Marketing, And Significantly Over-Spend In The Catalog Marketing Channel, Driving Down Profit.
This problem is systemic across the catalog industry. Matchback vendors aren't trying to rip you off, they simply aren't. But there isn't an incentice to create a "best practice" that accounts for the differences that retailers observe when executing contact strategy testing and what catalogers measure via matchback analytics.

A simple solution for catalogers is to execute a test similar to the one designed above. Do not tell the matchback vendor about the holdout group. Have the matchback vendor run the control group through the matchback algorithm, and see how many orders are allocated to the holdout group. Subtract the results of the holdout group from the results of the mailed group, and you have true incremental demand as illustrated in the retail example at the beginning of this post.

Hillstrom's Contact Strategy Optimization: A New E-Book.

## June 03, 2008

### Great Moments In Database Marketing #1: Incremental Value

Our top rated Database Marketing moment takes us back to 1993 - 1994. Yeah, way back then, people were doing sophisticated work. Honestly!

Way back in the early 1990s at Lands' End, we had seven different business units that marketed to customers, either through standalone catalogs, or though pages added to catalogs.

As growth became more and more difficult (pay close attention online marketers ... your world is heading in this direction), management elected to mail targeted catalogs to targeted customer segments.

In other words, a Mens Tailored catalog concept was developed, with a half-dozen or more incremental catalogs mailed to customers who preferred Mens Tailored merchandise. A Home catalog concept was developed, with nine or more incremental catalogs mailed to customers who preferred Home merchandise.

Seven concepts were developed. Each concept was growing.

But the core catalog, the monthly catalog mailed for three decades, was not really growing anymore. And total company profit (as a percentage of net sales) was generally decreasing over time.

Something was amiss.

We studied the housefile, and learned that the "best" customers were being "bombed" by catalogs ... upwards of forty a year. Every business unit, making independent mailing decisions, mailed essentially the same customers. And all of our metrics, when viewed at a corporate level, indicated that customers were not spending fundamentally more than they spent several years ago when the new business concepts didn't exist.

So we developed a test. We selected ten percent of our housefile, and created seven columns in a spreadsheet. We randomly populated each column with the words "YES" or "NO', at a 50% / 50% proportion. Each business unit was assigned to a column. When it came time to make mailing decisions for that business unit, we referred to the column assigned to the business unit. If the word "NO" appeared, we did not mail the customer (if the customer qualified for the mailing based on RFM or model score criteria).

In statistics, this is called a 2^7 Factorial Design.

There are two reasons for designing a test of this nature.
1. Quantify the incremental value (sales and profit) that each business unit contributes to the total brand.
2. Identify, across customers segments, the number of catalogs a customer should receive to optimize profitability.
What did we learn?
1. Each catalog mailed to a customer drove less and less incremental increases in sales. If a dozen catalogs caused a customer to spend \$100, then two dozen catalogs caused customers to spend \$141, and three dozen catalogs caused customers to spend \$173. The relationship roughly approximated the Square Root Rule you've read so much about on this blog.
2. Each business unit, on average, was contributing only 70% of the volume that company reporting suggested the business unit was contributing. In other words, if you didn't mail the catalogs, you'd lose 70% of the sales, with customers spending 30% elsewhere.
The latter point is critical.

Take a look at the table below, one that illustrates the profit and loss statement reported by finance, and one that applies the results of the test.

 Test Results Analysis Finance From Reported Test Results Demand \$50,000,000 \$35,000,000 Net Sales 82.0% \$41,000,000 \$28,700,000 Gross Margin 55.0% \$22,550,000 \$15,785,000 Less Marketing Cost \$9,000,000 \$9,000,000 Less Pick/Pack/Ship 11.0% \$4,510,000 \$3,157,000 Variable Profit \$9,040,000 \$3,628,000 Less Fixed Costs \$6,000,000 \$6,000,000 Earnings Before Taxes \$3,040,000 (\$2,372,000) % Of Net Sales 7.4% -8.3%

The test indicated that what appeared to be highly profitable business units were actually marginally profitable, or in some cases, unprofitable. In this example, the business unit is "70% incremental", meaning that if the business unit did not exist, 70% of the sales volume would disappear, while 30% would be spent anyway by the customer, spent on other merchandise.

Imagine if you were the EVP responsible for a business unit that appeared to generate 7.4% pre-tax profit, only to have some rube in the database marketing department tell you that your efforts are actually draining the company of profit?

Why Does This Matter?

This style of old-school testing (which is more than a hundred years old, with elements of the testing strategy now employed aggressively in online marketing) tells you how valuable your marketing and merchandising initiatives truly are.

Catalogers fail to do this style of testing, not realizing that a portion of catalog driven sales would still be generated online (or in other catalogs). In 2008, most catalog marketers are grossly over-mailing existing buyers. Catalog Choice, in part, exists due to catalogers mis-reading this phenomenon.

E-mail marketers seldom execute these tests, not realizing that in many cases almost all of the sales would still be generated online. E-mail marketers, ask your e-mail marketing vendor to partner with you on test designs like the ones mentioned in this article. You may be surprised by what you learn!

Online marketers are more likely than most marketers to execute A/B splits at minimum, with some executing factorial designs. Many online brands evolve in a Darwinian style, fueled by the results of factorial designs. Online marketers know that you make mistakes quickly, and you correct those mistakes quickly.

Web Analytics folks have the responsibility to tell management when sku proliferation no longer contributes to increased sales. It is important for Web Analytics folks to lead the online marketing community, shutting off portions of the website in various tests to understand the incremental value of each additional sku.

What are your thoughts on this style of testing? What have you learned by executing tests of this nature?

## May 29, 2008

### Great Moments In Database Marketing #6: Long-Term Impact of Promotions at Eddie Bauer

We go back to 1998 for this Great Moment in Database Marketing.

At the time, I was Director of Circulation at Eddie Bauer, a brand that was punch-drunk on promotions. Anytime a customer failed to purchase in six months, the "CRM/Circulation" process offered the customer a "20% off \$100" promotion ... twenty percent off your next order of one-hundred dollars or more".

We tested these promotions until we were blue in the face. Continually, they showed that the customer spent about twenty percent more if offered this promotion.

So, the promotions became part of "what we did". And then my team decided to execute a long-term test. For the next six months, we would not offer a segment of lapsed customers a single promotion.

What do you think happened?

Take a look at the following table, a table that approximates the actual results of the test.

 Eddie Bauer Six Month Promotion Test: 1998 Receive No Incr- Promos Promos ement Month 1 \$10.80 \$9.00 \$1.80 Month 2 \$9.00 \$9.30 (\$0.30) Month 3 \$10.80 \$9.60 \$1.20 Month 4 \$9.00 \$9.90 (\$0.90) Month 5 \$10.80 \$10.20 \$0.60 Month 6 \$9.00 \$10.50 (\$1.50) Demand \$59.40 \$58.50 \$0.90 Net Sales \$41.58 \$40.95 \$0.63 Gross Margin \$22.87 \$22.52 \$0.35 Marketing \$9.00 \$9.00 \$0.00 Promos \$4.07 \$0.00 \$4.07 Pick/Pack/Ship \$4.99 \$4.91 \$0.08 Profit \$4.80 \$8.61 (\$3.80) % of Sales 11.6% 21.0% -9.5%

Oh oh.

Here's the 411 folks. When customers are continually promoted to, they delay purchases until the promotion is offered to them.

In our test, if customers were not offered promotions, they slowly began to "build momentum". Instead of the every-other-month cadence of promotions to this audience (the actual test had a different rhythm than illustrated above), the customer waited for promotions, did not receive them, then started spending more.

After six months, we noticed that customer spend in the two groups was nearly identical!

Now look at profit. Sure, the group that received promotions appeared profitable --- they appeared profitable via every system we had in the company, via every A/B test we executed.

But when viewed via a long-term A/B test, the results were significantly different. We were losing a boatload of money promoting to customers who would ultimately spend the same amount of money if we didn't execute the promotion.

In 1999, we dramatically pulled back on promotions. Total Net Sales decreased by maybe five or six percent. Total profit hit an all-time record high.

The core fundamentals of direct marketing are often violated in the world of "instant metrics" we've created. Our e-mail marketing friends read open rates and conversion rates from a "Free Shipping" e-mail within an hour of blasting the campaign. The adrenaline rush felt from obtaining instant access to customer behavior fuels strategy.

My challenge to the e-mail marketing and web analytics community, two communities that live and die by a steady diet of exhilarating and instantaneous metrics, is this ... do your metrics allow you to understand if what we observed at Eddie Bauer in 1998 is happening in your business? And if your visit-specific metrics don't allow you to observe a trend like this, what kind of systems/software/human investment is needed to allow for this style of measurement?

Hillstrom's Multichannel Secrets: Fifty-Nine Facts For CEOs!

## April 26, 2008

### Retail Catalog Marketing

Retail catalog marketing is an inexact, imprecise science.

Let's assume that a major American retail brand sends you a catalog on April 1. Let's also assume that your small business purchases from this major American retail brand on the 15th of every month, regardless of marketing activity.

Did the catalog cause you to purchase merchandise?

The answer is probably "no".

The catalog may have influenced the merchandise you purchased. The catalog may have caused you to spend more than you normally would have. The catalog may have caused you to spend less than you normally would have.

But you would have purchased merchandise anyway, no matter what. You always buy something from this brand on the 15th of the month.

Now let's pretend you are the Database Marketing Executive at this major American retail brand. Your job is to measure the effectiveness of this retail catalog marketing effort. Using the tools and techniques available to the database marketers, let's see if you would decide to mail this sample customer future catalogs.

Methodology = Mail And Holdout Groups: Do Not Mail This Customer A Catalog

This is a classic direct marketing strategy, practiced for more than a century (and maybe for centuries). When measuring effectiveness by mail and holdout groups, we'd learn that this customer would purchase regardless of catalog marketing. Therefore, the segment this customer belongs to is not considered a "responder".

Methodology = Pattern Detection: Do Not Mail This Customer A Catalog

Pattern detection suggests that this customer buys on the 15th of every month. The database marketing executive learns that marketing doesn't influence this customer. Therefore, this individual customer would not be considered a responder.

Methodology = Matchback Analytics: Mail This Customer A Catalog

Matchback analytics, the kind offered by major list processing corporations, co-ops, and data compilers, match purchases within a window of time to a marketing activity. Let's say that the matchback window is three weeks (oftentimes, the matchback window is something silly, like ninety days or six months). Any retail purchase within three weeks of the catalog mailing is attributed to the catalog mailing. Therefore, this individual customer would be considered a responder. Here's a little secret. Matchback analytics grossly over-state the effectiveness of most retail activities. You've been warned!!

Methodology = Brand Marketing: Mail This Customer A Catalog

All too often, retail catalog marketing falls into the brand marketing arena. In other words, a budget is set, say \$1,000,000. The database marketing team is asked to mail a million customers, to use up the entire budget. The database marketing team executes the strategy. In this case, if our sample customer buys every month, the customer is a "good" customer, and will receive this catalog. This is the most common scenario in retail catalog marketing --- the CMO determines a budget, the CMO determines the marketing tactics that will be employed, and the database marketing executive picks the best customers for any given strategy. In some instances, rogue database marketers set up tests to determine if the strategy actually worked or not. I've executed this rogue strategy myself --- I wanted to understand how much money my company was losing. For the most part, however, the effectiveness of the mailing isn't even measured.

Retail catalog marketing
is an inexact, imprecise science. The corporate culture, the quality of information captured in the customer database, and the measurement technique used by the database marketing team determine whether you will receive a retail catalog from your favorite American retail brand.

How does your company execute measurement of retail catalog marketing activities?

## January 03, 2008

### Testing Issues

Recall that my focus in 2008 is on multichannel profitability.

Experimental design (aka 'tests') is one of the most useful tools available to help us understand multichannel profitability.

We run into a ton of problems when designing and analyzing 'tests'. Let's review some of the problems.

Problem #1: Statistical Significance

Anytime we want to execute a test, a statistician will want to analyze the test (remember, I have a statistics degree --- I want to analyze tests!).

In order to make sense of the conclusions, the statistician will introduce the concept of "statistical significance". In other words, the statistician will tell you if the difference between a 3.0% and 2.9% click-through rate is "meaningful". If, according to statistical equations, the difference is not deemed to be "meaningful", the statistician will tell you to ignore the difference, because the difference is not "statistically significant".

Statisticians want for you to be right 90% of the time, or 95% of the time, or 99% of the time.

We all agree that this is critical when measuring the effectiveness of a cure for AIDS. We should all agree that this isn't so important when measuring the effectiveness of putting the shopping cart in the upper right-hand corner of an e-mail campaign.

Business leaders are seldom given opportunities to capitalize on something that will work 95% of the time. Every day, business leaders make decisions based on instinct, on gut feel, not having any data to make a decision. Knowing that something will work 72% of the time is a blessing!

Even worse, statistical significance only holds if the conditions that existed at the time of the test are identical to the conditions that exist today. Business leaders know that this assumption can never be met.

Test often, and don't limit yourself to making decisions only when you're likely to be right 99% of the time. You'll find yourself never making meaningful decisions if you have to be right all the time.

Problem #2: Small Businesses

Large brands have testing advantages. A billion dollar business can afford to hold out 100,000 customers from a marketing activity. The billion dollar business gets to slice and dice this audience fifty different ways, feeling comfortable that the results will be consistent and reliable.

Small businesses are disadvantaged. If you have a housefile of 50,000 twelve-month customers, you cannot afford to hold out 10,000 from a catalog or e-mail campaign.

However, small business can afford to hold out 1,500 twelve-month customers out of 50,000. The small business will not be able to slice and dice the data the way a large brand can. The small business will have to make compromises.

For instance, look at the variability associated with ten customers, four of which spend money:
• \$0, \$0, \$0, \$0, \$0, \$0, \$50, \$75, \$150, \$300.
• Mean = \$57.50.
• Standard Deviation = \$98.63.
• Coefficient of Variation = \$98.63 / \$57.50 = 1.72.
Now look at the variability associated with measuring response (purchase = 1, no purchase = 0).
• 0, 0, 0, 0, 0, 0, 1, 1, 1, 1
• Mean = 0.40.
• Standard Deviation = 0.516.
• Coefficient of Variation = 0.516 / 0.40 = 1.29.
The small company can look at response, realizing that response is about twenty five percent "less variable" than the amount of money a customer spent.

Small companies need to analyze tests, sampling 2-4% of the housefile in a holdout group, focusing on response instead of spend. The small company realizes that statistical significance may not be achievable. The small company looks for "consistent" results across tests. The small company replicates the rapid test analysis document, using response instead of spend.

Problem #3: Timeliness

The internet changed our expectations for test results. Online, folks are testing strategies in real-time, adjusting landing page designs on Tuesday morning based on results from a test designed Monday morning, executed Monday afternoon.

In 1994, I executed a year-long test at Lands' End. I didn't share results with anybody for at least nine months. What a mistake. We had spirited discussions from month ten to month twelve that could have been avoided if communication started sooner.

Start analyzing the test right away. Share results with everybody who matters. Adjust your results as you obtain more information. It is ok that the results change from month two to month three to month twelve, as long as you tell leadership that results may change. Given the fact that the online marketers are making changes in real-time, you have to be more flexible.

Problem #4: Belief

You're going to obtain results that run contrary to popular belief.

You might find that your catalog drives less online business than matchback results suggest. You might find that advertising womens merchandise in an e-mail campaign causes customers to purchase cosmetics.

You might find that your leadership team dismisses your test results, because the results do not hold up to what leadership "knows" to be true.

Just remember that people once thought the world was flat, that the universe orbited Earth, and that subprime mortgages could be packaged with more stable financial instruments for the benefit of all. If unusual results can be replicated in subsequent tests, the results are not unusual.

Leadership folks aren't a bunch of rubes. They have been trained to think a certain way, based on the experiences they've accumulated over a lifetime. It will take time for those willing to learn to change their point of view. It does no good to beat them over the head with "facts".

## January 01, 2008

### Rapid Test Results

In 2008, I'm going to focus energy discussing how test results and Multichannel Forensics increase profitability, and hopefully decrease customer dissatisfaction. Today, we begin the discussion by exploring the concept behind a project I call "Rapid Test Results".

One of the easiest ways for multichannel catalogers, retailers and e-mail marketers to understand customer behavior is through the use of "A/B" tests.

In an "A/B" test, one representative group of customers receive a marketing activity, while the other representative group of customers do not receive a marketing activity.

The catalog industry uses matchback algorithms to understand multichannel behavior. As most of us understand, matchback algorithms over-state the effectiveness of marketing activities.

Conversely, e-mail marketers understate the effectiveness of e-mail marketing activities when using open rates, click-through rates, and conversion rates.

Therefore, we need to improve the understanding of our marketing activities. One way to do this is to create and analyze more "A/B" tests, often called "mail/holdout" tests.

It can be very easy to execute these tests.

However, we don't always have the resources necessary to analyze and understand the test results.

If you are an executive who falls into the latter category, I have something for you. It is called "Rapid Test Results".

For my loyal blog readers, executives, and current customers, I have an inexpensive proposal just for you. The Rapid Tests Results Analysis Document outlines an inexpensive project that gets you results to the tests you executed, within just a few days of sending your information for analysis purposes.

If there's one thing I learned in 2007, it is that e-mail and catalog teams are minimally staffed! And yet, the information that can be gleaned from tests executed by e-mail and catalog marketing teams can shape the future direction of your organization.

So if any of the following criteria are met by your organization, please consider a Rapid Test Results Project:
• You are an e-mail marketer who believes your e-mail campaigns drive more sales and profit than you can measure via standard metrics like open rate, click-through rate, and conversion rate.
• You are a catalog marketer who wants to truly understand if multichannel customers respond to catalog marketing, and want to truly learn the impact of catalog marketing on the online channel.
• You are a catalog marketer who wants to reduce catalog marketing expense (and benefit the environment) by limiting contacts to internet customers.
• You do not have the analytical resources to analyze test results quickly.
• You do not have the systems support to measure test results by different customer segments, across different channels, or across different merchandise classifications.
• Your executive team does not understand the constraints and limitations that prevent your team from analyzing all of your tests in a timely manner.

## December 07, 2006

### A/B Test Design And Incremental Multichannel Campaign Performance

Never before has the traditional "A/B" test been as important as it is in our multichannel ecosystem. Such a simple concept, the "A/B" test is uniquely designed to measure the incremental performance of marketing activities.

As an example, assume a multichannel organization mails a catalog to a housefile list of 1,000,000 names. The database marketer chooses the best 1,100,000 households, and randomly splits them into two groups. The "A" portion of the test are the 1,000,000 households who receive the catalog. The "B" portion of the test are 100,000 households who will not receive the catalog.

Maybe a month after the in-home date, the database marketing analyst is prompted to analyze the results. Within each group, the 1,000,000 who received the catalog, and the 100,000 who didn't receive it, the analyst calculates the average net sales within the catalog/telephone channel, the online channel, and the retail channel.

Here are sample results:

 Quantity Telephone Online Retail Totals Received Catalog 1,000,000 \$6.00 \$8.00 \$21.00 \$35.00 Did Not Receive Catalog 100,000 \$2.50 \$7.00 \$19.50 \$29.00 Incremental Lift \$3.50 \$1.00 \$1.50 \$6.00

In this example, the catalog drove an incremental \$3.50 per customer to the catalog/telephone channel, \$1.00 per customer to the online channel, and \$1.50 per customer to the retail channel, for a total of \$6.00 incremental sales per customer.

Because we mailed 1,000,000 households, the total net sales attributed to this mailing is 1,000,000 * \$6.00 = \$6,000,000.

Some vendors advocate a different methodology --- they advocate allocating any online and retail order generated during the time the catalog was active to the mailing of the catalog. This results in a gross over-estimation of the importance of the catalog. Please don't go down this path.

A similar methodology can be used to test multiple marketing activities at the same time. Assume an e-mail campaign was mailed to the opt-in portion of this audience. Within this audience, you randomly assign customers to one of four test segments. Here are some sample results.

 Quantity Telephone Online Retail Totals Catalog + E-Mail 400,000 \$5.50 \$8.50 \$21.25 \$35.25 Catalog Only 50,000 \$6.00 \$8.00 \$21.00 \$35.00 E-Mail Only 50,000 \$3.00 \$8.10 \$18.65 \$29.75 No Catalog, No E-Mail 50,000 \$3.50 \$7.00 \$18.50 \$29.00

Tests like these yield interesting and intriguing results. Notice that the best strategy for the catalog/telephone channel was to mail only a catalog. The best strategy for the online channel was to mail a catalog and an e-mail. The best strategy for the retail channel was to mail both a catalog and an e-mail.

Statisticians can assist with significance tests, if you feel that is appropriate. It is more important to simply execute tests of this nature, and learn how all of your marketing activities interact with each other. What you learn about how marketing activities and channels interact with each other within our multichannel ecosystem may surprise you.

## November 17, 2006

### Williams Sonoma: Incremental Online Sales and Matchback Analysis

Williams Sonoma always does a nice job of sharing fun facts with the public. In their third quarter earnings release, they state that "55% of online revenues are generated by customers who recently received a catalog."

This is always an interesting topic of debate in the database marketing world. Williams Sonoma does not specifically state which of two popular analytical methods they use to measure this metric.

Most popular, and most vigorously argued against by the analytically adept, is the method of attributing every online order to the catalog channel, if a customer recently received a catalog. The theory behind this technique (often called a "matchback analysis") is that the catalog inspired the order. Many vendors promote this methodology, and for good reason. The technique can overstate orders attributed to mailed catalogs, and vendors have a vested interest in promoting paper as a viable means of profitable marketing. Critics will argue that if you mail your entire housefile, this methodology will cause you to attribute every single online order to the mailing of the catalog. Critics will also argue that if you mail every housefile name a catalog, and send every housefile name an e-mail, the methodology completely breaks-down, rendering the analysis useless.

Less popular is the method of an "A/B" split. The marketer randomly splits her mail list into two halves. 50,000 customers receive the catalog, a like group of 50,000 customers do not receive the catalog. Several weeks after the in-home date, the marketer measures total sales in the mail group and control group, in both catalog and online (and, where applicable, retail) channels. This method tends to provide much less-optimistic answers than the "matchback analysis". Critics will argue that this methodology cannot produce reliable results due to sampling error issues.

Which methodology do you believe is more appropriate for allocating online orders to the marketing channel that drove the order?