July 10, 2013

1988: Deer Romping

Top Selling Song of 1988 = Faith, by George Michael.

Ya gotta have faith!

Sometimes, however, it was hard to have faith in the data.

In the autumn of 1988, at the Garst Seed Company, we were analyzing the annual "harvest". Each research plot was harvested, yields were recorded, and data was sent to a twenty-three year old analyst named Kevin Hillstrom.

Research plots were in grids.  I analyzed each cell in the grid - a cell represented a corn or sorghum hybrid, the number represented the yield of the hybrid.

Look in the top half of the table.  See the red numbers?  Those "yields" don't look right, do they?

Turns out that a deer "romped" through those plots, ruining the experiment.  Yields are 70% too low!

These experiments were expensive to conduct, and took six months from planting to harvest to complete.  You couldn't just say that the experiment was wasted, you had to do something with the data.

We used a method called "GLM", or "Generalized Linear Models" ... to correct for cases where deer romped through and destroyed our experiment.  By adjusting for the rows and columns in the field (and by planting a second or third "rep" ... an identical but re-randomized test), we could predict (values in blue above) what "should" have happened.

This allowed us to save the company a fortune ... we used good data to predict what should have happened.

These days, you go out on Twitter, and you'd swear that folks just invented A/B tests.  Well, the good folks at Kansas State and Iowa State University were executing and analyzing randomized plots, using GLM to adjust for outliers ... as far back as the 1940s ... which means these methods were being used long before that!

We have the same problems today in e-commerce ... how do you measure conversion rate when the email marketing team adds/removes campaigns from the schedule?  Or how do you measure conversion rate when you toss a bunch of sloppy brand advertising on the home page, foregoing revenue generated by top-selling items that were previously sold on the home page?

You use this procedure, this "GLM" procedure, to predict "what should have happened".  Heck, you don't even have to get that fancy ... just eyeball the results, it's better than using sloppy, bad data, isn't it?