By Florent Buisson

 

‘Nothing has such power to broaden the mind as the ability to investigate systematically and truly all that comes under thy observation in life.’ — Marcus Aurelius

 

Experiments (aka A/B tests) are the bread and butter of behavioral scientists. Many of us have been eagerly awaiting the opportunity to run our own experiments since the first time we learned how to do a t-test in an introductory statistics class. However, running experiments requires much more than knowing statistics, especially when you’re running “field” experiments and not purely digital ones. In-store promotions, HR interventions to reduce attrition or biases in hiring, talk paths to appease irate passengers: all of these interventions offer great opportunities to improve business outcomes, but testing them can be challenging, with many pitfalls awaiting the innocent behavioral scientist. Fortunately, careful planning before the fact can go a long way. In this post, I’ll give you four recommendations to get your experiment right:

  1. Develop a strong business case
  2. Measure twice (or twenty times), cut once
  3. Build for scale
  4. Get randomization right

Develop a Strong Business Case

When running experiments, it’s easy to get into tunnel vision and lose track of the broader business context. But a field experiment requires the involvement of other human beings, which means you have to convince them, or at least their manager, that it’s worth running. To build a strong business case, you need three things: a sponsor, a goal, and a hypothesis. 

First, you need to identify the senior leader who would own your experiment and figure out if they have the bandwidth and goodwill to do it. Are they brand new to their role, or two months away from rotating into a different one? Are they in the middle of a politically sensitive merger, reorganization, or overhaul of their IT system? Hint: if you can’t even get 30 minutes in front of that senior leader to show them your experimentation plan and a tentative roadmap to full-scale launch, how much time and resources do you think they will allocate to implementing it?

Once you have a sponsor, you need a goal. What metrics does the sponsor  care about? Of course, money is the endgame so it’s easier to sell increases in revenue or decreases in costs. But they may not care about revenue if their strategic priority for the year is cost-cutting, and vice-versa. They may not even care about money if their goal for the year is to reduce employee turnover in the call-center or any other operational metric. 

Last but not least, you need to connect your proposed intervention to the goal through a clear and measurable hypothesis. For example, “by setting overlapping break times for all the employees of a call center, we’ll increase job satisfaction and reduce monthly turnover”. Quite often, simply writing something down and sharing it with stakeholders will reveal hidden assumptions, inconsistencies, or unclear turns of phrases. By breaking down your hypotheses into components and guesstimating some plausible values for each component (e.g., based on historical values), you can see if the math checks out. What is the current turnover in the call center? What is the cost of replacing an employee? Don’t worry about getting the numbers precisely right; you can avoid many embarrassing situations by simply verifying that the math is not vastly off (think: costs being two or three times the size of the best-case benefits). 

Measure Twice (or Twenty Times), Cut Once

Digital experiments are flexible: if you realize that an email to customers was poorly worded, you can tweak it and that’s that; there’s no need to get everything right from the get-go. But undoing an intervention in a field experiment is about as easy as for a tailor to uncut a piece of fabric. Imagine having to go back  to the director of a sales team and telling them that their 200+ representatives will need to be retrained again because you made a mistake in the training material. Oops. 

Therefore, you need to ensure that as little as possible goes wrong (because trust me, there will always be something going wrong!). Fortunately, you can usually call upon the expertise of your colleagues when preparing your experiment. UX and market researchers, copywriters, trainers and company lifers all have a wealth of knowledge you can tap if you ask nicely (good thing behavioral scientists are a friendly bunch!). They will often be able to point flaws or possible improvements in your intervention just by looking at it. Researchers can also help you rapidly prototype it and iterate on it through small-sample testing. 

Behavioral scientists can be prone to overemphasize the importance of large sample sizes and statistical significance, but the truth is that you can measure (pretty much) anything–or at least get a broad sense of it–from a very small sample. For instance, even with as few as 5 subjects, there is more than 90% chance that the population median for a numeric variable is between the lowest and the highest values in your sample. 

Get Randomization Right

The random assignment of units to either the control or the treatment group is the crucial ingredient that ensures the absence of bias in your comparative analysis. It is also surprisingly tricky. I have seen people be unsatisfied with a random draw (e.g., maybe the treatment group has a higher average income than the control group) and correct it by swapping subjects between groups; this breaks the randomization so don’t do it! A better solution is simply to re-run your assignment algorithm. Or even better, use matching to ensure balanced experimental groups, as I describe in my book

A good way to check that your assignment mechanism works well is to first do a dry run, aka an A/A test: you create an experimental protocol for a small sample, where both groups receive the control copy. You can then check that outcomes are reasonably similar across the two groups, as they should be. This also allows you to validate your data pipeline and collection processes, to avoid running a full-blown experiment and realizing only after the fact that data collection was defective. Really, the only downside of A/A tests is the grumbles from your impatient business partners!

What about non-randomized designs? Sometimes, you or your business partners will want to apply a treatment to a “pilot” unit with conditions more conducive to success (e.g., a team of top-performing employees). There are indeed situations where it makes sense to look for a best-case environment—for instance if you anticipate you won’t be able to implement the treatment across the board, or if you’re just running a preliminary proof-of-concept to demonstrate that the treatment can work and possibly improve its implementation. This can be especially relevant with field experiments, where you often have to rely on other people correctly understanding and implementing your treatment. 

Alternatively, there are simply cases where you can’t randomize, e.g., because that would be unfair or illegal. That’s why econometricians have developed complex statistical methods to design and analyze quasi-experiments such as phased rollouts (for a lighter introduction, you can also check my book). 

However, keep in mind that the effort and risks you’re avoiding by not randomizing have a cost in terms of increased uncertainty and effort to appropriately manage expectations. As the saying goes, it’s really hard to get someone to understand something when their salary depends on their not understanding it, and the pressure to implement a “successful” intervention can get overwhelming even when it’s unlikely to scale. It is then important to consider from the get-go the scalability of your intervention.

Build for Scale

Indeed, the goal of most if not all experiments in business is to implement the treatment at scale if it yields better outcomes. Unfortunately, it is pretty common for the implementation at scale to show disappointing results compared to the experiment. Beyond issues with randomization, there are several possible reasons for this “voltage drop”

The most common one is when the scaling-up of the intervention leads to changes in execution compared to the original experiment. Maybe the training during the experiment was done in-person by a member of the behavioral science team—someone intelligent and charismatic like all behavioral scientists are— whereas at scale employees only watch a training video produced by a vendor.

A second one has to do with what economists call “spillover effects”, when the impact of a treatment on an individual is affected by the number of people treated. This is most visible when there is a limited pool of opportunities available to all. For example, providing coaching and support to an employee may increase their probability of being promoted; but as you increase the number of employees treated, they compete for the same positions and the average impact decreases. 

Finally, spillovers affect the benefit side, but they have a counterpart in the cost side: increasing costs. To continue with the coaching example, hiring one coach can be done without affecting the average salary of a coach; if you attempt to hire several thousand coaches nationwide, you’ll probably have to increase your salary offers to compete with deep-pocketed companies and lure people from other occupations. 

Altogether, this means that if you’re designing a treatment in the hope of later applying it on a grand scale, you need to plan ahead of time how it will be expanded, and how much voltage drop you should anticipate. 

Conclusion: Keep Learning!

Running field experiments is, as the phrase goes, an art and a science. A strong business case, preliminary measurement and validation, proper randomization, and a scaling plan will go a long way toward ensuring a successful experiment; but I still learn something new about experimentation every week and that’s half the fun of it. If you want to dig deeper, there are some great books out there by development economists, political scientists, and econometricians. I also like to look at the abstracts of past presentations at one of the main conferences on the topic. So, as Green Day would say, “make the best of this test and don’t ask why. It’s not a question, but a lesson learned in time”. 


This article was edited by Victoria Valle Lara

Florent Buisson

Florent Buisson

Florent Buisson is a behavioral economist with 10 years of experience in business, analytics, and behavioral science. After starting the behavioral science team of one of the largest insurance companies in the US, he is now leading experimentation efforts for an eCommerce company. Florent is the author of the first book on behavioral data analysis for business (O'Reilly Media). He holds a Ph.D. in behavioral economics from the Sorbonne University in Paris.