20+ years of A|B testing

What shade of blue should your hyperlink be?

People have been using and writing about A|B testing since 1996. Yikes!

Source: http://dilbert.com/strip/2014-03-08

A|B Testing

A|B Testing Overview

A|B testing is a commonly used term for a controlled experiment to find the better of two variants, based on an objective or goal.  The variant A (the “control” variant) is compared with variant B (the “variation” or “treatment”) to increase an outcome of interest. Another term for A|B testing, bucket testing was popular decades ago.  You put the visitors of a site in different buckets and show them variations.

The statistical name for A|B testing is two-sample hypothesis testing.  Two sample comes from the two variations, and you are testing a postulation that one variation will perform better than the other.

Let’s look at some real world examples.

  • Which one of the two variations of the on-boarding process do users complete?
  • Which advertisement layout or copy gets more click-throughs?
  • Popularity of hamburger menu versus explicitly labeled “Menu” on a mobile app?
  • Which testimonials get more click-throughs?
  • Which landing page gets more click throughs?
  • Which email subject line gets more opens?

Experimentation Traffic or Segment

You take the experimentation traffic  — a portion of the total traffic for experimentation — and direct a sample or portion of the experimentation traffic to control variant (control sample) and another portion to the variation (variation sample). We will revisit experimentation segment when we talk about the importance of segmentation in A|B Testing.

Over enough time and traffic, you determine the better of the two variants by looking at which sample produces better results.  The conversion success rate for an A|B test would be ratio between, count of successful conversions / size of sample.

For example, variations of testimonials were shown to 10% of the total 100K users.  52K were shown the control variation and 48K were shown the treatment variation.  Further, 3120 users from the 52K control sample clicked on the testimonial and 2640 users from the 48K treatment variation clicked on the testimonial. The control variant received 6% and treatment variant received 5.5%.

A Split-run testing or Split-testing is a 50/50 A|B test, i.e., both samples are of the same size.

Objective or Goal

The criteria to test the variations should be selected with clear understanding of desired outcomes.  “How many people click through a link” and “average revenue per user” are two distinct objectives.  Most commonly used criteria by marketers are Conversion Rate (CR), Click-through Rate (CTR), Mean Usage (Time on app, website, webpage, an app screen or game scene), Average Revenue Per User (ARPU), Average Retention, Renewal Rate, Average Transactions Per User, Net Promotor Score (NPS), Customer Satisfaction Rate, Bounce Rate, Average Order Size, Net Profit, etc.


Further, always account for skewness and outliers in data.  For instance, instead of using Mean Time on screen, consider using Trimmed mean, Tri-mean or Winsorized Mean.  In a 10% trimmed mean the largest and smallest 10% of the values are removed and then the mean is taken on the remaining 80%.  Winsorized mean is similar to the trimmed mean, except that rather than deleting the extreme values, they are set equal to the next largest (or smallest) value, preserving the divisor count in mean.

aB testing
A|B Test performed on a segment of total population, showing 2-sample variations.  Outliers are rejected when calculating mean.


Statistical Significance

Is the difference between results of two variants significant enough to make a decision? You would want to make a change that produces a significant change in the desired outcome.  Is the difference statistically significant?  You can use statistics to figure out if the change is statistically significant, but which statistical significance test to use depends on your objective and the sample data.  For instance, Fisher’s Exact Statistical Significance Test works best for Click-through Rate (CTR), and Pearson’s chi-squared test works better for “Purchase by Genre.”

Conversion Rate Optimization via Continuous A|B Testing

A|B testing can be used every time you want to introduce a new variant, and used continuously to test various imagery and titles for a specific objectives.

Rather than waiting for the entire experimentation to conclude, you can end the experiment early to prevent showing bad experiences to a poorly performing variation.  Alternatively, you can reroute traffic from poorly performing variations to better performing variations.  You are less focused on a specific A|B test but are optimizing for conversions.

Rerouting traffic to better performing tests can be implemented by multi-armed bandit technique.  Understand the pros and cons of this technique and figure out when it should and should not be used.

Multi-sample Hypothesis Testing or A|B|N Testing

When you use one control variant and multiple variations in the same test, A|B testing may be referred to as A|B|N Testing or what I call  Multi-Sample Hypothesis Testing (as opposed A|B testing which is two-sample hypothesis testing of one variable).  For example, having five total varying layouts and sending 20% of the experimentation traffic to each of the five variations.

Multi-variate Testing

Experimenting with multiple variables simultaneously where each variable could have two or more variations, is multi-variate testing (as opposed to A|B testing, which is two-sample hypothesis testing of one variable). The total number of variations is the product of number of variations of each variable.

For example, you could check three button colors (red, blue, green), two button labels (“Get Pyze,” “Get Pyze Now,”) and button positioning (left edge of top menu, right edge of top menu, center of top menu, middle of visible area of page).  Total variations for three variables in this example would be 3 x 2 x 4 = 24.


One-size fits all approach vs. Segmentation

Not all customers are the same.  Why would you treat them the same way?

You cannot treat existing customers the same way as potential customers.

You cannot treat recently acquired customers the same as a customer that has been around for years.

You cannot treat mobile customers the same way as web or kiosk customers.

Who you A|B test should be predetermined. The experimentation traffic should not be chosen randomly from the entire population.  Rather, the experimentation traffic should be a segment of users carefully selected to A|B test.

Your best A|B test variation may work for 62% of the users.  This also means it does not work for 38% of the users.  When you know about users, personalizing content, interfaces and messaging always provides much better experiences.  And, you always know something about visitors, even first time visitors.

Choosing a right segment as your experimentation traffic, makes the A | B testing much more reliable and actionable.  Automated segmentation tools make selecting the right experimentation traffic simpler.

You can use a number of segmentation techniques to pick the experimentation segment.  Demographic segmentation is popular but behavioral segmentation provides better results.

Post A|B test Analysis

Selecting the experimentation traffic or segment for A|B Testing improves experiences.

In addition, you can also do the A|B test analysis separately, post A|B test, for various segments in addition to the entire experimentation segment.  Maybe weekend traffic behaves differently from week-day traffic.

As another example, measuring for objective: average attention span, you may see the following results in a three-sample single variate test.

    • Mean Attention span for control variation:  12%
    • Mean Attention span for treatment variation 1:  10%
    • Mean Attention span for treatment variation 2:  15%

I.e. treatment variation 2 is the winning test.  However, analyzing for mobile visitors only may yield different results, indicating treatment variation 1 is way better than the control variation and slightly better than treatment variation 2.

    • Mean Attention span for control variation:  6%
    • Mean Attention span for treatment variation 1:  14%
    • Mean Attention span for treatment variation 2:  13%

The marketer may conclude using treatment variation 1 for mobile visitors and treatment variation 2 for web visitors.

A|B testing has been around for over two decades.  Yet, not everyone one is getting value from it.  A lot of businesses try A|B testing early and give-up as it does not produce results.  It takes time to collect data and analyze results. Typical A|B tests can span anywhere from 1 week to 9 weeks.  Only 14% of the A|B tests produce statistically significant improvements, per ConversionXL (Conversion optimization agencies do get higher results).

With the exception of large players who run thousands of A|B tests simultaneously, very few businesses are accounting for outliers when calculating objectives, taking advantage of segmenting the users for finding the right segment to A|B test, and performing meaningful post A|B test analysis using segmentation.