Reducing experiment duration with predicted control variates
In 2021, we published a blog post titled “Increasing experimentation accuracy and speed by using control variates,” describing how we reduce the variance of metrics using CUPED in our experimentation platform. This is a follow-up on how CUPED has evolved at Etsy since then. Spoiler – It’s changed a lot, decreasing our average experiment duration by 3 days!
Etsy’s mission is to Keep Commerce Human. To achieve this, we need to understand the impact each change to our platform has on our buyers' and sellers' experience. Whether that involves changing the color of the “Buy Now” button on the Etsy app or updating elements of how our algorithms rank search results, we leverage large-scale online experimentation to iterate on and improve the things we build.
However, running an experiment can be a long process. From design and setup to running the experiment and analyzing results, the entire experimentation process can take weeks to months. Experiments must run long enough to collect sufficient data for the results to be statistically significant – ensuring we can confidently attribute observed changes to the treatment, rather than random chance. On the other hand, being able to learn from an experiment quickly is a crucial step in the product development lifecycle, enabling faster improvements to Etsy. Fortunately, there are tools to reduce experiment runtime. CUPED is one of them! Variance reduction techniques like CUPED can help reduce the time to run an experiment, shortening the overall experimentation lifecycle and time to learning, as visualized below.
A recap of CUPED
CUPED is a variance reduction technique that estimates experiment outcomes with greater speed and accuracy compared to a direct comparison between control and treatment groups. In 2021, Etsy implemented CUPED (Controlled-Experiment Using Pre-Experiment Data) for key metrics like Conversion Rate (the percentage of visitors that make a purchase).
CUPED leverages historical visitor data collected before the experiment begins — for example, the number of purchases in the week prior to the experiment – to explain some natural variation in the outcome metric. The pre-experiment factors are used as covariates in a linear regression model to remove some of the “noise” that is not attributable to the treatment. By accounting for this variation, CUPED reduces the variance of the treatment effect estimator, increasing statistical power and improving sensitivity without introducing bias.
The CUPED correction can be conceptualized as:
The CUPED-adjusted metric will have a smaller variance than the original metric, as visualized below, providing more precise estimates of a mean or treatment effect.
Sample size, power, and variance are all related. Holding everything else unchanged, the smaller the variance of a metric, the smaller the sample size required to reach a desired power. Since we can reduce the variance of our metric by applying CUPED, we can achieve the same amount of power with a smaller sample size. In practice, a smaller sample size corresponds to a shorter experiment duration.
Etsy’s initial implementation of CUPED yielded an average variance reduction of 7% across all experiments, with some experiments achieving up to 30% variance reduction. Experiments that used CUPED-adjusted metrics in decision-making yielded a decision about 1 day earlier, on average. However, we’re always iterating to improve our buyers’ and sellers’ experience on Etsy, and we knew we could do even better. Enter: CUPAC.
Leveling up further with CUPAC
During our research and implementation of CUPED in 2020, scientists at DoorDash published a blog post describing a novel statistical method, building on CUPED, called Control Using Predictions as Covariate, or “CUPAC.”
When performing CUPAC, the pre-experiment data is first input into a non-linear machine learning model that captures more complex relationships than a linear model. The non-linear model is trained to predict the outcome metric of interest – for example, if an experiment is measuring the observed Conversion Rate, the model would predict Conversion Rate. The prediction more effectively captures the impact of pre-experiment behaviors on our experimental outcomes than the raw pre-experiment data because it captures complex relationships in the data that linear regression alone cannot. The prediction is then used as an “ML-based covariate” in a linear regression to perform the CUPED correction:
The CUPAC-adjusted outcome has an even smaller variance than the CUPED-adjusted outcome, as visualized below.
Empirically, our CUPAC-adjusted metrics showed even lower variance than CUPED. Our initial prototype demonstrated that CUPAC produced an adjusted metric with an additional 10% smaller variance when compared to our original CUPED estimator. Despite the added complexity, these results justified incorporating CUPAC into our experimentation pipeline. We hypothesized it would cut average experiment duration by an additional day, enabling teams to run more experiments and ship changes to Etsy faster.
Training and implementation
The first step was to train the CUPAC models to predict the ML-based covariate. We identified over 100 pre-experiment features, increasing from 3 features in CUPED, to capture more behavior prior to the experiment. Using these features, we iteratively trained and tuned the models in Vertex AI. Hyperparameters were optimized on a validation dataset to maximize the median correlation between the model’s predictions and the observed in-experiment metrics across experiments.
Initially we trained XGBoost, a popular gradient boosted tree model, but then found LightGBM, a similar non-linear, tree-based model, was better suited to predict the covariate. When testing the models at scale with billions of predictions, LightGBM demonstrated both rapid training and prediction times, along with strong validation results.
Once the models were trained, our next challenge was to implement them at scale. Our experimentation pipeline runs batch jobs for hundreds of experiments each day. From our original implementation, we had an Airflow DAG (directed acyclic graph) to orchestrate the CUPED variance reduction pipeline, as visualized below:
We evolved this pipeline to support CUPAC by adding a batch prediction step to produce the ML-based covariate.
In the above CUPAC pipeline, we perform the following steps:
Calculate pre-experiment features and in-experiment data using BigQuery SQL jobs.
Predict ML-based covariates with our trained LightGBM models via parallel Dataflow jobs using the pre-experiment features.
Perform variance reduction with a Spark job that fits a linear regression model between the ML-based covariates and in-experiment data, creating the CUPAC-adjusted metrics.
Apply statistical t-tests using the CUPAC-adjusted metric to calculate the treatment effect, p-value, and power of the experiment.
Impact: Shortening average experiment duration by 3 days
We measured success through variance reduction. Variance reduction is the percent change between the:
Variance of the metric without CUPAC, and
Variance of the CUPAC-adjusted metric.
The original CUPED implementation showed 7% variance reduction, reducing overall experiment duration by almost 1 day, on average. After implementing CUPAC, we observed an average of 27% variance reduction, nearly 4x as much variance reduction, when compared to CUPED, exceeding our early research estimates.
The additional variance reduction shortens our average experiment duration by almost 3 days. This means a 10-day experiment could conclude in only 7 days due to the ability to reach power on a smaller sample size with CUPAC. These marginal time savings allow many teams to run 10 or more additional experiments each year. That translates to more opportunities to test and faster insights into how we can deliver the best experience for our community of millions of sellers and buyers.
Notably, there was a substantial spread in variance reduction among different metrics and experiments, ranging from 2% to 77%. In the chart below, each blue bar displays the percent variance reduction for a sampled metric on an experiment.
The large range is expected because variance reduction can be influenced by several factors, such as metric definition, data accessibility, experimental design, and market characteristics. These factors impact how predictive the pre-experiment data is of the outcome metric, resulting in the degree of variance reduction. For example, two common experimentation metrics are Mean Visits and Purchase Rate. In the e-commerce setting, an individual's visit behavior will almost always be more stable over time than their purchasing behavior. This implies that pre-experiment data is more correlated with in-experiment data for a visit-related metric than for a purchase-related metric. Therefore, CUPAC is more effective at reducing variance in a metric like Mean Visits than in a metric like Purchase Rate.
What’s next?
Aligned with Etsy’s culture of experimentation, we’ll continue to evolve our pipeline to be nimble and flexible based on the needs of the teams that use them.
One challenge we face is that teams use metrics curated to specific parts of the Etsy experience – like search, recommendations, seller features, etc. – to make decisions on their experiment results. However, our CUPAC models take significant time to train and maintain for each metric, consequently limiting the number of CUPAC-adjusted metrics we can develop. While we continue to grow CUPAC use, we also encourage teams to continue to use CUPED, which is more scalable and has lower maintenance costs. To account for this, we plan to increase the flexibility of CUPED to more metrics by automatically collecting pre-experiment data based on the metric definition to reduce noise. In tandem with our work on CUPAC, this CUPED expansion will enable teams across Etsy to benefit from variance reduction across all their team-specific metrics, not just a select few.
Despite the success of CUPED and CUPAC thus far, there remains a need to explore additional variance reduction techniques for the current metrics that leverage CUPAC. In 2024, we released research findings exploring a novel approach: Variance reduction combining pre-experiment and in-experiment data. As we look to generalize our variance reduction architecture, we expect that incorporating such techniques will continue to strengthen our experimentation platform and enable product teams to iterate more quickly.
Lastly, it is important to recognize that applying variance reduction in practice can be a never-ending race to squeeze the most noise out of these estimators. In our experience, the craft lies in finding the sweet spot between variance reduction, implementation cost, and the impact on experimentation velocity. That intersection is context-dependent and what makes experimentation code as craft.
We hope our experience inspires you to try out variance reduction techniques and determine which one is best suited to your needs!
Acknowledgements
Thank you to Alexander Tank and Stephane Shao for their work on initial research and implementation of CUPAC. Thanks to Pablo Crespo for his research into extending our CUPAC models with more predictive features. And, thanks to Julie Beckley, Kevin Gaan, and Mary Hu for supporting and prioritizing this project.
References
A. Deng, Y. Xu, R. Kohavi, T. Walker (2013). Improving the sensitivity of online controlled experiments by utilizing pre-experiment data.
J. Li (2020). Improving Experimental Power through Control Using Predictions as Covariate (CUPAC).
Source: https://www.etsy.com/codeascraft/reducing-experiment-duration-with-predicted-control-variates?utm_source=OpenGraph&utm_medium=PageTools&utm_campaign=Share