Product Rollouts in Retail: What I Learned About Stakeholder Alignment and Why A/B Testing Isn't Enough

ciciodonnell
May 10
7 min read

During my time at Target, I led data science support for several high-stakes software rollouts aimed at optimizing store fulfillment operations. These weren't minor tweaks—we're talking about changes to how 1,800 stores physically moved through picking workflows, where a 1-unit improvement in Units Per Hour (UPH) translated to nearly $4 million in annual savings for ship-from-store alone.

The projects taught me two hard lessons. First, the non-technical work of aligning stakeholders and managing expectations is harder than the actual analysis. Second, retail A/B testing is fundamentally different from the clean randomized experiments you see in web/mobile product development, and if you treat it the same way, you'll get bad answers.

This is the story of what worked, what didn't, and how I'd approach it differently now.

Part One: The Stakeholder Problem

A. The Setup

Approximately 80% of American households are located within 10 miles of a Target store. In the mid 2010s when e-commerce was coming on to the scene, Target leadership ingeniously used this fact to their advantage and started using their retail stores as fulfillment centers.

For you, as a customer, what does this mean? When you go to Target.com and order a tube of toothpaste, that order does not go to a warehouse. Instead the order goes to your local Target store and an in-store employee goes out onto the salesfloor, “picks” your order, takes it to the backroom, puts it in a box, slaps a label on it, and puts it on a pallet to be picked up by the post office for last-mile delivery to your home. Alternatively, you can request Order Pickup (OPU), where you drive to the store, park in the lot,

Picking efficiency (in units per hour or UPH) improves when you let orders batch together. If you wait for 10 orders to accumulate, you can route an employee through the store more efficiently than if you send them out for every single order as it arrives.

But there's a 30-minute SLA for Order Pickup. If you let too many orders build up, you risk missing the SLA and delivering late to customers. So there's a fundamental tradeoff: batch for efficiency, but not so much that you blow the SLA.

A typical product rollout team included:

A business process owner (non-technical, responsible for operational outcomes)
Field leaders (store managers and district leads who understood ground truth)
A software manager (technical, responsible for shipping code)
Data engineers (responsible for landing data in usable form)
Data analysts (responsible for reporting and insights)
Me, as the data scientist bridging all of them

The business team wanted magic. They expected new software to simultaneously reduce labor costs and increase picking speed. The engineering side knew that wasn't how tradeoffs work. Field leaders were afraid of a poorly thought out product release that would temporarily tank day-to-day operations. And data engineers were frustrated because data capture was always an afterthought, leaving them scrambling to build pipelines after the software had already launched.

B. The Communication Gap

The first obstacle wasn't technical. It was getting everyone in the same conceptual space.

Business owners would say things like, "We need to know if this is working within the first week." Engineers would respond, "We won't have clean data for at least a month." Field leaders would push back with, "Our stores are all different. What works in Minneapolis won't work in Miami." All true. All unhelpful without a shared framework.

The breakthrough came when I stopped trying to answer their questions directly and started facilitating "what-if" scenario exercises.

Instead of saying "here's the metric we should track," I'd ask:

"What if speed improved by 10%, but labor costs increased by 5%—would that be worth it?"
"What if we hit the 30-minute SLA 95% of the time instead of 98%—how much cost savings would justify that?"
"What specific number would make you say 'we need to roll this back immediately'?"

This forced the team to articulate their actual priorities before we had any data. It also surfaced hidden disagreements early. The business owner and the field leader often had very different thresholds for acceptable tradeoffs, and it's much better to hash that out in a conference room than in a panic three weeks into a rollout.

C. Setting Quantitative Thresholds Before Launch

The single most valuable thing I did on these projects was to get the team to commit to explicit go/no-go thresholds before the software went live.

We'd write them down. Something like this:

Roll forward if UPH improves by ≥5% with no SLA degradation
Hold and investigate if UPH improves but late orders increase by >2%
Roll back if late orders increase by >5% regardless of UPH gains

Having those numbers in writing prevented the "let's wait and see" trap that kills so many rollouts. When you don't have pre-defined thresholds, every data point becomes a negotiation. Someone will always argue for more time, more stores, more data. But if you've agreed upfront that a 5% increase in late orders is unacceptable, then when it happens, the decision is already made.

This approach became the default template for how Target ran these rollouts. I kept getting pulled into new projects not because I was the best analyst, but because I'd built a process that prevented the stakeholder chaos that usually derailed these initiatives.

Part Two: The Technical Problems

A. Why A/B Testing in Retail Is Harder Than You Think

In web and mobile product development, A/B testing is relatively straightforward. You randomize users into treatment and control groups, show them different experiences, and measure the difference in conversion rates or engagement. The infrastructure is mature. The sample sizes are huge. The results are clean.

Retail doesn't work that way.

The Operational Constraints

First, you can't randomize at the individual level. Picking workflows are store-level processes. If you change the software for one employee in a store, you're effectively changing it for everyone, because they're all operating in the same system, sharing the same workload, and responding to the same batch of orders.

So you randomize at the store level. But now your sample size isn't millions of users—it's hundreds of stores. And stores aren't identical. They vary by layout, volume, staffing levels, customer demographics, and a dozen other factors that all confound your treatment effect.

Second, you can't isolate the intervention. In web A/B testing, you can show User A the blue button and User B the red button, and nothing else changes. In a store, changing the picking workflow is a bundle of process changes that all interact.

The Diff-in-Diff Solution

To handle this, we used a difference-in-differences (DiD) approach. The basic logic goes like this… instead of just comparing test vs. control after the rollout, we also compare the change in performance from before the rollout to after.

Let's say:

Control stores had UPH of 30 before, 32 after (change = +2)
Test stores had UPH of 31 before, 35 after (change = +4)

The naive comparison would say the treatment effect is 35 - 32 = 3 units. But that assumes the stores were identical to begin with.

The DiD estimate is (35 - 31) - (32 - 30) = 4 - 2 = 2 units. We're isolating the incremental change in test stores beyond what we'd expect from general trends.

This approach accounts for time-varying confounders—things like seasonality, promotional calendars, or broader operational changes that affect all stores. As long as those trends would have been the same in test and control stores (the "parallel trends" assumption), DiD gives you a cleaner causal estimate.

It's not perfect. The parallel trends assumption is strong, but it's a hell of a lot better than a naive mean comparison.

B. The Data Capture Problem

One recurring failure mode was leaving data pipeline work until the last minute.

The software team would spend months building the new application, then realize two weeks before launch that nobody had thought about how to capture the data needed to evaluate it. Data engineers would get a frantic request to "just pull the picking times" without any context about what data was even being generated, where it lived, or what format it was in.

I started asking the engineers, "What events are we logging? What does the data model look like? Where does this land?" I’d also ask the business teams, “What does the UI look like? What’s the typical workflow for a store employee?”. Then I’d have to connect those two pieces together in the transformation logic that turned data points into interpretable human events.

The Fragmented Data Landscape

Target's on-premise infrastructure was a patchwork of APIs, Kafka streams, virtual machines, and one-off databases maintained by different teams. For this particular rollout, the engineering team was transitioning to a microservices architecture using Kafka. That meant I'd have to:

Land the JSON event stream in Hadoop
Parse the nested JSON structure
Deduplicate events (Kafka streams often "re-ping" and create duplicates)
Join across multiple event types to reconstruct a picking workflow
Transform it into a tabular dataset that analysts could actually query

Event-Based Versus Time-Based Data Capture

Kafka is event-based, not time-based. Events arrive in the order they're generated, but they can arrive multiple times if the stream gets re-pinged.

Our internal tools would sometimes pull the same event twice, creating exact duplicates. I built deduplication logic that dropped exact duplicates based on a composite key (employee_id + item_id + timestamp).

Part Three: What I Learned

1. The real outcome is trust. Regardless of whether the product rolls out to all 1800 stores or ends up getting pulled back, using this product roll-out process improved communication, collaboration, and trust between teams.

2. Get data engineers involved at the software design stage, not the deployment stage. The marginal cost of adding one more person to a design meeting is tiny. The cost of rebuilding a data pipeline because nobody thought about logging structure is huge.

3. With today's technologies we can improve causal inference. Even with DiD, retail A/B tests are noisy. A small pilot lets you catch data quality issues, refine your metrics, and test your assumptions before you're committed to a 200-store experiment. But in today’s world, I would also add store-level fixed effects to control for persistent differences between stores (layout, volume, staffing) or stratify the randomization to ensure test and control groups were balanced on key observables. With modern data orchestration tooling this is actually much easier than it was even five years ago.