Blog
chevron-right
Causal discovery for product analytics

Causal discovery for product analytics

Sean J. Taylor
August 1, 2024

Two years ago, I wrote about bringing more causality to analytics. My observation was that product analytics is filled with causal questions but that causal modeling and causal inference methods were not particularly popular. Since then, we have talked to hundreds of data scientists and analysts about their challenges and those conversations have further convinced me that we can usher in a big change in how we ask and answer analytics questions.

Today’s post is about the progress we have made in fulfilling that original vision: how can we leverage techniques from causal inference to learn more from our data and make better business decisions? At Motif we have some exciting developments to share which leverage GLEAM, our foundation model for event sequence data. Before we get into our solution, let me recap the problem we have been trying to solve.

Forward and reverse causal questions

In my previous post, I introduced the idea that causal analysis tasks can be categorized based on whether the cause and outcome are defined or undefined (at the time I called them “known” or “unknown”). I proposed a 2x2 framework, now shown in terms of DAGs:

The primary challenge with the three “undefined” quadrants is that they require updating our causal DAG, not estimating parameters of a DAG we have already specified. Every data team should strive to provide a more complete view of the process that generates value (and negative outcomes) and that means gradually improving causal models over time.

In our discussions with data scientists and analysts this framework has held up well in categorizing their work. I can roughly summarize our findings here (skipping quadrant 4 which I view as enabled by addressing quadrants 2 and 3):

Quadrant 1: Experimentation is the gold standard

The comfort zone for analysts is the lower-left quadrant, when cause and outcome are both defined. The task is particularly routine when experiments are run and they can use A/B testing tools like Eppo and Statsig. When experiments can’t be run, some data scientists turn to weaker evidence or try to use causal inference strategies, such as synthetic control methods.

Quadrant 2: Experiments aren’t always straightforward

Many teams we have talked to identified Quadrant 2 is a pain point. It is common for organizations to struggle with tests or launches that don’t go as anticipated, leading to exploratory analysis intended to understand the story. This can become a time-consuming project because experimentation platforms generally rely on a limited set of pre-configured metrics may not cover every possible outcome of interest.

Motif is already helping teams with these kinds of analyses. Sequence comparisons allow users to investigate effects of launches and experiments in a more flexible and expansive way than experimentation tools, frequently yielding important insights and leading to faster turnaround to followup experiments.

Quadrant 3: What drives my key product outcomes?

We discovered strong demand for tackling the quadrant 3. Analysts commonly frame these tasks as finding opportunities for improving signup, retention, or revenue. Often they would like to generate ideas and help prioritize product and engineering work; other times they are searching for intermediate metrics that can be used as milestones of success. In either case, there are no standard tools or approaches available.

Most causal inference methods are tasked with measuring effects of already-hypothesized causes rather than uncovering new causes. In 2013, two titans of causal inference, Andrew Gelman and Guido Imbens, called this problem “reverse causal questions." In that paper they share an excellent quote from Kaiser Fung summarizing the situation:

A lot of real world problems are of the reverse causality type, and it’s an embarrassment for us to ignore them… Most business problems are reverse causal… If sales amount suddenly drops, then the executives will want to know what caused the drop. By a process of elimination, one can drill down to a small set of plausible causes. This is all complex work that gives approximate answers.

Kaiser’s example introduces an additional subtlety — some reverse causal questions are about what tends to cause an outcome of interest (e.g. sales) while others are about what causes a specific change in that outcome that has been observed (e.g. a drop in sales). They are obviously related and answering either requires a method for discovering and prioritizing new causes in the causal DAG.

Discovering new causes enables product improvement

Hopefully I have convinced you that causal discovery is a task with a lot of room for improvement. Here is the hero story for how learning a new cause yields an impactful product change. There are three main phases:

Phase 1: Discover a new potential cause

In this phase, we generate and select a new hypothesis for a cause of the outcome we’re interested in. Graphically, the cause is a new node in the DAG that potentially has a directed edge toward our outcome node.

Example: In the free version of Motif, we provide example data sets but also allow users to work with data locally (in their browser) without ever transmitting data to us. We observed that users who loaded their own data were more likely to use the product again in the following week. This is a plausible cause because direct experience with familiar data may help users learn the value of the tool more quickly. However, there is a clear confounding story as well because users with higher intent are more likely to jump through the hoop of gathering and loading data.

Phase 2: Generate ideas for product changes that act on that cause

In the second phase, we use our imaginations to come up with hypothetical interventions which can act on the cause we’ve discovered. This is creative work because the node in the DAG does not exist in the current data; it is something we will build or reconfigure in order to act on the hypothesized mechanism. In causal inference terminology, we can think of this intervention as an instrument which acts on the cause we’ve identified.

Example: We brainstormed ways to reduce friction for loading data and to encourage users to do so. A variety of ideas emerged, including redesigning our data loading form and changing our onboarding dialogs to nudge users to bring their own data. We even considered removing the demo data sets as a very strong encouragement.

Phase 3: Run an experiment and validate the hypothesis

The final phase is to validate the hypotheses we generated. The improvement in the outcome relies on two things being true:

  • First stage: the product change we developed increases the probability of the hypothesized cause.
  • Second stage: the hypothesized cause has an practically significant effect on the outcome of interest.

These two stages mirror the steps in estimating an instrumental variables model, which is because this is exactly the setting we are in: we have introduced an exogenous instrument which affects the outcome of interest via the causal variable we have discovered. The hope is that the instrument is strong enough and also that the causal relationship we identified in Phase 1 is strong enough to yield benefits.

Example: We launched simplified data loading in Motif. Data loading rates went up by about 100% and retention improved as a result. (This is a bit of creative storytelling, we did not actually run an experiment because we are moving pretty quickly!)

How can we reliably discover new causes?

If you agree with the process I laid out in the previous section, you might wonder how we can make that story play out faster and more reliably.

  • Phase 2 is generally the purview of PMs, designers, and engineers, who are very skilled at the task of developing and implementing useful product improvements.
  • Phase 3 is well addressed by experimentation. There is a reliable technology for testing causal hypotheses. Across the industry, thousands of data scientists, analysts, designers, and product managers have converged on using experiments to evaluate product changes and make decisions in a principled way. We have industry-wide conviction about this approach because it is well-designed to produce business value and solves an ongoing decision problem teams face.

That leaves us with Phase 1: the biggest opportunity to use data to improve products is to discover and prioritize hypotheses which lead to successful product changes with high probability. For analytics teams which have invested in rich instrumentation and data quality, it can be frustrating that there are no standard approaches for turning that data into value to their cross-functional partners.

In practice, the most common approach is (nebulously) exploratory data analysis (EDA) intended to help generate new insights. Indeed providing “actionable insights” is described as the specific value proposition of BI software beyond their reporting capabilities, as if somehow the same slicing, dicing, and aggregation could do everything that’s needed.

Practitioners need a reliable procedure for generating useful hypotheses from data, but have no rigorous, scientific approach to this task. This leads to a systematic underinvestment in causal discovery work. Exploratory projects can drag on for a long time and end without strong conclusions so they are not given many resources.

How can we create a process for generating hypotheses that is as reliable as experimentation is for testing hypotheses?

Approach #1: Group by, aggregate

The most common approach to hypothesis generation (and to product analytics in general) is comparing groups of users across attributes that are usually convenient to define and measure.

  • Country
  • Device or application
  • Cohort based on signup date
  • Referral source (like search, organic, ads, etc)

Breakdowns by (slow moving or never-changing) user-level attributes are easy to create in most BI tools and explain significant variation in outcomes of interest. We can think of this breakdown as estimating a structural model:

$$\text{Outcome} = f(\text{Country}, \text{Device}, \text{Cohort})$$

There are limitations inherent in this approach:

  1. There are only so many variables readily available for breakdowns. They are probably not particularly rich because they are unchanging user-level facts. There is a low ceiling for this type of model.
  2. The model does not predict any useful counterfactuals. Even if we observe that users in a specific cohort are more likely to sign up for a premium plan, this does not straightforwardly imply a course of action.
  3. The estimated relationships have no clear causal relationship. For instance, we could attempt to recruit more users from a high-performing country, but it is unclear how that would scale and whether the difference would persist if we did do that.

Adding a time dimension can increase the richness of the structural model:

$$\text{Outcome} = f(\text{Time}, \text{Country})$$

We can sometimes spot changes in time series data in particular subgroups, suggesting a cause was introduced at a specific time, which can be reasonable pointer for discovering a cause. However this method can generate more questions than answers — we then have to investigate what else changed at that time and form theory about why it may have affected only one group of users.

Approach #2: Behavioral correlations

A great example of gaining an insight through a behavioral correlation comes from an Amplitude case study about a valuable insight from the Calm app. They write:

Calm built a Daily Reminder feature that allowed users to set a reminder for their daily meditation session, but the reminder feature was buried deep on the Settings page of the app. Very few users, less than 1%, were finding and using reminders. To their surprise, they found an almost 3x increase in retention for users who set Daily Reminders. With such a small sample size of users, they couldn’t know whether this was a causal relationship. It could be that the power users of their app, who would have been well-retained anyways, were the ones digging into the Settings page and finding the Reminders feature.

This case study fits our causal discovery story quite well, and they later explain that they ran an experiment to confirm the generated hypothesis. The story is a bit unclear about how the behavioral correlation was found, but we can imagine the approach was to segment users into groups who used a feature and those who did not, and compare them on some key outcome (e.g. retention). The structural model is roughly:

$$\text{Retention} = f(\text{Set Daily Reminders})$$

This model addresses a key limitation in Approach 1: the hypothesized cause is a behavior captured by event logging rather than a more persistent user property. However, it comes with its own set of potential limitations:

  1. How did they decide to look at the correlation for this particular feature? Many apps have quite a rich set of functionality, were they going on a hunch or did Amplitude surface this proactively from among a large set of hypotheses?
  2. Is the correlation they estimated really a causal effect? Usage of almost every feature will be correlated with retention, so what makes this particular one special? (Conveniently in the case study the correlation turns out to be unbiased 🤯)

In the next section, I explain how we address these limitations.

Approach #3: Correlations adjusted for prior behaviors

Finding correlations between behaviors and outcomes in product data is almost a given, because the following graph is often a good representation of the causal structure:

The unobserved confounder here will usually bias the estimated correlation between feature usage and outcome. The opportunity we have is to adjust for the user’s earlier behaviors, which can mitigate the bias. In an ideal world, the pre-feature usage behaviors are a perfect measure of the latent intent variable, in which case adjusting for them will block the backdoor path and eliminate the bias. Experienced students of causal inference will know that there are other possible DAGs here, including one where this adjustment increases bias, though in reasonable simulations we have run, the bias reduction is generally far larger than the amplification.

Adjusting for pre-event behaviors automatically

The key to automatically adjusting the correlations and minimizing bias comes from the foundation models for event sequences we introduced earlier this year.  Recall that these models are trained with the same self-supervised objective for generative AI models of text. So for any event (or pattern of events) we view as a cause, the model can estimate the probability of that happening at any position in the event sequence.

We can use this estimate to take events in the “treatment” group and assign them a probability of treatment (propensity score), as well as to construct a control group — positions in event sequences where the treatment could have occurred but didn’t:

In a world without the GLEAM estimates, we would have a treatment group but we would have to use a non-representative group of users as control — those who are systematically less likely to have exhibited the cause we are interested in. By balancing the propensity score distribution, we can achieve a substantial reduction in bias, mimicking as closely as possible the thought experiment where we intervene to prevent the cause from happening for some users.

The intuition bias reduction achieved by this procedure is captured in the following diagram:

Note that the treatment group estimate does not change from this procedure. We are constructing a more realistic control group which will tend to move their estimate closer to the treatment group. We are targeting the ATT (average treatment effect on the treated) in this setting. For some treatments, we are unable to find enough users with similar propensity scores to the treatment group, and in those cases the effect estimates are high variance and it is natural to ignore them.

A world with automatic causal inference

As we have seen, we can use the propensity scores estimated from a transformer architecture to reduce the bias from confounding and this can be done automatically. The assumptions aren’t always realistic, but we have a powerful tool that can adjust data for prior events in a very flexible way (without feature engineering or even training a new model). Let’s take a look at what it looks like in practice to wield this capability to generate hypotheses.

Empirical example: e-commerce activity

Since we can’t share results from customer data, we have been experimenting with a simple e-commerce data set shared as 2021 SIGIR eCom Data Challenge (h/t Jacopo), which is one of the best public data sets we’ve been able to find for evaluating this approach. You can think of a simple web shop, where users can search for or browse items, run searches, and add items to their cart that they want to purchase.

Interestingly, that SIGIR competition included an intent-prediction task: a model is shown a session containing an add-to-cart event, and it is asked to predict whether the item will be bought before the end of the session. One can think of our causal discovery task as pivoting this from a predictive task to an interpretability task — to which events can we attribute the purchase outcome? Data scientists are used to using something like partial dependence plots to show how one feature influences the prediction, and our approach adapts this concept to the sequential setting.

This figure shows the distribution of probability scores from GLEAM for each of the main events, broken down into what I think of as “factual” and “counterfactual” cases.

The differences between these distributions is a source of confounding — the sequences where the event occurred are different in various ways that we need to correct. We can discard the non-overlapping regions and then balance the two distributions by applying weights to the control cases, making the grey distribution match the green.

Below we see the estimates of the effects of events on the user’s probability of making a purchase.

Observe the following:

  • Adjustment makes the effect estimates more conservative in all cases. The unadjusted correlations are all larger in magnitude than the adjusted ones. The effects of pageview and result_click events are close to zero after adjustment.
  • The purchase effect is mechanical because the outcome is defined to be 1 if it occurs. However the estimate has an interesting interpretation — at the moments when users are about the purchase, since the control group has about a 1 - 0.63= 37% probability of purchasing. One can think of the correlational estimate as using the roughly 10% probability of purchasing as the baseline.
  • The adjusted estimates tend to be estimated with greater error (see the wider confidence intervals). We have made a bias-variance tradeoff, reducing bias through adjustment. The results seem favorable if we believe the adjusted estimates have strictly less bias since the variance is not so much larger.
  • The sign of the remove treatment reverses after adjustment — removing an item from cart has a strong positive unadjusted association but a negative effect after adjustment. The positive effect has a clear confounding story, users are likely very close to purchasing if they are removing items from their carts. The negative estimate seems more plausible in this case.

Overall these estimates seem to have some face validity, in some world where we could nudge the user to make these events happen more frequently (e.g. through changing the design of the store). Certainly they make more sense and are more conservative than relying on correlations.

Scaling up to more and more complex hypotheses

The next step is to scale up the number hypotheses we consider — casting a wide net to find potentially useful causes. There are a variety of ways to add complexity to the hypothesis space:

  • Sequences of events: we can define patterns of events (e.g. something that could be matched in SOL) as treatments. For instance “what is the effect of using this feature twice in an hour?”
  • Accounting for properties of events: events typically have many properties which could plausibly modify their effects.
  • Contextual effects: Users in certain states (having used other features previously) have smaller/larger effects.
  • Time-variation in effects: Effect of certain events can change over time, revealing latent changes driven by product changes.
  • Dose-response effects: effects of events can accumulate, and we can estimate these curves.

Regardless of how we define the the cause and its context, the machinery for estimating its treatment effect is the same, completely automatic, and relatively fast. We can estimate hundreds or thousands of effects and look at their distribution.

Strikingly, the adjustment procedure tends to result in  effects which are centered at zero and with relatively small estimates, which fits our prior that most of our hypothetical treatments do not influence the outcome. There is also an asymmetry that makes sense — for rare outcomes like purchase it is unlikely to find treatments that lower the probability by a lot, but some treatments can have strong positive effects.

Applying a false discovery rate control procedure helps us select potentially interesting effects for investigation. By applying Benjamini-Yekutieli to the p-values from the estimates, we can set a false discovery rate threshold we are comfortable with. Here with over 300 treatment definitions, we set the rate to 0.1% and find 27 interesting effects to follow up on.

A human analyst and product manager with knowledge of the product and its instrumentation form a human-in-the-loop process to make sense of the estimates. We find that often a few can be immediately discarded as either obvious or not amenable to interventions. They can be incrementally filtered out, leaving more actionable hypotheses to follow up on.

Validating a new causal discovery method

In my post from two years ago, I wrote:

If you’re reading this and you’re anything like me, alarm bells are going off: what makes any of this “causal”?

And in this post I have likely triggered more alarm bells for the folks worried about multiple comparisons problems. So now there are two important reasons we should expect this method to lead to occasional false discoveries: from the bias we’ve been unable to remove through adjustment and from the noise that would allow us to find something interesting by sheer chance.

I find it important to emphasize the difference between estimating effects to inform a decision or policy, and estimating effects to generate a new hypothesis about a cause-effect relationship:

  • Hypothesis testing: we experience estimation error as making the wrong decision about what to launch, leading to worse products and unhappy users.
  • Hypothesis generation: we experience estimation error as not knowing how or what to focus on in our products, leading to poor product prioritization and a “guess and check” development cycle.

My contention is that we should shift the weight in our loss function from caring mostly about false positives to caring about false negatives for cause-effect relationships. Put differently: if the status quo is using human intuition and correlations to decide what contributes to the success of our product, what kinds of mistakes could we be making? Wouldn’t it be worth seeing if we missed something through that methodology?

A healthy scientific attitude toward our products is that we face bottlenecks in both testing ideas AND generating new ideas. The strides we have made in routinizing experimentation have not yet been met with a corresponding innovation in discovery. But as of today, Motif’s new causal discovery engine is ready for testing with our customers and we’re excited to see what we can learn from your event sequence data. If you'd like to try Motif's approach on your data, reach out to us.

Acknowledgements

This has been a long project with a lot of conceptual challenges. I’d like to thank a few people who’ve given lots of valuable feedback as I fumbled through this, including: Alex Chin, Avi Feller, Dean Eckles, Cam Bruggeman, Alex Deng, and David Robinson.

Sign up for an account to try Motif
Get started for free