Tuesday, December 15, 2015

How to Find Unknown Unknowns

Imagine you're tracking number of new user sign-ups by day on a website, and it looks like this:
Hypothetical New Users by Day
Basically random noise, right? Not exactly.

I sometimes tell people that I don't believe in randomness. As a data scientist, this earns me quite a few surprised looks. The point I try to make is that very few things in our everyday lives are truly random in the quantum mechanical sense. Even a coin flip or die roll can be reliably predicted with a high-speed camera; neither is very chaotic. In our everyday experience, most things can be reliably predicted IF we have the necessary information. The question, then, is what information is necessary.

Let's go back to that hypothetical graph of new user sign-ups by day. What would you need to know in order to accurately predict the number of new user sign-ups today? Here's just a few possibilities:
- A list of new people who visit the site
- What each of those people were looking at before they came to the site
- What each of those people were looking for before they came to the site
- Comprehensive psychological, hormonal, and metabolic profiles for each person
- What each of those people had for breakfast
...
Obviously, all this information would be rather difficult to obtain. But there's a point here: most of this is information which varies per person. We could wrap all that stuff in a black box and build a very simple approximate model like this:
number of new user sign-ups ~ (number of new people visiting site) x (probability a new visitor signs up)
The really big thing about this simple model is that the probability that a new visitor signs up should be independent from person to person. Whether or not one new visitor signs up should not have anything to do with whether a different new visitor signs up. This wouldn't be true for something like e.g. Facebook, but for many kinds of sites it should be pretty accurate.

Now we need just a little bit of math; you may have to dust off some high school stats. When counting lots of events (e.g. new user signups) which occur independently over a fixed time period (e.g. 1 day), the resulting total should be Poisson-distributed. The details aren't too important; what IS important is that the standard deviation should be roughly the squre root of the mean. In other words, if we're averaging around 400 new users per day, then that number should vary by about 20 new users per day. Let's take another look at that graph:
Hypothetical New Users by Day
Hmm. That looks like it's varying by over 100 new users per day. Just for fun, here's what it might look like if the count had a standard deviation of only 20:
Hypothetical Counts with Standard Deviation = 20.
So where the heck is all that extra noise coming from?

These numbers are made up, but I've run into very similar problems at work. There's too much noise to be explained by independent user variables. Whatever's causing all that noise, it HAS to be correlated between different users. In one particular case just like this, looking at new user signups, it turned out that our website's servers were sometimes slow to respond, drastically decreasing new user signups. Because the servers were slow for everyone at the same time, some days would get way more signups than other days. Once we improved the servers, new user signups were much higher!

But this isn't really about one special trick for one special problem.

In project management there's a concept called "unknown unknowns". As the saying goes, there's stuff you know you don't know, and then there's stuff that you don't even realize you don't know. In general, it's the unknown unknowns which are hard to deal with. They'll sneak up and bite you, and you'll never realize it. It won't even occur to you to deal with them. It'll just seem like random noise out in the world.

Of course, if you don't believe in randomness, then nothing seems like random noise out in the world. There's always SOME chain of cause and effect. The trick is to follow the same general pattern above:
1. Explicitly write down everything you think you'd need to know to predict something (i.e. write out the known unknowns). In our new users example, the list of individual user characteristics were all known unknowns.
2. Put together a simple model to estimate how much noise you'd expect to see from those unknowns. In our example, we used a Poisson distribution to model independent users. This is where math skills help.
3. If there's more noise out in the world than your model predicted, then you're missing something: you have successfully identified unknown unknowns.

This may seem like a lot of work just to realize that you're missing something. But as the saying goes, finding the right answer is easy. The hard part is asking the right question. As a data scientist, most of what I do at work on a day-to-day basis is look at little bits and pieces of data, then sit around thinking and staring into space, going through exactly this procedure in my head. The biggest opportunities I find usually begin as a little itch in the back of my mind, telling me there's too much noise in the data, I must be missing something. That's when I know it's time to go Seeking Questions.