As an integrator, I had to design a MES Dashboard at a customer factory, for two champagne production lines. The line manager wanted to highlight the scrap quantity between these two lines.
Here is what I got.
In this case, we can see that Line 1 generates the most scrap quantity. Fine, I have my answer ! The second line is the best of the two.
Then, I came back to him with this result, and he asked me if it was possible to separate it per year, 2018 and 2019. Well, I did it and here is what I get now.
In this second case, by simply adding an additional criterion, we can see the opposite effect. That is to say that in 2018 it is line 2 that generates the most waste, and the same is true in 2019. So, who is right ? I was surprised by this phenomenon, and by looking a little I discovered that it was a known paradox: Simpson's Paradox !
It is a fact that a correlation can disappear or even be reversed depending on whether we consider the data as a whole, or segmented by groups.
For the paradox to occur, two ingredients are needed :
First, you need a variable that influences the end result (the "group"), which is not necessarily explained at the start. This is called a confounder.
Second, the sample we are studying must not be distributed evenly.
When these two conditions are met, Simpson's paradox can happen ! That is to say, because of the heterogeneous distribution of the sample, grouping the data points to a trend which may be false, and which disappears if we analyze the data by separating according to the confounding factor.
In my case, I encountered this paradox by manipulating a large amount of data to summarize and extract information.
So, I looked for an explanation in the data.
This is where the confounding factor appears. In this case, the production year highlights the fact that in 2018, we favored production on line 1 which was to run faster than line 2. But if we look at the year 2019, line 2 was extremely privileged compared to 1. This could be explained by work carried out on equipments with a goal of improving the production rate.
I imagine that you can easily see the manipulation potential behind this paradox: we can make you believe something while looking at the figures in detail, the effects can disappear or reverse!
How to prevent Simpson's Paradox ?
My best advice is to know what do you want to highlight, and in which context. The more you will add criteria, the more you will get real results. Otherwise, if you just want to have one value such as “Who is the best line?”, keep in mind that this may not be the truth.
When you are presented with figures, you must therefore be critical, and be particularly wary when these figures come from data analyzed a posteriori, rather than from an experimental sample that you have constructed yourself a priori.
Finally remember, this paradox occurs when there is a highly influential hidden variable. This means that the raw numbers make little sense, and should be criticized by an expert in the field, likely to point to the existence of such a factor. At a time when the fact-checking trend is flourishing, we have a tendency to make us believe that the figures would be the "naked" truth. No, the naked truth does not exist, and there will always be a need for knowledgeable people to correctly interpret numbers, whether scientific, economic or medical.
"The bed is the most dangerous place in the world, this is where most people die"
Do you think this is really the truth?