We Use Comparison Studies to Fool Others...

Simpson’s Paradox states that for any comparison study you can always find two or more subgroups of participants that together add up to the total, in which all of the subgroups show results that are the complete opposite of the original study. So if our product comes out better than a competitor’s in a comparison trial, you can find two or more subgroups adding up to the total where the competitor’s product is better than ours in each of the subgroups. If I'm a member of any of the subgroups (and by definition I always will be) I can point to the results for that subgroup as evidence that I should be using the competitor's product.

Figure 1. Smokers and Sleep I. Illustration of different sensationalist headlines for different cuts of the same data. Simply by differentiating patients by obesity, results of this smoking trial can be completely reversed.
Look at a case of Simpson’s Paradox with specific calculations. An excerpt of the original Washington Post newspaper article is shown at the top of Figure 1. The modified article is shown on the bottom. The bottom article uses study data developed by picking obesity as a differentiator, but could just as easily have used male/female, tall/short, young/old, watches CNN / doesn’t watch CNN, etc. There are infinite numbers of tags to slice an 80-patient, side-by-side study. Calculations are shown in Figure 2. Simply by artificially splitting the study group into two subgroups, the findings of the original article are completely reversed in both of the subgroups. Now, if you don’t smoke you are more tired, that is you sleep less, than smokers. Many striking examples of this paradox are found here.

Simpson’s Paradox operates even if you include the entire population of the planet in the comparison. Simply multiple each of the numbers in Figure 2 by 100 million. So if you run a global study, everyone participating, then smokers who are obese will be less tired, and smokers who are not obese will be less tired, than their non-smoking counterparts. If you’re getting a good night’s sleep then you’re most likely a smoker, but only if you’re obese or not-obese.

Figure 2. Smokers and Sleep II. Calculation showing how to reverse findings using Simpson's Paradox. The contention of the study is that smokers are more tired. Simply by differentiating patients by obesity (obese vs. non-obese) the results of the trial can be completely reversed. Both obese and non-obese smokers are now less tired than their non-smoking comparators.
Simpson's Paradox+ gives a powerful jumping point to discuss the use of statistics in industrial R&D. What claims are being made by the statisticians and under what conditions are those claims valid? This is vitally important given the importance often placed on statistical results in many industries, particularly in the life science industries. Statistical results and reasoning are still very influential on the treatments that most patients receive, from drugs (Phase III Clinical Trials) to procedures in medical facilities (Evidence-Based Medicine). It's often a numbers game.

You selected your demographics of Watches CNN versus Doesn’t Watch CNN after the fact, something the FDA would never allow.

That’s correct. They never allow it because they recognize the fallacy of mining the data looking for the results you want. So for example when Nitroxed found no benefit for BiXil in a general population, they wanted the FDA to approve it because they found retrospectively+ that African-Americans demonstrated a 43% increase in efficacy on their drug over the placebo. The FDA obliged them to go back and run an African-American only trial to substantiate their findings.

That’s my point, so you need to go back and run a Watches CNN-only trial to prove your point. You merely did what Nitroxed did but for Watches CNN and Doesn’t Watch CNN. You need to randomly give 1/2 of the CNN-watching population the drug and the other 1/2 the placebo. Then you can look at the results.

But think about it, I didn’t just selectively pull out individuals who performed better on a drug. Let’s assume I was able to enroll every person on the planet in a comparator trial. If a drug comes out 30% better than a placebo (e.g., a sugar pill) I can still Simpsonize the results and find two subgroups where the placebo wins. For each of these subgroups, which together add up to the entire population of the planet, the placebo is better. So it looks more like your conflation of these two subgroups into one larger group is a problem of false generalization+.

But that’s the magic of statistics. You’re not allowed to pick the people going into your subgroups post hoc. If you don’t randomize participation in each of the subgroups before the trial begins then the results are meaningless. What happened was that through sheer chance the CNN-watchers that received the drug in your whole-planet trial happened to be less ''responsive' to the drug, and so the placebo looked better. Run the trial with just the CNN-watchers of the world, randomized between those receiving the drug and those receiving the placebo, and the difference will likely disappear. Similarly for all the non-CNN-watchers of the world. You did for CNN-watchers what Nitroxed attempted to do for African-Americans. And you simultaneously did it for non-CNN-watchers.

If I were to run the study again, I'm sure I would get different results, but for reasons different from our current discussion. What if I were to 'reveal' to you that this planet-wide study was indeed originally targeted at CNN-watchers but we suspected the investigators were secretly fans of Fox News? I have a sheet of paper in my filing cabinet stating just that. We 'blinded' the investigators to the true nature of the study so they wouldn't bias the results. So the CNN-watchers were actually more randomized than you could achieve through more deliberate means.

I refuse to be pulled into metaphysical arguments as to whether or not original intentions influence study results. Since I have no way of knowing post hoc your original intentions or how many sheets of paper you have in your filing cabinet, I will merely point out that the number of CNN-watchers who received the drug is significantly different from those who received the placebo. And this fact alone says to me that CNN-watchers were not your original intention. Nitroxed would have liked to argue they too had a sheet of paper in a filing cabinet.

So now you're placing yet another condition that both 'arms' of the study have to have the same or similar numbers of patients. Seems to me that you have abandoned randomization to avoid outcomes that can prove embarrassing to the statistical discipline. Is the only way that statistics 'works' is by following arbitrary preconditions designed to ensure it can't be refuted?

What I’m saying is we’re trying to extrapolate from the particular to the general. You’re still leaping to a general statement even when you have 4 billion people in your subgroup. And as a mathematician and scientist the best we’ve been able to come up with are the preconditions and rules we find in studying random activities like rolling of dice or flipping coins. Based on these analogies we are 100% certain you need a precondition of randomization in each of the subgroups and similar numbers of participants on each side of the comparison. We use the same randomization algorithms for the patients that we use for the dice-games. Maybe the fundamental basis of statistics from a logical standpoint is still murky, but it works. And isn't that one of your prized beliefs: Whatever Works?

Yes, but I can’t help but thinking that statisticians take credit for conclusions that would be apparent even if statistics had never been invented. Many of the supposed victories for statistics are really interocular+ differences: conclusions that hit you between the eyes. Simple domestication of wheat and farm animals in pre industrial ages achieved just as much incremental improvement in yield as our modern-day statistical approaches. I'm beginning to become more sympathetic to Nitroxed's arguments. Do we have studies that show how the use of statistical reasoning improves on decision-making using other approaches?

Of course, there are dozens of studies that show how the statistical approach is extremely effective at eliminating bias in studies.

Granted, but we can eliminate bias through other means. Let me rephrase. Are there studies that show how statistics beats common sense in the absence of bias? If there were no FDA and Nitroxed was sure they would later be sued if they promoted BiXil to African-Americans under false pretenses, that seems to me to be a much more reliable way to control for bias than giving them the cover of an FDA stamp-of-approval with another clinical trial.

Now who's placing preconditions? Elimination of bias is a core competency of the statistics discipline. Nothing beats it. If you eliminate it as a reason for using statistics then you've largely undermined the power of statistics. Within the Nitroxed family the calculation is cold cash. They would be idiotic not to try to eliminate all bias before making their internal decisions. But when the FDA is involved, we need an objective test, one that may not give you the right answer, but it gives you an answer that cannot be easily 'gamed' by outsiders. Officials at the FDA have no way of knowing if Nitroxed, or the particular individuals involved with the BiXil case, were acting scrupulously. It's not a matter of fact, it's a matter of Cover Your Backside (aka CYA+).


So now that you know how statistics works how would you change your approach with the FDA (Phase III clinical trials) in the pharmaceutical industry? If it's a game of chance then how do you play the game well? Where would you look to recruit your statisticians in order to improve your odds of winning? ...we should not fool ourselves!

Further Reading