Ed: Doctors say he’s got a 50/50 chance at living.
Frank: Well there’s only a 10% chance of that
There are several motivations for choosing a topic about which to write. One is to educate others about a topic about which I am expert. Another motivation is amusement; some posts I write solely for the glee I experience in deconstructing a particular piece of nonsense. Another motivation, and the one behind this entry, is to educate me.
I hope that the process of writing this entry will help me to better understand a topic with which I have always had difficulties: statistics. I took, and promptly dropped, statistics 4 times a college. Once they got past the bell shaped curve derived from flipping a coin I just could not wrap my head around the concepts presented. I think the odds are against me, but I am going to attempt, and likely fail, in discussing some aspects of statistics that I want to understand better. Or, as is more likely, learn for the umpteenth time, only to be forgotten or confused in the future.
In medicine it is the p <= 0.05 that rules. If the results of the study meet that requirement the results are statistically significant. Maybe not clinically relevant or even true, but you have to have pretty good reasons not to bow before the power of a p <= 0.05. It is SIGNIFICANT, dammit.
But what does that mean? First you have to consider the null hypothesis: that two events are totally unrelated, that there is no difference between the two treatments in terms of their effect. The p value is the likelihood that any observed difference away from the null hypothesis is due to chance. Or
The probability of the observed result, plus more extreme results, if the null hypothesis were true.
So if there is a small p value, 0.05, then the chance that a difference observed in two treatments is random is 5%. If the p-value is ≤ 0.05 then the result is significant and the null hypothesis may be rejected and the alternative hypothesis, that there is a difference in the treatments, might be accepted.
The cut-off of significance, 0.05, it is should be emphasized, it an arbitrary boundary that was established by Fischer in 1925 but has since become dogma set in rebar-reinforced cement.
And what is significant, at least as initially formulated by Fischer?
Personally, the writer prefers to set a low standard of significance at the 5 percent point … A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.
In other words, the operational meaning of a P value less than .05 was merely that one should repeat the experiment. If subsequent studies also yielded significant P values, one could conclude that the observed effects were unlikely to be the result of chance alone. So “significance” is merely that: worthy of attention in the form of meriting more experimentation, but not proof in itself.
The p value has numerous problems as a method for determining whether the null hypothesis can be rejected. There are at least 12 misconceptions with p value, all of which are common and I would have though true once upon a time:
1. If P = .05, the null hypothesis has only a 5% chance of being true.
2. A nonsigniﬁcant difference (eg, P ≥.05) means there is no difference between groups.
3. A statistically signiﬁcant ﬁnding is clinically important.
4. Studies with P values on opposite sides of .05 are conﬂicting.
5. Studies with the same P value provide the same evidence against the null hypothesis.
6. P =.05 means that we have observed data that would occur only 5% of the time under the null hypothesis.
7. P =.05 and P ≤ .05 mean the same thing.
8. P values are properly written as inequalities (eg, “P ≤.02” when P = .015)
9. P = .05 means that if you reject the null hypothesis, the probability of a type I error is only 5%.
10. With a P = .05 threshold for signiﬁcance, the chance of a type I error will be 5%.
11. You should use a one-sided P value when you don’t care about a result in one direction, or a difference in that direction is impossible.
12. A scientiﬁc conclusion or treatment policy should be based on whether or not the P value is signiﬁcant.
My head is already starting to hurt. It appears from reading writers wiser than I that the p value is a piss-poor criterion to judge biomedical results.
Above all what the p value does not include is measure of the quality of the study. If there is garbage in, there will be garbage out. I would wager that the most popular way to find a significant p value is subgroup analysis, the Xigris study perhaps being the most expensive example of that bad habit. Just this week I was reading an article on high dose oseltamivir for treatment of influenza and
Subanalysis of influenza B patients showed faster RNA decline rate (analysis of variance, F = 4.14; P = .05) and clearance (day 5, 80.0% vs 57.1%) with higher-dose treatment.
And I have no end of colleagues who will see that meaningless p value and up the dose of the oseltamivir. Same as it ever was. To my mind most p of 0.05 is almost certainly random noise and clinically irrelevant.
The most important foundational issue to appreciate is that there is no number generated by standard methods that tells us the probability that a given conclusion is right or wrong. The determinants of the truth of a knowledge claim lie in combination of evidence both within and outside a given experiment, including the plausibility and evidential support of the proposed underlying mechanism. If that mechanism is unlikely, as with homeopathy or perhaps intercessory prayer, a low P value is not going to make a treatment based on that mechanism plausible. It is a very rare single experiment that establishes proof. That recognition alone prevents many of the worst uses and abuses of the P value. The second principle is that the size of an effect matters, and that the entire confidence interval should be considered as an experiment’s result, more so than the P value or even the effect estimate.
So what’s a science-based medical practitioner to do?
If comprehending a p value gives me a headache, Bayes gives me a migraine. Bayes is evidently a superior conceptual framework for determining whether a result is ‘true’ and has none of the flaws of the p value. But Bayes also lacks the simplicity of a simple number and I prefer a simple number, especially given the volume of papers I read. The p value is a shortcut, and unfortunately an unreliable shortcut.
If you Google Bayes theorem you always get that formula, an expression of the concept that is both concise and, for a practicing clinician, imminently forgettable and impossible to apply without help. I have a Bayes calculator on my iPhone and I re-read blog entities on the topic over and over. It is still difficult to comprehend and apply, at least for me.
As a clinician, as someone who takes of sick people for a living, and not a scientist, how do I apply Bayes?
Simplistically how valid, how true, a result might be depends in part on the prior plausibility. In a world of false positives and false negatives, it is not always so simple to determine if a positive test result makes a diagnosis likely or a treatment effective. Many of the Bayes explanation sites use cancer screening as an example and I cannot retain that example any longer than while I read it.
The problem with Bayes, although superior to p values, is that of pretest plausibility. Often it seems people are pulling pretest plausibilty out of thin air. In the old days we would do V/Q scans to diagnosis pulmonary embolism (PE), not a great test, and there was always the issue of how to interpret the result based on whether you thought by risk factors that the patient was likely to have had pulmonary embolism. I always felt vaguely binary during the discussions. Either they had a PE or they didn’t. The pre-test probability didn’t matter.
But it does. And that is my problem.
I have found the Rx-Bayes program for iOS gives a nice visual for understanding of Bayes. Even a highly sensitive and specific test is worthless if the pretest probability is low. I deal with this most often with Lyme testing. There is virtually no Lyme in Oregon, so a positive test is so much more likely to be a false positive than represent the real deal. It is striking how high the pretest probability has to be before even a sensitive and specific test has good reliability. And most tests have only a middling sensitivity and specificity.
The p value is so much nicer as it gives a single number rather than a range of probabilities. It is interesting to see what happens when you apply Bayes to p values:
If one starts with a chance of no effect of 50%, a result with a minimum Bayes factor of 0.15 (corresponding to a P value of 0.05) can reduce confidence in the null hypothesis to no lower than 13%. The last row in each entry turns the calculation around, showing how low initial confidence in the null hypothesis must be to result in 5% confidence after seeing the data (that is, 95% confidence in a non-null effect). With a P value of 0.05 (Bayes factor = 0.15), the prior probability of the null hypothesis must be 26% or less to allow one to conclude with 95% confidence that the null hypothesis is false. This calculation is not meant to sanctify the number “95%” in the Bayesian approach but rather to show what happens when similar benchmarks are used in the two approaches.
These tables show us what many researchers learn from experience and what statisticians have long known; that the weight of evidence against the null hypothesis is not nearly as strong as the magnitude of the P value suggests. This is the main reason that many Bayesian reanalyses of clinical trials conclude that the observed differences are not likely to be true
As I understand it as a clinician, the take home is that a p of 0.05 or even 0.01 maybe statistically significant, it is unlikely to mean the result is ‘true’, that you can reject the null hypothesis. In part this is due to the unfortunate fact than many clinical studies stink on ice.
The Bayes factor is a method by which the subjective prior probability is removed:
The Bayes factor is a comparison of how well two hypotheses predict the data. The hypothesis that predicts the observed data better is the one that is said to have more evidence supporting it.
How do the two values compare? A p of 0.05 corresponds to a Bayes factor of 3 to 5, considered weak evidence. The take home is that p of 0.005 is probably a better value for significant and 0.001 for highly significant if you really want to reject the null hypothesis. Not quite the 5 sigma criteria that CERN used to find the Higgs boson, but better and more likely to be true (true meaning the null hypothesis is unlikely).
Wallowing in the medical literature these last 30 years I have been struck how studies wobble about the zero. Some studies show benefit, some do not, of a given intervention, all with slightly different designs but all with middling p values.
My big take home from the above is to consider an intervention effective, as true, if the p is 0.005 AND has been replicated. But a p of 0.05 in a single study? Pffffftttttttt.
I think the above analysis probably excludes a big chunk of real medicine and all the topics covered by this blog as being true. I wonder, is there an SCAM intervention that has a p of 0.005, much less 0.001? Not that I can find, but I am sure the comments will correct me about this and the numerous other errors I have made in the essay.
I think I have a simple rule of thumb with a sophisticated background. As Barbie noted, math is hard.