## 5 out of 4 Americans Do Not Understand Statistics

*Ed: Doctors say he’s got a 50/50 chance at living.*

* Frank: Well there’s only a 10% chance of that*

Naked Gun

There are several motivations for choosing a topic about which to write. One is to educate others about a topic about which I am expert. Another motivation is amusement; some posts I write solely for the glee I experience in deconstructing a particular piece of nonsense. Another motivation, and the one behind this entry, is to educate me.

I hope that the process of writing this entry will help me to better understand a topic with which I have always had difficulties: statistics. I took, and promptly dropped, statistics 4 times a college. Once they got past the bell shaped curve derived from flipping a coin I just could not wrap my head around the concepts presented. I think the odds are against me, but I am going to attempt, and likely fail, in discussing some aspects of statistics that I want to understand better. Or, as is more likely, learn for the umpteenth time, only to be forgotten or confused in the future.

### Frequentist

In medicine it is the p <= 0.05 that rules. If the results of the study meet that requirement the results are statistically significant. Maybe not clinically relevant or even true, but you have to have pretty good reasons not to bow before the power of a p <= 0.05. It is SIGNIFICANT, dammit.

But what does that mean? First you have to consider the null hypothesis: that two events are totally unrelated, that there is no difference between the two treatments in terms of their effect. The p value is the likelihood that any observed difference away from the null hypothesis is due to chance. Or

The probability of the observed result, plus more extreme results, if the null hypothesis were true.

So if there is a small p value, 0.05, then the chance that a difference observed in two treatments is random is 5%. If the p-value is ≤ 0.05 then the result is significant and the null hypothesis may be rejected and the alternative hypothesis, that there is a difference in the treatments, might be accepted.

The cut-off of significance, 0.05, it is should be emphasized, it an arbitrary boundary that was established by Fischer in 1925 but has since become dogma set in rebar-reinforced cement.

And what is significant, at least as initially formulated by Fischer?

Personally, the writer prefers to set a low standard of significance at the 5 percent point … A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.

In other words, the operational meaning of a P value less than .05 was merely that one should repeat the experiment. If subsequent studies also yielded significant P values, one could conclude that the observed effects were unlikely to be the result of chance alone. So “significance” is merely that: worthy of attention in the form of meriting more experimentation, but not proof in itself.

The p value has numerous problems as a method for determining whether the null hypothesis can be rejected. There are at least 12 misconceptions with p value, all of which are common and I would have though true once upon a time:

1. If P = .05, the null hypothesis has only a 5% chance of being true.

2. A nonsigniﬁcant difference (eg, P ≥.05) means there is no difference between groups.

3. A statistically signiﬁcant ﬁnding is clinically important.

4. Studies with P values on opposite sides of .05 are conﬂicting.

5. Studies with the same P value provide the same evidence against the null hypothesis.

6. P =.05 means that we have observed data that would occur only 5% of the time under the null hypothesis.

7. P =.05 and P ≤ .05 mean the same thing.

8. P values are properly written as inequalities (eg, “P ≤.02” when P = .015)

9. P = .05 means that if you reject the null hypothesis, the probability of a type I error is only 5%.

10. With a P = .05 threshold for signiﬁcance, the chance of a type I error will be 5%.

11. You should use a one-sided P value when you don’t care about a result in one direction, or a difference in that direction is impossible.

12. A scientiﬁc conclusion or treatment policy should be based on whether or not the P value is signiﬁcant.

My head is already starting to hurt. It appears from reading writers wiser than I that the p value is a piss-poor criterion to judge biomedical results.

Above all what the p value does not include is measure of the quality of the study. If there is garbage in, there will be garbage out. I would wager that the most popular way to find a significant p value is subgroup analysis, the Xigris study perhaps being the most expensive example of that bad habit. Just this week I was reading an article on high dose oseltamivir for treatment of influenza and

Subanalysis of influenza B patients showed faster RNA decline rate (analysis of variance, F = 4.14; P = .05) and clearance (day 5, 80.0% vs 57.1%) with higher-dose treatment.

And I have no end of colleagues who will see that meaningless p value and up the dose of the oseltamivir. Same as it ever was. To my mind most p of 0.05 is almost certainly random noise and clinically irrelevant.

The most important foundational issue to appreciate is that there is no number generated by standard methods that tells us the probability that a given conclusion is right or wrong. The determinants of the truth of a knowledge claim lie in combination of evidence both within and outside a given experiment, including the plausibility and evidential support of the proposed underlying mechanism. If that mechanism is unlikely, as with homeopathy or perhaps intercessory prayer, a low P value is not going to make a treatment based on that mechanism plausible. It is a very rare single experiment that establishes proof. That recognition alone prevents many of the worst uses and abuses of the P value. The second principle is that the size of an effect matters, and that the entire confidence interval should be considered as an experiment’s result, more so than the P value or even the effect estimate.

So what’s a science-based medical practitioner to do?

### Bayes

If comprehending a p value gives me a headache, Bayes gives me a migraine. Bayes is evidently a superior conceptual framework for determining whether a result is ‘true’ and has none of the flaws of the p value. But Bayes also lacks the simplicity of a simple number and I prefer a simple number, especially given the volume of papers I read. The p value is a shortcut, and unfortunately an unreliable shortcut.

If you Google Bayes theorem you always get that formula, an expression of the concept that is both concise and, for a practicing clinician, imminently forgettable and impossible to apply without help. I have a Bayes calculator on my iPhone and I re-read blog entities on the topic over and over. It is still difficult to comprehend and apply, at least for me.

As a clinician, as someone who takes of sick people for a living, and not a scientist, how do I apply Bayes?

Simplistically how valid, how true, a result might be depends in part on the prior plausibility. In a world of false positives and false negatives, it is not always so simple to determine if a positive test result makes a diagnosis likely or a treatment effective. Many of the Bayes explanation sites use cancer screening as an example and I cannot retain that example any longer than while I read it.

The problem with Bayes, although superior to p values, is that of pretest plausibility. Often it seems people are pulling pretest plausibilty out of thin air. In the old days we would do V/Q scans to diagnosis pulmonary embolism (PE), not a great test, and there was always the issue of how to interpret the result based on whether you thought by risk factors that the patient was likely to have had pulmonary embolism. I always felt vaguely binary during the discussions. Either they had a PE or they didn’t. The pre-test probability didn’t matter.

But it does. And that is my problem.

I have found the Rx-Bayes program for iOS gives a nice visual for understanding of Bayes. Even a highly sensitive and specific test is worthless if the pretest probability is low. I deal with this most often with Lyme testing. There is virtually no Lyme in Oregon, so a positive test is so much more likely to be a false positive than represent the real deal. It is striking how high the pretest probability has to be before even a sensitive and specific test has good reliability. And most tests have only a middling sensitivity and specificity.

### Synthesis

The p value is so much nicer as it gives a single number rather than a range of probabilities. It is interesting to see what happens when you apply Bayes to p values:

If one starts with a chance of no effect of 50%, a result with a minimum Bayes factor of 0.15 (corresponding to a P value of 0.05) can reduce confidence in the null hypothesis to no lower than 13%. The last row in each entry turns the calculation around, showing how low initial confidence in the null hypothesis must be to result in 5% confidence after seeing the data (that is, 95% confidence in a non-null effect). With a P value of 0.05 (Bayes factor = 0.15), the prior probability of the null hypothesis must be 26% or less to allow one to conclude with 95% confidence that the null hypothesis is false. This calculation is not meant to sanctify the number “95%” in the Bayesian approach but rather to show what happens when similar benchmarks are used in the two approaches.

These tables show us what many researchers learn from experience and what statisticians have long known; that the weight of evidence against the null hypothesis is not nearly as strong as the magnitude of the P value suggests. This is the main reason that many Bayesian reanalyses of clinical trials conclude that the observed differences are not likely to be true

As I understand it as a clinician, the take home is that a p of 0.05 or even 0.01 maybe statistically significant, it is unlikely to mean the result is ‘true’, that you can reject the null hypothesis. In part this is due to the unfortunate fact than many clinical studies stink on ice.

Recently an article in PNAS, way over my head, discussed how Bayes, by way of the Bayes factor, and the p value can be related.

The Bayes factor is a method by which the subjective prior probability is removed:

The Bayes factor is a comparison of how well two hypotheses predict the data. The hypothesis that predicts the observed data better is the one that is said to have more evidence supporting it.

How do the two values compare? A p of 0.05 corresponds to a Bayes factor of 3 to 5, considered weak evidence. The take home is that p of 0.005 is probably a better value for significant and 0.001 for highly significant if you really want to reject the null hypothesis. Not quite the 5 sigma criteria that CERN used to find the Higgs boson, but better and more likely to be true (true meaning the null hypothesis is unlikely).

Wallowing in the medical literature these last 30 years I have been struck how studies wobble about the zero. Some studies show benefit, some do not, of a given intervention, all with slightly different designs but all with middling p values.

My big take home from the above is to consider an intervention effective, as true, if the p is 0.005 AND has been replicated. But a p of 0.05 in a single study? Pffffftttttttt.

I think the above analysis probably excludes a big chunk of real medicine and all the topics covered by this blog as being true. I wonder, is there an SCAM intervention that has a p of 0.005, much less 0.001? Not that I can find, but I am sure the comments will correct me about this and the numerous other errors I have made in the essay.

I think I have a simple rule of thumb with a sophisticated background. As Barbie noted, math is hard.

The problem with aiming for lower p values is that it means running much larger trials. In order to establish with an extremely high degree of statistical certainty that you’re not just seeing random noise, you need to run an experiment on a larger group of participants. For a lot of types of research, that means that a lot of people are going to have to suffer a bad outcome – up to and including cancer recurrence, stroke, heart attack, and death. If you already believe, based on the preponderance of basic science work, clinical experience, and, perhaps, small or limited clinical trials, that a given therapy is an effective treatment for a dangerous and potentially fatal condition, but have not yet gotten a p value of less than, say, 0.01, can you in good conscience randomize people into an arm where they don’t get this treatment? What if you’ve gotten a p of 0.01, but want to hold out for 0.001, or 0.0001? What if a nearly identical therapy has already been shown to work well? What if the mechanism is so strikingly obviously helpful, like a parachute for survival gravitational challenge, that no randomized trials have ever been attempted? Striking the appropriate balance between the need to do right by our patients, and the desire to achieve scientific certainty is difficult, and I think that ultimately there’s no single p value that will work in all circumstances.

I often feel like people argue about p values and evidence as a way to escape responsibility for exercising their judgement and their rational minds. With a good p value, one doesn’t need to justify one’s thought process, or talk about how plausible the therapy is, or the scientific basis or clinical experience supporting one’s choices. But I think we owe it to our patients to put those very things into making decisions in uncertain circumstances.

These are interesting questions. One reply I suppose is

http://cid.oxfordjournals.org/content/48/4/407.long

I fully agree with your concerns about trial size, etc. However, according to Arrowsmith, about 50% of all drug candidates that reach Phase III studies fail. And 90% of those failures are due to efficacy and/or safety problems (as opposed to business reasons, for example). In other words, even if you have a drug that works in nonhuman models and has promising results in early clinical studies (through Phase II), you

stillcan’t be more than ~ 50% confident (on average) that it really works!Based on that, I don’t see how to escape the conclusion that p < 0.05 is inadequate. Believe me, I don't like the implications any more than you, especially since I'm in the business of trying to develop new therapeutics. But I think it's clear that our ability to select good drugs based on science, experience, and limited clinical testing is pretty damn poor.

If you already believe, based on the preponderance of basic science work, clinical experience, and, perhaps, small or limited clinical trials, that a given therapy is an effective treatment for a dangerous and potentially fatal condition, but have not yet gotten a p value of less than, say, 0.01, can you in good conscience randomize people into an arm where they don’t get this treatment?

that occurs to me such criteria apply to my favorite example, internal mammary ligation for angina back in the 50s. But medicine is tricky

If I remember correctly, mammary artery ligation was used due to uncontrolled single studies or case series showing reduction in angina. Once randomized against a control it was rejected because equal reductions were seen with sham surgery. I don’t think it was a p value issue.

I agree with you. The author is still a bit fuzzy about statistics. Statistical power, to identify groups that are likely to be different depends on sample size, how many differences you want to look at at once, etc. The larger the sample the less differences there needs to be between the groups to find differences that are likely to be real. A tiny sample size requires huge differences, differences that look huge but are not statistically significant might go away if you were to repeat the study with a huge sample size. You can’t just eyeball two means and say they must mean something because the differences between the two groups are so big. You need to take into consideration all the other factors involved that affect identifying effects between treatment groups that are highly likely to be real and not due to chance. And yes something can be statistically significant but the practical significance may be questionable.

Another problem is that the P chosen affects the kind of error you have. You could have the kind of error where you mistakenly identify something as having an effect when it does not OR you can mistakenly identify something as NOT having an effect when it actually does. This is called type 1 and type 2 error. They do not operate independently from each other and your p value influences which kind of error you are more likely going to fall prey to… hold out for a p=0.000001 and you will be making, big time, the error of saying that something that is effective is not.

They say 3% of the people use 56% of their brain

97% use just 3% and the rest goes down the drain

I’ll never know which one I am, but I’ll bet you my last dime

99% think we’re 3% 100% of the time

64% of the world’s statistics are made up right there on the spot

82.4% of people believe them whether they’re accurate statistics or not

I don’t know what you believe, but I do know there’s no doubt

I need another double shot of something 90 proof, I got too much to think about

Todd Snyder, Statistician’s Blues

****

Congratulations on your safe return from the land of boiled white food.

Anchorman: Ron Burgundy the Legend of Ron Burgundy (2004)

Brian Fantana: They’ve done studies, you know. 60% of the time, it works every time.

[cheesy grin]

Ron Burgundy: That doesn’t make sense.

I’m with Barbie.

Just skimming this gave me a headache.

I’m sure this is annoying, but I’m going to be self-centered. If you think it’s hard to understand the implications of statistics with the education of a doctor, try understanding it as a patient with the education of an artist. I have to pause and imagine crocodiles eating numbers everytime I see a > or < sign.

"A result might be depends in part on the prior plausibility. In a world of false positives and false negatives, it is not always so simple to determine if a positive test result makes diagnosis likely or a treatment effective. Many of the Bayes explanation sites uses cancer screening as an example and I cannot retain that example any longer than while I read it."

Here's a testing issue I'd love an explaination for. I will use anti-nuclear antibody (ANA) testing as an example (because it's all about me). One common thing I have heard from doctors is that a person with Lupus is statistically, highly likely to have a positive ANA test (the test is 95%ish sensitive) But, up to 30%ish (depending upon how you split them up by age or sex) of healthy people have a positive ANA test. Therefore (they say) the majority of people with a positve ANA are healthy (do not have an autoimmune disease).

Is this true? I don't get how 70% of the people with a positive test (who, I guess are not healthy) aren't in the majority?

The other thing that I've experienced with the application of prior plausibility in testing is this. One might doctor say to me ."Well mouse, Your symptoms suggest an auto-immune disease. We will run an ANA test." (a few tests later) " Well your ANA test is high positive, but your dsDNA test (sensitivty for SLE 57.3% – specificity 97.4%) is negative. You are not sick enough to have Lupus, therefore congratulations, you are well!

If I'm not sick enough to have a disease, then why the heck run the tests for that disease?

I'm not working with these doctors anymore, mostly because my other doctors seem to have different interpretations of the same test results and their approach seems to help me more.

But I don't actually know which set of doctors is correct. (And I suspect that's because I don't get statistics). Or if there even is a correct. :p

Barbie was right.

I’ll take this one! (I’m a statistics teacher, and this is like my favorite conditional probability problem EVER.)

If you do an ANA test looking for lupus, there are 4 possible situations.

1) No lupus, negative test (true negative)

2) No lupus, positive test (false positive)

3) Patient has lupus, negative test (false negative)

4) Patient has lupus, positive test (true positive)

Working the probabilities requires a reasonable estimate of the prevalence of lupus in the population. On the basis of a quick Google and to make math easy, I’m going to go with 1 per 1,000 people. So, 999 per 1000 don’t.

So, lets say I want to round up EVERYONE and test them for lupus. After all, lupus is dangerous, people who have it need to be diagnosed and treated.

Probability of a true negative is the chances that a randomly selected person has lupus, times the probability that a healthy person has a negative ANA.

(999/1000)*(.70) = .6993

Probability of false positive is (999/1000)*(.30) = .2997, roughly 30%.

Probability of true positive (1/1000)*(.95) = .00095, roughly 1%.

Probability of a false positive is (1/1000)*(.05)= .00005, way crazy small.

So, if you have a negative test, it is almost certain that you don’t have lupus. But if a randomly selected person has a positive test, false positive is a much more likely explanation than actual lupus. In particular, .2997/(.2997+.00095) = 99.6% of positives will be false positives!

This is why you shouldn’t test randomly selected people for lupus. (We do test random people for drugs, TB, certain cancers, and other things, though.)

Here’s where Bayes comes into it: If you’re testing a 40-year old woman with joint pain and a butterfly rash, the prior plausibility of lupus is quite high. Hence, if it comes back positive, it’s MUCH more likely that the patient actually has lupus. (I’m not a rheumatologist, I don’t have the data to generate a specific probability.)

When you test the entire population for a disease, false positives pop out all over the place. When a doctor tests one particular patient because the doctor thinks he might have it, then true positives are more common.

Yes, it’s the famous one that doctors get wrong so often.

I forget what the classic link is.

We now have meta-analysis (2004): http://qjmed.oxfordjournals.org/content/97/1/53.full

Fun quote (PPV=positive predictive value):

“In one study from Australia, 13 of 50 (26%) physicians stated that they could describe PPV, although on direct interviewing only one could actually illustrate it with an example.”

The money quote, saying that it might matter.

“Physician innumeracy remains an impediment in popularizing EBM. Inattention to pre-test probability, and inability to assess the PPV accurately, could result in increased anxiety in patients by generating unnecessary tests and consultations.”

. . . “could result in patient anxiety” when the patient realizes that the doctor is innumerate.

…actually it’s more like 0.1% probability of true positive if you test the whole population – and also, under a “test the whole population” circumstance, if you get a negative test, it’s something like 99.993% likely to be a true negative.

It cuts both ways: as you reduce the tested population, a positive is more likely to be a true result, and a negative is more likely to be a false result. If, out of that 1000 people, 10 of them have symptoms that make it a good idea to give the ANA test, and one of those 10 is the one from the original group who actually has lupus, now you’ve got:

0.1 * 0.95 = 9.5% probability of a true positive

0.1 * 0.05 = 0.5% probability of a false negative

0.9 * 0.7 = 63% probability of a true negative

0.9 * 0.3 = 27% probability of a false positive

So your “positive” test is now about 25% likely to be real (a much better scenario for sending people through to more specific but maybe more expensive or painful testing), but you’re up to nearly 1% of negative results being false. There are always cracks to fall through…

cc prof – I’m reading this and there is a glimmer of hope that I might understand it…but

“Probability of false positive is (999/1000)*(.30) = .2997, roughly 30%.

Probability of true positive (1/1000)*(.95) = .00095, roughly 1%.

Probability of a false positive is (1/1000)*(.05)= .00005, way crazy small.”

You meant probability of false

negativeis 1/1000=crazy small – right?Yes, sorry, that was a typo. And you caught it. So I think you get it!

Bing! Actually I DO get it. My mistake has been that I have been thinking that the false positive rate of 30% was a percentage of all positives, not all tested. So – I thought that the remaining 70% were true positives. The remaining 70% are actually true NEGATIVES. (F*%ck!) *

I did understood that SLE is a low probability in the general population. I just thought that when you tested only people with a high suspicion of auto-immune disease that a positive ANA was true 70% of the time and false 30% of the time. In fact, I though that’s how it was explained to me (by the dermatologist). That there was a 30% chance that I didn’t have an auto-immune disease. But, then the negative dsDNA (also negative RF, SS, SSA, SSB, and scl-70, ) showed that I was in that 30%. But the fact that my C3 and C4 was mildly low along with the other stuff suggested I was in the 70% Urgh.

I think I was confusing the statistic for the general population screening and some other statistic for diagnostic testing, that I don’t have numbers for.

It’s a rabbit hole. (or maybe a honey badger) I probably shouldn’t have brought it up.

But at least I do get my original error. Thanks!

And sorry for TMI – sometimes commenting helps me think. (although probably it should be the other way around.

*I’m not yelling at you, just a bit frustrated at how far of I was.

“Here’s where Bayes comes into it: If you’re testing a 40-year old woman with joint pain and a butterfly rash, the prior plausibility of lupus is quite high. Hence, if it comes back positive, it’s MUCH more likely that the patient actually has lupus. (I’m not a rheumatologist, I don’t have the data to generate a specific probability.)”

Okay, I think I’m getting it. Oh wait I lost it again….

My ANA test(s) weren’t done as screening. They were done because I had symptoms. The first time (in my early 30′s) it was bilateral pain in hands, elbows and feet and bouts of flu like symptoms followed by fatigue. No face rash, though. The second test was ordered by a dermatologist when I was 47 (1 and a half ago). Because of an intermittent rash he said may be a heliotrope rash, plus prickling/itching skin, eye irritation, fatigue shortness of breath, mild anemia, raynaud’s that had developed two years prior….(yada yada, sorry). The doctors were looking at SLE, Sjogren’s, dermatopolymyositis and scleroderma*.

Wouldn’t the statistics on false positive/negative of the general population be irrelevant, if they are doing the test because of a constellation of signs symptoms related to auto-immune disease? Shouldn’t they use a different set of statistics that are based on a group of people that have those symptoms? But I guess that’s not practical? Particularly since the symptoms are so non-specific.

*Really? Medical folks? who comes up with these unspellable disease names?

Yes, it’s important to differentiate between screening tests and diagnostic tests. Example: rheumatologists know that 80% of patients with rheumatoid arthritis have a positive test for rheumatoid factor and that a smaller number of patients with other diseases also have a positive test. They try to take all such factors into account. No one would consider doing a screening test on the entire population for RF, and it is not diagnostic of RA, but in patients with symptoms compatible with RA, it can help the rheumatologist by raising or lowering the probability of a diagnosis post-test.

Not sure about unspellable disease names, but let me take another crack at the “positive predictive value” calculations, which as has been noted, are misunderstood by many, including physicians.

For some it’s easier to understand these calculations when done not as probabilities, but as frequencies and relative frequencies. (See Gerd Gigerenzer’s books and papers for lots about this.) This applies both to how the sensitivity and specificity are presented as well as to the calculations themselves. Since we already have started with the former as percentages, I’ll stick with that, but will reason not with probabilities, but with frequencies.

So imagine that there are 10,000 people who will be tested, all from the general population, where the prevalence of lupus is 1 out of 1000.

We’d expect that 10 of these 10,000 people would have lupus, and the remaining 9,990 would not.

Let’s think about each of these groups separately:

Of the 10 who have lupus, we’d expect 95%, so about 9 or 10, to have a positive test result. To be concrete, let’s say we’d expect 9 to have a positive result and 1 to have a negative result.

Of the 9,990 who don’t have lupus, we’d expect 30%, so a bit less than 3,000, to have a positive test result, and we’d expect the remaining 70% to have a negative test result.

Let’s look at the positives now: We have about 9+3000 of these, of whom 9 actually have lupus, and the remaining 3,000 do not. So of the 3,009 who test positive, only 9 have lupus, leading to a positive predictive rate of

9/3009

which is about 0.003 or 0.3%.

Now think about screening versus diagnostic testing. (I hope I’ve got the terminology right. I’m a statistics professor, not a doctor!)

For diagnostic testing we’d expect a much higher prevalence. Let’s suppose for illustration this is 1 in 20 rather than 1 in 1000. (I have no idea whether this is a reasonable prevalence in this context, but the ideas are the same in any case.)

Think again about giving the test to 10,000 people. Now we’d expect about 500 (1 in 20) to have lupus, and the remaining 9,500 not to have lupus.

Of the 500 with lupus, we’d expect 95%, or 475, to have a positive test.

Of the 9,500 without lupus we’d expect 30%, or 2850, to have a positive test.

Now the positive predictive value, i.e., the proportion of those who test positive who actually have lupus, is 475/(2850+475) which is about 0.14 or 14%. Still not great, but better than 0.003.

If you look more closely at the calculations, the thing that makes the PPV so low in both cases is the low specificity, i.e, the high proportion (30%) of those without lupus who test positive. That’s what gives us so many positives who don’t have lupus, even when the prevalence is 1 in 20.

I hope this illuminates rather than obscures…

@Vince Melfi – That IS a very good explanation. Someone should make a animated chart that takes patients through that explanation, when relevant. I think it would be more helpful than many articles I’ve found on the Internet.

Although, on that chart, I think it would be good to see percentiles of people with positive ANA with auto-immune disease (broken down into subgroups) vs healthy positives rather than divided into Lupus, not Lupus, since some percentage of folks with positive ANA end up diagnosed with other AI disease or UCTD.

http://library.mpib-berlin.mpg.de/ft/gg/GG_Helping_2008.pdf

might be classic. Docs did terribly when given conditional probabilities. They did better when give what some folks call the natural frequencies.

http://www.ncbi.nlm.nih.gov/pubmed/9609869 reached the same conclusion, their German docs only estimating the the chances of being a true positive given a positive test result (the PPV) correct 10% of the time.

The first one discusses some other common problems too like lead time and overdiagnosis biases (well known to readers here), and in an entertaining way.

” 5-year survival for colon cancer was

60% in the United States compared to 35% in Britain. Experts

dubbed this finding ‘‘disgraceful’’ and called for government

spending on cancer treatment to be doubled. In response, then-

Prime Minister Tony Blair set a target to increase survival rates

by 20% over the next 10 years, saying, ‘‘We don’t match other

countries in its prevention, diagnosis and treatment’”

Ofcourse mortality rate was almost no different, just lead-time bias and overdiagnosis. Likewise Giuliani used 82% survival in US compared to 44% in England to show how great our medical system was when running for president, 2007. America was exceptional.

Also you have tests that do not discriminate well – perhaps they discriminate better than chance and so they are used, in conjunction with other things, but there are many things out there where you can not say everyone with this marker will get this disease – most of them with the disease have the marker but so do some people with out the disease. Some people have that marker and never get the disease. Something else, in addition to the marker, is clearly going to better help discriminate between those with the marker and those with the marker who have the disease.

That is why, with this test for lupus you can say that 95% of the people with lupus are positive but that in the general population 30% of the people have the marker and many of them do not have the disease. Maybe the people in the group with marker, without disease have a higher chance to get the disease and just don’t have it yet. Maybe they need to have a second marker we haven’t identified it yet and the two together are needed and the folks with marker #1 and no disease don’t’ have marker #2. Maybe there are some environmental factors that people with the marker but no disease haven’t been exposed to in order to get the disease (and so the people without the marker exposed to the environmental things won’t get the disease even if exposed…).

Liz B – “Maybe they need to have a second marker we haven’t identified it yet and the two together are needed and the folks with marker #1 and no disease don’t’ have marker #2.”

Yes – In the case of SLE (and other AI diseases) they diagnose based on a criteria that includes both signs and symptoms.

http://www.lupusresearchinstitute.org/lupus-facts/lupus-diagnosis

…and they work to set criteria for diseases that seem like a disease but don’t fit the other criteria, like Undifferentiated Connective Tissue Disease (formerly known as Latent Lupus or Incomplete Lupus, I think)

http://emedicine.medscape.com/article/334482-overview

These criteria are used to research the diseases. But, you have to watch out, because sometimes statistics for outcomes or prognosis can be based on groups who met different parts of the criteria.

So, to some extent, these diseases are human constructs, labels based on grouping signs and symptoms. There is obviously a biological basis for the disease, but the exact biological basis is not established, so they work from a sketch based on hints.

From a laymen’s standpoint, It’s really very interesting looking at how science/medicine deals with the uncertainties in this situation.

i had a similar experience… My wife is a special educator and is good at her work at applying standardized tests and proposing courses of action based on the results.

She was venting one day that some one had scored below the first percentile. like a geeky husband I latched on to an irrelevant part of her story and would not let go. I explained to her that is was was impossible for some to score below the first percentile. My definition the lowest percentile score you can get is 1. She would not agree even when I presented her with the math; We discussed it with a couple of co-workers and they disagreed with me as well regardless of the math.

When pointed to a reference I quickly found the source of my misunderstanding. You can score below the first percentile when compared to the sample group the score was being compared against. Once again my wife was quick to point out how wrong I was and that I had missed the point of the original story which dealt more with the situation and not the test result.

My take away was: these educators are skilled in giving and interpreting the results of the test they give but don;t understand the math (statistics) behind them. Maybe they don’t need to or maybe they were taught and could articulate it at one time but was forgotten because it was not directly relevant.

Anyway, a buddy of mine who teaches math feels that math scores in Canada and the US would be better if math teachers taught math starting in the elementary grades. I tend to agree….

Thanks for the post, Dr. Crislip. Thought provoking as always…

I became more interested in some these issues after reading Nassim Taleb’s books (Black Swan, Fooled by Randomness, etc.)

Taleb has a really interesting argument about probability and GMO’s (it’s an anti-GMO argument) that I am always wanting to post on the GMO threads around here…but I’m not smart enough to understand either his argument or the SBM thread in a enough detail to comment, usually. Dangit! Anyway, with that awful preamble, I will just say I think lots of people around here would find his books interesting.

I read Swan but not Fooled. Maybe I need to read that as well as I don’t recall the anti-GMO argument in Swan.

That said I found Swan interesting in the descriptive sense but disappointing in any prescriptive sense. But maybe I was too dense to get the prescriptive part. Would that make it a case of pearls before swan?

I could not get through Swan. To me, the author came across as incredibly arrogant and self promotional…a self prescribed statistical guru. Which immediately undermined his credibility, in my mind and sucked all the joy from the novel concepts.

Or maybe it’s just because math makes me ornery.

His argument about GMOs seems to be essentially that if something goes wrong, it could go really horribly wrong, ecological disaster kind of thing.

I’m not terribly convinced that GMOs are more likely to cause this than anything else we’re doing agriculturally (or industrially.) In the past, humans HAVE caused major ecological disruptions by importing organisms from one continent or region to another either accidentally or deliberately. This will continue to happen as long as we have the ability to travel long distances and cross oceans.

With GMOs:

1) Most of the changes are pretty small

2) We actually do test things.

So yeah, if you want to talk about remote but catastrophic risks, talk about someone carrying his favorite potted plant from Asia into California’s Central Valley, or the corn fields of the Great Plains, with an unknown fungus hiding in the roots. Way more plausible than the GMO apocalypse.

That was in Swan, wasn’t it?! It has been a number of years since I read it and like mouse I was not nearly as impressed as Taleb seems to be.

Following that logic we would never try anything revolutionary because a black swan lurking could send a ripple through space-time that might change the value of c and then the poo would really hit the fan.

Yeah, he’s dressed up the utopian fallacy with game theory, throws in a few straw men about using untested completely new organisms, and then sits there smugly because the scientists don’t seem to share his concerns.

If our current agricultural practices were perfectly effective or at all environmentally sustainable, then I’d buy his argument about being absolutely certain of safety before we change anything. As it is, problems exist, and I’m willing to accept a solution once it’s been reasonably well demonstrated that potential risks action are smaller than the risks of inaction.

I’m heartily on board for the importation of wholly new, complex organisms into an existing ecosystem (particularly from China, where all manner of horrors and ecological disasters seem to lurk) being an utter disaster for the ecosystem. There are lots of examples – nautalis (?) and the Panama Canal, zebra mussels in the Great Lakes, the Potato Famine, kudzu vines, etc. etc.

Changing a single gene so a plant is immune to a

humanintervention? You’re giving it a couple generations of immunity, after which point the remaining herbs and pests will evolve around it. Not something to be concerned about in terms of whole ecosystems, definitely something to be concerned about when it comes to producing enough food to feed the world.I’ll take a stab at explaining Mouse’s conundrum.

Suppose we simplify the problem and assume everyone who doesn’t have lupus is normal/healthy and that 1% of the population have lupus. (In actuality, there are other auto-immune diseases that also correlate with high ANA–but lupus and RA are the most common ones.)

So (0.01 fraction of pop w/ lupus) * (.95 true positive )

+ (0.99 healthy) * (.30 false positive)

= .0095 + .297

= .3065

But, only .0095/.3065 = .031 or 3.1% of those that have high ANA actually have lupus.

So relax. I had high ANA when I was training really really hard in grad school. In midlife, my ANA levels are normal. But, I was probably healthier when I worked out more often and harder.

BMGM – Thanks – if I am reading correctly, you are showing me something similar to cc prof’s explanation.

As to relaxing. Well to be honest, In each case (see above) I was relaxing before I had the symptoms that took me to several doctors. I wasn’t so much worried by the test results as bothered by my symptoms and wondering what the likelihood of them being actually being related to ANA test vs. the ANA test being a red herring.

Not that I expect an opinion on that, I’m sure I’ve everyone enough with my list of ailments without bringing out specific test results

Qualifying this by saying I’m only a patient, my test results and symptoms were unclear and they’re still not sure exactly what I have, but I was told that my ANA level (1:2560) was too high not to be an autoimmune disease of some sort – at least by two of the three doctors I consulted. The first just kept retesting and waiting for something to become apparent; the second sent me to a rheumatologist who started me on Plaquenil, a tremendous help. But I still don’t have a clear diagnosis.

@Denise and Mouse

I’m sorry that you are experiencing health difficulties. I guess you had ANA tests in conjunction with a rheumatoid factor test and neither were definitive? Did they do a test for HLA-B27?

http://www.nlm.nih.gov/medlineplus/ency/article/003551.htm

First off, I am a PhD, not an MD. So I am NOT giving medical advice.

I write a mommy science blog where I go on ad nauseum about my statistical pet peeve, correlation does not imply causality. My physics professor friend, Eric, wrote a series about his pet peeve, the misinterpretation of p-values. Seriously, I didn’t really understand p-values until I read:

http://badmomgoodmom.blogspot.com/2009/11/what-is-p-value-of-bullshit-part-one_16.html

http://badmomgoodmom.blogspot.com/2009/11/what-is-p-value-of-bullshit-part-two.html

And then he followed that up with the tongue in cheek:

http://badmomgoodmom.blogspot.com/2012/10/what-is-p-value-of-chocolate.html

Lastly, I hope you get some medical answers to what ails you. Feel better.

PS I have a non-statistical pet peeve about the representation of theoretical physicists in pop culture. Please, please set people straight. We do not behave like on the Big Bang Theory.

Finally, you can get a small p-value for a very slight (or nonexistent) difference by using a large sample size. So take studies that have a very large sample size and very small effect with a grain of salt.

@Denise –

onlya patient, pfft…..:)Thank’s for the info. My titer was 1280 speckled/homegenous. The derm said no worries, my internist said strong suspicion of AI disease and sent me to the Rhuem. He said, not sure, but something could be going on and offered plaquinil, which he says is low risk. That seemed to help, many of my symptoms went away. but, you know, it’s a slow acting medication. It was gradual. So, I wonder if they may have gone away on their own.

The Rhuem runs test every 6 months for organ involvement, just to be on the safe side. At this point, I feel pretty confident that if there is something going on, it’s not severe.

I could stop the plaquinil and see if the symptoms return, but they were such a drag, it’s hard to work up any enthusiamsm for that.

@Mouse If you are seeing a rheumatologist AND a dermatologist, why hasn’t anyone done a work-up for psoriatic arthritis and the HLA-B27 marker?

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1079257/

http://www.mayoclinic.com/health/psoriatic-arthritis/DS00476

Don’t take medical advice over the internet from a non-MD (like me). Go ask your rheumatologist if that is a possibility.

BMGM – My skin isn’t (wasn’t) like psoriosis. It was more like this. http://www.flickr.com/photos/dokidok/2368946075/ but could also be rosecea. It’s sun sensitive. Since it was intermittent, didn’t get a biopsy.

Also, At this point, I don’t have joint swelling or arthritis, or the kind of spine issues that would lead to the HLA-B27 test. Rhuem says my MRI didn’t look like RA.

I don’t think the HLA-B27 disease typical are associated with low C3 /C4 and my ANA titters above 320.

And in the end, I’d probably be taking the same medication, so not much difference.

BTW, this former math major wasn’t bothered by Barbie’s “Math class is tough!” statement. Something else rankled.

http://badmomgoodmom.blogspot.com/2010/08/math-class-is-tough.html

I’m with her on the Barbie thing. It’s just a fact that math is not easy for most people. If she were to recite a claim that math is easy, that might even be taken to mean that anyone who finds it difficult must be an idiot (which would be both harmful and false).

I think this is one of those ironic things where the actual sexism was inserted by the reactions of the people who object. The doll does not say “…because I’m a girl”, but people imagined that and threw a fit over their own assumptions.

The other problem is (I have taught stats but am no math genius – I struggled with calculus and stats in grad school took me 20 hours a week, each week, to get through the homework and master it – I did well in those classes but spent a lot of time beating my head against the wall) is that many people click their brains off when faced with math.

I had students who claimed they couldn’t do formulas so couldn’t do stats. So I broken down the formulas in to a big grid of steps to do and they could get it that way. Also I found that many people who teach stats (I got stuck teaching stats, my PhD is not in this area) presume that the math will explain the meaning. If they are afraid of math to begin with that is not likely to happen because they are not going to think about what they are doing – just hope they can get through it by rote.

I found it works better to help people to understand stats by explaining first what the meaning is of a test/concept (e.g. what it does) and when to use it, when not to use it, what the limits are (explaining in words and examples) and then and only then tackle the math (remembering that we all use computer software and class is probably the first and last time they will do most of that math by hand, not to mention you couldn’t easily or quickly do that math with a large dataset).

I *hated* math all through high school. It was just tedium, this long, arduous process of solving 75 damn quadratic equations in a row…. Then I got to college and reluctantly signed up for Calculus, since it was required for my chemistry major. (Which I ended up switching for Computer Science anyway, but that’s another story.) And it was much harder, but also so much more FUN! Spending an hour and a half on a couple of calculus problems was a lot more enjoyable than forty-five minutes of trigonometry.

Math *is* hard. But that doesn’t mean it isn’t fun or worthwhile or that you have to be special to do it.

I must give a shameless plug to a previous SBM post explaining the importance of prior probability in interpreting P-values. Intended to clarify the issue, even for the statistically and mathematically challenged.

http://www.sciencebasedmedicine.org/the-plausibility-problem/

Just a heads up. It’s R.A. Fisher, not Fischer… http://en.wikipedia.org/wiki/Ronald_Fisher

Mark Crislip wrote:

Apparently, you still believe misconception #1 above, because you say, “[I]f there is a small p value, 0.05, then the chance that a difference observed in two treatments is random is 5%,” and that statement is synonymous with misconception #1. Saying that an observed effect is “random,” or “due to chance” is saying that the observed effect is due to the null hypothesis being true. Thus, saying “p=.05 means that the observed difference has a 5% chance of being random” is saying that there is a 5% of chance that the null is true, which is misconception #1.

Thank you for bringing this paper to my attention. I haven’t read it yet, but the statement that a p-value of .05 corresponds to Bayes factors of 3–5 is somewhat incorrect. Berger & Selke (ref. 13 in the PNAS paper) proved that for all intents and purposes a p-value of .05 cannot correspond to a Bayes factor greater than about 2.5. And usually the Bayes factor will be greater than that. In fact it is not uncommon that p-values in the range of .01–05 actually correspond to Bayes factors that favor the null hypothesis over the alternative. So, you’ve got the take-home message right: p-values need to be much smaller than most researchers think to provide convincing evidence against a null hypothesis.

You might find the Bayes factor calculators at Jeff Rouder’s website useful. They can easily be used to covert commonly reported statistics (like those from t-tests) to Bayes factors. Also, on my website, I present a method to convert hazard ratios from survival analysis (or logistic regression) to Bayes factors. Don’t be put off by the derivation. You only need equations 13, 14, and 15 to do the calculation. And I include a worked example to illustrate the method.

So is this wrong because:

a) it leaves out the “at least” part (i.e., it’s a 5% chance of getting data that strong or higher, so saying 5% chance for the exact data is too specific), or

b) it is worded in reverse, implying a conclusion when the actual cause might be something not considered?

I think this is the 4th or 5th time I’ve heard Dr. Crislip make the Naked Gun joke. Luckily I share his sense of humor so it never gets old. And thanks to the parachute effectiveness example that always gets brought up when we talk about RCTs, I look forward to the inevitable “best ways to die” Naked Gun joke in a future post.

On a more serious note, I have a nagging concern that always pops up whenever people talk about raising the bar for statistical significance. Better standards are great for high-quality, multi-center, well-funded clinical trials. But what about scientists that are just starting their careers? How the hell are they going to get publications when the standards go up by a factor of ten? Most graduate students I have met struggle with crappy results for years before they really get the hang of things, and I’m worried higher standards will only encourage dishonesty.

There are related issues regarding funding high-risk research, particularly in the United States where politicians seem to think 100% of science research should yield an economic return in 3 years. If more preliminary research results are suddenly deemed “not significant” it may cause a shift in what the NSF/NIH is willing to throw money at (considering they ALREADY do this with the current system).

Any solutions to these problems?

“If more preliminary research results are suddenly deemed “not significant” it may cause a shift in what the NSF/NIH is willing to throw money at (considering they ALREADY do this with the current system).”

What are the chances of that?

I have no clue. I’m hoping people that interact with funding agencies on a regular basis could offer some insight, because everything I know is stuff I heard through the grapevine 5 years ago.

As a professor of statistics, first let me say: thanks for writing on this subject.

You write that Bayes is

“evidently a superior conceptual framework for determining whether a result is ‘true’ and has none of the flaws of the p value” I am afraid this is not true.

Bayes has its own problems. It has plenty of flaws which is why the majority of

statisticians do not use it.

More importantly, there is much more to frequentist statistics than p-values.

In fact, p-values are probably the least useful part of frequentist statistic.

Despite all this nit-picking, I do appreciate that you posted on this topic.

I’m about halfway through the Bunge that Dr. Hall reviewed last week and in it Bunge takes a pretty good rip at Bayes as a useful tool in clinical research. If you’d be interested in expanding on this or directing us to further reading, I’d certainly be interested in reading it.

I use Bayesian methods and I agree with this assessment. Bayesian methods have different problems to frequentist methods, I’ve never meet a pure Bayesian who wouldn’t use something classified as frequentist when that fit the problem.

Right on!

Frequentist versus Bayesian is a philosophical dispute that has only a weak connection to the actual practice of statistics. It is common for a good statistician to construct a Bayesian model to solve a particular problem but then evaluate that model by considering its frequency properties.

Carl wrote:

The main reason that Mark’s stated interpretation of the p-value, “[I]f there is a small p value, 0.05, then the chance that a difference observed in two treatments is random is 5%,” is wrong, is because the probability that a result “is random” (or “due to chance”) is synonymous with saying that the result is due to the null hypothesis being true. He is thus interpreting a p-value of .05 to mean that the probability that the null hypothesis is true is 5%. That is a probability about a hypothesis H0, given the observed data D, P(H0|D), which is the (Bayesian) posterior probability that H0 is true. But a p-value is the converse; it’s the probability of obtaining the observed data (or more extreme data) D*, given that the null hypothesis is true, P(D*|H0). The p-value is calculated on the assumption that the null hypothesis is true; therefore, a p-value cannot possibly be the probability that the null hypothesis true.

Make sense?

Discussing stats here never works out very well.

“is random” here was merely trying to say that under the null hypothesis, p will be less than .05 5% of the time – it will be small some of the time due to randomness in the data. If we repeat the experiment 100 times, we expect p to be less than .05 5 times.

It’s only trouble with words, and not very much trouble. The first assertion at Jay’s #12 seems wrong.

On the other hand, have you ever met a skilled statistician who would make a good life coach?

Dr. Crislip,

Nice post.

Sawyer said:

“But what about scientists that are just starting their careers? How the hell are they going to get publications when the standards go up by a factor of ten? Most graduate students I have met struggle with crappy results for years before they really get the hang of things, and I’m worried higher standards will only encourage dishonesty.”

Are you suggesting that it’s okay that the medical literature keep publishing misleading papers, so that beginners who haven’t gotten the hang of things, can get publications on their CV, and stay in science? And that they are likely to respond with dishonesty, if they can’t get easy publication of insignificant results? At the risk of stating the obvious: the literature isn’t really supposed to be a tool to manipulate for career gain; it’s supposed to tell us what’s happening in nature.

Also, if really big trials are needed to improve the p value, that probably means that your hypothesis isn’t true (or that it’s such a small part of the situation that it is unimportant). Don’t keep piling up numbers of patients, or numbers of similar trials: rethink the hypothesis and the trial designs, and start over. (Please don’t say that is unacceptable because it might embarrass someone, or ruin someone’s career.)

No, of course not. I naively expected people here would understand the dilemma I’m talking about without me having to go into much detail, but I guess I should have known better by now.

Nobel prize winning scientists don’t materialize out of thin air. Neither do life saving doctors. They begin their career just like everyone else in the world: by making A LOT of mistakes. Whether or not you think the “publish or perish” model that universities employ is a good thing, there needs to be some sort of outlet for new scientists to publish their work and gain recognition for it. And that work, initially, is not going to be a double-blinded, randomized control trial, P<0.005. What are they supposed to do with 4 years worth of data that doesn't make the cut? That data may not be clinically relevant, but there may very well be hundreds of other scientists that are interested in their work.

In an ideal world, they publish their work, get the recognition they deserve, and their results are recognized by more experienced researchers as being PRELIMINARY and not worthy of clinical significance. But we don't live in that world. The line between great research and mediocre research is blurry, as is the line between top-notch and second-tier journals. Doctors and drug regulators should absolutely demand really good evidence for treatments, but what happens when journals follow suit and start jacking up the p-value they expect for every single paper they publish? Does that actually make the overall literature get better? Does that really bring the best and brightest and most honest scientists into the field?

I'm a curious person and wonder about this stuff. I want to hear from other people that are equally curious, or those that have real solutions to these challenges. If you're not in that group we're not having a discussion.

@Sawyer:

Well said. I have actually been thinking about the issue myself. I think qetzal makes a very good point that should be expanded – we need to teach people about the statistical nature of the universe and what that actually implies. Most people have a very binary frequentist view of the world around them. Something is or isn’t and once we hit a certain p-value we can assert with confidence whether it is or isn’t and that’s all there is to it.

But beyond that, I believe that the issue with peer review is not so much that it is broken but that it was a model built around an infrastructure based in print journals with much, much less articles and data to sift through. It isn’t that the process is suddenly bad, it is that the world has changed around the process faster than it could accomodate. I’ve actually been in a bit of a discussion with an acquaintance of mine who is a researcher in Edinburgh and I had this to say:

“There must be some middle ground that will work well – something like a post-hoc peer review and a somewhat open journal model. I could envision something where you pay extra for immediate access to the newest articles, pay less for slightly older articles, and anything over a year old is completely free. There will still be up-front peer review, but we abridge the process in order to get articles out faster. Then, anybody can pay a small fee to have their identity and credentials verified and a small fee to comment in a post-hoc peer review process. (something like $20-30 per year or so). Then, the journal editors can look through the comments and come up with a quarterly print journal that puts out the best of the best – those that really stood up to the post-hoc bashing and with an addendum and notes (plus an opportunity for the original authors to respond/amend/etc) the data and then charge for that with the same stipulation that after some time period those articles become open access. ”

There is much more refinement needed but it seems at least plausible as a start to me.

Sawyer,

I don’t think we need to demand randomized double blind phase 3 studies with p < 0.001 from everyone. It's not that we should quit doing small studies or totally disregard results with p < 0.05. But I think we do need to start realizing (and teaching) that positive results with p < 0.05 are, at best, just an indication to do more study. NOT an indication that the result is likely to be true.

In fact, I think the sooner a young researcher learns that, the better their chances of being successful.

I agree with Andrew Lang about not using statistics as a drunken man uses a lamp post, for support instead of illumination.

I’ll have to spend more time talking to my friends in med school to see how well they understand this problem. I always forget that I’ve just been spoiled by years of listening to skeptical podcasts and websites that this stuff now seems obvious to me, and might not be apparent to students trudging through their coursework.

The best way to learn a topic is to teach it so you are definitely taking the right steps. Sounds like you have a pretty good grasp on statistics though.

Never forget the Bonferroni correction!

The traditional test of significance is P<= 0.05

If you're using P values, but you entertain two hypotheses, then you should use P<= 0.025.

If you're considering four hypotheses, then you should use P<= 0.0125

If you're considering n hypotheses, , then you should use P<= (0.5/n).

So, if somebody does a fishing trip, and considers 10 possible associations, and one has a P value of 0.4, it's not significant – you should only consider associations with a P value of 160mmHg by 1mmHg. The result of the study might be statistically meaninful – you can say with great confidence that the drug does this. But that doesn’t mean it’s clinically meaningful or significant.

Say I have 5 conditions A B C D E, and test all pairwise contrasts (A vs B, A vs C, etc, there are 10 of them). I’m most interested in whether D and E give bigger values than in A,B,C, which is 6 of the contrasts and for all those I get p=.01, and for the other 4 I get p=.5.

Bonferroni corrections are probably not what I want there. Often folks instead compute the false discovery rates (FDR) – how many of my 6 claims of differences are expected to be false. In the example I have I would probably just give the p-values and make sure reader gets how many tests I am doing – there’s no lying in doing that. Cherrypicking one small p-value while not being crystal clear about how many tests you tried – that is a problem.

The more classic examples for me come from arrays where we measure about 50000 transcripts. If we get 400 p<.001 we'd (naively) guess that 50 of those are false positive. Bonferroning for corrected p's less than .05/50000 =.000001 is making a list with the property that the chance that EVEN ONE of my transcript differences is false positive is <.05. Sometimes that's the error rate you are interested in, but not always, and for this example it might be never.

Conclude: inappropriately demanding a Bonferroni correction is the hallmark of a dickish reviewer, who knows just enough to be a danger.

Further study: folks often use Benjamini-Hochberg corrections to compute FDRs. When possible perform (and demand) permutation testing – in tricky situations we don't really believe our test statistics have the theoretic distribution we imagine they do under the null. Only the best reviewers and readers will feel when that is likely to be a big problem. Good example is tests of transcription factor binding sites being within a certain distance of some given set of gene transcription start points in ChIP-Seq experiments – our models for what "random binding locations" means won't be perfect, and not admitting that can let you fool yourself something awful. Simpler example is T-testing log of transcript abundances – we don't really believe those distributions are actually perfect Gaussians, so our statistics don't really have T-distribution under the null.

Question: the Bonferroni correction (like regular p-value tests) assumes these hypotheses were defined before the experiment began, right?

I don’t think this is doing much to solve the problem. When you see research with multiple outcomes, the major problem is that those outcomes were selected during or after an experiment began. This correction factor would reduce this ability, but it doesn’t eliminate it. Maybe if the data sets are perfectly normally distributed the odds stay the same?

I always mix up my p-tests, t-tests, and chi-squared tests, but I remember encountering this problem when I helped a friend with ANOVA analysis.

Right, and right. No adjustment of p-values will help if you’ve made up your hypothesis based on the data. It seems innocent enough to find an unexpected result in your data, and say, “That looks interesting; let’s see if it’s statistically significant”; but doing so irreparably poisons the analysis. Why? Because in the typical dataset there are a vast number of things that could have looked interesting but happened not to, and you would have to adjust your p-value to account for all the tests you could have done had the data looked differently, a number that is essentially impossible to determine.

An article in Slate discussing this problem of implicit multiple comparisons is by statistician Andrew Gelman is here (be sure to read both pages 1 and 2).

This article and the comments section would be a good discussion starter in a grad class in statistics (or research methods). So many students think everything is cut and dry, equivocal results are due to bad data, bad Ho’s, bad whatever, and not because fundamentally we are talking about a fuzzy world where we don’t know all the variables and/or can’t test for all of them at once… incidentally don’t forget we never prove anything, we can only disprove something or not disprove it and so let the hypothesis stand for the moment… because sometime in the future we may find out something that disproves what we can not, at this instance in time with this study, disprove it. If we triangulate, using different approaches, etc. then we increase the likelihood that we will not be able to disprove and the London Bookies will stop taking bets on it not being true. LOL

The inability to accumulate evidence in favor of the null hypothesis is a limitation of the standard frequentist null hypothesis significance test, which can only reject, but never accept, the null hypothesis. Bayesian hypothesis tests don’t suffer from this handicap, and permit the conclusion that a null hypothesis is true.

Obligatory XKCD reference:

XKCD #882: Significant

My focus is lab research in molecular biology, so I have the luxury of controlling variables individually, to get yes/no answers from my experiments. Not needing to use statistical models to interpret my research, I don’t have a sophisticated understanding of them. But I recognize that eliminating the need for statistics at the stage of designing the experiments, is not possible in medical research on humans, so all these statistical models are needed. But I think sometimes they risk obscuring some simple problems that common sense might detect, were we not distracted by the effort to follow the numbers.

One thing I’ve never understood is how the most simple, common kind of Bayesian analysis that I’ve seen, which includes the assignment of “prior probability,” (or “pretest probability”) can be considered convincing, in many of the situations in which it is invoked. As Dr. Crislip noted, it often looks as though the assigned prior probability is a subjective number, chosen according to whatever the current guesses or biases might be, about a disease’s prevalence among some population, or some other probability that hasn’t been realistically ascertained.

At best, a calculation that includes any number that is only a guess, can only generate another guess, of no more precision than the first one. But it seems more convincing, because it’s got a bunch of math around it.

It’s like significant digits, which many otherwise intelligent people don’t seem to understand; for example, if you only measure one of your input figures to the nearest inch then any output stated in fractions (say, 6.47) inches, must be rounded off to 6. The numbers after the decimal point are an artifact of arithmetic.

At worst, Bayesian calculations can cause the generation of self-fulfilling prophesies, that get ensconced in medical textbooks as empirical “facts.” For example, if the “prior probability” of someone having a disease is assigned a small number like 0.1, because the disease is thought to be rare, then the disease will often not be diagnosed even if the patient tests positive, because that small “prior probability” will bias the decision toward regarding the test as a “false positive”. The bias toward failure to diagnose, will reinforce both the notions that the disease is “rare”, and that the tests are prone to “false positives”, and lead to ever-increasing, self-reinforcing failures to diagnose. The same thing occurs in reverse for diseases that are perceived to be more common than they really are. So when looking at the reported numbers for a disease, rare diseases seem to get even rarer, and common diseases seem to get even more common, while actually this is all an artifact of formal, or informal, Bayesian analysis.

It’s an amplifying loop, where one guestimated number (the “prior probability”) plugged into an equation in year 1, by year 10 has generated a whole literature full of mistaken assumptions and conclusions that keep getting further from reality each time the cycle operates.

To put it simply, if one number in a calculation is questionable, then the whole analysis is tainted. A chain of logic or math is only as strong as its weakest link. And if there’s a feed-back loop that amplifies the effect of the bad number, then perception get divorced from physical reality very quickly, and in a predictable direction, all under cover of science-y looking math.

Or to put it another way, failure to recognize actual zebras because one has been told to avoid thinking about them, makes them seem even more rare than they are. And the inclination to attribute all signs and symptoms to some popular “horse” disease, makes the horse/zebra ratio look even bigger than it really is. And the distortion gets worse, every year that passes.

In thinking about statistics, we shouldn’t underestimate the sheer PR value of math: I think Ioannidis was a genius, to pepper with equations, his famous article “Why Most Published Research Findings Are False.”

http://www.plosmedicine.org/article/fetchObject.action?uri=info%3Adoi%2F10.1371%2Fjournal.pmed.0020124&representation=PDF

Those equations may have gotten him from Greece to Stanford.

Don’t get me wrong, I like the article, and the rest of Ioannidis’s output; for example, he recently weighed in, on the new statin guidelines, in a very balanced manner that I found exemplary:

http://jama.jamanetwork.com/article.aspx?articleid=1787389 )

Here’s a fun example of how guessed-at assumptions, like prior probabilities, can result in different conclusions:

http://www.technologyreview.com/view/510126/the-statistical-puzzle-over-how-much-biomedical-research-is-wrong/

@Self skeptic:

You are writing as if none of us here, nor any of the medical researchers out there, realize the limitations of statistical models and Bayesian inference. There are, in fact, ways to rigorously calculate the Bayesian prior but in most cases it is just a subjective “probable, possible, improbable, highly improbable.” The thing is that, for most cases, that is actually plenty good enough! You must realize that the systems we deal with actually have a much larger tolerance than the systems you deal with. If you are off by a factor of 0.001 that can totally muck up your work. If we are off by a much larger factor the outcome is (often) the same.

But this is exactly why we argue the points and keep doing studies. Is it ideal? Of course not. But you are falling for the Nirvana Fallacy. It may not be ideal, but it is the best we’ve got. Additionally, whilst the Bayesian prior is subjective it is (and, our big point here should be) informed by actual science and lower levels of evidence.

As for your zebras and the prior probability… of course, you are exactly right. Where you go completely wrong is how we handle it. There is a joke which in part goes “… the family practitioner says ‘Looks like a duck, flies like a duck, must be a duck’ the internist says ‘Yes, probably, but we must rule out flamingoes, penguins, and kiwi birds first’” The point is that we recognize that we may miss zebras. But we must balance that with the resources available and the resources necessary to root out all zebras. We also balance that out with the harms of trying to chase zebras that aren’t there. Are we perfect about it? No. But once again, that is the best we can do within the confines of reality. We always strive to improve and hone that edge better. But until we have better evidence that a rare disease is less rare or vice versa, what would you have us do? Changing our thoughts on the pre test probability is

worsethan being wrong about it based on the available evidence. Because at least in the latter we may be wrong, but we are using the best possible rationale. In the former we are literally just making things up.TL;DR: You are taking correct statements about the limitations of statistical modeling, inference, and Bayesian priors and then making assumptions about how those play out in real life medical care that are simply wrong, compounded by the fact that you think medicine should be rigorous like a bench science. I also think we should

strivefor that but, for reasons that you can’t seem to grasp, is very, very far from where we are in our understanding and technology. There are only two options – wait till medicine becomes as rigorous as particle physics before doing anything or do the best we can with the best we got. The latter is obviously better, in particular because it would be extremely difficult to achieve the former without it unless you are advocating for Nazi type human experimentation. Then we can sure be a lot more rigorous and get better answers faster.“For example, if the “prior probability” of someone having a disease is assigned a small number like 0.1, because the disease is thought to be rare, then the disease will often not be diagnosed even if the patient tests positive, because that small “prior probability” will bias the decision toward regarding the test as a “false positive”. ”

I don’t understand this at all. Why would you order a test if the results will not alter what you do?

This might help: http://araw.mede.uic.edu/cgi-bin/testcalc.pl

@Weing,

Yes, many physicians wouldn’t even order the test, because they’ve been taught they shouldn’t, if the prior probability is said to be low. So the patient who actually

hasthe supposedly rare disease wouldneverbe diagnosed; unless he happened to find a doctor who can think outside the EBM/SBM box. Thanks for illustrating my point.And, since the disease would go undiagnosed repeatedly in the wider medical community (because the theoretical prior probability (say, the prevalence) was guessed too low), it would come to be regarded as even more rare, and the theoretical prior probability would sink even lower, as the years go by. This would happen, even if the actual incidence of the disease was increasing. It’s a perfect Catch-22.

Meanwhile, the most likely misdiagnosis for that disease (some diagnosis of exclusion like fibromyalgia, some popular diagnosis like an autoimmune disease, or something psychiatric like “somatization disorder”, all of which are chronic and have no known cure) would be given, and the prevalence of those diseases or syndromes would appear to increase. Hence they would be (mistakenly) diagnosed more often, and the errors would be propagated and amplified, ad infinitum.

Sorry, I know it’s bad news.

This is an example that is badly in need of specifics, what rare disease(s) are you referring to?

In my experience, doctors pretty much follow the clues. They will test for a rare disorder if the see signs or if there is a red flag of a dangerous condition (even if rare). They don’t test for every illness when you have non-specific symptoms, because of the problem of false positive…to much data means to much noise to make sense of the problem.

Sure, there are blow-off doctors. But there’s also specialists and researcher who are looking pretty hard for incidence under-diagnosis with some of these rare diseases. Ex: SLE in the African American population.

@MTR,

I see your point, but if I give a specific disease example (and there are a couple, in my immediate family) we’ll be off to the races, on that particular disease, and the concept won’t ever sink in. I think it’s better, in this case, to leave the disease as a hypothetical, so as to make the general concept prominent. This is a case where common sense is enough to get the point; and it’s a general problem that clinicians and patients need to understand, in cases where a patient’s presentation doesn’t exactly match any existing diagnostic algorithm, so that judgment is needed.

I saw a good slide set about statistics, in which examples of calculations involving the new drugs “cleverstatin” and “blundamycin” were used to make a point about the statistics justifying drug use:

http://www.slideshare.net/tdemerdash/medical-statistics-made-easy

We need some imaginary diseases, like that; but I find, here at SBM, that if you trigger any emotional response, even sardonic humor, it throws people off from thinking logically (i.e., following a train of inevitable consequences that follow from one erroneous assumption.)

This has nothing to do with blow-off doctors. This is a general problem that will occur with all doctors, scientists, and patients, when a widely-circulated number, like the prevalence of a disease in a certain population, is wrong.

SS – “I see your point, but if I give a specific disease example (and there are a couple, in my immediate family) we’ll be off to the races, on that particular disease, and the concept won’t ever sink in.”

But specific diseases need to be addressed in specific ways and the risks of over-diagnoses and under-diagnoses are different depending upon treatment. If you stick to a general hypothetical statement. .’it’s possible that doctor’s are missing rare diseases’ then you have no way of concluding if that is better than or worse than testing more and possibly over diagnosing the hypothetical disease.

“This has nothing to do with blow-off doctors. This is a general problem that will occur with all doctors, scientists, and patients, when a widely-circulated number, like the prevalence of a disease in a certain population, is wrong.”

And yet, people DO get diagnoses with rare or mismatched diseases. Women do get diagnoses with ankylosing spondylitis, a disease that’s thought to occur mostly in men. Caucasians do occassional get diagnosed with sickle cell anemia, etc. And it seems like your family did get diagnoses with some rare disease. How does that happen if doctor’s don’t generally test for rare diseases? It’s multifactorial, yes? Factors can include; the quality of the doctor’s approach, the information available on diseases and how diseases present themselves.

One fact of life is that research is generally focused on common diseases or very dangerous disease. If you happen to have a less common + less dangerous disease, then it’s less likely that science has found a way to help you. The flip side is, if you have a less dangerous disease, it’s more likely that fiddling about with experimental treatments could cause more harm than good. Some folks do get lucky. There may be a well documented low risk plausible treatment that can be prescribed off label. Or you may have a rare disease that has a documented easy cure. That is supposed to be in most doctors differiantial diagnoses, though.

See without specifics, there are tons of variables. You seem to want to keep things vague to prove your point. But, it only get in the way of finding real practical safe solutions.

Just an additional thought. There are tons of rare diseases. If the symptoms are non-specific or not well matched to the disease, How is the doctor supposed to know which rare disease to test for? Is the doctor just supposed to test for ALL rare diseases and then run the follow-up tests for ALL the inconclusive or possible false positive results (also considering the chance of false negatives)? Some of these tests aren’t risk free, or comfortable, biopsies, endoscopes, etc. Wouldn’t that be incredibly expensive, exhausting for the patients and possibly more dangerous than going undiagnosed?

That “slide set” is, sadly, an elementary medical statistics handbook aimed at physicians and medical students. They start their chapter on p-values, “The P (probability) value is used when we wish to see how likely it is that a hypothesis is true,” which, of course is completely misleading. In fact, their whole chapter on p-values is wrong. No wonder physicians are confused.

@Jay:

No wonder indeed. I overheard a conversation with a med student and a resident where they were trying to figure out what a “p-value” actually meant. After 15 minutes of back and forth they finally settled on the wrong definition.

In fact, if it hadn’t been for SBM and my own interest in learning more and more about the topic I would have been exactly one of those confused physicians.

@self skeptic:

Do you really think we clinicians are

thatstupid? How do you think we estimate the prior probability of a disease? Do you really think it is initially just a pure guess and then we use population level data from self reported cases by community physicians? Because the only way your Catch-22 scenario would play out is if that were the case. And do you not think that we update incidence and prevalence statistics and actuallylookto update them? How about autism?Seriously, we don’t just make up random prior probabilities. If we don’t know we say we don’t know and try and find out. If we know, it is based on some sort of actual data. And hey, guess how literature works? If someone comes out and proves that the incidence is actually higher or lower, they get big kudos! So there is an incentive to find it. Just like how we showed that older kids are getting whooping cough and being misdiagnosed because the vaccine is actually effective for a shorter duration than we’d thought. Or how about the fact that we

thinkup to 20% of all chronic cough in adults may be due to whooping cough? There is no data to really support this and the number is shaky. But enough clinicians wondered if the incidence may be a lot higher than we previously thought and so now the American Thoracic Society has a note to all pulmonologists to keep and eye out for and consider pertussis in their adult patients with chronic cough. Then we are much more willing to do a test for it and can write up case series and individual hospital data (yep, we do that too!) and thus get a better sense of the numbers.You act like we get locked into this cycle and a) never consider that this may be the case and b) have no way to extricate ourselves from it. Like we are helplessly watching as we test less and less for a disease as the incidence increases around us.

Seriously SS, you do know the basics of science and stats, but you are really and truly naive about how the process works from the clinical side of things. You aren’t a clinician and that is painfully obvious.

“@self skeptic:

Do you really think we clinicians are that stupid?”

Yes. I really think he does think that.

@weing:

Yes, more and more I am seeing that. Which is why my comments are getting angrier and angrier. LOL.

“At best, a calculation that includes any number that is only a guess, can only generate another guess, of no more precision than the first one.”

This is crazy talk.

Give me an experiment like tossing a coin where probability of heads is p (and “unknown”).

Before obtaining data I specify a prior on p, and that will take the form of a statement like “The mean of my prior is .5 (or something else), and with variation as if I had seen 1 head out of 2 tosses (or such).” (Technically: it’s a beta distribution, and interestingly “has to be” or else you will be super-burning stupid – Finetti’s work). You could instead have a prior mean along with variation as if you had tossed the coin thousands of times, but if you are uncertain you don’t do that. For super-uncertain you could use a prior corresponding to having seen .0005 heads (h) out of .001 tosses (t).

Now toss the coin to collect data, and form your posterior distribution. You don’t get more uncertain, you get less uncertain with each toss – the variance of the posterior (also a beta distribution) goes down with every toss. You act as though you’ve seen h+H heads in t+T tosses where H and T are how many heads you got in T tosses of the coin. Posterior mean is h+H/t+T. It’s so beautiful I think I may weep.

To see the burning stupid of non-baysian methods toss the coin just once, and ask what odds you will give (to a fellow gambler) on the second toss.

(To fellow math police: yes I know frequentist point estimate of p is admissible, so it’s not super-burning stupid. I can’t make them loose money with probability 1.)

I’d like to see examples of yes/no answers from a biology experiment to see if it is really so, or if you don’t measure and test just cause you are lazy. Like bands on a gel – you really should quantitate them in replicate experiments. Just cause some folks don’t do that doesn’t mean it shouldn’t be done. If there’s no mathematical model, how do you estimate the chances of your findings being false positive? (Hint: there’s likely a fuzzy feeling that those chances are small. Good scientists estimate those chances explicitly – it’s the p-value.)

You don’t even have an elementary understanding of them. Unfortunately, that has not prevented you from forming strong opinions against them.

Uh, no. Because we’re modifying our prior “guess” in light of new data, our posterior “guess” is more precise than either our prior guess or the new data itself. Under Bayes, we summarize our beliefs using probability distributions. For example, our prior beliefs about the effect of a new treatment compared with an old one might be represented by a normal distribution with a mean of 0 and variance—representing our uncertainty of the true treatment effect—of 6 (in some clinically relevant unit). We collect some data and find that the observed treatment effect is normally distributed with a mean of 5 and a variance (SEM^2) of 2. Then, by Bayes Theorem, our posterior beliefs about the true treatment effect will be a normal distribution with a mean close to 4, reflecting both our prior beliefs and the data, with a variance of 1.5, indicating that our uncertainty about the true treatment effect is reduced in light of the data, as common sense suggests it must be. In fact, the precision, which by definition is 1/variance, of our posterior “guess” is the sum of the precisions of the our prior “guess” and the data, so your assertion that the posterior “guess” can be no more precise than the prior “guess” is literally and mathematically false.

To summarize the above, Ioannidis’s model showed that—to the horror of just about everyone—that more than half the medical literature could be wrong. Much less well known is the criticism of Goodman and Greenland revealing flaws in the model—notably underestimation of the p-values in the literature—suggesting that Ioannidis’s estimates were probably too high.

Then, just recently, Jager and Leek, using a completely different approach than Ioannidis, estimated the false positive rate in the medical literature to be just 16%. Because Jager and Leek were so open about their data and their methodology, flaws in their approach were quickly discovered, and critics showed that a more reasonable estimate from Jager and Leek’s model is about 30%.

Prior to Ioannidis’s paper, most doctors and medical researchers probably thought that the actual false positive rate in the literature was small, perhaps mistakenly believing it to be as low as 5%, due to the usual error in interpreting significance level. Now, thanks to the work of Ioannidis, Goodman and Greenland, and Jager and Leek, we know that the actual error rate is much higher. A reasonable belief would be that there is a 90% chance that the true error rate is bounded by, say, 30% and 60%. What you deride as a “fun example” of “guessed at assumptions” looks to me like substantial progress. We now know that the error rate is much higher than we previously believed, and we’ve got a handle on likely bounds for it.

Thankfully, Andrew Pavlov has handled your remaining bloviations.

Thanks Jay. Your response was excellent as well and it always helps me to re-read and crystalize that sort of knowledge. It still just boggles my mind that anyone, let alone someone who clearly has some significant scientific knowledge, thinks that we would somehow get caught in such an insanely obvious mathematical “trap.” I mean, we understand that there is a LOT of variability and change in our field. And if something isn’t working right it doesn’t mean we just keep doing it until we run the train off the cliff. We do have the ability to jump off the damned train.

“Yes, many physicians wouldn’t even order the test, because they’ve been taught they shouldn’t, if the prior probability is said to be low. So the patient who actually has the supposedly rare disease would never be diagnosed; unless he happened to find a doctor who can think outside the EBM/SBM box. Thanks for illustrating my point.”

Again, I don’t understand. Which is it? First the test was ordered and the narrow-minded physician considers it a false positive, now you say it’s not even ordered because he’s narrow minded and wouldn’t even bother. I think you need to be a little more skeptical of your imagined representation of the mind of a clinician.

Weing,

Um, you’re the one that brought up the physician who doesn’t even run the test. I’m the one who pointed out that even if the physician suspects that the supposedly “rare” diagnosis is the correct one, and runs the test to gather more evidence (not clinching evidence, but another clue) his colleagues may tell him the clue is worthless because of the supposed prior probability.

So basically, the patient with the “rare” disease is out of luck, unless he manages to get himself out of this circular system of logic that is disabling his doctors from recognizing his illness.

This is really pretty simple. I think the problem is, you’re letting emotions cloud your ability to think through the problem logically. See my reply to MTR, above. This doesn’t have anything to do with any particular clinician, or with clinicians in general. A scientist or a patient following the same path, will make the same mistake, for the same reason. It’s a systemic problem, not a personal one.

It’s about the consequences of using a guessed number, to make calculations about probabilities, and then being lulled into thinking that the result is solid, because the guessed number had some math done to it.

From your point of view, I suppose it all comes back to trusting that the “experts” guessed the prior probability (say, the disease prevalence) correctly, so that the number that is disseminated is correct. I’m just pointing out that when that number is wrong, it affects everything that happens to those patients. Many of these numbers are estimated, as placeholder fictions. But once people start thinking they are real, they get set in stone, and then the patient is out of luck.

“his colleagues may tell him the clue is worthless” – perhaps you mean “not worth very much”. That’ll be cause the test lacks specificity or sensitivity, and that’ll just be a fact, not a fallacy.

“circular system of logic that is disabling his doctors”. There are no problems of logic except perhaps yours, in thinking that “low quality data should be worth more than it really is” or equating low-quality data to no data at all, and then noting that undervalues it (tautology).

We all know Wagner’s Opera music is better than it sounds.

Alternatively you might be mostly thinking “priors can be unrealistic, exhibiting more certainty than is warranted, or even be wildly wrong about the expected value” – that can happen. When it does there should be dispute about what prior to use among different people. In your examples there might be data to inform the prior, or we could try to obtain more information about that.

Bayesian statistical theory is the only real theory of learning I know of. Frequentist theory, when any good, corresponds to bayesians using very-uninformative priors.

If the physician suspects the rare diagnosis is correct, then obviously his personal prior probability is fairly high. When the test comes back positive, his posterior probability is substantially higher.

Why would this MD then fail to follow through? Because his colleagues disagree? I’m not an MD, but that’s not my understanding of how patients are treated at all! I understand that MDs are taught to take responsibility for their patients (at least within the MD’s specialty) and provide the treatment that they (the MD) consider best. It’s not something that’s done by consensus, in my experience.

I would also suggest that you follow an implicitly Bayesian approach much more than you realize. You just do it much more crudely. To illustrate, imagine you do an experiment and get some interesting new result. Do you automatically believe that result? Or do you first try to repeat it a couple more times? Presumably the latter. That’s what I was taught in grad school, anyway. (I’m a fellow molecular biologist by training.)

But why repeat it? Because we know from experience that new interesting results aren’t trustworthy after only a single observation. There’s too much chance that it’s a fluke or a mistake or a misinterpretation or whatever. Put in Bayesian terms, the prior probability of any substantially new and interesting result is quite low, so a single observation (akin to a single test result) isn’t sufficiently convincing. Only after 2 or 3 consistent results do we believe the result. And even then we typically require data that supports the new finding in several complementary ways.

As I see it, that’s essentially a Bayesian approach. It’s just that we don’t typically articulate a value for our low prior probability, and we don’t formally analyze when the cumulative data leads to a sufficiently positive posterior probability. We just subjectively assess whether our colleagues and target publication journals are likely to believe us given the available data.

Qetzal said:

“Why would this MD then fail to follow through? Because his colleagues disagree? I’m not an MD, but that’s not my understanding of how patients are treated at all! I understand that MDs are taught to take responsibility for their patients (at least within the MD’s specialty) and provide the treatment that they (the MD) consider best. It’s not something that’s done by consensus, in my experience.”

Well, my experience has been that now MDs are expected to follow standardized guidelines. The guidelines are supposed to be “evidence-based,” and they are issued by relatively small panels of experts, often suspiciously like-minded, who work them out privately, in closed-door sessions. We have just seen the introduction of new cholesterol guidelines; take a look at the complaints made about that process. Doctors who strictly follow guidelines are protected against lawsuits if outcomes are bad, because they were following the “standard of care.” Also, it is much faster to process patients using guidelines, and speed is highly valued. They’re calling it productivity now. The reasons for that, are obvious; medical care groups are businesses, after all.

There are doctors who are skeptical of expert consensus, and don’t follow all the guidelines, but you won’t find them here at SBM, except as dissenting voices.

I don’t have anything against Bayes, and I’m sure you’re right, that I use it all the time, informally. It’s kind of a common-sense process. I just want people to realize that if the prior is a guess, and the guess is wrong, then the result will be wrong, too. Even if a panel of experts handed down the guess from their committee meeting. I want people to be more skeptical and inquiring, about the quality of their priors.

“Um, you’re the one that brought up the physician who doesn’t even run the test.”

Again you misunderstand. I can see that by misunderstanding, you are actually illustrating the behavior in yourself, that you are attributing to physicians. Your prior probability of a physician knowing what he is doing is so low, that any evidence presented is explained away and ignored, to let you continue in holding on to your prior belief.

Running tests is most valuable when you are uncertain of the diagnosis. If you are certain, then the patient response will in most cases, let you know if your certainty was warranted. If you want to rule out a diagnosis you use a test that has a sensitivity of over 99%. If the test is negative, that makes your diagnosis highly unlikely. Conversely, a test with a specificity of over 99%, if positive, will most likely rule in your diagnosis.

@Weing,

Ha! It occurred to me, also, that we’re using different priors, based on our self-interest. It’s more utilitarian for a physician to trust the experts and follow the guidelines, no matter what. This is definitely safer for you, legally.

But I’ve seen enough bad outcomes for patients, from this method, that I don’t trust it anymore. There’s no safety for me, as a patient, from having followed the standard of care, if it’s wrong. So my safety is served better by questioning, while yours is served better by obeying.

So of course we get different results when we run the numbers about the probability of any medical encounter having a “good” outcome. A good outcome for the physician (safety from blame) may simultaneously be a bad outcome for the patient (poorer health). Let me emphasize again, that I’ve seen this divergent outcome happen, in real life, more than once. It’s not merely theoretical.

Well, I’m glad we cleared that up.

“It’s more utilitarian for a physician to trust the experts and follow the guidelines, no matter what. This is definitely safer for you, legally.”

How did guidelines get in here? Now you are talking about a situation where the diagnosis is correct but you disagree with the established treatment as indicated by guidelines. For example, you have coronary artery disease and high cholesterol and your medical cocktail includes the dreaded statin. You develop muscle aching and dark urine and your doctor says to stay on the statin no matter what because that is the guideline and the elevated CK and creatinine are obviously false positives? That would not be safe for you nor for the doctor. The alternative is your doctor stops the statin and after hydration or even dialysis, your renal function improves but does not entirely return to normal. Is this the type of divergent outcome you are talking about? You are upset because the doctor, by following the guidelines here, is legally protected? If the doctor had not slavishly prescribed the statin, this never would have happened. That’s your beef? Sorry. Outcomes are not guaranteed. Absence of adverse reactions are not guaranteed.

@self skeptic:

Ha! It occurs to me that you don’t have a single friggin’ clue as to what you are talking about.

Yes, defensive medicine is a reality. No, it is not a huge consideration. And more importantly it actually works against your argument. We would be much more likely to cover our asses and order all the tests or even the rarest of things regardless of what our priors tell us. It doesn’t work the other way around where we let our patients languish, refusing to order the test because we feel the prior to be so low.

And it is also downright insulting to insinuate that we care so little about our patients that we would sacrifice their health for our professional security and the field is downright littered with counter examples.

I have been in school or doing research for a total of 11 years after high school. I have $400k in debt. I have 6-8 more years of training where my maximum earning capacity will be around $75-85k per annum. I have friends who already own homes and make double that and have no debt and are able to put away for their retirement. I have a Roth-IRA with around $6k in it and had to scramble to pay my bills this month. If you think I am doing this so I can go on living the easy life at the expense of my patients well… eff you.

Your hypothesis is based on the rather insulting premise that all doctors care about is not being sued. Many doctors also would like to heal their patients.

Is your skepticism based on a specific personal experience?

@weing:

Sadly too many of us run tests because we don’t know any better. But the reality is that you should be able to make the PDx and have a very good DDx from H&P alone. Tests should be used only to confirm a diagnosis for insurance or treatment purposes (there are times, as you know, when being wrong on the treatment could make things worse so we confirm it just to be sure), decide between indifferentiable PDx and DDx’s, provide a baseline so as to monitor progress, determine severity for prognostic reasons, and in very rare circumstances as a fishing trip because we just really don’t know what is going on. Except in the last case we should always be able to predict what the outcome of the test is and what that means and what we would expect to see if it were actually something else. And in all cases we should be able to state how the results of the test would change our management.

SS, I dont think you understand clinical thought processes very well. You remind me of myself when I look at a painting. I see a picture, but an artist sees lines directing the viewer to the focal point and different values hieghtening the impact of the painting.

EVERY condition has common and rare causes. If I were to use an example, hypertension is extremely common. A very few individuals with hypertension have hypertension due to an underlying condition, most of which are quite rare – a pheochromocytoma, an aldosterone producing adrenal tumor, Cushing’s disease, renal artery stenosis,etc. Much of medical training is to pick up the clues that one of these conditions might be present (low potassium for hyperaldosteronism, family history of other endocrine tumors for pheo, body habitus or presence of diabetes for Cushing’s disease, blood pressure that widely vacillates or is difficult to control, to give examples). When we are presented with a condition such as hypertension the differential diagnosis includes the common and uncommon possibilities, and one of our jobs is to avoid missing the rare and hopefully curable causes without subjecting large numbers of individuals to potentially dangerous studies with minimal yield. Much of training deals with this exact issue. Often in training the academic attending will focus on the rare conditions (the patient may obviously have a common disease but the attending will hammer the residents with questions about the rare conditions. This is variously known as “pimping the resident” or as “mental masturbation” but there is a point to this approach) and the boards certainly have this focus (a joke among internists is that the likelihood of a condition appearing on the internal medicine boards is inversely proportional to the chance of it occurring in real life).

I do admit that some doctors are better at it than others and there’s definitely the art of medicine involved also.

Hmm…Dr. Crislip said,

“Above all what the p value does not include is measure of the quality of the study. If there is garbage in, there will be garbage out. ”

and

“The problem with Bayes, although superior to p values, is that of pretest plausibility. Often it seems people are pulling pretest plausibilty out of thin air.”

So I’m agreeing with one of the SBM bloggers, that we shouldn’t get so absorbed in statistical methods, that we lose sight of the quality and accuracy of the input data and assumptions.

The point I’m making, is that when you’re adjusting your priors over time (as somebody said, learning), you’re likely to be adjusting them in a predetermined direction. Initially low priors will get lower, and high priors will get higher, over time, because of the habit of underdiagnosing diseases perceived to be rare, and overdiagnosing those perceived to be common. That’s true whether you collect the data from experts or from community practitioners.

Why is that so controversial? It’s a simple GIGO observation, with a little thought about how the input assumptions will increase their bias, in a non-random way, over time.

Your various comments about how complex and discerning the clinical process is, and how I’m failing to appreciate its excellence and difficulty, might be more convincing if I were younger, and had fewer decades of experience with various friends and family members getting medical care. I do agree that it’s difficult and complex. Believe me, I’d much rather delegate all this research I’ve been doing, to the medical experts. But that hasn’t worked out very well, several times in the past, and the consequences are ongoing. So I think I’d better keep a closer eye on the ball, myself. And report back, on what I’m finding.

I’m sure you’re all doing the best you can, with “the army you have.” Maybe we non-clinicians can help, by providing a wider perspective.

First you need to actually have a clue what it is the clinicians do and how they act and think. You can’t go help a rocket science build an engine if you think they run on farts and are made of cheese. And with your current level of understanding clinical and translational medical science that’s about where you are.

I think that SS is making the mistake of thinking providing a wider perspective means telling a doctor how to do their job – micromanaging.

Using the services of a professional means being willing to accept that they have expertise beyond yours in the field. This does not mean that you completely hand over all the decision making and power.

Often the best way to work out a concern or disagreement, and still have access to the skills of a professional, is to present the problem and allow the professional to use their training to find a workable solution.

Unfortunately, I don’t really have a clear idea what SS thinks the problem is. Delayed diagnoses? We have no idea if his expectations for diagnoses timing are even realistic.

I don’t expect you to have a full understanding of clinical thought processes – you’d have to go to medical school for that. I do think you have a “charicature’d” idea of how doctors think whereas the reality is that the processes are deeper than you seem to be aware of. You are obviously well read and intelligent but don’t seem to know physicians very well and give them no credit for any in-depth thinkig, at least from your comments. I’m sorry you had such bad experiences – I also have with some of my family members, so I have some insight as to where you’re coming form.

Most physicians I know agree with the following statement:

“The preface to each clinical guideline emphasizes the fact that these publications are presented as aids to clinical care and should not be used as a “one size fits all” recipe for clinical care of all patients. Individual patients will have a variety of contraindiciations and comorbidities that could and should result in an individualized approach to diagnosis and therapy that might differ importantly from what was suggested in the published guidelines”.

Joseph S. Alport, MD – from the American Journal of Medicine, the current (December 2013) issue.

Guidelines are very helpful. I think most physicians try to abide by them as possible, but depart from them when there’s a good reason to do so. There are obvious dangers to slavishly following them which is the reason for Dr Alport’s editorial.

Precisely. We are taught explicitly that we must understand the limitations of data and guidelines as to how that applies to our specific patient in front of us. If the patient fits the data and guidelines well, then we should follow the algorithm. If not, then we should use our knowledge to exercise clinical judgement and adjust accordingly. We should also use our clinical judgement to suspect uncommon things and always keep them in mind so that when we get stymied – or if our patient is in serious trouble – we can check for it. And even then, in the end, if the data do not support the rare diagnosis but it is still the only one that makes sense, guess what?

That’sthe diagnosis!By example, I have sent of a number of anti-liver/kidney microsomal type 1 antibody screens on patients with obvious liver disease (one had a Tbili of 15 that had been rising by over a point a day for a week) but nothing else was coming up. Cryptogenic autoimmune hepatitis with a positive anti-LKM1 is actually pretty darned rare. Yet I’ve tested for it.

Or an even better example. We had a patient who went into respiratory failure with a very ARDS type picture. We intubated and saw blood coming out of the ETT. Did a bronch and found blood and did a triple lavage and found pulmonary hemorrhage. We sent of the tests for Wegener’s and pretty much every other auto immune marker we could think of and all came back negative. Well, we shrugged our shoulders, called it “pauci-immune Wegeners” (the patient had a h/o kidney failure) and gave steroids. Two days later we had her off the vent.

Now, the

very next weekwe get another kidney patient who came in for what was called cardiogenic pulmonary edema. Our admitting diagnosis was fluid overload from her peritoneal dialysis getting backed up and not filtering well. She also had a not-too-distant h/o SBP. So all of this led us to attribute SBP in the past leading to clogging of the PD machine and decreasing its filtration efficiency so she built up fluid that backed up into her lungs. We also considered a possible recurrence of the SBP. So, of course, we dialyzed her. The next morning she is not doing much better despite having a few liters off and we scratch our heads again. She continues to deteriorate and we have to intubate. Another ARDS like picture. Well, I’ll cut to the chase… it was another pulmonary hemorrhage, this time with more auto immune markers, but still not the “right” ones as per the textbooks and board questions. So steroids were given and Robert’s your mother’s brother.Wegener’s by itself is extremely rare. Wegener’s without the markers is even rarer still. Yes somehow, unfathomably, we stupid clinicians didn’t get locked into the more likely scenarios and ignore the unlikely, eschewing the tests necessary to diagnose it. By Self Skeptic’s logic we would have done the blood tests, have them come up negative, and left it at that chasing our tails down the more likely pathways. Yet we did a bronch, diagnosed our patients clinically and well outside the established guidelines, and properly treated them and sent them home alive and well. All this despite it being known as an extremely low prior probability. Shocking, I know.

The other thing SS misses is that while it may be a very low pre-test probability

in the general populationwe, as physicians,knowthat we do not see the general population. We see a self selected population that needs medical attention.An FP may reasonably miss a much more rare diagnosis, but then it gets passed on up. Sooner rather than later (typically, though we can always be better and always do strive to be better) it gets to someone who realizes how selected their population is and considers the rare cases. As someone else (maybe you) said before, we are “pimped” for the rare cases to make sure we don’t miss them. FP guys a little less, but we are all taught to think of what is most common and what can kill our patients and never miss both. Missing something rare that won’t kill you may delay diagnosis but that is all. And in cases where a delay means death or serious injury we are taught to go ahead and pull the trigger without gold-standard proof to back you up.

If you are even thinking of temporal arteritis, compartment syndrome, ischemic bowel, necrotizing fasciitis, necrotizing enterocolitis, epiglottitis, tension pneumo… you treat! Regardless of how rare or unlikely or what evidence you have to support it. You just can’t afford to miss it and we know that.

Sorry SS, but we as clinicians are as dumb and locked in as you think we are.

just out of curiosity–if you were to go chasing rare stuff (that ‘zebra/horse thing’) too soon, wouldn’t people start being harmed because they actually had a common issue that was not treated promptly enough? as IANAD, I can’t think of any relevant examples though.

What makes you think that? Do you have any sound basis for claiming this is true? It suggests that people aren’t actually learning at all, they’re just adjusting priors independent of the data.

I mean, I agree that this would be problematic

if it were generally true, but I see no reason to believe it is. In fact, I think that applies to most of concerns you’ve raised here.Hi folks,

It seems to me, you shouldn’t be focusing on me and my imagined shortcomings as an observer, but on what I’m saying. This glitch in the Bayesian process, leaped out at the me the first time I saw it explained: the problem in diagnostic algorithms, of biased, tentative “priors” about the relative likeliness of a diagnosis being true, affecting diagnostic decisions, and then the diagnostic decisions (correct or incorrect) further biasing the prior toward high- or low-probability assignment. To me, this is similar to the theory of evolution; once you understand the concept, you realize that is is inevitable that it will happen, over time.

But then I realize you don’t share my view that there is a big chance of misdiagnosis occurring, and never being corrected, whenever one of us goes to the doctor with a non-obvious health problem. So you are not on the look-out for sources of error in diagnosis.

There’s no diagnosis fairy that goes around correcting wrong diagnoses, so that the data on who sickened or died of disease x, y, or z, ends up being correct. The percentage of wrong diagnoses that go to the grave without ever being discovered or corrected, is pretty big. Especially since the US autopsy rate is so low (around 5%, depending on how your measure it.

Maybe the information about how common misdiagnosis is, isn’t widely circulated and discussed, in clinical circles. That must be true, as you all seem to think that I am deficient in understanding the realities of clinical medicine, and that I am unduly concerned about diagnostic errors, caused not by sloppiness, but by strictly adhering to current algorithms, many (most?) of which depend on estimated Bayesian priors, formally or informally.

Having seen, by now, many such misdiagnoses play out (and when you see the percentages, you’ll realize that any alert person in their late 50′s, will have witnessed plenty) I have to think that you clinicians are the ones who are being unrealistic about how well your system is working. I can understand your bias, as you have to be able to get up and go to work in the morning, and look at yourselves in the mirror at night. But when you’re so aggressive about rejecting a scientist’s observations about where some of the problems might lie, I think maybe we potential patients have to be a little more aggressive about trying to get medical culture out of denial and into fix-it mode.

Since according to you, I’m making extraordinary claims, I’ll have to present a lot evidence in support of them. So don’t give me the old TL;DR excuse again, Andrey. Maybe I’ll have to make a few posts on this. I’ll give a recent review first, and then some of the other evidence I’ve been collecting since I first started researching this matter, in 2007.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3786666/

BMJ Qual Saf. 2013 Oct;22 Suppl 2:ii21-ii27. doi: 10.1136/bmjqs-2012-001615. Epub 2013 Jun 15.

The incidence of diagnostic error in medicine.

Graber ML.

Abstract

A wide variety of research studies suggest that

breakdowns in the diagnostic process result in a staggering toll of harm and patient deaths.These include autopsy studies, case reviews, surveys of patient and physicians, voluntary reporting systems, using standardised patients, second reviews, diagnostic testing audits and closed claims reviews. Although these different approaches provide important information and unique insights regarding diagnostic errors, each has limitations and none is well suited to establishing the incidence of diagnostic error in actual practice, or the aggregate rate of error and harm. We argue that being able to measure the incidence of diagnostic error is essential to enable research studies on diagnostic error, and to initiate quality improvement projects aimed at reducing the risk of error and harm. Three approaches appear most promising in this regard: (1) using ‘trigger tools’ to identify from electronic health records cases at high risk for diagnostic error; (2) using standardised patients (secret shoppers) to study the rate of error in practice; (3) encouraging both patients and physicians to voluntarily report errors they encounter, and facilitating this process.KEYWORDS:

Decision making, Diagnostic errors, Medical error, measurement/epidemiology, Patient safety

PMID: 23771902 [PubMed - in process] PMCID: PMC3786666 Free PMC Article

The following paper is about misdiagnosis among ICU patients, determined by autopsy. You’d think ICU patients would be less likely to “fall through the cracks” – but when there’s a systemic reason for the misdiagnosis (that is, according to the rulebook the diagnosis is “correct”) there is no way to fix it except for some courageous individual doctor to go out on a limb, question the rulebook, and do something outside it. With a patient who is that sick, this is asking for legal trouble, so it’s only going to happen on TV. Bayes is a big part of the rulebook, explicitly or implicitly, and a big part of how doctors reach their diagnoses, formally or informally. (See Jerome Groopman’s book “How Doctors Think” for example.) So estimated priors, and the feedback loop that exaggerates their bias over time, is likely a significant factor here. IMHO.

The autopsy rate is frighteningly low in the US these days, so most of these errors are never discovered. That means that even death certificates that attempt to be exact are not accurate, and the official statistics on who dies of what, are also not accurate, which affects estimated priors; but I’ll deal with that in another post.

http://qualitysafety.bmj.com/content/early/2012/07/23/bmjqs-2012-000803.abstract

The following paper is a study of how primary care providers can sometimes get around “system-related barriers” to prompt diagnosis. They are calling these successful efforts to get patients diagnosed despite the system “resilient actions.”

http://www.ncbi.nlm.nih.gov/pubmed/23813210

There is nothing in the quotes from the papers you posted that supports your “humble”

[sic]opinion. There may be a high rate of misdiagnoses, but it isn’t the fault of Bayes.You have imagined a “feedback loop,” whereby evidence in favor of a disease with a low prior probability magically lowers that probability, but that is mathematically impossible. If a doctor has a differential diagnosis of disease A, P(A)=.1, and disease B, P(B)=.9, and he orders a test that can discriminate between A and B; then, if the test comes back positive for A, by Bayes’ Theorem this

mustincrease the probability of disease A—and it’s hard to imagine how a doctor could think the opposite. Of course the doctor might, on the basis of the prior, doubt the correctness of the test and order additional tests, but that sounds like generally good practice.“Of course the doctor might, on the basis of the prior, doubt the correctness of the test and order additional tests, but that sounds like generally good practice.”

I don’t know about others, but in the situation you describe, my probability for the diseases would change to 0.5. I would then need another test. It’s hard for 1 test alone to be able to change the probability of a disease from 0.1 to 0.9 and vice-versa.

I checked out the references you gave. I couldn’t find the Bayesian process being faulted. Although what you describe is Bayesian non-learning, if there is such a thing. It sounds more like refusal to learn. There are many factors that can lead to errors, cognitive biases are a well known cause.

http://qualitysafety.bmj.com/content/22/Suppl_2/ii58.full

http://qualitysafety.bmj.com/content/22/Suppl_2/ii65.full

Thanks, Weing, I’ll read those.

Time to do some holiday prep, heading out tomorrow.

“So you are not on the look-out for sources of error in diagnosis.

There’s no diagnosis fairy that goes around correcting wrong diagnoses, so that the data on who sickened or died of disease x, y, or z, ends up being correct.”

If a series of studies on diagnostic error ranges from 5.5% to 100%, what conclusions can be drawn?

Dr Donald Snead, former director of the Internal Medicine program at Duke University 30 some years ago, once made a comment that 80% of the time the diagnosis was not absolutely certain and some educated guesswork was involved. Hopefully we have better diagnostic tools and are doing better now – your one figure of 6.3% of missed class I diagnoses would seem to support that. However, doctors have to live with a great deal of uncertainty (I cannot tell you how distressing it was to me when I discovered this. I had assumed patients had classic histories pointing to particular disease, would have definite physical findings supporting a diagnosis which could be confirmed with lab testing, and had a predictable and positive response to treatment. How naive I was. Humans are really variable, their presentations are variable, some of them (especially the 400 pounders) are almost impossible to examine and the response to treatment is often unpredictable. The classic patient with coronary insufficiency has chest pain. About 10% do not, to use an example. It’s much worse in pediatrics where the history often is “The babysitter said he was sick”. Pediatricians have to have ESP.).

Most physicians are acutely aware of this uncertainty. We also know that nothing affects the care of a patient more than the working diagnosis. We formulate a differential diagnosis, develop a treatment plan based on the most likely causes, try to not miss the less likely but potentially fatal causes, and if things do not go well question the diagnosis. This can take the form of obtaining subspecialty consultations or further testing, of which there’s a lot. There’s a real danger in holding onto a flawed diagnosis. You think doctors are either unaware of this or too intellectually lazy to care. Give me a break! This is what keeps us awake at night.

In your own case if you think something is being missed you can and should always ask for a second opinion.

Exactly right Dave. One thing to bear in mind – and what you have just crystalized in my own head about Self Skeptic – is that a diagnosis can be wrong. A good differential, however, is much less likely to be wrong. It seems that SS thinks we just pick a diagnosis and stick to our guns. Ha. I once got pimped on “causes of chronic cough.” I made it out to about 15 or so differentials, my attending pushed me hard to about 18, and then he filled in another 3 or 4. We pick a PROVISIONAL diagnosis and then at LEAST a few of the next most likely plus a few that we shouldn’t miss. We then do what we think will elucidate the PDx and what we think will help tease apart the DDx. We simply don’t get locked into a single diagnosis and then feedback loop and leave people undiagnosed. Well, some of us do, but they are generally considered to be still learning (med students, residents) or crappy doctors.

“We simply don’t get locked into a single diagnosis and then feedback loop and leave people undiagnosed. Well, some of us do, but they are generally considered to be still learning (med students, residents) or crappy doctors.”

I think fatigue plays a huge role in premature diagnostic closure. That’s why I try to be extra-vigilant with the patient who walks-in just before you close the door, especially on a Friday evening.

@weing:

You are absolutely correct. There are other factors indeed that play into early dx closure. I was being brief and overly simplistic in my statement.

Not trying to be smartass, but you forgot a third possibility.

While you correctly identified that there is the frequentist and the bayesian approach to hypothesis testing, the frequentist approach should really be subdivided into the Neyman–Pearson approach and the Fisherian approach, as they are clearly distinct. Those two being mixed up is imho one of the sources for many myths and common misunderstandings.

In no particular order, if you read those three articles you may get a better idea about the issue:

Goodman, Steven N. “P values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate.” American Journal of Epidemiology 137.5 (1993): 485-496.

Goodman, Steven N. “Toward evidence-based medical statistics. 1: The P value fallacy.” Annals of internal medicine 130.12 (1999): 995-1004.

Hubbard, Raymond, and M. J. Bayarri. “P values are not error probabilities.” Available in Internet (2003).