## Statistical Errors in Mainstream Journals

While we frequently on SBM target the worst abuses of science in medicine, it’s important to recognize that doing rigorous science is complex and mainstream scientists often fall short of the ideal. In fact, one of the advantages of exploring pseudoscience in medicine is developing a sensitive detector for errors in logic, method, and analysis. Many of the errors we point out in so-called “alternative” medicine also crop up elsewhere in medicine – although usually to a much less degree.

It is not uncommon, for example, for a paper to fail to adjust for multiple analysis – if you compare many variables you have to take that into consideration when doing the statistical analysis otherwise the probability of a chance correlation will be increased.

I discussed just yesterday on NeuroLogica the misapplication of meta-analysis – in this case to the question of whether or not CCSVI correlates with multiple sclerosis. I find this very common in the literature, essentially a failure to appreciate the limits of this particular analysis tool.

Another example comes recently from the journal Nature Neuroscience (an article I learned about from Ben Goldacre over at the Bad Science blog). *Erroneous analyses of interactions in neuroscience: a problem of significance* investigates the frequency of a subtle but important statistical error in high profile neuroscience journals.

The authors, Sander Nieuwenhuis, Birte U Forstmann, and Eric-Jan Wagenmakers, report:

We reviewed 513 behavioral, systems and cognitive neuroscience articles in five top-ranking journals (Science, Nature, Nature Neuroscience, Neuron and The Journal of Neuroscience) and found that 78 used the correct procedure and 79 used the incorrect procedure. An additional analysis suggests that incorrect analyses of interactions are even more common in cellular and molecular neuroscience.

The incorrect procedure is this – looking at the effects of an intervention to see if they are statistically significant when compared to a no-intervention group (whether it is rats, cells, or people). Then comparing a placebo intervention to the no-intervention group to see if it has a statistically significant effect. Then comparing the results. This seems superficially legitimate, but it isn’t.

For example, if the intervention produces a barely statistically significant effect, and the placebo produces a barely not statistically significant effect, the authors might still conclude that the intervention is statistically significantly superior to placebo. However, the proper comparison is to directly compare the differences to see if the difference of difference is itself statistically significant (which it likely won’t be in this example).

This is standard procedure, for example, in placebo-controlled medical trials – the treatment group is compared to the placebo group. But what more than half of the researchers were doing in the articles reviewed is to compare both groups to a no-intervention group but not comparing them to each other. This has the effect of creating the illusion of a statistically significant difference where none exists, or to create a false positive type of error (erroneously rejecting the null hypothesis).

The frequency of this error is huge, and there is no reason to believe that it is unique to neuroscience research or more common in neuroscience than in other areas of research.

I find this article to be very important, and I thought it deserved more play than it seems to be getting. Keeping to the highest standards of scientific rigor is critical in biomedical research. The authors do an important service in pointing out this error, and researchers, editors, and peer reviewers should take note. This should, in fact, be part of a check list that journal editors employ to ensure that submitted research uses legitimate methods. (And yes, this is a deliberate reference to The Checklist Manifesto – a powerful method for minimizing error.)

I would also point out that one of the authors on this article, Eric-Jan Wagenmakers, was the lead author on an interesting paper analyzing the psi research of Daryl Bem. (You can also listen to a very interesting interview I did with Wagenmakers on my podcast here.) To me this is an example of how it pays for mainstream scientists to pay attention to fringe science – not because the subject of the research itself is plausible or interesting, but because they often provide excellent examples of pathological science. Examining pathological science is a great way to learn what makes legitimate science legitimate, and also gives one a greater ability to detect logical and statistical errors in mainstream science.

What the Nieuwenhuis et.al. paper shows is that more scientists should be availing themselves of the learning opportunity afforded by analyzing pseudoscience.

“513 behavioral … articles in five top-ranking journals … and found that 78 used the correct procedure and 79 used the incorrect procedure.”

78 correct procedures plus 79 incorrect procedures is 157. What was the status of the remaining 356 articles and the statistical techniques they used?

The article in Nature Neuroscience reported correct and incorrect analyses in five leading scientific journals in their Table 1. They did not report whether there were differences between error rates in studies of animals and studies of humans. It was so jarring to read their article because the errors they report rarely occur in randomized clinical trials involving human participants.

They were trying to avoid identifying specific published studies (giving fictive examples) for diplomatic reasons, and this deprives the reader of looking at which kinds of studies fall into the traps they describe. From the examples they provide, it sounds as if the errors were found in the cellular and molecular neuroscience studies. I cannot recall the last time I read a study of a clinical intervention in patients which drew conclusions based on p values less than .05 in one group and greater than 0.05 in the other. The listed journals do seem to be weighted toward laboratory science. I suspect that statistical errors involving research at the bedside occur less frequently than for research at the bench.

It would be interesting (and worthwhile) to have this comparison. It may take some time to do, but would be publishable if it were done.

sCAM/ acupuncture researchers tend to take this technique one step further. They compare “real” acupuncture and sham technique to a no intervention group, and when “real” and sham produce statistically equivalent results that significantly differ from no intervention, they conclude that both the sham technique and “real” acupuncture are both effective.

It seems to me that the (typicaly unblinded) no intervention group in such studies is usually only useful as an indicator of the presence/strength of a placebo response/effect/factor and really shouldn’t be used in statistical analysis if it is done at all. The placebo is supposed to be the control, not the no intervention.

This post meshes well with Prometheus’ two part post “Anatomy of a Study: A Dissection Guide”

http://photoninthedarkness.com/?p=228

http://photoninthedarkness.com/?p=230

I think “How to critically/skeptically evaluate a study” would make an excellent workshop or presentation at the next TAM.

Great article. I wonder if part of the problem is that many in the cell and molecular programs do not emphasize statistical training at the undergraduate and graduate levels. Our pre-med undergrads (Tier I research university) must take *either* calculus or stats, and nearly all opt for Calculus. (It is required for our grad students.) I was never required to take stats and to this day deeply regret it. I use stats daily and calculus almost never, and I think I’m typical of my colleagues (yes, arguing from experience, sorry). I essentially had to teach myself.

Another limitation may be the access to a statistician. My college (heavy in cell and molecular PIs and not the med school) underwrites statistical help via a few stats grad students, but access is limited and their knowledge thin – after all, they are students. In contrast my med school colleagues have access to full time statisticians and they have the good sense to build 5% salary into their grants. This is seldom done in the many basic research budgets I’ve reviewed over the years.

Then again, there is that old joke. Ask three statisticians for an answer, and you will get five opinions.

I think this is less common with human drug trials because the placebo-controlled paradigm is deeply entrenched. Partly because it is mandated by the FDA – so it has become the standard. (This is a good thing, BTW)

Regarding statistics – I find that this should be taught more in medical school and science programs. But perhaps it needs to be taught in a more accessible and practical way. Instead of getting bogged down in the complex math (i.e. teaching statistics to scientists so that they can crunch the numbers themselves) it might be better to split it up to a basic course that everyone takes and is designed to improve understanding of how to use and read statistics, and an advanced course that gets into the math to the point that you can actually do statistics.

If you go just for the latter, you lose a lot of students who end up not understanding the basics.

Dr. Novella,

This is my first post here, but as someone who teaches applied statistics courses in the social sciences, I want to point out a few problems I have encountered when splitting the material the way you suggest. They are mostly from the behavior induced, and I don’t know which of them would be applicable to medical students.

The first is that the stronger students are able to opt into the more advanced class first, since they know most of what is being taught the first semester (usually from undergrad) and therefore the first class slows down. (It is hard to resist the urge to pitch the class at the median student actually there instead of the median student in the cohort.) Second, the math serves as a signal to students that they need to take the class seriously. In my opinion, the hardest parts of the material to get students to really take seriously are causal inference and identification. There are no tests or problems with exact answers to do, but if you get them wrong you can be technically perfect and still have complete nonsense. Again, I don’t know whether med students would fall prey to these problems, but they can be significant practical impediments to teaching a “how to read and use stats” course before teaching the numbers.

That said, I am constantly seeing technically perfect nonsense published where the authors do exactly what they claim but don’t seem to realize that what they are doing cannot be what they want, so I am very sympathetic to the need to change how we present statistics.

AS – Thanks. I was just throwing that out there as a suggestion, but your points seem valid.

Another way to go is to include statistical analysis in a course on how to interpret studies and the literature. One way to analyze a paper is to ask – are they using the correct statistical methods? This would include issues I raised in this blog – are they doing multiple analysis? Are they properly applying a meta-analysis? Are they data mining? What does statistical significance mean? What is a Baysean analysis?

You can teach this to doctors and scientists without making them learn the advanced math.

Dr. Novella – I sort of agree, but I don’t think you need to split the class, but rather emphasize the uses, strengths and weaknesses of different statistical tests.The stats class that I took that was based in my field was much easier to follow than the one taught by the math department (sorry mathematicians!). The department I am in now is trying to develop a biostatistics course, focusing on field-specific procedures and software to distinguish it from the math course.

Your second suggestion actually fits with some seminar courses I have taken, both as a grad and an undergrad. That’s something you can easily do with upperclassmen. (I will confess my skepticism overprepared me for those courses since I already knew how to spot shaky science.)

This error is tricky to understand without visuals. It badly needs a diagram, so I took a stab at it.

diagram of the difference error on SaveYourself.ca [~75K]

Feedback, corrections, suggestions are most welcome. Have I got it?

“Another way to go is to include statistical analysis in a course on how to interpret studies and the literature.”

Combine this post with Prometheus’s two parter on evaluating studies and you’ve got the basis of a potential killer CME workshop for TAM X.

Many times I have wished that authors put the data they used in the analysis on the journal web site, so that the reader could have a chance to analyze it. No identifying data, obviously, but if the data in the tables and figures were in an ASCII file, then, no matter what software the reader has, there would be an option to check for main effects and interactions on one’s own.

One paper on my desk now reports that an important outcome variable changed by 1.9 hours in the text, but in the table it is reported as 1.9%. There is obviously a typo in one or the other place, but I cannot tell which. Also, they reported doing multiple regression using 5 covariates, but there were not enough participants in the study to support that kind of analysis; I would like to check the analyses myself.

This is the age of the internet; let us exploit its capabilities. Posting the text of the article online is revolutionary in that you do not have to go out at night and trudge to the library and wait in line to use the Xerox machine in the periodicals section, but this amounts to nothing more than a change in delivery of the same content as before. A real revolution would be to enhance the actual content which is delivered.

In the research-review work that I do the differences between groups, and their statistical significance, are not as telling as the effect sizes (eg, Cohen’s d or similar), which also takes into account dispersion or Standard Deviation of the data. All too often, authors report significant findings based on differences when the effect sizes (which they do not report) turn out to be of minimal clinical importance, at best. In some cases, the data to calculate effect sizes are not provided, which is either deceptive or careless (one can never tell for sure).

As for Paul Ingraham’s chart – which is very nicely done – it seems to overlook the possibility of having a no-treatment condition (eg, “wait-list group”) as a valid cohort for comparison with an active-treatment group. This allows evaluation of actual treatment effects compared with the normal course of disease in similarly selected subjects who are enrolled in the research study but awaiting any treatment. Anyway… I’ve seen that done and it seems to be a valid approach in the pain management field.

Of course, having a third group – receiving placebo – would shed further light on whether treatment effects reflect (a) influences of receiving any therapy, even one expectedly inert, within the context of the study environment (ie placebo response) and/or (b) the natural course of the disease/condition or one which might spontaneously improve merely by being a part of a research trial (ie, Hawthorne effect).

Some people have proposed eliminating calculus as a medical school prerequisite and substituting statistics. Apart from a few very superficial references to derivatives in physiology class, like dV/dt, I encountered precious few allusions to the concepts of the calculus.

How about making linear algebra a prerequisite instead of calculus, with a statistics course that can make the concepts of multiple regression, either as a second undergraduate math course or maybe a first or second year medical school course? The med school course would focus on the assumptions of statistical tests and the pitfalls into which one may fall if unwary. If there are pitfalls in diagnosis which clinicians are taught to be wary of, there are traps in inference which future readers of new research should be taught to recognize and avoid. This does not mean that every medical student would be expected to run a lot of analyses, but just as we hope that they can make sense of radiologists’ reports and can look at a few images themselves, we can hope that the methods section of published literature will not look like a bunch of crazy shapes and shadows with no rhyme or reason.

“This should, in fact, be part of a check list that journal editors employ ”

Damn it, what happened to the reviewers? It is they that are incompetent. It is they that failed, either through ignorance or laziness.

Next the proposal to teach stats without calculus disgusts me. Let’s just have Gaussian distributions be magical? That is the “best fitting line” cause my calculating machine says so?

It is however indisputable that scientists need more courses on the design and analysis of experiments – I see folks designing experiments that have absolutely no business doing so – they go ahead doing it without a serious idea about how the data will be analyzed. When I tell my scientists that what they really want is to test an interaction, some don’t know what I’m talking about (this is the issue of the article, using different words). I explain it 3 different ways, and they sometimes still refuse to get it. I write down a model, and their brains shut down. Wanna do science? Get a clue. Stop letting folks do biology while (and because) they hate the math. We are making docs and scientist that can’t even properly read the literature. We are making a literature where statistical cheating is common, cause the reader, the reviewer, and the editor all can’t spot it.

Some case studies. We have a culture problem.

I am having to explain why I take logarithms to post-docs, or having to explain what I mean when I ask them why they took the anti-logs of the RT-PCR data. They compare the cycle thresholds, which is good, since it is already in log space. They even compute the standard deviations and do T-tests in log space – good, though they may not know why it’s good. They then plot in non-log space, and do a bastard thing that they have no clue about to obtain error bars for that plot. They argue that “everybody does it that way” – popularity, or “my gigantic non-peer reviewed book says to do it this way” – authority.

They have 3 vs 3 observations of something – and plot the means and the standard deviations rather than showing the data. “everybody does it that way” again. They call the usual plot of the actual data “the dot plot” – maybe they’ve never seen someone actually show the damn data, and need new words for that peculiar situation.

They run blots and show one, and write that it was “representative”. Why not design a good experiment with replication and analyze the data using (gasp) a model, and estimate the chance that rejecting the null hypothesis is an accident – so we can actually call it science? And god forbid saying what the nature of the replicates was.

Folks do t-tests on 3 vs. 3 designs where the replicates are merely technical and not biological replicates. They fail to say what the replicates are ofcourse. Their lab mates have taught them to do the experiment this way, cause it works – you get small p-vaules. If it doesn’t, repeat it, being sure to know which things you want to have bigger values – it will work eventually. Then say it is representative. Doing actual biological replicates is hazardous to your desired conclusion.

They would rather break up a linear Y-axis into 3 ranges, where we still can’t see how different the low-valued samples were, rather than make Y be log scale. Reader might not get how big the difference you want to tell them about is – cause they aren’t used to log-scale plots. “everybody does it that way.”

Cell counting we see that at day 5 we have 20 million of type A cells but only 10 million of type B, the average of triplicates that give p=.01. Be sure to plot on a linear scale so people can’t tell 1) that we started with about 20,000 type A and about 10,000 type B, all we can see is that they were very small numbers, and 2) we can’t tell if we get straight lines on the log-scale. When I plot log-scale and they see perfectly straight lines that are perfectly parallel, they don’t like my methods. If I suggest a dumbed-down version, taking log(final/starting), they may get it, but may not want me to take logs. Thankfully this last example is more rare, and it is true that there are smart and good people out there who do understand the math pretty well for all my examples.

The real reason they do these things is never because it is good. Whether it is good is usually never asked. It is in fact bad – misleading or suppressing information. And that’s why it is popular. It gets passed on like certain rhymes on the kindergarten playground – the adults are powerless to stop it.

An often whimsical biostatistics book by Geoffrey Norman and David Streiner tells of a colleague whose master thesis involved looking at the constipative effects of medications used by elderly patients. Because the dependent variable (whether or not the patient had a bowel movement that day) was binomially distributed, the arc sine transformation should be used to analyze the data. A supervisor asked “If a clinician were to ask you what the number means, are you going to tell him, ‘It is two times the angle whose sine is the square root of the number of patients (plus 0.5) who shat that day?””

The moral was that even when it is mathematically rigorous to transform the data, it may make it harder for non-mathematicians to make sense of the results.

You can require only so much undergraduate work from applicants to medical school. Lots of math is great. Making sense of the methods sections of studies is a valuable skill, and the math which is needed to do that should have priority.

@Rork

You are absolutely right on this. I have the same experiences with the students and postdocs; we are wrestling right now on the “right” way to normalize western blots. Here’s an example: we recently published a echocardiography study. The control mean & SD on ventricle size was tight. The experimentals had a similar mean but huge SD. The student insisted there was no difference until I forced her to make the dot plot. Shabam! She suddenly discovered that half the experimental hearts segregated into dilatative failure. The means were only the same thanks to the law of averages.

But I think there’s another culprit here – the Journal Editors and reviewers. There is no longer space for ambiguity or less-than-perfect results. Editors seek excuses to reject and review competition is fierce. Remember when we used to say, “Those data look too perfect”? Not any more. The students and mentors are responding not only out of ignorance or laziness, but out of artificial standards of “data perfection.”

Angora Rabbit:

Getting the student to draw the data by hand is a nice device. Look at the data before you calculate anything.

When the mean in the samples is the same but the variance is different, it can confuse students who have the idea that the t-test is done to see if the means are the same in the two samples. You need to remind them that the test is really done to see if the samples are drawn from the same population, not to see if the means are equal.

The world population of wild sewer rats has a mean weight. You could, through many generations of careful inbreeding, develop a strain of rats with that same mean weight, but with a much smaller variance. If a sample of wild and laboratory rats are weighed, and the mean weights do not differ, most students will agree that the means are the same, but few will agree that the samples are drawn from the same population. This thought experiment is easy to carry out. Stating the null hypothesis correctly can cause at least some clouds to disperse.

Rork:

It may sound disgusting to skip calculus and present the normal distribution as “magical,” but the methods needed to do integration of the gaussian distribution are well beyond the scope of undergraduate calculus courses, so even if you have taken three semesters of calculus, that distribution has some elements of magic (or faith).

In Thomas Gilovich’s 1991 classic book, “How We Know What Isn’t So”, he spends some time at the end talking about learning proper use of statistical methods and inferences.

He discusses a study of graduate students of various sorts, and how they developed in their ability to properly interpret mixed data. The outcome was that the most growth in the ability to interpret complex data was among students in the social sciences, who get used to dealing with complex data without simple cause and effect relationships, where multiple causal chains have to be teased apart experimentally. Medical students did a little worse, and students in the physical sciences and in the humanities (law) did the worst.

I certainly didn’t learn my stats in med school or much before. Grad school, and then working in the world of clinical trials, was what it took for me.

I am not sure that I am following this correctly, having only a very limited grasp of statistics, but is the original complaint really a statistical problem, or a more fundamental conceptual one — the researchers have forgotten, or never quite understood the purpose of various elements in study design.

I mean, in a study testing for intrinsic efficacy in a treatment method with subjective end-points then

obviouslyonly the comparison to a completely inert but indistinguishable placebo will do.So why should an untreated group be included in studies of some some intervention and especially in lab studies where there are usually far fewer variables? What is its function? It implies a more complex scenario than exists in the usual drug study.

Thanks for the suggestion, Vera Montanumom. I definitely agree that a no-treatment group is be a “valid cohort for comparison.” It’s not wrong to compare no-treatment to treatment … it’s just wrong to

onlymake that comparison, and pass off the statistically significant difference as a statistically significant treatment effect size. So you’re right, but that subtlety may be beyond the power of one diagram to convey. This is why a scientist buddy of mine called this more of an error inference than of statistics. As she put it:And the stakes for getting it right tend to go way up studying anything where treatment requires a lot of interaction (i.e. psychiatry, massage therapy), where placebo can loom large, powered by a metric buttload of nonspecific effects, and accounting for nearly all of what was presumed to be a treatment effect. If you don’t include a placebo comparison there,

anda no-treatment group,anddo an ANOVA … well, shoot, experimental design fail. You’re doing it wrong.I’ve revised that diagram more, based on feedback received so far, here and elsewhere. More criticism still most welcome. I am determined to nail this.

diagram of the difference error on SaveYourself.ca [~125K]

Paul: – it’s just wrong only make that comparison, and pass off the statistically significant difference as a statistically significant treatment effect size.

I like the diagram, but you need to specify what question has been asked of the data. That surely determines which are “right” and “wrong” approaches, also what is “relevant” and “significant”.

Those judgments often coincide, as in questions of intrinsic drug efficacy, which is what everyone here usually has in mind.

But other questions do arise.

For example patients and CAM practitioners may ask why the enormous difference between both placebo/treatment arms and “no intervention” is regarded as “significant but not relevant”. Not relevant to what? Why could that not be seen evidence of important “mind-body” responses from the overall program of care? The “no intervention” arm will control for most other influences.

There is at minimum often the false presumption from such data that nothing of value has occurred, as implied by somewhat loose “it doesn’t work” statements (should at least be qualified by “via the mechanisms typically claimed”).

This should NOT be seen as a defence of pseudoscience. It is a matter of scientific precision, which should operate all ways,

especiallyin our dealings with CAM.pmoran, quick partial response: it sounds like you haven’t seen the more recent version of the diagram, which addresses some aspects of your suggestions (i.e. I ditched “right” and “wrong”). Visit again and refresh your browser to make sure you’re not seeing a cached old version.

“Next the proposal to teach stats without calculus disgusts me. Let’s just have Gaussian distributions be magical? That is the “best fitting line” cause my calculating machine says so?”

I am disturbed that people can use computers without knowing anything about how they work. Most can’t even program a simple search routine! They regard their working as little better than magic. Yet it would be insane to insist everyone who is going to use a computer needs at least second year IT papers!

You cannot teach everyone everything. And calculus is well down the list of necessary skills for a doctor. If they learn that, they are not learning something else.

I would suggest that the issue is that people who go into medical research need skills that most doctors do not. They shouldn’t even be doing the same degree.

(For the record I like calculus and am good at it, passing level 2 university with ease, so I am not biased against the subject.)

For what it’s worth:

I took a year of calculus, was good at it and enjoyed it, but the only time I ever used it was to solve problems in a physics class, and there were ways to do that without calculus. Looking back, I wish I had had a year’s education in statistics instead. There is an excellent course at The Teaching Company that explains what calculus is all about without getting into equations. I think understanding those basic concepts would be all most doctors would need. The same really goes for statistics: doctors need to understand the concepts and how to interpret the statistics they read in published studies, but they don’t necessarily need to know how to do the statistical math themselves.

I can completely second what Dr. Hall said. I completely breezed through calculus and did quite well at it, but never found too many applications for it. And when it came to physics, I almost always just converted everything to energy and used conservation to solve my problems. It was just easier.

But, I did also have a reasonably solid foundation in statistics – though not as much as I would have liked in retrospect.

@ pmoran et al:

Intuitive Biostatistics by Harvey Motulsky (ISBN 978-0-19-973006-3) is now in its second edition, and might be a good investment. The place of statistics within the framework of science is well-discussed and many common problems with study interpretation are nicely presented. Check it out and see what you think.

Regarding the flaw mentioned in the OP: testing two interventions vs. placebo, and not against each other:

Robinson and colleagues made this mistake in a test of meds vs therapy for post-stroke depression prevention:

JAMA, May 28, 2008—Vol 299, No. 20.

They were taken to task for it. This is the JAMA article that drew heavy fire for failure to disclose drug funding, and so this study got raked over the coals a bit more than most do. [This is where the journals amped up their COI standrds, and DeAngelis had a couple editorials on the issue.]

Do journals even now look at statistical significance?

I thought in the top journals you have to use confidence intervals for RCT’s.

Preface from Michael Woodroofe’s “Probability with Applications”, which is just a beginning to learning statistics:

“The prerequisite for an intelligent reading of this book is 2 years of calculus.”

Folks claiming not really needing it for physics weren’t doing difficult enough stuff. Do you take elliptical orbits on faith? That sphere’s mass acts similar to a point mass? Is it OK to do physics but not understand Newton? Einstein or Schrodinger stuff is clearly impossible.

Paul: Your diagram doesn’t reflect the classic situation, where there are four groups, and one tests an interaction to see if

(A-B)-(C-D) = 0. People instead are showing A>B is significant and that C>D is not. They will occasionally instead show that A>C (which for you would be treated vs. placebo, possibly hailed as the solution to the problem) but that is still not enough. For example imagine C and D are placebo and no treatment but performed by group 2, and C is lower than A, but D is much lower than B, a no treatment arm performed by group 1. It’s OK, so long as you know that.

@rork,

“Folks claiming not really needing it for physics”

I think you are referring to me, but all I said was that I did not need it for the college physics course I took as a pre-med requirement. Physics majors definitely need calculus; but I don’t think the average medical student or clinician does.

@rork:

I chimed in as well. And yeah, I was not a physics major, so the year of non-major physics I needed to take was not exactly taxing on my calculus skills. I did use it a few times. But I don’t need to solve every basic problem using the calculus to trust that my answer is representative of the real world. Fortunately for the physicists of the world there aren’t too many “complimentary and alternative physics” people out there.

So of course we weren’t doing difficult enough stuff – human beings and their biology do not involve the calculus of elliptical orbits. And yes, I did take it on “faith” (more like trust in my college physics professor) that a sphere’s mass acts similarly to a point mass. In fact, IIRC, he specifically said that such calculations were tough, required calculus to solve, and if we were curious enough to go ahead and do it, otherwise just accept that others have and move on (or something to that effect).

Dr. Novella, I think your introduction of the third group, “no intervention,” is confusing the issue. A no-intervention group is irrelevant in most medical studies, and the only relevant comparison is the pre-treatment–post-treatment difference between the active treatment and placebo (or another active treatment). As Goldacre and the authors of the original article point out, showing that the pre-post difference is significant in the active treatment group, but not in the placebo group, does not imply that the active treatment was more effective than placebo. Such a claim requires that the difference between these pre-post difference is significant.

The error is prevalent in journals in many fields.