## Reporting Preliminary Findings

While scanning through recent science press releases I came across an interesting study looking at the use of a pharmaceutical grade antioxidant, N-Acetylcysteine (NAC), in the treatment of certain symptoms of autism. This is a small pilot study, but it did have a double-blind placebo controlled design. The press release reports:

During the 12-week trial, NAC treatment decreased irritability scores from 13.1 to 7.2 on the Aberrant Behavior Checklist, a widely used clinical scale for assessing irritability. The change is not as large as that seen in children taking antipsychotics. “But this is still a potentially valuable tool to have before jumping on these big guns,” Hardan said.

But concluded:

“This was a pilot study,” Hardan said. “Final conclusions cannot be made before we do a larger trial.”

I also noticed that two of the authors list significant conflicts of interest – patents on the use of NAC, and one has equity in the company that makes it. It occurred to me that a larger question than the efficacy of NAC for these autism symptoms is this – if this is a pilot study only and we should not base any firm conclusions on the results, then why the press release?

In a broader context, we are currently going through an information revolution. There are many positive aspects to the easy access to information provided by the internet, but it is a disruptive technology in many ways. The news industry is desperately trying to adapt to the collapse of their traditional business model, for example. Relevant to my question above, the scientific community is now exposed to public scrutiny in a dramatic way that is very different from the pre-internet age.

It used to be that studies were published in print scientific journals that would largely be read by experts in the field. Journal articles and presentations at scientific meetings were (and continue to be) an important conversation that each scientific community has with itself, to review new findings and ideas. This conversation happened largely in isolation from public view, because it really is only interesting and relevant to researchers. Most new ideas in science are not going to pan out. The meat grinder of peer-review, replication, and analysis by the scientific community weeds out many hypotheses (which in the field of medicine are potential new treatments).

As new hypotheses survive through the process of peer-review and further research, the valuable ideas will tend to survive while the rest fade away (at least ideally that is what should happen). At some point scientific ideas become established enough that they begin to filter their way into the science section of newspapers, articles in popular scientific magazines, and then finally into documentaries and text books. I won’t pretend this system ever worked perfectly, but at least there was a relationship between the degree to which scientific ideas were validated and their penetration into public awareness. Anyone really interested in the latest research could read primary journals or attend meetings, but preliminary findings were not often presented as news to the public.

The internet has changed this. Now scientific journals often publish their abstracts online, and there are some journals that have full public online access. On the whole I think this is a very good thing. It facilitates research and the exchange of information. I certainly cannot imagine maintaining an information source like Science-Based Medicine, with almost daily articles, without instant online access to published articles.

The scientific community has to realize, however, that as a consequence of this the conversation that scientists used to largely have with each other is now happening more in the public eye. In medicine this means that patients are reading preliminary studies and making health care decisions based upon the findings.

The issue within medicine is similar to the broader issue created by the democratization of information by the internet – the explosion of access (both on the creation and consumption end) increases freedom but decreases quality control (which in medicine we call the standard of care). Journalists argue that a million amateur bloggers are not a replacement for even a single full time, trained, investigative journalist, and they have a point.

I am not arguing for limiting access to information. That genie is out of the bottle, and I think the positive effects of information access far outweigh the negatives. But I do think that the scientific and medical communities need to recognize the new reality they are living in and adapt to it. I have some specific recommendations.

The first is not to send out press releases for preliminary pilot studies. The studies are there for those who are interested to find, but sending out a press release almost guarantees the results will be published as news, and the caveat at the very end about the results being preliminary is likely to be lost on most of the public. Essentially the authors are saying, here are some interesting results, now don’t make any health decisions based upon them. It’s not only unrealistic, it’s irresponsible. Over reporting of preliminary results, most of which are likely to be wrong or misleading, is a disservice to the public. The noise it generates will lead to confusion, and may lead to a decrease in public confidence in the scientific community.

In the case of NAS and autism, the authors acknowledge that NAS as a supplement is already being used by “alternative” practitioners to treat autism. Further, the NAS used in the study is pharmaceutical grade and specially handled to prevent break down of the chemical, and the results (in addition to being preliminary) cannot necessarily be applied to the supplements being sold. So this study is a setup to be misused to promote an existing market for an unproven therapy.

Journals should also consider more clearly labeling preliminary studies they publish, and perhaps even creating a special section in the journal for such studies. The reason for this is that the “peer-reviewed” label gets attached to definitive and preliminary research alike, and often this label is used to promote an idea to the public. If they were published in a section of the journal designated for preliminary findings, that might help reduce confusion over what peer-reviewed actually means. It would at least provide an easy alert to journalists and news outlets – this article comes with a huge warning at the top that the results are preliminary only (not a short caveat at the end).

I also believe that the scientific community needs to be more involved in reporting science news to the public. Press releases are often handled by non-scientists working in the press office of a journal or university, and often they distort or overhype the research. Their goal is not to accurately report results, but to promote their institution and grab headlines. The dwindling science journalism infrastructure means that many news outlets just pass along the press releases, without doing any actual journalism themselves. Often generalist reporters are reporting science news stories they are not equipped to understand. We need to rebuild the science journalism infrastructure by finding ways to support journalists who are trained to report science news, and training scientists to be their own journalists or at least how to deal with the media.

Finally we need to include in our science education curriculum how to skeptically read science news. The public needs to become more scientifically literate, because no matter what steps we take to improve how science news is reported to the public, there will always be a tremendous amount of bad or misleading science reporting (again – the internet genie is out of the bottle).

**Conclusion**

The scientific community needs to recognize that the world has changed around them. The conversation that scientists used to have mainly with each other is now increasingly happening in full public view. This can ultimately be a powerful boon to science, which needs to be transparent and open. The scientific community, however, has to adapt to this new reality. It has to retool the conversation to account for the public attention, and it has to engage more energetically with the public. Further it should become more directly involved in promoting higher quality science education in our society.

Great article! I totally agree with it!

Sometimes the very reporting of something at all lends more credibility to it’s significance than exists. It would be nice if preliminary work didn’t end up in the news and was just reserved for conferences and seminars. I’m not sure how useful a separate section of a journal reserved for preliminary data would be, though I don’t have too many better ideas. It seems like reporters untrained in science wouldn’t realize that it was separated, or even if they did, they would misunderstand it’s significance simply because ,”Hey it’s in a peer-reviewed journal!”

Btw, I think later in your article you meant to say NAC instead of NAS. Funny because I was JUST thinking about NAC yesterday and wondering if SBM ever did a post about it.

I found the study at clinicaltrials.gov. It is listed as a phase 2 study with an estimated enrollment of 40 subjects.

There were four primary and four secondary endpoints in the study. The irritability scores was among the 4 secondary endpoints.

So, they got a hit on one of 8 endpoints, and a secondary one at that. I wouldn’t tape it to the fridge, much less issue a press release.

@Wholly Father:

The irritability score is one of five subscales of the ABC scale. The entry at clinicaltrials.gov lists total ABC score as a primary outcome and irritability score as a secondary outcome. In the paper, though, the ABC irritability subscale score is called a primary outcome, and the total ABC score is not mentioned at all. That seems strange, but perhaps it was an honest tweak of the objectives after the filing with clinicaltrials.gov.

Of the five ABC subscales, only the irritability subscale was listed as a specific outcome on the clinicaltrials.gov listing, so it appears that the investigators had a specific hypothesis about it; and of the five subscales it was the one, and only one, to achieve a statistically significant improvement, and it did so at a very highly statistically significant level (p<.001). So, it does seem like this hypothesis could merit further study.

@jt512

The change in priority of the endpoints is suspicious for a post-hoc hocus pocus to enhance the impact of the one positive result.

“Journalists argue that a million amateur bloggers are not a replacement for even a single full time, trained, investigative journalist, and they have a point.”

Maybe so in principle, but in practice a lot of the reporting on medical research is not done by “trained, investigative journalist[s].” It’s breathless, uncritical, overhyped, and hagiographic toward the heroic medical and scientific geniuses who are doing the research. Work in animal models that will at best pay off in 15 years and likely fade away with nothing to show for it is routinely presented as the breakthrough that will cure cancer, Alzheimer’s and MS. So really, the corporate media is not a helpful filter but much more often a distorting amplifier.

It strikes me that early reports do also help fend off the “Why hasn’t this been reported? Obviously the medical community is trying to hide something!” mentality.

People need to be trained to tell the difference between scholarship, journalism, and Hollywood script writing.

@Wholly Father:

You may be right, but I think you’re jumping to conclusions. There’s no law that says that you can only publish one paper per registered clinical trial. For all we know, the investigators may be preparing another paper about the original primary outcome.

Or not, but perhaps we should give them the benefit of the doubt.

Jay

@jt512:

There is no such thing as “very high statistically significant.” It either meets significance or not. A study that is examining if A and B are different and finds a p=0.00000001 does not mean anything different than a study of A and B that finds a p=0.045. Both are still significant and the difference between A and B (what you are actually looking for) will remain the same. We can say that we are more confident that the difference is not a product of actual chance but is really there, except that since the alpha level is an aribtrary cut off anyways that really has little bearing on it. What would make something more interesting is the power of the study. I would value a highly powered study with a p=0.045 much more than a lower powered study with p=0.000000001.

@nybgrus:

For a given statistical test and sample size, the smaller the p-value, the greater the evidence against the null hypothesis. That’s why most journals require authors to report the p-value exactly, rather than as just “p<.05." We routinely use phrases like "very highly statistically significant" to verbally communicate that the p-value was very small.

You’ve got that backwards, I’m afraid. For a given p-value, the greater the power of the study, the

lessevidence the study provides against the null hypothesis. The relative weight of the evidence in favor of the null hypothesis vs. the alternative hypothesis is quantified by a statistic known as theBayes factor. The smaller the Bayes factor is, the greater the evidence favors the alternative hypothesis.Now, consider three hypothetical studies’ findings, each with a two-tailed p-value from a paired t-test of 0.01. For consistency, assume that each study found the same effect size—say, 0.46. However, the studies differed in power to detect the observed effect; let’s say they had power of .2, .5, and .9, respectively. Then, the corresponding Bayes factors (which I calculated here) would be .33, .42, and .62. As you can see, for the same p-value, the greater the power, the less evidence the finding provides against the null hypothesis. Indeed, in the study with the highest power, .9, the Bayes factor, being greater than .5, indicated that the evidence actually favored the null hypothesis over the alternative, in spite of the “significant” p-value.

I’ll be the first to admit that while versed in statistics I am by no means an expert.

However, I think you missed my point. Your original claim was that the very low P-value meant that there was more clinical significance to the finding – that there was something worth pursuing because the P-value was so low. My comment was that the P-value only tells us how confidently we can reject the null hypothesis – it does not indicate a greater degree of clinical significance. In other words, if A and B are found to be two units apart, whether it has a P-value of .00001 or .045 that will not change – we can just more confidently say that indeed there is a real difference between A and B of two units.

As for power, I’m pretty certain it is you that have it backwards. Power can be looked at prospectively or retrospectively. In a prospective sense it is used to determine the necessary n to be able to adequate determine differences. In a retrospective sense it can be used to demonstrate that a study did not have enough power to make the determination in the first place.

In other words, just as a lower P value will give you greater confidence in rejecting the null, a highly powered study will give you greater confidence that the P value is actually legitimate. So when I say that I would feel more confident in a study that is highly powered but has a higher P-value what I am saying is that with an inadequately powered study you cannot trust the P-value as much, therefore a lower P-value does not inspire confidence that one can actually reject the null.

Bayes factors are indeed separate from powers, though you can obviously use the power to help infer a Bayes factor. However, a higher Bayes factor means that the

a priorilikelihood of the outcome is higher and thus would support whatever you applied it to – either the null or the alternative. In other words Bayes factors work like powers except that it can be applied either way. If the Bayes factor for the null is very high (homeopathy doesn’t differ at all from placebo) then even a very low P-value would be rendered insignificant. But if the Bayes factor for the alternative is very high (null = arsenic is safe to ingest, alternative = arsenic is unsafe to ingest) then things swing the other way. It just isn’t done that way by convention.So I think I am still correct in my statement. To make it more concrete I’ll offer examples:

Study A is testing to see if homeopathic Arnica gel can prevent bruises from occurring. It is well designed, but has a small sample size and thus underpowered. After crunching the numbers, a P-value of 0.01 is calculated. This would indicate that we can reject the null (=Arnica gel does not prevent bruises). However, because it is low powered, we have much less confidence in the validity of the P-value. It could easily be noise in the signal.

Study B is testing to see if homeopathic Arnica gel can prevent bruises from occurring. It is well designed but has a very large sample size and is very well powered. After crunching the numbers, a P-value of 0.4 is calculated. This would indicate we can reject the null. This time it is well powered, so I would pay more attention to it since the P-value is more likely to be accurate. On the surface it would indicate much more strongly that Arnica gel actually does prevent the formation of bruises.

The prior likelihood of homeopathy working is extremely low. So when you apply a Bayes factor to Study B you find that the P-value is not very impressive and we can no longer reject the null.

The problem of Type II error inflation (i.e. decreasing the power of a study) is magnified depending on how many end points are looked at. In the specific example that prompted our discussion, there were multiple primary and secondary endpoints. With correction (such as the Bonferroni correction) the results are essentially completely invalid. But even with correction, the confidence in the results is automatically decreased.

I am aware that people

say“very statistically significant” and that this is another way of saying there is a low P-value. What I was saying is that this doesn’t actually really mean very much when analyzing studies for the reasons I have stated above.@nybgrus: sorry, but I think jt512 is right about high vs. low power. Look at it this way: imagine two studies, one with 100 patients, and one with 1,000,000 patients. The study with 100 patients shows that drug A prevents death better than placebo, and the P-value is 0.04. The study with 1,000,000 patients shows that drug B prevents death with an identical P-value of 0.04. What that means is that we are equally confident that drug A and drug B beat placebo. However, for a study with N=100 and N=1,000,000 to have the same p-value, it means that the low-powered study must have a much larger effect size than the high-powered study, so the low-power study is actually more likely to be clinically significant.

The most important take home point is that clinical significance is determined by effect size, not p-value. If I had a novel antihypertensive drug that lowered your BP by one point, I could get a p-value of 10^-30 if I had a large enough sample size. It doesn’t mean it’s clinically significant.

@nybgrus

What you are addressing is a concept called the positive predictive value. This is the idea Ioannidis uses in his oft-cited paper on this blog.

The Positive Predictive Value of rejecting the null hypothesis = True positive rate / [True positive rate + False positive rate]. The P value tells us something about the False positive rate, but by itself tells us nothing about the True positive rate. Calculation of the Positive Predictive Value requires: Study Power, Critical P Value, and Prior Plausibility.

All other things being equal, a study with a low Power will yield fewer True Positives, therefore, if the null hypothesis is rejected, the Positive Predictive Value will be lower than a study of greater Power

I made an embarrassing mistake in my last post. I said that the weight of the evidence changes from favoring the alternative hypothesis to the null hypothesis when the Bayes factor exceeds .5. The Bayes factor is the ratio of the probability of the data under the null hypothesis to the probability of the data under the alternative hypothesis, so, obviously, the pivot point is not .5, but 1.

With sufficient power, we can indeed get a Bayes factor for a test that shows that the data favors the null hypothesis over the alternative in spite of the p-value being significant. Under the conditions stated in my previous post, we need power of greater than .999. For example, with a power of .99986, the Bayes factor is 1.04, indicating that the data slightly favor the null over the alternative.

Ah yes, I believe I understand now. I was indeed not using the term quite correctly and misunderstood what jt meant. I was thinking of a different scenario than evilroboto posited (i.e. the one I posited).