I love reading quotes by the likes of Karl Popper in the scientific literature. A recent replication of Bem’s infamous psi research, Feeling the Future, gives us this quote:
Popper (1959/2002) defined a scientifically true effect as that “which can be regularly reproduced by anyone who carries out the appropriate experiment in the way prescribed.”
The paper is the latest replication of Daryl Bem’s 2011 series of 9 experiments in which he claimed consistent evidence for a precognitive effect, or the ability of future events to influence the present. The studies were published in The Journal of Personality and Social Psychology, a prestigious psychology journal. All of the studies followed a similar format, reversing the usually direction of standard psychology experiments to determine if future events can affect past performance.
In the 9th study, for example, subjects were given a list of words in sequence on a computer screen. They were then asked to recall as many of the words as possible. Following that they were given two practice sessions with half of the word chosen by the computer at random. The results were then analyzed to see if practicing the words improved the subject’s recall for those words in the past. Bem found that they did, with the largest effect size of any of the 9 studies.
Needles to say, these results were met with widespread skepticism. There are a number of ways to assess an experiment to determine if the results are reliable. You can examine the methods and the data themselves to see if there are any mistakes. You can also replicate the experiment to see if you get the same results.
Bem’s studies have not fared well in replication. Earlier this year Ritchie, Wiseman, and French published three independent replications of Bem’s 9th study, all negative. Of further interest is that the journal that originally published Bem’s article had declined to publish Ritchie et al’s paper claiming that they don’t publish replications. This decision (and editorial policy) was widely criticized, as it reflects an undervaluing of replications.
It’s good to see that the journal has relented and agreed to publish a replication. Galak, LeBoeuf, Nelson, and Simmons should be commended, not only on their rigorous replication but their excellent article, which hits all the key points of this entire episode.
The researchers replicated experiments 8 and 9 of Bem (they chose these protocols because they were the most objective). They conducted 7 precise replications involving a total of 3,289 subjects (Bem’s studies involved 950 subjects). Six of the seven studies, when analyzed independently, were negative, while the last was slightly statistically significant. However, when the data are taken together, they are dead negative. The authors concluded that their experiments found no evidence for psi.
Lessons from Bem
A Bayesian Analysis
At Science-Based Medicine we have advocated taking a more Bayesian approach to scientific data. This involves considering a claim for plausibility and prior probability. Bem and others are fairly dismissive of plausibility arguments and feel that scientists should be open to whatever the evidence states. If we dismiss results because we have already decided the phenomenon is not real, then how will we ever discover new phenomena?
On the other hand, it seems like folly to ignore the results of all prior research and act as if we have no prior knowledge. There is a workable compromise – be open to new phenomena, but put any research results into the context of existing knowledge. What this means is that we make the bar for rigorous evidence proportional to the implausibility of the phenomenon being studied. (Extraordinary claims require extraordinary evidence.)
One specific manifestation of this issue is the nature of the statistical analysis of research outcomes. Some researchers propose that we use a Bayesian analysis of data, which in essence puts the new research data into the context of prior research. A Bayesian approach essentially asks – how much does this new data affect the prior probability that an effect is real?
Wagenmakers et al reanalyzed Bem’s data using a Bayesian analysis and concluded that the data is not sufficient to reject the null hypothesis. They further claim that the currently in vogue P-value analysis tends to overcall positive results. In reply Bem claims that Wagenmakers used a ridiculously low prior probability in his analysis. In reality, it doesn’t matter what you think the prior probability is, the Bayesian analysis showed that Bem’s data has very little effect on the probability that retrocausal cognition is real.
Galak et al in the new study also perform a Bayesian analysis of their own data and conclude that this analysis strongly favors the null hypothesis. They also specifically cite Wagenmakers’ support for the Bayesian approach.
Criticisms of Bem
It’s possible for a study to “look good on paper” – meaning that the details that get reported in the published paper may make the study look more rigorous than it actually was. There may also be alternate ways to analyze the data that give a different picture Ritchie, Wiseman, and French outline several criticisms of Bem’s methods. They mention the Bayesian issue, but also that an analysis of the data shows an inverse relationship between effect size and subject number. In other words, the fewer the number of subjects the greater the effect size. This could imply a process called optional stopping.
This is potentially very problematic. Related to this is the admission by Bem, according to the article, that he peeked at the data as it was coming in. The reason peeking is frowned upon is precisely because it can result in things like optional stopping, which is stopping the collection of data in an experiment because the results so far are looking positive. This is a subtle way of cherry picking positive data. It is preferred that a predetermined stopping point is chosen to prevent this sort of subtle manipulation of data.
Another issue raised was the use of multiple analysis. Researchers can collect lots of data, by looking at many variables, and then making many comparisons among those variables. Sometimes they only publish the positive correlations, and may or may not disclose that they even looked at other comparisons. Sometimes they publish all the data, but the statistical analysis treats each comparison independently. In short what this means is that if you look at 20 comparisons with a 1/20 chance of reaching statistical significance, by chance one comparison would be significant. You can then declare a real effect. But what should happen is that the statistical analysis is adjusted to account for the fact that 20 different comparisons were made, which can potentially make the positive results negative.
Finally there was a serious issue raised with how the data was handled. Subjects occasionally made spelling error when listing the words they recall. The result may have been a non-word (like ctt for cat) or another word (like car for cat). Researchers had to go through and correct these misspellings manually.
The authors point out that these corrections were done in a non-blinded fashion, creating the opportunity to fudge the data toward the positive by how correction choices are made. Bem countered that even if you removed the corrected words the results would still be positive, but that is still methodologically sloppy and is likely still relevant, for reasons I will now get into.
Researcher Degrees of Freedom
As we see, there were many problems with the methods and statistical analysis of Bem’s original paper. Bem argues that each problem was small and by itself would not have changed the results. This argument, however, misses a critical point, made very clear in another recent paper – one that was also cited and discussed in the Galak paper.
Simmons et al published a paper demonstrating how easy it is to achieve false positive results by exploiting (consciously or unconsciously) so-called “researcher degrees of freedom.” In the abstract they write:
“In this article, we accomplish two things. First, we show that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (! .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis.”
In my opinion this is a seminal paper that deserves wide distribution and discussion among skeptics, scientists, and the public. The paper discusses the fact that researchers make many decisions when designing and executing a study, and analyzing and reporting the data. Each individual decision may have only a small effect on the final outcome. Each decision may be made perfectly innocently, and can be reasonably justified.
However, the cumulative effect of these decisions (degrees of freedom) could be to systematically bias the results of a study toward the positive. The power of this effect is potentially huge, and likely results in a significant bias towards positive studies in the published literature.
But even worse, this effect can also be invisible. As the authors point out – each individual decision can seem quite reasonable by itself. The final published paper may not reflect the fact that the researchers, for example, looked at three different statistical methods of analysis before choosing the one that gave the best results.
The authors lay out some fixes for this problem, such as researchers disclosing their methods prior to collecting data (and no peeking). But another check on this kind of bias in research is replication.
The Power of Replication
The degrees of freedom issue is one big reason that replicating studies, especially precise replications, is so important. A precise replication should have no degrees of freedom, because all the choices were already made by the original study. If the effect being researched is real, then the results should still come out positive. If they were the result of exploiting the degrees of freedom, then they should vanish.
There are also the other recognized benefits of replication. The most obvious is that any unrecognized quirky aspects of study execution or researcher biases should average out over multiple replications. For this reason it is critical for replications to be truly independent.
Another often missed reason why replications are important is simply to look at a fresh set of data. It is possible for a researcher, for example, to notice a trend in data that generates a hypothesis. That trend may have been entirely due to random clustering, however. If the data in which the trend was initially observed is used in a study, then the original random clustering can be carried forward, creating the false impression that the hypothesis is confirmed.
Replication involves gathering an entirely new data set, so any prior random patterns would not carry forward. Only if there is a real effect should the new data reflect the same pattern.
Prominently displayed at the top of the Society for Psychical Research’s website is this quote:
“I shall not commit the fashionable stupidity of regarding everything I cannot explain as a fraud.” – C.G.Jung
Clearly that quote reflects the prevailing attitude among psi researchers of external skepticism of their claims and research. Every skeptic who has voiced their opinion has likely been met with accusations of being dismissive and closed-minded.
But this is a straw man. Skeptics are open to new discoveries, even paradigm-changing revolutionary ideas. Often I am asked specifically – what would it take to make me accept psi claims. I have given a very specific answer – one that applies to any extraordinary claim within medicine. It would take research simultaneously displaying the following characteristics:
1 – Solid methodology (proper blinding, fresh data set, clearly defined end points, etc.)
2 – Statistically significant results
3 – Absolute magnitude of the effect size that is greater than noise level (a sufficient signal to noise ratio)
4 – Consistent results with independent replication.
Most importantly, it would need to display all four of these characteristics simultaneously. Psi research, like most research into alternative medicine modalities like homeopathy and acupuncture, cannot do that, and that is why I remain skeptical. These are the same criteria that I apply to any claim in science.
In addition, I do think that prior probability should play a role – not in accepting or rejecting any claim a priori, but in setting the threshold for the amount and quality of evidence that will be convincing. This is reasonable – it would take more evidence to convince me that someone hit Bigfoot with their car than that they hit a deer with their car. There is a word for someone who accepts the former claim with a low threshold of evidence.
You can convince me that psi phenomena are real, but it would take evidence that is at least as solid as the evidence that implies that such phenomena are probably not possible.
It is also important to recognize that the evidence for psi is so weak and of a nature that it is reasonable to conclude it is not real even without considering plausibility. But it is probably not a coincidence that we consistently see either poor quality or negative research in areas that do have very low plausibility.
The least important implication of the recent paper by Galak et al is that it provides further evidence against psi as a real phenomenon, and specifically against the claims of Daryl Bem. Psi is a highly implausible hypothesis that has already been sufficiently refuted, in my opinion, by prior research.
The paper is perhaps a milestone, however, in other important respects:
– It is an admission of sorts by The Journal of Personality and Social Psychology of the importance of precise replication and a reversal of their prior decision not to publish such research.
– The paper highlights the role, and possible superiority, of Bayesian analysis as a method for looking at experimental data.
– It highlights the role of researcher degrees of freedom in generating false positive data, and replication as one solution to this problem.
– This large, rigorous, and negative replication establishes that studies published in peer-reviewed journals with positive and solid-appearing results can still be entirely wrong. It therefore justifies initial skepticism toward any such data, especially when extraordinary claims are involved.
The insights provided by this excellent paper reflect many of the points we have been making at SBM, and should be applied broadly and vigorously to alternative medicine claims.