Methodologists had an interesting summer this past year, thanks in part to a bombshell paper by Benjamin and 71 others, shared as preprint on July 22^{nd}, 2017. The authors argued to reduce the ‘default threshold’ α for statistical significance from 5% to 0.5% (i.e., from 0.05 to 0.005).

To refresh your memory, null hypothesis significance testing (NHST) works as follows:

- Postulate a null and alternative hypothesis (H
_{0}and H_{A}); - Collect your data;
- Compute (using software) the
*p*-value based on it; - If
*p*< α, reject H_{0}; if*p*> α, don’t reject H_{0}.

The *p*-value is the probability of finding a sample result as extreme or more extreme than the current result, given that H_{0} actually is true. This implies that when there is no effect (that is, when the H_{0} is true), there is a probability α of (incorrectly) rejecting H_{0}. When the α level is set to 5%, which is custom the social sciences, this means that—when there was actually no effect—there’s a 5% chance that you will incorrectly claim to have found one: a false positive.

If you perform a study and find a significant result, it is difficult to find out whether this result is ‘real’ (a true positive) or ‘coincidence’ (a false positive). There are two-ways to gain additional information about which may be the case:

- Simply do the study again, based on another sample. If your original result was a false positive, it is very unlikely that your new result will again be significant. (And if you still unsure, just do the study again, and again, and again….) In the past few years, several large-scale studies—most notably OSF’s Reproducibility Project—have done exactly this, and found that roughly only one-third of significant results ‘replicate.’
- If you make educated guesses about (i) the probability that H
_{0}actually is true, and (ii) the statistical power of your experiment, then you can estimate the probabilities of a false positive, a false negative, a true positive, and a true negative. You can then estimate the*false positive rate*(FPR), or the estimated proportion of significant results that are false positives.

Method 1 requires *a lot* of work: you need to re-do many studies. Benjamin et al. (2017) focused on method 2. This, however, has a drawback: you have to estimate numbers without really knowing if your estimates are any good.

Benjamin showed that, for certain reasonable conditions, the FPR will be as large as 50% if α = 5%, which is to say that half of the significant findings are actually false positives. When working with α = 0.5%, however, this FPR drops to 9%. Intuitively, this is also clear: if you make it much more difficult to claim a significant effect, it will be much less likely that you incorrectly claim a significant effect. This is the main reason why Benjamin et al. suggest lowering the default alpha-threshold by a factor of 10.

This sounds great. Let’s do it!

However, there is a trade-off between false positives and false negatives: by shifting the boundary between ‘significant’ and ‘non-significant’, we reduce the FPR but increase the false *negative* rate (FNR). In other words: more often than before, we would fail to label a true effect as significant. This is obviously a problem too.

You can compensate for the higher FNR by increasing statistical power. For this, you need to increase the sample size. But it has to go up *by a lot*: you’d need 70% to 88% more participants for your study. That new standard would then eliminate a lot of labs from less-wealthy universities. And this of course comes with further problems (like the association between wealth and WEIRDness).

It’s no surprise that Benjamin’s paper received criticism from various sources. Some authors (Amrhein, Greenland, 2017; McShane et al., 2017) suggested to stop after the third step: if you don’t draw conclusions, then you never draw false conclusions. According to them, it’s not the scientist’s task to decide whether the evidence to reject H_{0} is strong enough. We, however, thought that this advice is impractical: sometimes you just have to make a decision.

This ‘we’ is a team of 84 people, led by Daniël Lakens. In a nice example of open and transparent science, we’ve been working on a reply in a publicly-accessible Google Document. At the time of writing, this working document contains 110(!) pages of carefully considered arguments, which was then reduced in our reply to Benjamin et al. (2017) to about 18 pages.

In this reply, we outline why holding onto a default α-level—whether it is 5%, 0.5%, or something else that we might also agree-upon and then accept as a new convention—is not a good idea. Instead, we argue that the choice should always be carefully considered. Indeed, the chosen α-level should be informed by the context.

Consider the following two situations:

- You’re studying for an exam and you want to study efficiently. You want to put in sufficient hours to pass the exam, but also no more than that: you’re happy with a 6 and you would rather put the rest of your time in studying for other courses. If you fail the exam, you can do the resit in two weeks.
- Alas, you failed the exam. You’re now studying for the resit and still want to study efficiently (you have other things to do too.) But if you fail the exam this time, then you have to wait until next year for another attempt. And this study delay costs you another year of tuition fees, as well as the ire of your parents.

In both cases, a false positive would be thinking you had spent enough effort in the course, while still failing the exam. It is clear that the consequences of that are much bigger in the second example than in the first, so you will decide in advance to put in some extra effort to make sure that the false positive probability is smaller in case 2 than in case 1.

You should set your α to a stricter level when the stakes are higher.

The same line of reasoning holds in NHST. You can interpret your α-level roughly as “How bad is it if I accidentally (and incorrectly) call for an effect in this study?” It is not logical to give the same answer (α = 5%) to this question in every situation. You should justify your decision.

When the stakes are high, use a small α. When they are low, use a larger one.

This doesn’t sound like a groundbreaking suggestion. But that’s because it isn’t: mathematical statisticians have been saying this since the birth of NHST.

More than 50 years ago, Sir R. A. Fisher himself said much the same thing: *“no scientific worker has a fixed level of significance at which, from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas”* (see Lakens et al., 2017, p.14). As the urge is so strong to hold on to default values, rather than to put in an effort to motivate specific choices, it is good to remember where the defaults came from.

Our *p*’s are contemporary conventions, and nothing more. We don’t need to ban them. We just need to be a bit smarter about how we use them.

References

Amrhein, V., Greenland, S. (2017). Remove, rather than redefine, statistical significance. *Nature Human Behaviour*, paywalled at doi: 10.1038/s41562-017-0224-0

Benjamin, D.J., Berger, J., Johannesson, M., Nosek, B. A., Wagenmakers, E-J., Berk, R., …, Johnson, V. (2017, July 22). Redefine statistical significance. Preprint: doi: 10.17605/OSF.IO/MKy9J. Postprint in *Nature Human Behaviour*.

Henrich, J., Heine, S. J., & Norenzayan, A. (2010). Most people are not WEIRD. Nature, 466, 29. Shared via author home page at doi: 10.1038/466029a

Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M. A. J., Argamon, S. E., … Zwaan, R. A. (2017, September 18). Justify Your Alpha: A Response to “Redefine Statistical Significance”. doi:10.17605/OSF.IO/9S3Y6

McShane, B. B., Gal, D., Gelman, A., Robert, C., Tackett, J. L. (2017). Abandon statistical significance. Preprint at https://arxiv.org/abs/1709.07588

]]>In my last post, “To *p* or not to *p*”, I promised to write a follow-up post in which I would tell you all about Bayesian hypothesis testing. Although I can do that (and I will), I recognize that there is something dry about listing a bunch of properties about some statsy technique. You can learn about that during study-time already; that’s not why you’re on Mindwise, right?

As it happens, I have a much more pertinent issue to discuss with you. This idea that statistics provides some kind of ground truth and is separate from the rest of science in the sense that “we all agree on how it should be done” and “there are no more developments of note”. This is a notion I often encounter among my students. Statistics is a tool, a technique we learn to facilitate the real thing: science. Much like plugging a hole in a bike tyre does not have any other purpose than to facilitate the actual riding of the bike.

I propose to you, however, that statistics is not like that at all; it is a dynamic and fascinating field. Just words from a stuffy guy working in the stats department? I hope to convince you of the opposite. And I shall actually use Bayesian hypothesis testing to make my case.

As you may recall from “To *p* or not to *p*”, traditional Null Hypothesis Significance Testing – as it is routinely taught in undergraduate courses – suffers from four problems:

- You cannot quantify evidence in favor of the null hypothesis
- You over-reject the null hypothesis in cases where both the null and the alternative are unlikely (i.e., the data is just unlikely, regardless of the true state of the world)
*P*-values are hard to interpret*P*-values do not allow for sequential testing

However, Bayesian statistics do not fall prey to any of these problems. To illustrate the idea behind Bayes factors (again, briefly, this is not a statistics class), consider the following example, adapted from Dan Navarro’s wonderful textbook “Learning statistics with R”.

You wish to know whether or not it’s raining outside. Specifically, you have two hypotheses about the world:

- H0: it is dry
- H1: it is raining

You have one datum (singular of data), which is that you have seen me going out with an umbrella.

How does a Bayesian statistician approach this problem? They combine the *prior* (what we believe about the world before seeing data) with the *likelihood *(what the data tell us we should believe about the world) to create a *posterior* (what we now believe about the world, after having seen the data). Applied to our rain-problem, it might go something like this:

- Prior: there is a 45% chance it is raining (and therefore a 55% chance it is dry). You might base this on data from the KNMI (Dutch weather institute), who have catalogued that it typically rains on 45% of the days in May.
- Likelihood: what is the chance of me walking out with an umbrella,
*given that it rains*? Let’s say that probability is 80% (I sometimes forget). Secondly, we need to know what the chance is of me walking out with an umbrella,*given that it is dry*. Let’s say that probability is 10% (time to dust up that scattered-professor stereotype). - Posterior: we can calculate the probability that it is raining
*and*I go out with an umbrella (.45*.8=.36), as well as the probability that it is dry*and*I go out with an umbrella (.55*.1=.055) . These two scenarios encompass all states of the world (we already know that I did in fact go out with an umbrella), so our posterior belief about the hypotheses is:- H0 (it is dry): .055/(.055+.36) = .13
- H1 (it is raining): .36/(.055+.36) = .87

So based on your observation only of my umbrella, coupled with the information we have from KNMI (as well as the likelihood of my remembering to use it), you conclude there is an 87% chance that it is raining (or that it will rain). In other words, your single observation of my behaviour greatly increases your estimate of the risk of rain (because KNMI data suggests only 45% chance of rain). And there is no *p*-value involved.

If the numbers dazzle you, do not worry. The point of this is to provide you with an intuition of how this works: we combine our prior belief about the world with what we learn from the data to end up with a new (and hopefully more refined) belief about the world.

Bayesian hypothesis testing is not plagued by any of the four problems that make the *p*-value wielding researcher’s life so difficult: (1) We explicitly quantified evidence in favor of the null hypothesis (13% chance that it is dry); (2) We explicitly specify our null and our alternative hypothesis, so that we do not bias against the null hypothesis in the face of unlikely data. We saw in the previous post how conventional NHST struggles with really unlikely data (Sally Clark’s two infant children dying), and as a result biases against the null hypothesis; (3) Bayesian posteriors are easy to interpret: the probability that it is dry, given your a-priori information and the data you have observed, is 13%. The *p*-value does not allow one to calculate probabilities that any hypothesis is true; and (4) We can continue collecting new data, calculate our new posterior each time along the way, and stop whenever we want without issues. The *p*-value requires one to specify in advance exactly how many data points one will collect.

Case closed then? Perhaps not. The use of *p*-values is ubiquitous. One context in which they are routinely used is in the endorsement of new mediations and medical treatments. It should come as no surprise that I believe that better tools exist out there. Last summer, I went on a bit of a mission, visiting Prof. John Ioannidis’ lab at Stanford University. My purpose was to join forces with him and write a paper to flag some of the issues that stick to using *p*-values as a yardstick for the efficacy of new medicine.

Why go to Stanford for this? In 2005, Prof. Ioannidis rocked the scientific world with what has since become his most iconic publication: Why most published research findings are false. As the name of the paper suggests, the paper shows using simulations “that for most study designs and settings, it is more likely for a research claim to be false than true.” Partly as a result, he is presently regarded as one of the world’s leading authorities on proper research methodology. I knew from his previous work that he has similar objections to the use of *p*-values, so an unholy alliance was forged.

During my stay at Stanford, Prof. Ioannidis and I wrote a paper on the US Food and Drug Administration’s (FDA) policy for endorsing new medication. Our paper, which can be found here, provides a simulation-based critique of the errors that can occur when endorsing medications based on a certain policy for combining *p*-value results. In one sentence, we conclude that strict adoption of the FDA’s policy to endorse new medications following two statistically significant trials may lead to a large proportion of ineffective medication on the market.

This conclusion led to a wave of controversy over if it was really as bad as we claimed. Published only two months ago, the paper has been viewed over 8,000 times and shared on social media over 200 times. It also led to two blog-posts that were quite critical of the main message. One of them, available here, was written by Stephen Senn, who is one of the most respected statisticians in the field of medicine. He questions the value of our conclusions, stating that we use Bayesian statistics to quantify evidence in simulations that are based on a traditional frequentist premise, greatly exaggerate the number of non-significant medical trials that are seen in practice (or are left unpublished in a file-drawer), and use an unrealistic prior distribution for the problem at hand.

Responses to the article (and the blogpost) were quite heated, as can be seen from the comments sections at both websites. And that’s for good reason! FDA policies have an enormous impact on people’s lives. And we showed that their process needs improvement: It was inevitable that this would evoke some emotion!

In my previous blog post, I talked about how uncritical thinking about statistics cost Sally Clark her life. My suggestion here, now, is that this is a widespread and rampant problem: given the influence of the FDA, it’s clear that this is not an individual problem. The FDA policy feeds the idea that statistics is finished; that it’s a tool you can apply unthinkingly. That, ultimately, is the point I wanted to make.

Statistics is not ‘done-and-dusted’. It is a healthy and vibrant area of science that includes a number of cutting-edge topics about which some very smart people intensely disagree. That does not mean that ‘everything you learned is wrong’ or that ‘we may as well not do any statistics at all’. Rather, it means that it is important that we don’t turn off our brains and continue to think about how best to quantify evidence, how to best make a smart generalization from a small sample to the entire population, and how to optimally carve out the hidden gem of information from the raw data so painstakingly obtained. In my opinion, statistics and methodology is – hands-down – the most exciting area in science right now. Consider doing your post-graduate work in this area, you can make a difference that affects all of science!

Take a piece of paper. Draw a few points, and then connect them. Congratulations: you have drawn yourself a network! More formally, networks are simplified representations of how the elements in a system are interconnected. So, in essence, everything that can be understood as being in relation with something else—and represented using dots connected with lines (i.e., nodes or vertices connected by edges, ties, or links)—can be seen as a network.^{1}

Because networks are so broadly defined, it is no surprise that the field of network research covers all disciplines of science. Recently, it has also extended to psychological science, and especially the study of psychopathology. This provides a new way of thinking about mental disorders.

It had long been assumed that the symptoms of depression, such as *sadness* and *suicidal thoughts*, co-occur because they are caused by the same underlying disorder. However, the network perspective suggests a different possibility. Instead of being due to some unobserved common cause (A), the symptoms might co-occur because they are themselves influencing each other directly.^{2 }If I am having sleeping problems (B), then I am also more likely to *feelings of sadness* (C). This is not because whatever disorder is causing my sleeping problems also makes me sad (A->B, A->C), but because sleeping poorly will itself affect how I feel (B->C).

Consequently, if we want to get a better understanding of psychological disorders, we should refocus on the relation between the symptoms and elucidate the patterns of interactions among symptoms.^{3} For example: using a network approach, my colleagues and I have created a network of depression symptoms.^{4,5}

This figure illustrates how having the symptom *loss of interest* leads to another symptom *loss of pleasure*, which in turn leads to *sadness,* and – as the downward spiral spreads through the network, and the symptoms come to reinforce each other – the eventual result can be a full-blown depression.

This broader structural perspective is useful. But so too is the analytical toolbox that the network approach affords. For example, centrality analyses answer questions about how important a variable is in a network.^{6 }A central symptom could be especially interesting for clinicians, as it may give an indication as to which symptom should be intervened upon in order to disrupt a dysfunctional symptom network. Indeed, because the network approach reveals the dynamics of symptom interactions—how the pieces fit together, and reinforce each other—it affords a whole new perspective of psychopathology.^{6}

Of course, a big tree attracts the woodsman’s axe. As the popularity of networks has increased, so too has the criticism. Critics point out, for example, that psychopathological networks represent a fundamental overreach: this application, they argue, represents a generalization of methods developed for social networks, where network research partly has its origins. And it’s not clear that the move from one type of network to another is valid.

In social networks, the variables are people. These are really distinct entities, and the relations between them can often be directly observed. For example, co-authoring a paper together provides unambiguous empirical evidence of a connection between two authors. In contrast, psychopathology has to deal with fuzzy variables. So difficult questions arise, such as: How distinct are symptoms such as *loss of pleasure* and *loss of interest* really? And if we are unsure whether our symptoms are really different things, does it make sense to separate them and draw lines between them?

Furthermore, the lines that connect variables in a psychopathological network are not given, but have to be inferred from some kind of a dynamic model. And the estimation and interpretation of such models is itself still a topic of debate.^{7} Thus, critics argue that researchers are heaping problem upon problem when using psychopathological networks. The result is pretty pictures, but these illustrate relations that we don’t really know how to interpret.

So should we throw in the towel and quit doing network research in psychopathology? Well, no. It is important to make a distinction between statistical and conceptual issues. From a statistical point of view, there are still many hurdles to cross. But conceptually, the network idea seems very plausible; it opens up a whole new way of thinking about psychopathology, and enables us to ask new questions. Thus, at least in the latter sense, networks in psychopathology have already provided much more than pretty pictures.

References

- Bringmann, L.F. (2016). Dynamical networks in psychology: More than a pretty picture? (Doctoral dissertation). DOI: 10.13140/RG.2.2.28223.10404
- Borsboom, D., & Cramer, A. O. J. (2013). Annual Review of Clinical Psychology, 9, 91-121.
- Fried, E. I., & Nesse, R. M (2015). Depression sum-scores don’t add up: Why analyzing specific depression symptoms is essential. BMC Medicine, 13, 1-11.
- Bringmann, L. F., Lemmens, L. H. J. M., Huibers, M. J. H., Borsboom, D., & Tuerlinckx, F. (2015). Revealing the dynamic network structure of the Beck Depression Inventory-II. Psychological Medicine, 45, 747-757.
- Epskamp, S., Cramer, A.O.J., Waldorp, L.J., Schmittmann, V.D. and Borsboom, D. (2012). qgraph: Network Visualizations of Relationships in Psychometric Data. Journal of Statistical Software, 48, 1-18.
- Borgatti, S., Everett, M. & Johnson, J. (2013). Analyzing social networks. Los Angeles: Sage Publications.
- Bulteel, K., Tuerlinckx, F., Brose, A., & Ceulemans, E. (2016). Using raw VAR regression coefficients to build networks can be misleading. Multivariate Behavioral Research, 51, 330-344.

In Psychology, inferential statistics are predominantly conducted through means of the Null Hypothesis Significance Test (NHST). In NHST, statistical evidence is often communicated with the so called *p*-value. *P*-values are used to indicate the probability of obtaining a data pattern at least as extreme as the one that was observed, *given that the null hypothesis is true*.

Let us say, for example, that we are interested in relieving the symptoms of depression. We experimentally compare the effects of a new medication to the effects of a placebo on relieving these symptoms. We find that people in the medicated group have fewer symptoms than people in the placebo group. The between-group difference in this sample is associated with a *p*-value of .12.

This means that *if* the new medication is *just as* effective as the placebo (not better), then the probability of observing the difference between the new medicine and the placebo — or a difference even more extreme — is 12%. By convention, this is taken to be insufficient to disprove the null hypothesis. And thus, by convention, we “fail to reject” the null hypothesis: we do not find evidence to reject the notion that the new medication is just as effective as the placebo.

A low *p*-value, typically below .05, is considered “statistically significant”. Such a finding can then be interpreted as evidence against the null hypothesis: the difference is large enough that it was very unlikely to have been produced by chance. Unfortunately, *p*-values are plagued by a series of problems (e.g., Wagenmakers, 2007, van Ravenzwaaij & Ioannidis, 2017). Below, I list what I consider to be the four most pertinent.

Using *p*-values, researchers are not able to quantify evidence in *favor *of the null hypothesis. This is because a non-significant *p*-value (by convention, any *p*>.05) can be the result either of evidence in favor of the null hypothesis, or the result of a lack of statistical power (that is, if we had collected more data, the results of our inference *would* have been statistically significant).

Clinically, it is important to be able to quantify evidence in favor of the null hypothesis: this treatment is *good* for that problem. But there is an equally important, albeit different, interest in research. To wit: the same experiment might be carried out by twenty different labs, with the one “lucky” one concluding—by chance—that there actually is an effect. Relying solely on *p*-values then allows the random accident to be treated as true knowledge, with potentially harmful consequences.

*P*-values lead to over-rejecting the null hypothesis. The underlying statistical problem is that evidence *is a relative concept*, and only considering the probability of the data under the null hypothesis leads to biases in decision making. When this null is the presumption of innocence, people go to jail who should not.

Consider, for instance, the case of Sally Clark, a British solicitor whose two sons died, in separate incidents, in their infancy. She was prosecuted for and initially convicted of their murder. The argument for her guilt was statistical: the likelihood of two infants in a row dying of Sudden Infant Death Syndrome was calculated to be extremely low (about 1 in 73 million, or *p* < .001). So the null hypothesis was rejected with great confidence. Should it have been?

The prosecution’s statistical expert had not taken into account the probability of the data under an alternative hypothesis: A mother is very unlikely to murder her two infant children. Subsequent calculations then showed this second probability to be even less likely than the former (by a factor of 4.5 to 9; see Hill, 2004 for details).

Clark’s original conviction was overturned, and the expert disgraced, but only after she had already spent four years in prison. She then later died of alcohol poisoning. In reporting on her death, *The Guardian* quoted a statement from her family: “Having suffered what was acknowledged by the court of appeal to be one of the worst miscarriages of justice in recent years… she was never able to return to being the happy, kind and generous person we all knew and loved.”

In other words, over-reliance on the improbability of one piece of evidence is not merely a problem for researchers. It has real-world implications.

*P*-values produce results that are not intuitive to interpret. Researchers generally want to use the data to infer something about their hypotheses, such as: What evidence do the data provide for the null hypothesis versus the alternative hypothesis? The *p*-value cannot answer questions like this. They can only give an abstract number that quantifies the probability of obtaining a data pattern “at least as extreme” as the one observed *if* the null hypothesis were true. This definition proves to be so cryptic that most researchers in the social sciences interpret *p*-values incorrectly (e.g., Gigerenzer, 2004; Hoekstra et al, 2014, see also link).

*P*-values do not allow for optional stopping, based on examining the preliminary evidence. This means that a *p*-value can only be properly interpreted when the sample size for testing was determined beforehand and the statistical inference was carried out on the data of that exact sample size. In practice, additional participants are often tested when “the *p*-value approaches significance”, after which the *p*-value is calculated again.

In clinical trials, this problem takes the form of interim analyses with the potential of early stopping at different points (Mueller, Montori, Bassler, Koenig, & Guyatt, 2007). Alternatively, sometimes testing is discontinued when “an intermediate analysis fails to show a trend in the right direction”. These practices produce a bias against the null hypothesis: if researchers retest often enough, they are guaranteed to obtain a statistically significant result even if in reality the null hypothesis is true!

So if *p*-values are so riddled with problems, why is it that we get taught about *p*-values from our first year statistics courses on?

- Existing text books on statistics for the social sciences explain the state-of-the-art in statistics from two or three decades ago. The reason for this is simple, text books are written by relatively seasoned researchers who have not had the privilege of learning about what are currently state-of-the-art statistical techniques in their own undergraduate degree. As a result, statistical text books are a little “behind the times”.
- Because of the unrepresentative textbook issue, it is difficult to get exposed to different (and better) ways of conducting statistical inference. I myself only learned of these techniques as a PhD-student: there was no room for it in my undergraduate curriculum.
- Finally, the best alternative (Bayesian hypothesis testing, which was used to get Sally Clark acquitted) requires computational power that has not been available on our computers until relatively recently. As a result, better alternatives may have existed in the past, but were never really feasible!

So, what is this Bayesian hypothesis testing, how does it work? Bayesian hypothesis testing quantifies evidence of two competing hypotheses relative to one another by means of a *Bayes Factor* (e.g., Kass & Raftery, 1995). The Bayes Factor provides an attractive alternative to each of the four problems I listed above. In a follow-up post, I shall tell you all about it!

Gigerenzer, G. (2004). Mindless statistics. *The Journal of Socio-Economics, 33*, 587-606.

Hill, R. (2004). Multiple sudden infant deaths – coincidence or beyond coincidence? *Paediatric and Perinatal Epidemiology, 18, *320-326.

Kass, R. E., & Raftery, A. E. (1995). Bayes factors. *Journal of the American Statistical Association, 90*, 773- 795.

Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E.-J. (2014). Robust misinterpretation of confidence intervals. *Psychonomic Bulletin & Review, 21, *1157-1164.

van Ravenzwaaij, D. & Ioannidis, J. P. A. (2017). A Simulation Study of the Strength of Evidence in the Endorsement of Medications Based on Two Trials with Statistically Significant Results. *PLoS ONE 12*: e0173184.

Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of *p *values. *Psychonomic **Bulletin & Review, 14*, 779-804.

]]>

Last month, Marieke Timmerman gave her Inaugural Lecture at the University’s Academy building, laying out her academic vision and presenting an overview of her research work. We asked her to write an article inspired by her lecture, so we can share it with Mindwise readers. – ed.

Psychology seems to be in crisis. Alarming messages are spread about the lack of reproducibility of findings reported in the literature. The media spread around the sobering results of the ambitious *Reproducibility Project*, in which researchers tried to replicate the findings from 100 psychology studies that were published in prominent psychology journals. Only a meagre 39 studies could actually be replicated (Open Science Collaboration, 2015).

The findings are shocking, as they give rise to doubts on the foundations of psychological theories. They yielded, and still do, intense debates in the Psychology community. Various factors contributing to these problems have been identified, ranging from the incentive system to a lack of proper use of methodology and statistics. To remedy the latter, methodologists argued strongly in favour of defining strict hypotheses before data collection, deciding about the statistical tests to apply in due course, and what to conclude on the basis of the associated results (Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit, 2012). This strategy perfectly applies to purely confirmatory research, but it falls short as soon as the research becomes more exploratory in nature.

Deliberately, I choose the term *more exploratory*, rather than adhering to the common distinction between strictly confirmatory and exploratory types of research (De Groot, 2014). All empirical research in psychology builds upon earlier observations, notions, ideas, theories, and empirical results. This implies that background knowledge is used in defining the objectives of the study and building the expectations about the results. The latter may be strong, resulting in strict hypotheses, or weaker, and very often one finds combinations of stronger and weaker expectations in the same study. In my view, these combinations are vital for deepening our understanding and expanding our knowledge. Confirmatory research seems to keep you on the safe side a low risk of false findings , but also prevents you from gaining exciting visions.

Now, the key issue is what a proper analysis strategy is in the absence of strict hypotheses. For sure, one needs to stay away from hypothesising after the results are known (so-called *HARKing* by Kerr, 1998), which boils down to just trying out various statistical analyses to trace “the most interesting and promising aspects of the data” (Sijtsma, 2015). Such an approach is dangerous, as one runs a serious risk of coming up with incorrect and non-replicable results.

But what approach is to be advised for more exploratory research? It is of key importance to write a solid research plan, including the objectives, research questions, design and analysis. In this respect the strategy is similar to what is required in a confirmatory study. What differs is that the proposed analysis will be more of an exploratory nature. Often, the analysis will include various steps to arrive at an interpretable statistical model of the data. Herewith, it is essential that the proposed analysis matches the objectives, the type of data to be collected and all available knowledge which directly indicates why it is so important to have a good overview of the available knowledge on the topic at hand. And, of course, to achieve a proper match one needs a thorough understanding of the statistical model itself.

An example illustrates the issue. It is known that prematurity at birth is a risk factor for lower levels of functioning in childhood. Of course, one could perform a confirmatory study, administering test X, which is indicative for a particular aspect of functioning, among random samples of children born preterm and at term, all at the age of Y. Then, the null-hypothesis would be that “average scores on test X are equal among the children born preterm and at term at the age of Y”, which can be tested with some sort of suitable t-test. There is nothing really wrong with testing this hypothesis, but it fails to provide really interesting insights, such as whether all preterm children would be affected similarly, whether all aspects would be equally affected, or whether there are protective factors. Using an exploratory statistical modeling, it was found that actually a large part of the children born moderately preterm showed functioning within the normal range, and only a minority below the normal range, while particularly boys appear at risk (Cserjesi et al., 2012). More exploratory approaches are very often really worth the effort, as one can achieve insights that remain kept hidden otherwise.

Cserjesi, R., Van Braeckel, K. N. J. A., Timmerman, M. E., Butcher, P. R., Kerstjens, J. M., Reijneveld, S. A., Bouma, A., Bos, A. F., & Geuze, R. H. (2012). Patterns of functioning and predictive factors in children born moderately preterm or at term.* Developmental Medicine & Child Neurology, 54*(8), 710-715. doi:10.1111/j.1469-8749.2012.04328.x

De Groot, A. (2014). The meaning of “significance” for different types of research [translated and annotated by E. Wagenmakers, D. Borsboom, J. Verhagen, R. Kievit, M. Bakker, A. Cramer, D. Matzke, G.J. Mellenbergh, and H.L.J. van der Maas].* Acta Psychologica, 148*, 188-194.

Kerr, N. L. (1998). HARKing: hypothesizing after the results are known.* Personality and Social Psychology Review : An Official Journal of the Society for Personality and Social Psychology, Inc, 2*(3), 196-217. doi:10.1207/s15327957pspr0203_4 [doi]

Open Science Collaboration. (2015). PSYCHOLOGY. Estimating the reproducibility of psychological science.* Science (New York, N.Y.), 349*(6251), aac4716. doi:10.1126/science.aac4716 [doi]

Sijtsma, K. (2015). Playing with data—Or how to discourage questionable research practices and stimulate researchers to do things right. *Psychometrika, *, 1-15.

Wagenmakers, E. J., Wetzels, R., Borsboom, D., van der Maas, H. L., & Kievit, R. A. (2012). An Agenda for Purely Confirmatory Research.* Perspectives on Psychological Science : A Journal of the Association for Psychological Science, 7*(6), 632-638. doi:10.1177/1745691612463078 [doi]

How can we determine whether someone will be successful as a student, a manager, a policeman, a musician, an artist, a football player, and so on? This is a highly relevant question for psychology researchers, as well as human resources professionals in many organizations. In order to find an answer to this question, most lay persons would prefer to meet that someone, talk to the person, size him or her up, and figure out what that person is like. In general, most people have great confidence in their abilities to make judgments about others’ future performance after such an ‘assessment’. However, the majority of the population, even experts, are not very good at making correct judgments based on such unstandardized information [1].

Answering the question “How can we determine whether someone will be successful?” from a scientific point of view is a central topic in my current research project, as well as in the educational MSc. program Talent Development and Creativity. As psychologists we tend to approach this question by (roughly) following three steps. First, we start by trying to find a theory that could help explaining successful performance. Then, we identify specific traits and characteristics that are theoretically related to performance, such as cognitive ability, conscientiousness, or creativity. Next, we search for reliable instruments to measure these traits, and we statistically determine whether the trait measurements indeed predict successful performance. If this is the case, than someone with the “right” level of certain traits is likely to become successful.

The approach mentioned above is known as the ‘signs approach’ to predicting future behavior, and it is the dominant approach for predicting performance in psychological research and practice. It seems a sound approach that is preferable over (expert) judgement, and it usually is [2]. However, should we always go through all of the steps mentioned above? I argue that, in order to know whether someone is good at something, there is a simpler scientific approach that can be more effective as well.

If you want to know if someone will be successful, just find out if the person is good at the task he or she will need to perform.

In a nutshell, if you want to know if someone will be successful, just find out if the person is good at the task he or she will need to perform. This is known as the samples approach [3] to predicting performance, and it is based on the notion that past or current behavior is an excellent, and perhaps even the best predictor of future performance. According to this approach, we should take samples of relevant behavior, rather than distinguish presumably relevant traits. It is a simple idea and often very effective as I will illustrate below.

For example, In the American National Football League (NFL), the annual selection round among college-level players is a sign-based carrousel including physical ability tests such as jumping, sprinting and strength, an intelligence test, and an interview. However, research has shown that performance on most of these tests does not predict performance in the NFL very well. If you want to predict professional football performance, the most effective way is to just look at past football performance [4].

Similarly, if you want to know if someone will be good at a certain job, work sample tests that simulate the tasks of the job as realistically as possible are the best predictors for future job performance [5]. In our own research we have adopted this approach to predict academic performance by using trial-studying tests that mimic a representative course in the study program. For admission to the psychology program, for example, future students have to study two chapters from the book used in *Introduction to psychology*, view an online lecture, and take exams on this material. This is very similar to what they have to do as actual students and that is exactly why this approach to predict academic performance is effective [6]! The score on this test was a good predictor of academic performance in the first year. Additional advantages of sample-based tests are that they are often rated as highly face-valid, and offer the possibility of self-selection. That is to say, these tests also provide insights for the assessees (e.g. do I find this interesting, can I handle the level etc.), and not just for the assessors [7]. As a student stated after taking the trial-studying test “It helped me indicate what studying at the University of Groningen would be like and I really liked that.”

In sum, to know if someone will be good at something, we do not always need to distinguish traits and characteristics. Indeed, one could say that when we take samples of relevant behavior, all relevant traits are being measured together (be it implicitly). For instance, a trail-studying test, like the one we used in our own research, may measure both ability (does the student have the necessary skills and capacities), and motivation (does the student take the effort to prepare well). As a result, the first two steps that psychologists usually take, finding a theory and specifying traits, are not necessary to make good predictions. By designing assessments that are comparable to the tasks that performers will actually have to carry out, we may measure all relevant traits.

So, when it comes to predicting future performance, perhaps we should not spend all our time untangling potential relevant underlying traits, but just assess the actual relevant behavior.

NOTE: Image by Travis Wise, licenced under CC BY 2.0

]]>Let’s say that you are helping your niece with the French vocabulary that she has to learn for school. She has an important test in two days and has already learned quite a lot of words. You ask her the translation of “fromage” and she, without a bit of doubt, says “cheese”, then you ask her “maison” and after some thinking, she says that she doesn’t know the answer, which you provide to her. Clearly, your niece still needs to study the meaning of “maison”. Intuitively, you will probably wait for a longer period of time before asking her again about the translation of “fromage” and will ask her quickly for the translation of “maison”.

“The system, in all its simplicity, works remarkably well”

If one looks at online learning systems that help with learning factual materials, they typically use this type of data — correct or incorrect — to decide which word to present you on the next trial. If a word was answered correctly, it will be put at the back of the stack, only to return if you have seen all other words, but if a word was answered incorrectly, the system will show you the correct translation, and revisit the word after, a couple of other words have been presented. This system, in all its simplicity, works remarkably well. The reason why it works so well is that it adheres to the principles of the *testing* and *spacing* *effects*. The *spacing effect *is that the more time there is between repeated presentations of an item, the better that item will be stored in memory. The *testing effect*, on the other hand, reflects the fact that one of the best ways to learn factual material is to test yourself to see if you actually know that item. Related to this is that learning has shown to be optimal if, during testing, you still know the item. If we return to the “fromage”/”maison” example, it’s clear that you are helping your niece to learn using the testing method, and that, by waiting longer before revisiting “maison” than “fromage”, you are increasing the changes of her still knowing both items at the next presentation, but you’re also adhering to the spacing principle.

Although these computerized learning systems can help in learning, you are probably a better tutor for your niece than most computerized learning systems. That is, let’s assume that you now ask for the translation of “saucisse” and your niece takes a long time, but eventually comes up with the right answer. You immediately pick up on that, realize that “saucisse” is encoded, but not that well, and will repeat this item quite soon. Most computerized learning systems will encode this item as “correct”, and put “saucisse” on the end of the stack, just like “fromage”

“it is difficult to determine how to translate

thinking timeintohow well the learner knows an item“

The reason why most learning systems are only using correct and incorrect answers, is that it is difficult to determine how to translate “thinking time” into “how well the learner knows an item”, a comparison which needs to be made so that the system can determine how long it can wait before presenting the item again. A couple of years ago, together with Master students from Experimental Psychology and Artificial Intelligence, my group started exploring if we could use modern memory theories to help in this translation. The idea is pretty straightforward: any time someone learns something new, that item is stored in memory and initially has a very high “activity” (a word used to describe how easily available something is in memory), but after the very first learning trial, that activity will quickly decrease until the item cannot be retrieved anymore. For every subsequent presentation, there will be another temporary increase in activation, but each time the decrease will be slower. Although the activation itself cannot be measured, activation can be translated in how long it will take the simulated memory system to retrieve that item. Therefore, we can measure the time it takes your niece to answer an item, and figure out how “active” that item is. If we know how active an item is, and if we know how long ago we presented that item previously, we can calculate how quickly your niece forgets the item. By calculating this person-specific forgetting rate, our system, called “SlimStampen“, can present personalized learning to anyone aiming for optimal fact-learning. Obviously, the proof of the pudding is in the eating, and we have tested this system in the lab, in an experiment during which university students had to learn factual information that sometimes was (materials for a course), but sometimes wasn’t relevant to them (such as Swahili-English translations), but also in school settings. In all these settings, SlimStampen outperformed the correct/incorrect based systems, sometimes by improving the eventual grade by a whole grade point (i.e., an 8 instead of a 7, roughly corresponding to a 80% instead of 70% correct score).

Obviously, this is of interest to many different types of learners, and as a start, we have teamed up with one of the largest publishing houses in the Netherlands of secondary school educational materials, Noordhoff Publishers. Starting this school year, all students using the online learning systems of Noordhoff Publishers can use the SlimStampen method. Admittedly, this won’t magically make learning fun, but because of its effectiveness, your niece will have to learn for a shorter period of time, so that the two of you can do more interesting things instead.

NOTE: Image from the U.S. National Archives

]]>In 1892 Gerard Heymans founded the Psychological Institute in Groningen and, with that, empirical psychology in the Netherlands. By conducting experiments in his laboratory, he gained valuable insights into a wide range of psychological problems. Over a century later, we teach our students essentially the same approach for empirical research: develop a test or a questionnaire, randomly assign your “random sample” (read: fellow students) into treatment groups, let them take the test or complete the questionnaire, and perform adequate statistical analyses. Sometimes a follow-up measurement several months later is performed to study the longer-term effects of treatment.

All this is extremely useful in finding *inter-invididual* patterns: differences between (groups of) persons. However, these methods are not helpful when you are interested in *intra-individual* patterns: differences (over time) within a single person.

Why would you want to study intra-individual patterns? Suppose you are interested in (long-term patterns in) Positive Affect (PA) and study two persons, Red and Blue. You measure their PA scores on day 1 and a few days and 1, 2, and 3 months later. The first plot below, based on virtual data, shows that their PA scores at these respective time points (indicated by the dots) are very similar: in your sample you did not find evidence that Red and Blue behave differently with respect to PA. Further, the measured PA scores are fairly stable; there are no steep increases or decreases in scores.

However, suppose you didn’t measure Red and Blue just five times, but daily for a 100-day period. Now it is clear, from the second plot, that Red and Blue are actually quite different. For (nearly) every day, Red’s PA score is quite similar to the day before, whereas for Blue, a positive day is usually followed by a negative day and vice versa. The extent to which two subsequent days are similar is called inertia. It is known that inertia in PA is related to a wide range of psychological traits, such as depression, neuroticism, and rumination. Thus, based on the inertia-differences between Red and Blue, psychologists might infer something about their personalities.

Static psychological experiments are useful for understanding between-person differences in psychological *outcomes*. Measurement-intensive longitudinal studies such as above are essential for understanding within-person psychological *processes*. Up to a decade or two ago, it was very difficult to conduct such studies: you can’t expect your study participants to go to the basement of the Heymans building 100 days in a row, to complete a questionnaire. Thanks to advances in computing and Internet technology, however, nowadays you can measure variables highly intensively with relatively little effort: answering a short online questionnaire is easy, and applying *smart* *apps* to automatically measure how much people walk, sleep, or consume electricity is even easier.

When collecting these non-conventional type of data, you also need a non-conventional method for analysing them. The Bayesian Dynamic Linear Model (DLM) is extremely suitable here. This model can be used to both accurately estimate parameters of longitudinal data and accurately forecast the value(s) of the next measurement(s). The DLM gained popularity after Mike West and Jeff Harrison published a book on it in 1989, but it was mainly applied in economics and biology. Applying the DLM in psychology has been rare up till now.

The above example about Red and Blue is obviously an oversimplification of the type of data the modern psychologist might consider. More realistic examples would include some of the following ingredients: multiple dependent variables (e.g. both Positive and Negative Affect); multiple predictors (age, gender, personality scores); latent variables (i.e. variables that cannot be observed directly); many more than two persons in a possibly hierarchical setting (such as a multilevel model); strange patterns of missing data (due to non-response, drop-out, faulty *apps*, etc.), sudden changes in measurement due to therapeutic intervention, etc. In the past decades, there have been many additions to the theory of DLM that accommodate its use in these types of situation. The DLM is comparable to a box of LEGO bricks: once you know how it works, you can build whatever you like.

Thanks to two grants from NWO, our research group is now extending the DLM for application into psychological practice, with promising results so far.

]]>