The blessing and the curse of classifying neuroimaging data
Machine learning in cognitive neuroscience
In modern cognitive neuroscience, it has become common-practice to apply machine learning techniques to data obtained through neuroimaging. Despite this widespread use, however, there is something amazingly enigmatic about it. On the one hand, there is this organ that for millennia has eluded scholars: billions of neurons connected in myriads of ways too complex to comprehend, firing in intricate patterns to communicate and give rise to thought. This firing is then in some capacity picked up by machines that are as advanced as they are noisy. Methods such as the electroencephalogram (EEG) and functional magnetic imaging (fMRI) try their best at capturing neural firing. But the former is too far away to reliably say where activity is coming from, and the latter is too sluggish to say much about when (or in what order) neural processes were happening.
And yet, it seems that researchers nowadays just throw these patterns of activity in some mystical machine learning algorithm, and infer what a participant has been thinking about.
But what are we really learning from classifying neural data?
Classical neuroimaging research
Classically, cognitive psychology and cognitive neuroscience have followed more or less a similar logic in how they conduct experiments and treat their data. The cognitive psychologist will hypothesize that, say, the mind has more difficulty to process images of green apples than of red apples. They will devise an experiment to test this, and using reaction times and accuracy will infer whether the hypothesis is correct.
Next, the cognitive neuroscientist steps in, and repeats the same experiment in participants who are wearing an EEG cap. By averaging across different trials with red or green apples, the scientist may observe that certain ERP components are consistently less strong with green than with red apples. As such, this difference in neural activity is very likely to be related to be involved in the difference in performance – a result worthy of a scientific publication!
It is worth stressing that this conventional method hinges on two criteria. That is, the differences in neural activity between green and red apples should be somewhat consistent across trials, but also across participants: if in different participants, the neural activity between green and red apples manifests in completely different ways, then there would probably not be an observable, statistically reliable difference in the grand average. The conventional method, and in fact most statistical tests of neuroimaging data, assumes that given different conditions, we will observe systematic differences in the data that are somewhat comparable across different participants.
Machine learning: the great switch-up
One of the greatest tricks employed by most machine learning methods, is that they flip the logic of the conventional method on its head. Instead of using the conditions (red or green apples) to say something about the data (a high or a low amplitude measured at the scalp), classifiers will do the opposite: and take the neural data, and try to determine which condition was presented. If they can do so successfully, the classifier is able to tell from a pattern of neural data whether a participant saw a red or a green apple. In more popular jargon: the algorithm has read your mind.
A core reason for any new technique to become popular in a branch of science is that it yields results. Indeed, classifiers have allowed researchers to point out differences between conditions that had previously proven difficult to find. As it turns out, merely looking at averages and average differences is sometimes too short-sighted. Classifiers allowed researchers to identify whether the brain treats red and green apples differently “in any measurable way”.
Trick or treat?
However, there is something misleading about how many classification results are interpreted. It is quite an intricate issue, but I am sometimes afraid this misconception might have been key to the success and popularity of neural classifiers as an analysis tool. That issue is that classifiers are developed, trained and evaluated separately for data from different participants.
To see why this is relevant, recall the phrase from above about the classical averaging method: if in different participants, the neural activity between green and red apples manifests in completely different ways, then there would probably not be an observable, statistically reliable difference. In the case of separate classifiers per participant, this would be different: now, regardless of the (vast) individual differences in neural data, the classifiers could all consistently say that green and red apples are different. As a result, the average classification performance might be very high across participants – even if they share hardly any commonalities in neural data.
In fact, this feature of classifiers is often hailed as one of the strengths of the technique. Classifiers are able to mark that there is a difference, without pinpointing what this difference is. Interesting as that may be, it is in a way at odds with the goals of the cognitive neuroscientist – to look for reliable neural correlates of behavior. In fact, a popular topic of discussion at the moment is how to interpret the results of classifiers that are able to accurately decode neural data. What is it they are actually basing this decoding on? And – interestingly – are there any commonalities across classifiers fit to different participants? However, these developments are still in their infancy, are often specific to the exact type of classifier used. Moreover, they are only now slowly gaining popularity, over ten years after classifiers burst on the scene of cognitive neuroscience.
What to learn from all of this?
Does this make classification analyses worthless for our field? Of course not. There are many inferences about neural processing that have been made possible by means of classification analyses: many of these inferences are valuable and would have been impossible with conventional methods. Furthermore, the growing insights that predicting conditions from data can be just as valuable as predicting data from conditions, has lead to a new appreciation of what can be done with the data we collect. However, when a new technique garners widespread popularity in a short time span, it is worth critically evaluating where this popularity comes from. This is especially the case when a new technique is promoted as being ‘more sensitive’ – which may sometimes only mean that it produces more (false?) positives.
At the very least, it is worth asking ourselves what it is a classifier really does. If we don’t, we simply end up using tools we don’t understand, to make inferences about an organ we don’t understand.