February 27, 2017

The Cosyne Workshops, day 1: Deep Learning and the Brain

Today, I went to the ‘Deep Learning and the Brain’ workshop. This post includes some excellent considerations I heard there on analogies between deep artificial networks and the cortex. I have included in this post yesterday’s intervention by Yoshua Bengio.

Yoshua Bengio

Yoshua Bengio (presented as “the second most cited Canadian computer scientist”) gave the “last but not least” talk at the Cosyne main meeting. I found quite a few neuroscientists I talked to were decidedly unimpressed by it, essentially because (I think) they feel the subject is not too relevant for neuroscience. He himself explains why his approach is different from the typical neuroscientist’s at the end. This talk was very general and mostly non-technical, which is something I like a lot at conferences. After all, papers exist for the details, and it’s extremely important to reflect on where a branch of science is going, openly and publicly.

So, what follows is what I got from his talk.

There has been a great interest in ML recently, due to the incredible success (including commercial success) of machine learning. Successful training of deep NN started in recent years, but it descends from the early connectionist approaches of the 80s. Before connectionism, a lot of work in early AI was done without learning at all, while instead giving knowledge to a computer, manually. Additionally, the representations weren’t distributed in the model. Now the learned function is seen as a composition of simpler operations (inspired by neural computation). Note that learning (also from connectionism), still comes from the global minimisation of an objective function, by seeing multiple examples, multiple times. What is new, though, with deep learning, compared to early connectionism? Many things, some of which have a biological inspiration:

ReLUs instead of sigmoids, enabling much better and deeper backpropagation.
Dropout.
Unsupervised learning!, including for models that are generative. GANs can dream thanks to this.
Attention, incorporated with mechanisms that now form part of machine translation systems. It also relates to the recent development of neural networks that can read and write on memory. Not only neural networks work incredibly, crazily better (e.g. at image recognition) than 5 years ago, but we also have a better theoretical understanding of them, how they generalise, how we can make them optimise to their goal. We now know non convexity and local minima aren’t an impossible problem; stochastic gradient descent; curriculum learning, etc.

But there’s still a big gap between deep learning and neuroscience. Techniques such as backprop through time are impossible in a real brain. Backprop in general seems not to be biologically plausible as it is. And, of course, spikes. And neuron types. And neurotransmitter types. Note: backprop is not really only about supervised learning. In fact, it can be used for the unsupervised case: it’s the case of GANs, and also of a model predicting the next bit of data given the previous. Can we think of a sort of GAN in the brain? (Whoops, I missed a slide here, no answer!)

Learning rules that are both for machines and brains should be biologically plausible implementations of backprop. It’s not clear how to do this: joint training at a network level in the brain is a mystery. Some progress was made bridging the gap between DL, RBMs, STDP. I won’t report the technical details of his work here. An important point about the connection with STDP (see Hinton’s talk at Stanford, 2016) is the following: at the moment in ML algorithms we have a prediction phase and an update phase, which are separate. Weights/parameters are updated only in one phase.

So what else could be done in neuroscience to bridge this gap? Maybe DL predicts that there could be different phases: one with no weight update for pyramidal neurons, another where feedback is received and neural activation moves in a direction that “corrects” the prediction in some sort. If something like this is shown, than we have evidence for a backprop-like learning in the brain (but the opposite may also be true, I think).

Questions from the audience

Do we understand why and how DNNs learn?

Yes. We understand quite a lot. A lot more that in the brain.

The AI community is aware of this and there’s a lot of talking about this, up to the United Nations. See a conference called “beneficial AI”.

I found it interesting that he used, quite a few times, a differentiation between “we” and “you”, meaning “you neuroscientists”, in the talk. As I said above, he neatly explained the difference of his approach: instead of starting from the brain and trying to make sense of what it does (perhaps using the results in applications later), Bengio tries to find something that can work (in learning, in reasoning, etc.) and then has a look at whether these ideas apply to the brain. Matthias Bethge’s talk will elaborate on this.

The workshop: “Deep learning and the brain”

The short introduction to the workshop quickly listed the (quite important!) differences between deep learning and a computational neuroscience approach: the need of an unrealistic amount of training data, the stress on supervised learning, the unrealistic learning rules are properties of deep networks which make them different from brain networks.

Receptive fields in DCNNs and the cortex

A big topic, which I find very fascinating, even if, admittedly, it’s not everything, is the comparison between the role of a given unit in a deep network and in a brain network. I’m mostly thinking about deep convolutional neural networks for image recognition and the visual pathway, respectively, because they are the best known systems, at least as far as I know. I mean, think about it: they are trained in completely different ways, they are implemented in completely different ways; the only thing they have in common is being made of a large number of nonlinear units arranged in layers (and the cortex is not actually feedforward). But still, there seems to be a striking resemblance between receptive fields in the two cases. As a bare minimum, a qualitative one: units respond to simple, local features (difference of gaussians, Gabors) in the first layer, gratings and patterns in the second, and then increasingly complex patterns up to abstract objects in the last layers. It’s exciting! While many people may not agree, I find this adds a lot to our understanding of sensory areas of the brain.

Nikolaus Kriegeskorte proposes understanding what goes on in the various layers is basically doing synthetic neurophysiology. Neural networks are not just a tool. They are models that perform a task, even a task that scales up to real-world challenges. It is also true that computation in them is not very clear, but theoretical understanding is improving, and, he says, it’s anyway better than having to stick electrodes in a living brain. His idea, it seems, is that we can use a deep network in the same way as we would use electrophysiological recordings from the cortex of a mouse or fMRI data from the human visual cortex. So, how well do DCNNs correspond to the ventral stream of the visual cortex? Their work (same as one in the main conference, which I believe I did write about in the previous posts) shows that the area of the Inferior Temporal Cortex known to work on faces relates better to a neural network trained on faces. How do we measure what a unit is doing? They use an approach similar to deepdream, where they propagate derivatives back to the pixel space. Despite being feedforward, DCNN are so far the best analogy to the ventral stream.

The second talk, by Michael Oliver (Berkeley?), is again about the ventral stream and convolutions. These are some very rough notes. The canonical models of V1 see a simple cell model and a complex cell model. In both cases, the features selected are Gabors. They want to work on higher areas (V2), acquiring first of all large datasets of neural activity, with implantable large arrays. Then they fit their models to these data. It’s about finding spatio-temporal-chromatic filters. They do it under some assumptions on their structure and factorisability. This model dominates over other neural network models. V2 filters turn out to be more elongated, grating-like, and also include more complicated, curvy, features. The same thing is then done for V4.

Before the coffee break: Marcel van Gerven about modeling of sensory streams, with a bit of learning and behaviour. They model the response to complex, semantically rich stimuli with a stacked (deep) network of linear-nonlinear units. The model is trained with up to million images. Again, the approach is similar to the others seen above. The layers’ sensitivities are analysed: Gabors in level 1, then patterns, complex patterns, objects. A simple but very nice thing is, they analyse these using a few measures: Kolmogorov complexity, size of the receptive field and another one that I missed: they all increase for higher and higher areas. This is not a new result, but it quantifies how cells in higher areas are less sensitive to location in the visual field (whence the larger RFs), and sensitive to more and more specific aspects of the stimulus (whence the Kolmogorov complexity). Similar work can be done for the dorsal stream, using fMRI data of people watching movies and recognising the actions in them. A question they ask is whether you can fit a model on a person’s data and see it perform well on another person’s recording. They also work on auditory stream and on non-neural-level behavioural data. They use a model to predict personality traits based on a short audovisual bit. The ground truth is given by independent human annotators that watch the same video. The networks reach 90% accuracy. Now, what is interesting is to look at what the network focuses on in order to make these predictions. Naturally, one finds that certain areas of the faces are the most relevant. This is an interesting result for social psychology, which tries to interpret what aspects are used for stereotyping, and can be applied there. Again, this would mean using a neural network as a guinea pig in something where you can’t really measure things, because you can’t really open up a human brain while it is judging personalities.

Perhaps thanks to the coffee, the first talk after the break, by Alex Kell, is the clearest so far. It is more about a CNN for real-world auditory tasks (not visual, for once!): recognising a word or the genre of a song. The approach is also different: first, they try to find architectures that work well at either of these tasks, then to combine them. They end up with a network with some shared layers and some task-specific layers. Sharing 6/7 layers gives a similar performance to completely distinct (and therefore specialised) networks; at 10/11, the performance decreases significantly. Then, they compare human performance on the same tasks with the network’s. It actually seems to perform better than humans. Note, importantly, that it was trained to optimise the results and not in any sense to behave in the same way as a brain would do. But so does it do something similar to auditory cortex? Both humans and the networks listened to the same sounds, and the human brain was recorded by fMRI. It turns out the CNN predicts fMRI responses better than a traditional spatiotemporal filter model. Why does this approach work at all? Let’s study the relationship between task performance and neural predictiveness. The more it predicts the activity of the cortex, the better it performs in the task! Finally, the model predicts a hierarchical organisation of the auditory cortex.

Matthias Bethge

This was, once again, a talk about the general vision Matthias Bethge had of the field and its aim (that’s also because I didn’t really keep any notes about the technical parts). I liked the evolutionary ideas at the beginning, which, I’d say, complement and justify Yoshua Bengio’s approach to understanding the brain. It was the talk I enjoyed the most, at least for today.

Understanding the brain means understanding the neural basis of an animal’s fitness. This is the evolutionary perspective of the brain. The problem of this perspective is that we don’t know the global objective (fitness) function according to which the brain has evolved. However, we do know how to benchmark individual skills with test tasks. We can then turn it into “understanding how the brain achieves high performance in specific tasks”. AlexNet is an example of neural network, and is somewhat brain-inspired. But such networks existed already in the 1980s: what has changed? The fitness, of course, which from the ML point of view is the performance. So this can teach us something about the brain, in the sense that it provides a benchmark for our models. Suppose we find a model for the visual cortex: can it work for actual image recognition? It’s an important question, even from the neuroscience point of view! Of course, the problem is then moving to multi-task performance with a single network: the visual cortex performs object detection, depth estimation, image segmentation, and all sorts of inferences. In general, playing with artificial neural networks is something neuroscientists can and should do, thanks to the availability of pre-trained models and powerful tools such as TensorFlow. The rest of the talk goes into actual work about prediction of fixations and many other things. Of course, according to these notes, what was said in this talk is more or less the same as what Bengio said. But that’s because I didn’t go into the actual work they presented. So deep neural networks are a useful tool in the sense that:

they focus attention on why we learn, the objectives, the fitness;
they can be used as yet another experimental system (as when recording from animal or human cortex) to try understanding how networks learn representations. Finally, he goes into testing GLMs as an approximation of neural computation. If we replace a layer of a DNN with a GLM, what’s the impact on object recognition performance? If filters are fine-tuned for object recognition, performance dramatically increases. So linear receptive field estimation can be complemented with an evaluation of the impact of the nonlinearities we ignored.

Other talks of the day

I don’t have notes for quite a few of the talks given after the lunch break. Notably, one of the speakers was Konrad Körding, who seemed to be considered one of the protagonists of today. Sorry!

One of the talk after coffee was a tremendously mathematical one, about recognition despite stimulus variability, where the term “object manifold” was used, with the aim, of course, of telling these manifolds apart. Very theoretical results.

The last one I followed properly was by Friedemann Zenke, about learning in spiking neural networks. There are opposite motivations on doing work on neural networks: some people are interested in biological plausibility, spikes, STDP, all unsupervised; others are more interested in machine learning for the applications. These two schools have different goals, use different architectures, and see learning (called training in the other case) in very different ways. This talk is by a person who came from the neuroscientific interest into the machine learning interest. First point. A Hebbian plasticity rule (like Oja 1982, or the BCM rule also of 1982) needs a compensatory process. Most people think of compensation as homeostatic plasticity. The problem, though, is that by homeostatic plasticity, people mean processes in the timescale of hours or days; but for stability in Oja’s rule, we need very fast compensation (Zenke and Gerstner 2017). The activity can blow up in a matter of seconds. From another point of view, bio-inspired learning rules are often unstable, underconstrained, and people add ad-hoc “fast homeostatic processes” of dubious origin. On the other hand, functionally motivated rules derive from an objective function which doesn’t exist in a neuronal network. Second part: can we do the opposite, that is, work with a spiking neural network, but capable of solving a complex task, and with a learning rule that is local in time and space (prerequisite for synaptic plasticity)? The problem of ML approach is:

cost function
non-differentiability of spiking networks
credit assignment in hidden layers. How to define a cost function? We can just choose a distance between an objective spike train and the one we have. Which is obviously the van Rossum distance (which is smooth). The problem is still how to take the derivative of a spike train with respect to the weights! The approach is to smooth everything out by taking approximations. (Aside note: there exists an integral form of the LIF model, called SRM0, which is convolutions instead of an ODE). Summarising: given some spike times, we want to optimise a LIF neuron to fire at those times, despite random inputs, and then many of these LIF neurons. What about a network with multiple layers, with the last one firing in a given fashion? They introduce a “random feedback” system. So the learning rule does have a simple interpretation in a biological context, despite being based on an optimisation problem.

Panel discussion

This panel discussion will actually happen tomorrow, whence you can guess that I shamefully lied on this post’s publication date.

Do cool pictures constitute an understanding?

This refers to the fancy receptive fields learned by deep networks in the various layers. According to Konrad Körding, even if we recorded all receptive fields of all neurons, we still wouldn’t understand representations. We need to understand the brain in terms of learning, of cost functions. The rationale behind this is that we can’t make sense of everything in a representation. A point they can’t seem able to answer is what counts as understanding, but they seemed to agree that what is indeed interesting, is how the representations form, how they change, how they are learned. This may have a compact explanation in terms of learning rules. So, to answer a question posed by an experimental neuroscientist in the room, the experiments we need are the ones about learning mechanisms.

What are the key experiments for the coming decade?

It’s not obvious that nature came up with an optimal solution in the brain. Actually, it’s probably not. However, what can, according to theoreticians, a wet experiment still provide to ML? The answers are something like error/reward signals, credit assignment signals. In biology, it’s basically unknown how the error signal is produced. In ML, it can come from a variety of things, depending on supervised/unsupervised etc.

Someone also points out that for DL we can indeed draw inspiration from the brain, but we don’t have to follow everything the brain does. The brain includes a lot of accidents of evolution, it grew bit by bit, which is something artificial neural networks don’t have to do.