Transformers have completely taken by storm the field of sequence modelling with deep networks, becoming the standard for text processing, video, even images. RNNs that were once a very active engineering field have slowly faded into the void. All of them? No, some RNNs are bravely fighting back to claim state-of-the-art results in sequence tasks. The most suprising part? They are linear…
In this post, I want to give a broad introduction for people that do not follow closely the (crazily fast) field of deep learning to the recent developments of sequence modelling, explain why Tranformers have imposed themselves, and why and how RNNs are coming back, as well as describe these new classes of models that are starting to buzz.
Many thanks for help and comments to Nicolas Zucchet.
Table of contents:
We will first recall the precedents, ie. the demise of the RNN and success of the Transformer, and the first apparent limits of the latter. You may skip if you are familiar with Transformers.
Let’s start by the basics. What does a deep network do? It typically aims to model an unknown mapping $f$ from input vectors $x \in \mathbb{R}^N_{in}$ to output vectors $y \in \mathbb{R}^N_{out}$, from a set of datapoints $\{(x_1, y_1), \dots, (x_N, y_N)\}$. In the simplest case, this is done with a multilayer perceptron (MLP) which is simply a series of stacked layers of the form $h^{(k+1)} = \phi\left(W_k h^{(k)}\right)$ with the first $h$ equal to the input and the last to the output. The representational capacity of such models are guaranteed by the universality property that states that for any unknown function $f$ there exists a bunch of parameters $W_1, \dots, W_L$ such that the MLP approximates arbitrarily close the target function (and has actually a fairly simple proof). It doesn’t mean it will be easy to learn, especially depending on the data at hand, but at least the solution does exist. Very well.
But this universality property has its limits within the realm of vectors, of fixed dimension. Most real-world problems, alas, cannot be formulated like this: what if you want to process texts of varying sizes, videos of varying lengths, audios of varying durations? Translate sentences in French to sentences in English, when we don’t even know if the output length will match the input’s? All these problems can fit into a more general formulation though: learning an unknown mapping $f$ from input sequences of vectors $(x_1, \dots, x_T)$ for $T$ taking any value from 0 to infinity, to output sequences $(y_1, \dots, y_{T’})$ with the output length not necessarily equal to $T$. Now, this turns out to be a very general task: it is equivalent to learning an arbitrary program, and as such a system able to learn all such mappings is Turing-complete^{1}.
But it is not actually so difficult, and it turns out that adding simple non-linear recurrence to the perceptron is sufficient: one can define the vanilla RNN as the system obeying the equations $h_{t+1} = \phi(W_r h_t + W_i x_t)$ and $y_t = W_o h_t$. As simple as it seems, this system has a dynamical universality property: it can approximate as closely as needs be a non-linear dynamical system $y_{t+1} = f(y_t, x_t)$ (again, the proof derives quickly from the perceptron case). This suffices to show that these RNNs are actually Turing-complete, and can hence solve the above problem.
RNNs with several architectural adaptations (LSTMs, GRUs) have actually ruled the world (or at least language, audio, time series, etc.) for a few years. They have long been very difficult to train, but the literature had slowly progressed to better behaved systems that were able to do basic text classification, language modelling, or translation. Now, they are not even mentioned in the latest textbooks (e.g. Simon Prince’s Understanding Deep Learning). What happened?
Although the above seems like the only obvious solution to learning on sequences, another one exists: what if we processed every vector of the sequence with an MLP? Then we loose all global information about the sequence, not great. Then, what if we authorize this MLP to look at the other elements sometimes, for example through a learned aggregator function $a(x_t, \{x_1, \dots, x_T\})$? This function would be responsible to summarize how the rest of the sequence relates to the element at hand. We could call it “attention”: each $x_t$ would look at the parts of the sequence most relevant to its own information. This is exactly what a Transformer does.
Provided that the function $a()$ can take sequences of any length, we thus end up with a system able to learn mappings on sequences as above, but without the tedious part of having to backpropagate through long iterated dynamics, with all the risks of having vanishing or exploding gradients. Another advantage is parallelization: before, to compute the state of the RNN at timestep $t$, you had to wait until the calculation at the previous timestep was finished. If you wanted to process a sequence of length 1M, this would take minimum 1M clock cycles. Now it can be completely and massively parallelized: if it fits in memory, you can compute at the same time the network layers on all timesteps, just allowing them to share information when needed.
With all this, Transformers really seem like the cool kid. Why would anyone bother with sad RNNs anymore? Well, despite all their properties, Transformers have their dark sides too. First problem, note that $a()$ has to be applied for each element $x_t$ and every time processes the whole sequence. That means that one application of $a()$ does on the order of $T$ computations, and there are $T$ of them to compute $a(x_1, \{x_1, \dots, x_T\})$ until $a(x_T, \{x_1, \dots, x_T\})$. This means to apply attention we actually have to perform $T^2$ computations. Compare to the RNN: we just go through each timestep once, everytime applying a constantly-sized computation, so we get $\mathcal{O}(T)$. First point of the revenge! There are actually a few workarounds to avoid this issue: for example if $a$ is made completely linear, then it there are ways to compute it faster… by actually showing that the computation can be formulated as an RNN^{2}! We will come back to it…
Second issue: in the Transformer as formulated here, each element sees the same series of computations, and hence all information about ordering of elements in the sequence is lost. This information has to be added artifically through another input vector $p_t$ called a positional embedding that can be for example a combination of sine waves of time. The problem is that positional embeddings generalize very poorly out-of-distribution: if a Transformer is trained on sequences of length less than 2000 and tested on longer sequences, even if the architecture and definition of the positional embeddings allow it, performance seems to break. Finding ways around this is a very active area of research, and it is possible that it will not be an issue for very long^{3}, but it is the reason why we so far have fixed (and fairly limited) context windows on all deployed usecases.
Are these mere roadblocks or hard limits for Transformers? Future will tell, but they have already been sufficient to revive interest in RNNs…
RNNs have always remained an active area of research, but a particular benchmark has been a boon for them: the Long-Range Arena benchmark, released in 2020^{4}, combines reasoning and classification tasks over several thousand tokens. The hardest task, Path-X, over 16K tokens, was simply out of the range of any model back then! Suprisingly, all state-of-the-art results on this benchmark have been attained by RNNs^{5}. Who are they, and how do they do it?
We will talk here of a small set of influential publications, notably work by the lab of Chris Ré at Stanford and in particular Albert Gu who developed the HiPPO network^{6}, then S4^{7}, the RWKV indie project^{8} and a more “first-principles” LRU approach by Antonio Orvieto and coauthors^{9}. These all seem to rely on very similar principles, and to give a very concrete perspective I will focus on the formalism of the latter publication, although most details will be identical up to minor tweaks, and will hence refer to this module below as an “LRU” (standing for Linear Recurrent Unit).
The fantastic core idea is the following: RNNs are prone to vanishing/exploding gradients because of the non-linearity, and of eigenvalues in the recurrence above 1 (exploding directions) or close to 0 (fast decaying directions). Solution: get rid of both! Which means effectively using linear RNNs. The basic equation of an LRU is thus:
with $\Lambda$ the recurrence matrix, $B$ the input matrix, the $h_t$ are the hidden states and $x_t$ the inputs.
The problem is that linear RNNs are quite boring by themselves: they can only exhibit a fixed point at 0, and then activity that either explodes (eigenvalue > 1), decays to 0 (eigenvalue < 0), stays idle (eigenvalue = 1), or oscillates in clean concentric circles (pair of complex eigenvalues with module = 1). There is some variety but not enough to even get close to the diversity of dynamical systems out there, and all context-dependent operations (for example discovering that “not bad” is positive, not with the valence of “not” and “bad” summed) is out of reach.
We need an additional trick of course, and it consists of adding a non-linear mapping (that will in general be an MLP or even just a Gated Linear Unit function) from the hidden state. Effectively, the output will then be:
where $\hat{f}_{\text{out}}$ is the output non-linearity that can be made as complex as one wants, that takes as input a linear readout of the hidden state noted $Ch_t$ (and a skip connection $Dx_t$ that helps without modifying representational capacity). What matters is that all non-linearities are kept above single tokens, they never occur in computations that impact time dynamics of internal states. So dynamics here are always linear.
That’s all, as simple as it seems! Then stack a few of those one above the other, and you’re good to crush the long-range arena, and even design competitive LLMs! It is astonishing that a system relying only on linear dynamics, supposed to already be boring past the second year of undergrad can reach state-of-the-art results. And they can be made even simpler as we will see below.
In any case, this deserves a few more explanations. If you want more details, in the following, we will cover important tricks, the big universality question (with demo), and some more ideas about these networks.
Recurrence was already linear, if that is not simple enough, you can just make it diagonal. Effectively, this means in the equations above that we can parametrize $\Lambda = \operatorname{diag}(\lambda_1, \dots, \lambda_d)$, with the important caveat that the $\lambda_i$ are complex numbers. Effectively, this means that all tensors in the computation graph will be complex numbers, which torch and jax handle pretty well with gradients. Even accounting for the fact that eigenvectors can be pushed to the input and output matrices of the recurrence ($B$ and $C$), it remains a suprising fact (for the linear algebra nerds, it means that the nilpotent components in the Dunford reduction of $\Lambda$ can be thrown away). References about this fact are this paper by Gupta et al^{10}. and the Orvieto et al.^{9} too.
where $i$ is the imaginary number, and hence the module equal to $\exp(-\exp(\nu))$ is necessarily smaller than 1! This is called the exponential parametrization, and people have noticed it considerably facilitated training.
this can be seen as a convolution of sorts on the inputs $x_k$. This was actually one of the big motivations for getting rid of the non-linearity in recurrence only, and leads to a fast algorithm described in detail in section 2.4 of the S4 paper^{7}. Note the goal is not to compute all tokens in parallel, we would get the same quadratic explosion as for Transformers past a point, but to parallelize recurrence by chunks to better use tensor-handling capabilities of GPUs. Also note that for diagonal $\Lambda$, computing the matrix powers means simply exponentiating the $\lambda_i$, so again, everything falls together nicely!
Now for the big question: our system so far is essentially a simple linear DS followed by a token-wise non-linearity. Is this a universal approximator of dynamical systems, and a Turing-complete system just as the vanilla RNN? This idea seemed crazy at first, given the limited nature of linear dynamics, but we can just test it! I ran some tests with a one-layer LRU network (with three-layer MLP as its output), parametrized as above, and trained it to do some simple but very non-linear things, like reproduce a bistable, double-well 1D system^{11}. After some tinkering, here are some quick results:
There is no doubt that the LRU manages to capture intrinsically non-linear dynamics with its little linear engine. Understanding how this kind of phenomena arise exactly will be an interesting research project for the future, as well as proving mathematically what the exact capabilities and limits of such networks (a recent proof of universality for example in this preprint^{12}). Here’s a quick intuition: we aim to reproduce unknown dynamcis $y_{t+1} = f(y_t, x_t)$ with a system such that $y_t = g(h_t)$ and $h_{t+1} = Ah_t + Bx_t$, by learning a universal mapping $g$, and matrices $A$ and $B$. Let us assume that for any sequence $x_1,\dots, x_T$ we can guarantee that the sequences $h_1,\dots,h_T$ will be always different, and we won’t get the same $h_i$ for two different input sequences. Then it becomes trivial to just learn a $g$ such that for a given $y$, for any $h$ that will be in the precursor set $g^{-1}(y)$, then $g(Ah + Bx) = f(y, x)$. Our assumption though is not trivial, but by keeping $T$ fixed and increasing the hidden dimensionality $N$ we can always achieve it, an example solution being a block-circulant matrix such that $A^T = Id$ and such that all $A^tB$ have orthogonal column spaces. This requires $N = T \times N_{in}$ obviously.
Addendum: I thought any approximations results of the style would only hold on finite time intervals, but I was pointed to a fantastic proof by Stephen Boyd and Leon Chua^{13} showing approximations of non-linear dynamics with a linear DS can actually hold in infinite time intervals, provided that they have a property called fading memory which is exactly what one would think it is: differences of inputs in the far past have decreasing influence on present state (so not chaotic). However… If you care about long-term dependencies, till what point do you want the fading memory property to hold? #foodforthought
Addendum 2: At this point I think it is essential to add a few words about this “HiPPO” theory I mentioned along the way, because it is a very beautiful and slightly orthogonal view of the problem. The idea behind the design of S4 and its precursor^{6} is to summarize as well as possible the input sequence into the final hidden state. We know ways to efficiently summarize sequences or signals in an Euclidean representation, for example through Fourier series or the less known Legendre polynomials. Early works have proposed recurrent units that computed Fourier^{14} or Legendre^{15} coefficients of an input signal in an online fashion, but the general framework was formalized in Albert Gu’s “High-order Polynomial Projection Operators” work that forms the basis for all these developments^{6}. It requires quite a bit of maths, but it also makes one of the most grounded frameworks in DL.
In fact it is fascinating to see that given enough memory space a linear DS can accomplish all sorts of interesting things. A paper by Zucchet, Kobayashi, Akram and colleagues^{16} for example shows how they can reproduce a Transformer-like attention mechanism. Theoretical constructions always require capacities that would preclude any advantages of such networks, but in practice they seem to get along pretty well. In the end, how do they compute?
One thing that is particularly striking now is that these networks seem to have a fundamentally very different nature than traditional non-linear RNNs. The former ones are believed to compute mostly by exploiting interesting dynamical structures, like fixed points^{17}, more fancy topological shapes like rings, spheres and toroids, and rich non-normal transients. None of this is possible with LRUs. Their only possibility is to throw inputs into a large bunch of slowly oscillating modes, and then learn useful patterns from these internal rich melodies. As a weird analogy, it reminds me of the concept of epicyclic computing by which Ptolemeus was able to fit very complex astronomical motions using carefully adjusted sets of numerous rotating gears. Similarly, it might be that by having enough oscillating modes of diverse frequencies and phases, LRUs are epicyclic computers able to generate useful patterns from data, through which non-linear dynamics are ultimately learned.
A quick demonstration of the phenomenon is the following: in one of the papers cited above^{17}, RNNs were trained to perform the “flip-flop task”, which consists in receiving upwards or downwards pulses and keeping in memory the direction of the last pulse received. Very striking dynamical landscapes appeared when dissecting these RNNs, with notably fixed point attractors that encoded the memory of last pulse received. An LRU net can perfectly be trained to do this task, as demonstrated below, but this time no internal bistable attractors are to be found, and recurrent units simply keep oscillating, as they should do.
I cannot close without strongly advising a recent paper by Il Memming Park’s group^{18} which demonstrates continuous attractor-like behavior without internal dynamical attractors, and with oscillating dynamical modes instead, and outlines a rich theory. All this points, I think, to a deep dichotomy between two different ways of computing with high-d dynamics, and advantages and disadvantages of each will be interesting to understand, as well as figuring out which one brains are using.
Kenji Doya, Universality of fully-connected recurrent neural networks, 1993 ↩
Katharopoulos et al., Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention, 2020 ↩
See this write-up: https://kaiokendev.github.io/context ↩
Tay et al., Long Range Arena: A Benchmark for Efficient Transformers, 2020 ↩
Gu, Dao et al., HiPPO: Recurrent Memory with Optimal Polynomial Projections, 2020 ↩ ↩^{2} ↩^{3}
Gu et al., Efficiently Modeling Long Sequences with Structured State Spaces, 2021 ↩ ↩^{2}
RWKV project, Peng et al., RWKV: Reinventing RNNs for the Transformer Era, 2023 ↩
Orvieto et al., Resurrecting Recurrent Neural Networks for Long Sequences, 2023 ↩ ↩^{2}
Gupta et al., Diagonal State Spaces are as Effective as Structured State Spaces, 2022 ↩
Network trained to reproduce at its outputs trajectories sampled from the target DS with random initial points, and additionally an MLP encoder that maps from the initial state to an initial $h_0$ for the LRU. ↩
Orvieto et al., On the Universality of Linear Recurrences Followed by Nonlinear Projections, 2023. My understanding is that the proof involves showing that there can be a bijection from input sequence to final hidden state if N is large enough, but I’ll have to read it again. ↩
Boyd & Chua, Fading Memory and the Problem of Approximating Nonlinear Operators with Volterra Series, 1985 ↩
Zhang et al., Learning Long Term Dependencies via Fourier Recurrent Units, 2018 ↩
Voelker et al., Legendre Memory Units: Continuous-Time Representation in Recurrent Neural Networks, 2019 ↩
Zucchet, Kobayashi, Akram et al. Gated recurrent neural networks discover attention, 2023 ↩
Sussillo and Barak, Opening the Black Box: Low-Dimensional Dynamics in High-Dimensional Recurrent Neural Networks, 2012 ↩ ↩^{2}
Park et al., Persistent learning signals and working memory without continuous attractors, 2023 ↩
This post is an attempt at summarizing results obtained over five years of research in the laboratory of Srdjan Ostojic, at ENS Paris, on the uses of low-rank RNNs, in a ten-minutes read. Let’s see if that’s enough to catch your interest!
This post has been written with invaluable inputs from several lab members, in particular Friedrich Schuessler, Francesca Mastrogiuseppe and Srdjan Ostojic. Many thanks to them!
Artificial neural networks are super cool. They are known for all sorts of computational prowesses, and they also happen to model brain processes quite well. Among artificial neural networks, there are recurrent neural networks (RNNs), which contain a pool of interconnected neurons, whose activity evolves over time. These networks can be trained to perform all sorts of cognitive tasks, and they exhibit activity patterns that are quite similar to what is observed in many brain areas!
A common approach nowadays to understand neural computations is the state-space approach: one looks at the collective activity of many neurons in a network as a vector $(x_1(t), \dots, x_N(t))$ in a high-dimensional state-space. It turns out the neural trajectories in this state-space are not random, but stay confined to some particular low-dimensional subspaces, or manifolds. If you are not familiar with these concepts, a lot of places on the internet summarize them very well^{1}. However, a big remaining mystery is how connections in a network of neurons are able to generate this very organized collective activity, able to solve complex tasks. Indeed, computer scientists have figured out how to train neural connections to get a network to do a certain task, but this doesn’t provide a deep understanding of why the obtained connections implement the task, leading some people to coin RNNs as “black-box models”.
A first paper, published by Francesca Matrogiuseppe and Srdjan Ostojic in 2018^{2}, has shown that low-rank RNNs could be a solution to the mystery. A low-rank RNN is a network whose connections obey particular algebraic properties, namely the connectivity matrix is low-rank, and can be written formally as follows:
where $\boldsymbol{m}^{(1)}, \dots, \boldsymbol{m}^{(R)}$ and $\boldsymbol{n}^{(1)}, \dots, \boldsymbol{n}^{(R)}$, all $N$-dimensional, and $R$ is the rank of the matrix $\boldsymbol{J}$. Such a decomposition may look cumbersome, but it actually has a lot of advantages. Indeed, although they were not a general subject of study before, low-rank matrices had made many appearances in the history of computational neuroscience and machine learning, and that is no coincidence^{3}.
In their paper, Francesca and Srdjan show that low-rank RNNs have provably low-dimensional patterns of activity, and that they can be designed to accomplish many interesting tasks like Go-NoGo or stimulus detection. An extensive mean-field theory showed that the statistics of connectivity could be linked to the dynamics of the network, paving the way for a deeper understanding of neural computations. Many papers followed, deepening different aspects of that theory, and it is now a good time to wrap up what we know about them, and what they can bring to neuroscience. I will cover some interesting tidbits about low-rank RNNs in a first part, and then give a quick overview of what the different papers are about.
Low-dimensional dynamics
Low-rank RNNs are defined in terms of vectors. There are the recurrent connectivity vectors, namely the $\boldsymbol{m}^{(r)}$ and the $\boldsymbol{n}^{(r)}$ we mentioned before, and also some input vectors $\boldsymbol{I}^{(s)}$ feeding external signals to the RNN. An essential property, easy to prove mathematically, is that the neural activity vector $\boldsymbol{x}(t)$ in a low-rank RNN is constrained to lie in subspace spanned by the $\boldsymbol{m}^{(r)}$ vectors and by the $\boldsymbol{I}^{(s)}$ vectors. We can decompose this space into a recurrently-driven subspace and an input-driven subspace, which together explain all of the activity in an RNN.
In particular, when the network is not receiving any external inputs, hence generating spontaneous activity, it forms an $R$-dimensional dynamical system. If the network happens to be of rank 1 or 2, the whole set of possible dynamics can actually be visualized through a phase portrait, in which the direction and velocity of the dynamics are plotted at every point of the recurrent subspace as an arrow. Here are some example phase portraits, for a bistable network, a network implementing an oscillatory cycle, and a network implementing a ring attractor.
Even for networks of rank 3, dynamics can be usefully visualized although it requires more creativity. Now what happens if we add inputs? It turns out that tonic inputs (meaning input signals that stay at a constant value for periods of time) simply move the whole recurrent subspace along an input axis towards a new region of the state-space, like an elevator, transforming the dynamics by the same occasion:
Thanks to this property, we can still visualize the dynamics in a same network when it is receiving different inputs, and observe how these inputs modify its activity patterns. This explains how external cues could act as contextual modulators on a network, accelerating or slowing its dynamics, turning on and off certain attractors, or changing its behavior altogether.
The connectivity space, and the role of correlations
We have seen how low-rank RNNs can give a very visual, geometrical understanding of neural activity, but it still remains to be explained how to wire neurons to obtain some desired dynamics. We will here dissect low-rank RNNs a bit further to see what they can tell about this question!
First, let’s look at the scale of our problem. For a full-rank RNN, there is one connection between every pair of neurons, which makes at least $N^2$ connections to train and understand. One would have to wake up very early to understand how they affect the dynamics. Fortunately for the low-rank RNNs, the only parameters that can be trained are the entries on the connectivity vectors, so for each neuron the parameters $m_i^{(1)}, n_i^{(1)}, \dots, m_i^{(R)}, m_i^{(R)}$, which makes $2R$ parameters for every neuron, plus $N_{in}$ entries if we have as many input vectors. Overall, the number of free parameters in the network is a $\mathcal{O}(N)$, which is much much better than before, but still a lot to look at.
Thankfully there is a way to look at these parameters which is quite illuminating. As we mentioned, every neuron is characterized by $2R + N_{in}$ connectivity parameters. Each neuron can thus be visualized as a point in a $(2R + N_{in})$-dimensional space, which we will call the connectivity space. A rough visualization of this parameter space can be obtained by plotting the pairwise distributions of parameters, giving this kind of plots:
The obtained cloud of points seems rather random, but of an organized kind of randomness. A natural idea is thus to approximate it with a multivariate probability distribution. Let’s start by considering the most well known distribution, the Gaussian. Formally, we will say that the multivariate probability of connectivity parameters is a Gaussian distribution, characterized by its mean and covariance matrix, which writes as:
Every neuron is then a random sample of this global distribution, which summarizes the statistics of connections in the network. The whole complexity of the RNN it thus reduced to a handful of parameters, the entries of $\boldsymbol{\mu}$ (there are $2R + N_{in}$ of them) and of $\boldsymbol{\Sigma}$ (there are less than $(2R + N_{in})^2$ of them!). This description is thus compact, but is it interpretable? It turns out it is! A mean-field theory applied to such a Gaussian network can give a formulation of the dynamics in terms of the parameters of the above distribution.
The dynamics that can be obtained by replacing the connectivity space of a low-rank RNN with a multivariate Gaussian are quite diverse - many attractors and limit cycles can be explained in this way - but it would be too easy if they could explain everything. It turns out that certain dynamical landscapes cannot be obtained in this way, and the obtained networks lack flexibility. Fortunately, we can enrich them without complexifying the framework too much by replacing the Gaussian distribution by a mixture-of-Gaussians. Here is an example of a connectivity space with a mixture of two Gaussians, colored in green and purple:
It turns out that with enough components in the mixture, a rank-$R$ RNN with connectivity parameterized in this way is a universal approximator of $R$-dimensional dynamical systems, so that is theoretically “all we need”! Moreover, the mean-field theory extends naturally to this case, providing explanations of how the different parameters of components of the mixture can be tuned to obtain complex dynamics or modify the network behavior with inputs^{4}. This turns out to be very related to the idea of selective attention via gain modulation: for example a two-population network can implement two tasks by having contextual inputs selectively decrease the gain of the “irrelevant” population in each context. Mean-field theory shows that the “gain” can be decreased without any complex synaptic mechanisms, simply by setting neurons’ activity to the flat part of their non-linear transfer function^{5}.
The connectivity space gives many more insights into the relation between connectivity and dynamics. For example, by introducing certain symmetries in the connectivity space, we can obtain related symmetries in the dynamics, and implement symmetric attractors like rings and spheres, or polyhedral patterns of fixed points^{4}. And it has probably many more secrets to reveal.
What do low-rank RNNs tell us about “normal” networks?
At this point, you might think that low-rank RNNs are an interesting subject per se, but still be skeptical about their concrete applications to neuroscience or machine learning. Indeed, they are not particularly easy to train, and machine learning people use RNNs that are full-rank for concrete applications. And if you are a neuroscientist, you might have arguments to believe the brain is not a low-rank network: its neurons are spiking and not rate neurons, they have non-symmetric transfer functions, sparse connections, excitatory and inhibitory populations…
Many of these points are being worked out to push low-rank RNNs into the real world, and turn them into practical tools. Let us first tackle the machine learning-side concerns. Do low-rank networks teach us anything about more standard full-rank networks? A first answer is that standard networks have a secret low-rank life! Indeed, when training full-rank networks one can verify that the learned part of their connectivity can be approximated with a matrix that has a very low rank, without any loss of performance^{6}. This shows that low-rank connectivity might arise as a natural solution to computational problems.
Moreover, ongoing research tends to show that full-rank networks can be reverse-engineered very well by training low-rank RNNs to reproduce their activity (Valente et al in prep).
On the biological side of things, a recently published preprint shows that low-rank RNNs can be made sparse while keeping their properties! Ongoing research shows that positive transfer functions, spiking neurons and Dale’s law can all be added to low-rank RNNs without affecting their computational properties. This is of course to be continued, and there are many more problems to solve if we want the insights of low-rank RNNs to go further in neuroscience, but we hope these results can convince you of their interest in biological modeling.
Here is a quick summary of the research carried on low-rank networks these last few years, hoping you can find the paper that answers your questions:
Mastrogiuseppe & Ostojic 2018^{2}: in this paper that started the research direction, Francesca and Srdjan introduce the dynamic mean-field theory for low-rank RNNs with a fixed random full-rank noise in the connectivity. They show that these networks exhibit low-dimensional spontaneous dynamics, and they exploit a Gaussian parametrization of connectivity space to build example networks that solve interesting computational neuroscience tasks.
Mastrogiuseppe & Ostojic 2019^{7}: here, Francesca and Srdjan relate the low-rank framework with reservoir computing, and in particular the FORCE paradigm. In particular, they show how the mean-field theory introduced in the precedent paper can be applied to networks trained on a fixed point task using the least-squares and recursive least-squares methods.
Schuessler et al. 2020a^{4}: in (Mastrogiuseppe & Ostojic 2018), the low-rank structure was independent of the fixed random connectivity noise (in a probabilistic sense). Here, Friedrich and his co-authors study the case where the low-rank structure is correlated with the full-rank noise, showing that richer and interesting dynamics arise in this case.
Schuessler et al. 2020b^{6}: Friedrich and his co-authors study the dynamics of RNNs trained on a range of cognitive tasks, both experimentally and theoretically. They show that the learned part of the RNN connectivity can be well approximated by a low-rank matrix, and that this phenomenon can be explained by analytical results on low-rank RNNs.
Susman, Mastrogiuseppe et al. 2021^{8}: Lee, Francesca and co-authors apply a similar paradigm as in (Mastrogiuseppe & Ostojic 2019) to reservoir networks, both open-loop and with a rank-one feedback loop, trained to reproduce a sinusoidal signal. They reveal a “resonance” phenomenon for these networks, analytically deriving a very simple expression for the preferred frequency, and revealing interactions between task and connectivity properties.
Beiran et al. 2021a^{9}: Manuel and co-authors extend the Gaussian parametrization of connectivity space to the mixture-of-Gaussians distributions. They show the universality of such networks, and explicit which dynamics can be obtained with a single or with several Gaussian components through a detailed mean-field theory. They also show how symmetries in the connectivity space translate to symmetries in the dynamics of the network, showing in particular how to build networks with symmetric families of attractors.
Dubreuil, Valente et al. 2022^{5}: Here, Alexis, Adrian and co-authors focus on training low-rank RNNs to do particular tasks and reverse-engineering the obtained solutions. This method reveals that some computational tasks can be accomplished with a single population while others rely on several populations (meaning a mixture with several components in connectivity space, as studied in (Beiran et al. 2021)). In particular, the tasks that require several populations are those that need a flexible input-output mapping, which can be explained through a gain-modulation mechanism.
Beiran et al. 2021b^{10} (preprint): Manuel and co-authors train RNNs to perform flexible timing tasks, both full-rank and low-rank, and study their generalization abilities. These analyses show that low-rank RNNs can generalize better when they rely on tonic inputs, which, as mentioned above, can predictably modify the network dynamics. They also reverse-engineer the low-rank solutions, showing how networks rely on slow manifolds to implement their tasks.
Valente et al. 2022^{11}: Adrian and co-authors study the relationship between a classical latent dynamics model, the latent LDS (latent linear dynamical system) and linear low-rank RNNs. Although very similar, they are surprisingly technically different. Authors show theoretically and experimentally that they are equivalent when the number of neurons is much higher that the dimensionality of dynamics.
Herbert & Ostojic 2022^{12} (preprint): Here, Elizabeth and Srdjan add an element of biological plausibility to low-rank RNNs by showing that they can be made sparser while keeping their interesting properties. Random matrix theory is used to study the effect of sparsity on the eigenspectra of connectivity matrices, and some results of (Mastrogiuseppe & Ostojic 2018) are retrieved with sparse networks.
For an informal introduction through a youtube video, see here. For more formal reviews, see for example Yuste’s historic perspective (Yuste, R. (2015). From the neuron doctrine to neural networks. Nature reviews neuroscience, 16(8), 487-497.) or this more recent review (Vyas, S., Golub, M. D., Sussillo, D., & Shenoy, K. V. (2020). Computation through neural population dynamics. Annual review of neuroscience, 43, 249.). ↩
Mastrogiuseppe, F., & Ostojic, S. (2018). Linking connectivity, dynamics, and computations in low-rank recurrent neural networks. Neuron, 99(3), 609-623. ↩ ↩^{2}
Notably in the foundational Hopfield networks paper (Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8), 2554-2558.), but also in position-encoding circuits (Seung, H. S. (1996). How the brain keeps the eyes still. Proceedings of the National Academy of Sciences, 93(23), 13339-13344.) or in learning procedures like FORCE (Sussillo, D., & Abbott, L. F. (2009). Generating coherent patterns of activity from chaotic neural networks. Neuron, 63(4), 544-557.) or the dynamics of gradient descent (Saxe, A. M., McClelland, J. L., & Ganguli, S. (2019). A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences, 116(23), 11537-11546.). ↩
Schuessler, F., Dubreuil, A., Mastrogiuseppe, F., Ostojic, S., & Barak, O. (2020). Dynamics of random recurrent networks with correlated low-rank structure. Physical Review Research, 2(1), 013111. ↩ ↩^{2} ↩^{3}
*Dubreuil, A., *Valente, A., Beiran, M., Mastrogiuseppe, F., & Ostojic, S. (2022). The role of population structure in computations through neural dynamics. Nature Neuroscience, in press. ↩ ↩^{2}
Schuessler, F., Mastrogiuseppe, F., Dubreuil, A., Ostojic, S., & Barak, O. (2020). The interplay between randomness and structure during learning in RNNs. Advances in neural information processing systems, 33, 13352-13362. ↩ ↩^{2}
Mastrogiuseppe, F., & Ostojic, S. (2019). A geometrical analysis of global stability in trained feedback networks. Neural computation, 31(6), 1139-1182. ↩
Susman, L., Mastrogiuseppe, F., Brenner, N., & Barak, O. (2021). Quality of internal representation shapes learning performance in feedback neural networks. Physical Review Research, 3(1), 013176. ↩
Beiran, M., Dubreuil, A., Valente, A., Mastrogiuseppe, F., & Ostojic, S. (2021). Shaping dynamics with multiple populations in low-rank recurrent networks. Neural computation, 33(6), 1572-1615. ↩
Beiran, M., Meirhaeghe, N., Sohn, H., Jazayeri, M., & Ostojic, S. (2021). Parametric control of flexible timing through low-dimensional neural manifolds. bioRxiv. ↩
Valente, A., Ostojic, S., & Pillow, J. (2022). Probing the relationship between linear dynamical systems and low-rank recurrent neural network models. Neural Computation, in press. ↩
Herbert, E., & Ostojic, S. (2022). The impact of sparsity in low-rank recurrent neural networks. bioRxiv. ↩
Disclaimer: don’t set your expectations too high, this is just a grad student’s attempt at making sense of his frenetic field.
An algorithm is then trained to match each corrupted item to the corresponding original one. An example where this is applied is for training modern language models, like GPT-3 (see last year’s post): they are usually trained on huge swaths of text where some words are masked, and a deep network has to find which are the most probable words for filling the gaps. This approach was actually used in most natural language processing innovations since 2013, and was notably critical in developing word embeddings like word2vec or Glove.
More recently, these methods have been used in computer vision, as was done by the SimCLR network published by Chen et al. which was the state of the art self-supervised vision algorithm in 2020. You can read more about it in the excellent post by Amit Chaudhary. Most recent developments include the SwAV method developed at Inria and Faceb… sorry Meta.
Imitation learning: As its name suggests, imitation learning is a form of learning where an algorithm has access to an expert exhibiting the exact desired behavior that we want to reproduce. This is particularly used in a reinforcement learning context, in particular when the rewards are too sparse for an agent to retrieve a meaningful signal out of them. While imitation learning is still quite a niche among the huge corpus of ML research, it seems an important aspect of many forms of intelligence (and particularly PhD student intelligence: what would we be without postdocs and PIs that we can try to imitate?), and is one of the promising ways to solve challenges in complex and highly hierarchical environments. It was featured these last few years in the MineRL competition which aims at developing agents capable of mining a diamond in the video game Minecraft. Contrary to many other challenges solved by AI, this involves extremely sparse rewards, unattainable without careful planning conducted over hours of playing.
Lottery tickets: one of the most fascinating subjects of the last few years in the domain of ML, which I have already covered last year here. It shook the community in many ways, showing how small networks who happened to have the good initialization weights (Frankle and Carbin said they had “won the initialization lottery”) were able to match the performance of networks 10 times larger here, or even that pruning could replace gradient descent as a learning algorithm (!!!) here (see also “torch.manual seed(3407) is all you need” and watch deep learning researchers sweat). It kept leading to interesting developments and made its way closer to theoretical neuroscience, in particular through the work of Brett Larsen which linked it to dimensionality in the parameter space, showed the existence of a critical dimension above which training could succeed in a random subspace, and gave insights on how lottery tickets could be built. This all shows how little we know about learning in networks yet.
Self-supervised learning: This is really THE buzzword of the last few years, THE subject that exploded inside the community of machine learning, driven by the unbridled enthusiasm of many leading researchers like Yann LeCun who recently called it the dark matter of intelligence. You can see below in a small count of published papers I put together how its popularity exploded and quite reach its climax soon.
Now, what is this new frenzy about? Although hard to define exactly, self-supervised learning seems to consist in the application of supervised learning methods without any use of manually annotated data. That is, a self-supervised method must contain a way to generate labels from a raw dataset, and then train a supervised algorithm on that data. Given that it works on unlabelled data, self-supervised learning can technically be considered as a subset of unsupervised learning.
An archetypal example is found in the field of natural language processing, where both word embeddings (like word2vec, GLoVe) and general-purpose models like GPT-3 are trained by the following procedure: take a huge corpus of text, remove some words, and train a standard feedforward network to predict the most probable words for each gap. You will get a model which has a quite fine understanding of the relationships between words in written language, able among other things to generate text or perform question answering, and all that without any human intervention. Isn’t that wonderful?
I say a little more about the use of self-supervised learning in the realm of computer vision in the “contrastive learning” entry of this post. Contrastive learning is indeed one of the main approaches to these methods. Other applications exist, for example in the domain of video: one can train an algorithm on a dataset of videos simply by asking it to predict the next frame, or in time-series in general. It will then be able to generate similar data or create predictions “online”.
Now, if you are a pure neuroscientist, should this trend matter to you? I would say that self-supervised learning already made its way into neuroscience, for example through the work of Stefano Recanatesi in Nature Communications, who precisely trained a model to generate predictions about its future states as it was exploring an artificial environment. This is an exact application of self-supervised learning which reveals interesting latent representations in the obtained model, like some form of grid cell code. But most generally, self-supervised learning is by no means a stranger of the worlds of neuroscience of psychology, where it was known under different terms, like the extremely famous “predictive coding” framework, and probably many others which I don’t know about. It is interesting to see that this old, often theoretical, idea about brain function has become successful and practical among computer scientists, and could come back to neuroscience to shed light on biological intelligence, just as deep learning and RNNs did before.
Semi-supervised learning: Paradigm that tries to make the best of a big dataset, of which only a small portion is labelled. This can occur in quite a lot of settings: for example, one could have access to a huge bank of images, but a limited amount of human workforce to label those and train a model. Another situation can be for automatic translation between languages for which a limited number of bilingual texts exist, but where we have large corpuses in each of the two languages. The methodology often goes by the motto unsupervised pre-train, supervised fine-tune which means that you would start by applying a self-supervised algorithm to the large corpus while ignoring the labels, and then fine-tune your model on the small labelled portion. Moreover, you can take the same pre-trained model and fine-tune it on different domains of expertise: for example a language model can then be applied both to generate word embeddings, perform question answering, or generate poetry. In computer vision as well this method applies very well, and all the better with overparametrized models, hence the title of the paper presenting the SimCLR2 algorithm, “Big self-supervised models are strong semi-supervised learners” (which only needs 1% of ImageNet to achieve a very good performance). Or in other words, these models have learned to learn quickly.
Other things that didn’t make it here: research on the use manifolds in neuroscience and AI, use of topological methods like persistent homology (see this), progress in BCI with for example Frank Willett’s performance in decoding handwriting from motor cortex here, transformers for computer vision…
]]>To understand a tool, I like to understand the problem it solves, and CCA is already peculiar from this point of view, since one can come at it as the answer to many different questions (actually 3 big ones).
Let us first consider the question of correlation. When one has two scalar variables in a dataset, $\{x_i\}$ and $\{y_i\}$, for example vaccination status and hospitalization status, it is easy to check if they are related by computing their Pearson correlation coefficient given by :
(with $\overline{x}$ and $\overline{y}$ representing the means of $x$ and $y$). Everybody knows of course this is a dangerous tool that should be used with extreme care, but it is also unavoidable in science, and a good first step in exploring datasets.
Now, let us say one wants to compare not two scalar random variables, but two random vectors in a dataset, $\{\mathbf{x}_i\}$ and $\{\mathbf{y}_i\}$. Sometimes the coordinates of those vectors have a definite role and can be matched. For example if you are comparing the multivariate scores on some psychological test of a parent and its child, it makes sense to compute a correlation for each coordinate of the test, and then consider an average correlation. This is also a good solution when one wants to verify the fit of a model to some experimental, when each variable of the model is specifically targeted at modelling one of the measured variables.
But as one moves towards more complex models that representing the interaction of many simple entities, like a neural network is, one may lose track of individual variables when trying to capture global behavior. For example, let us say we have an artificial neural network model that we want to match to a neural recording. A first solution to tackle this problem is to match each neuron of the recording to one “best-fitting” neuron in the model, like is done in most of Yamins and DiCarlo’s works. This solutions needs extreme care to avoid weird statistical biases. It would be nice to have a number like the $r$ correlation coefficient that summarizes the “similarity” of unaligned data.
One way to look at it is by thinking of two shapes in 3D spaces that are not necessarily aligned. It would be nice to have a summary value that can tell how similar the shapes are when they are “as aligned as possible”. That is exactly what CCA does. But before explaining how, let’s come at it from another perspective.
Let us again consider the problem of the 2 shapes in 3D, but this time consider the two shapes in figure 2 : if you look at the points from above, they look exactly the same, but when looking from the side, you see how they are different : the depth value of the points is all scrambled. A two-dimensional being looking at these shapes from random perspectives might be interested in a method telling him that they look identical when seen from above. This is exactly the problem that CCA solves : it finds directions where the two sets of vectors are maximally correlated. From this perspective one can understand it either as a dimensionality reduction method, telling you which projections of the data to look at to find most correlation, or as an alignment. It is also obvious how this is related to our first question : by aligning the shapes first axis by axis, it is then direct to compute their correlation.
To summarize, CCA can be seen as three things :
NB: for the alignment purpose, it is important to note that it works for datasets that are matched (ie. each sample $\mathbf{x}_i$ corresponds to exactly one sample $\mathbf{y}_i$. If they aren’t matched, and are just two a priori unrelated point clouds, one has to look at Procrustes’ analysis instead).
As we mentioned, CCA takes as input two linked sets of vectors $\{x_i\}$ and $\{\mathbf{y}_i\}$, with the $\mathbf{x}$’s and $\mathbf{y}$’s of two not necessarily equal dimensionalities $d_1$ and $d_2$. We can put them into two data matrices $\mathbf{X} \in \mathbb{R}^{N \times d_1}$ and $\mathbf{Y} \in \mathbb{R}^{N \times d_2}$. Note that contrarily to the dimensionality, the number of samples $N$ has to be equal accross the two datasets (as is the case for Pearson’s correlation coefficient as well). Also note in everything that follows I will considered data has been centered (mean of $\mathbf{x}$ and $\mathbf{y}$ is 0 on each coordinate).
The goal of the method is to first find a direction $\mathbf{a}_1 \in \mathbf{R}^{d_1}$ and a direction $\mathbf{b}_1 \in \mathbb{R}^{d_2}$ such that $\mathbf{X}$ and $\mathbf{Y}$ projected on these directions are maximally aligned, ie. such that :
Once we found this first direction, we may wish to repeat the process with the rest of the data, ie. $\mathbf{X}$ projected on the orthogonal subspace of $\mathbf{a}_1$ and $\mathbf{Y}$ projected on the orthogonal of $\mathbf{b}_1$. This will give two new directions, on which the correlation will be equal or less than the first one, and so on until we obtain $m$ pairs of directions, on which the two datasets are progressively less and less correlated.
In the end, CCA’s output is quite similar to PCA’s: we get two orthogonal matrices $\mathbf{A}$ and $\mathbf{B}$, of respective shapes $d_1 \times m$ and $d_2 \times m$ where $m = \operatorname{min}{d_1, d_2}$, which map the original datasets $\mathbf{X}$ and $\mathbf{Y}$ to a common subspace where they are maximally aligned, by the transformations $\mathbf{X}\mathbf{A}$ and $\mathbf{Y}\mathbf{B}$. We also obtain a series of $m$ numbers $1 \geq \rho_1 > \dots > \rho_m \geq 0$ which correspond to the decreasing correlation coefficients of each of the aligned directions: $\rho_i = \operatorname{Pearson}(\mathbf{X}\mathbf{a}_i, \mathbf{Y}\mathbf{b}_i)$.
The easiest way to make CCA work is to install the excellent statsmodels library for python, and to use its class CanCorr
. A few lines of code will be more telling :
from statsmodels.multivariate.cancorr import CanCorr
cc = CanCorr(X, Y) # where X and Y and two numpy arrays, of shapes nxd1 and nxd2
A = cc.y_cancoef
B = cc.x_cancoef
X_al = X @ A
Y_al = Y @ B
print(cc.cancorr) # these are the ordered canonical correlations
That way, CCA performs its role as a dimensionality reduction and alignment tool. A summary similarity statistic between the two datasets can easily be obtained by taking for example the average of the canonical correlations, but it is of course more telling to keep the whole distribution of these correlations.
If you wish more details about how the algorithm works, you can find two very nicely explained derivations here. I cannot do any better than this report, so I will try to give some personal thoughts about how to interpret this algorithm
I am going to look at the SVD implementation to get some insights. Briefly, the algorithm takes the SVD of each dataset:
Then takes the SVD of the matrix $\mathbf{U}_1^T\mathbf{U}_2$:
and finally $\mathbf{S}$ here contains the canonical correlations while $\mathbf{A} = \mathbf{V}_1\mathbf{S}_1^{-1}\mathbf{U}$ and $\mathbf{B} = \mathbf{V}_2\mathbf{S}_2^{-1}\mathbf{V}$. This looks awfully complicated, but it actually makes a lot of sense. Let me explain.
The way PCA typically works is by considering the diagonalization, which happens to also be the SVD, of the covariance matrix of some data : $\mathbf{X}^T\mathbf{X} = \mathbf{U}\mathbf{S}\mathbf{U}^T$. When looking at two datasets, it seems natural to do an SVD on the covariance $\mathbf{X}^T\mathbf{Y}$. This can perfectly be done and will give some insights, but the problem is it mixes internal variance of the datasets with their correlations: if one axis of $\mathbf{X}$ has a huge variance compared to the others, it will dominate the SVD of the covariance, even if it is not so correlated with any of the directions in $\mathbf{Y}$.
The most reasonable workaround is to start by “whitening” each dataset, that is applying a linear transform so that (i) each coordinate of the data is independent of the others (orthogonality) and (ii) each coordinate of the data has a variance of 1 (normalization). Multiplying by $\mathbf{V}_1\mathbf{S}_1^{-1}$ corresponds exactly to those two steps, so that $\mathbf{U}_1$ and $\mathbf{U}_2$ are simply the whitened versions of $\mathbf{X}$ and $\mathbf{Y}$. It then suffices to apply the SVD to the covariance matrix of these two matrices to obtained canonical correlations (and one can show that those singular values will be contained between 0 and 1). Finally, $\mathbf{A}$ simply works by applying first the whitening transform, and then transforming into the left canonical basis found by the covariance SVD, and similarly for $\mathbf{B}$. This whitening step is actually very similar to the notion that Pearson’s correlation is simply the covariance of whitened data, as explained in my previous post.
I have to admit however that I am still not convinced that this whitening step is always desirable when we are comparing datasets, and I would say it is still useful to keep the “SVD on covariance” algorithm in mind when treating data. The amazing Kornblith paper I mentioned in the introduction actually tackles part of this question: in their framework, the “SVD on covariance” algorithm is called Linear CKA (Centered Kernel Alignment), and avoids some pitfalls, one of which I will mention in the next paragraph.
As any good statistical methods, there is probably a million ways CCA can be misused. I noticed one while I was using it with high dimensional data, that I will illustrate here: take two perfectly random matrices $\mathbf{X}$ and $\mathbf{Y}$ of increasing dimensionality, as in the code below, and you will observe something similar to figure 3.
N = 2000
fig, ax = plt.subplots(1, 3, figsize=(12, 4))
for i, d in enumerate((50, 200, 1000)):
X1 = np.random.randn(N, d)
X2 = np.random.randn(N, d)
cc = CanCorr(X1, X2)
ax[i].bar(x=np.arange(d) + 1, height=cc.cancorr, width=1.)
ax[i].set_ylim(0, 1)
ax[i].set_xlabel('component')
ax[i].set_xlabel('can. corr.')
ax[i].set_title(f'd={d}')
As we can see, when the data becomes very high dimensional, the highest canonical correlations will become as high as 1, probably because it becomes likely that among all these possible dimensions the algorithm will find some that match well between the two datasets (like you would easily find two people with the same birthday in a big enough group).
The simple workaround for this issue is simply to apply a first step of PCA to each of the datasets, forming $\tilde{\mathbf{X}}$ and $\tilde{\mathbf{Y}}$ by keeping only the first $m$ principal components of $\mathbf{X}$ and $\mathbf{Y}$, where $m$ is reasonably small number chosen by the user. This will keep the “main signal” part of the data, as it is likely that for very high dimensional data most of the dimensions simply contain noise anyway. With the 1000-dimensional datasets of above, choosing $m=10$ for example brings canonical correlations to the expected null values. This method is then called SVCCA, and summarized in a nice paper (Raghu et al. 2017).
Globally however, since the algorithm maximizes its objective, it seems reasonable to think that it will always overestimate the real underlying correlations (a classic case of maximization bias). Mmh… Keep safe when doing stats everyone!
(Gallego & Perich et al 2020): Long-term stability of cortical population dynamics underlying consistent behavior, Nature Neuroscience, 2019
(Kornblith et al 2019): Similarity of Neural Network Representations Revisited, ICML 2019
(Maheswaranathan & Williams 2019): Universality and individuality in neural dynamics across large populations of recurrent networks, NeurIPS 2019
(Raghu et al. 2017): SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability, NIPS 2017
(Semedo et al 2019): Cortical Areas Interact through a Communication Subspace, Neuron, 2019
]]>Disclaimer: I have the derivations for all that’s written here. At some point I might find the time to add them in an appendix to the post.
Given a two-dimensional cloud of points, one can choose to fit several lines to it. Let us consider some data $({(x_i, y_i)}_{1 \leq i \leq n}$. We will assume it is centered, ie. $\sum x_i = 0 = \sum y_i$ without loss of generality. Sometimes I will assume standardized data (ie. such that $\operatorname{Var}[x] = \operatorname{Var}[y] = 1$) which gives more elegant formulas but will mark all formulas that hold only in this case in blue to avoid confusion.
One could wish to predict $y$ from $x$, thus performing a linear regression of $y$ onto $x$. This corresponds to fitting the $\beta_0$ parameter in the model :
where $\epsilon \sim \mathcal{N}(0, \sigma^2)$ while minimizing the error variance $\sigma^2$. This corresponds to a causal model where the target variable $y$ is generated by fixed values of $x$ and a source of noise (see figure 2). This gives the classical linear regression model, whose solution is given by :
where $\sigma_x$ and $\sigma_y$ are the standard deviations of $x$ and $y$ (the square root of their variances) and $r_{xy}$ is the Pearson correlation coefficient between $x$ and $y$, given in general by :
And if the data is moreover standardized, this gives the very simple relationship :
Alternatively, one could want to consider the inverse prediction problem : predicting $x$ from $y$, with the model :
Naively one could think “well, if $y \approx \beta_0 x$, then surely $x \approx \beta_0^{-1} y$ and we can simply take $\tilde{\beta} = \beta_0^{-1}$ ?”. Ayayay, nothing could be more wrong ! To understand this, consider the simple example where $\beta = 0$, which leads to $y = \epsilon$. In this situation, $x$ and $y$ are actually two independent random variables, so we can also say that $x$ does not depend on $y$ and conclude that $\tilde{\beta} = 0$ too. Actually, just as above we can write the optimal solution for $\hat{\tilde{\beta}}$ as:
I am going to inverse it so that we can plot it while still having $x$ on the abscissa and $y$ on the ordinate so that we obtain our second fitting line with slope :
And we note that if the data is standardized we get:
In this case we see that we obtain two different lines fitted to the same cloud of points and symmetric one of the other with the respect to the central diagonal $y=x$ (or $y=-x$ if our correlations and slopes are negative). Which means that… in the end, what is the best fitting line to our cloud of points ? Is any line in between those two also valid ?
Actually, let us get rid of the causal models whereby $x$ generates $y$ or the opposite. After all, “data is all there is to science” as Judea Pearl says. A third idea could be to perform a PCA on our cloud of points and consider its first component as the line that explains our data the best. We can derive this first component particularly in the case of standardized data (for non-standardized data simply multiply by $\sigma_y/\sigma_x$) and obtain:
Yes, amazingly, once we have standardized the data, one of the best fitting line turns out to simply be a slope ±1. Then what is this linear regression all about ? What information do we get in the end when fitting a line ?
Note how the three lines here are all optimal in a different way : the first one minimizes the distance from each point to the line along the y axis, the second one minimizes the distance from each point to the line along the x axis, and the last one minimizes the distance from each point to the line orthogonally to the line (the real “distance to the line” according to mathematicians).
I mentioned earlier the coefficient of determination, usually noted $r^2$. It is a number between 0 and 1 indicating how accurate the linear regression is, and usually magically given by linear regression routines. It is often explained as the “proportion of variance explained by the model”, which mathematically would be, if we take our first model (caution with this relation, for linear regression it appears that noise variance + model variance = Var[y] but won’t necessarily be the case for other types of models):
From this we obtain immediately:
So the notation used for this coefficient is of course no coincidence ! We find again Pearson’s correlation coefficient hidden behind it ! Actually a more general result is that for multivariate regression with several explanatory variables $\mathbf{x_i} = (x_{1, i}, \dots, x_{i, d})^T$ and one single target variable $y$, when the model $y = \boldsymbol{\beta}^T\mathbf{x} + \epsilon$ is fit, its coefficient of determination is the square of the Pearson correlation coefficient of $y$ and $\hat{y} = \hat{\boldsymbol{\beta}}^T\mathbf{x}$.
Note that we would obtain the same $r^2$ when trying to predict $x$ from $y$ instead. Now everything comes back together : the fact that we obtained two very different lines earlier was a manifestation of the fact that a linear model was rather imprecise for our data. What a linear regression does is assess how close the data can be to a line, and this is a question that is actually answered by the correlation coefficient $r_{xy}$. So in the end, the slope of a simple linear regression, when stripped out of its scaling component $\sigma_y/\sigma_x$ simply indicates its own goodness of fit, which I find to be an amazing phenomenon.
Before going further, I wanted to add a few words on PCA. First, what problem does it solve, and how to relate it to the data generation process of $y$ and $x$ ? You may have heard that PCA solves the problem of finding a low-dimensional (here 1D) subspace that best explains our data. There are several ways to formalize it (see (Udell 2014), appendix A p.98). PCA actually optimizes the following latent variable problem:
where $z \sim \mathcal{N}(0, 1)$ and $\begin{pmatrix}\epsilon_1 \ \epsilon_2\end{pmatrix} \sim \mathcal{N}(0, \sigma^2\mathbf{Id})$. (Note that if we assume the noise terms can have correlations and a general covariance matrix $\Psi$ instead of $\sigma^2\mathbf{Id}$ then we obtain factor analysis instead of PCA. More on this in (Tipping & Bishop 1999)) In other words, it assumes that $x$ and $y$ are both generated by a common latent $z$ via two independent noisy processes.
Now, a note on fitting an ellipse to our data. As stated in appendix B, PCA works by performing an eigendecomposition of the covariance $\mathbf{X}^T\mathbf{X}$ of our data matrix (which I note $\mathbf{X}$, but whose rows are actually $(x_i, y_i)$ so that it encompasses the 2D data). The empirical covariance $\mathbf{X}^T\mathbf{X}$ is what one would use to fit a bivariate normal distribution to the data (this distribution would have $(0, 0)$ mean because our data is centered, and that matrix as covariance). An intuitive way to visualize multivariate gaussian distributions is via their isoprobabilistic curves which happen to be ellipses (sometimes called confidence ellipses or standard deviation ellipses). The major and minor axis of these ellipses align with the first and second principal component of the data (which are eigenvectors of $\mathbf{X}^T\mathbf{X}$), and we find after derivations that its widths and heights are for standardized data factors of $2\sqrt{1 + |r_{xy}|}$ and $2\sqrt{1 - |r_{xy}|}$ respectively.
From this result, one could conclude that the proportion of error variance (as the variance on the minor axis divided by the variance on the major axis) is
Another interesting measure of dispersion used in PCA is the proportion of variance explained by the first component. This would be:
Anyway, it is again Pearson’s correlation coefficient hiding itself there !
The conclusion is that the correlation coefficient is everywhere, and that simple linear regression turned out to be a little more complex than we thought, but only to turn out to be even simpler in the end.
]]>Let’s start with one of the most remarkable and buzz-generating AI releases of the year, I have named the infamous GPT-3, latest language model of a series developed by OpenAI. It is essentially a huge neural network with a Transformer architecture that is trained to accomplish any language task (question answering, text generation, chatbot…). Here when I say huge I really mean of mythological proportions: we are talking of a 175 billion-parameter model, which according to rumors was trained on something like 570 GB of text data, with a 5 to 10M$ budget for training alone! Although a Beta access for selected researchers was opened this summer, the model remains unavailable for the general public, and details about it have been published in the NeurIPS paper “Language Models are Few-Shot Learners” by Brown et al..
The way it works is that you provide GPT-3 with a prompt (aka context) that typically consists of a few examples (2 or 3 are enough) of the task you want to accomplish, or a simple question, or a text with questions about it, and GPT-3 will infer both what task it is supposed to do, and a text that fulfills the task. An example given in the paper would be to give GPT-3 2 example poems, and then a title and author name like illustrated below. GPT-3 will then understand that it is supposed to generate a poem titled Shadows on the Way in the style of Wallace Stevens and throw the best verses it can come up with (this is the few-shot learning mentioned in the paper’s title).
But what’s fascinating with language models are the endless possibilities they offer, and their capacity to amaze us in unpredictable, funny, or AI-will-take-over-the-world-scary ways. And GPT-3 is very, very good at this game. People have gotten it to generate startup ideas, share life advice from personalities, give creative writing lessons, share intimate worries about its own future, write a summary of experiments with itself and much more, the list could really go on forever. One particularity is it seems to abhor the sentence “I don’t know” and would rather improvise a plausible answer to anything (and it is quite an improvisational genius), unless you explicitly ask it for some honesty.
Despite being really good at spotting and imitating patterns, this model still seems to only be exploiting statistical regularities of an enormous language dataset, and doesn’t seem to actuallly understand the concepts it is talking about. It remains to be seen if those perks will always be around as long as we stick to purely statistical, system-1 AI, or if we just need to keep adding more parameters. In any case, I just hope we let the next language model choose itself a more inspiring pen name than GPT-4.
Meta-learning has been taking a lot of importance recently, with researchers designing algorithms that find new neural network architectures or that rediscover backpropagation. However, researchers always prompted their meta-learning algorithms to explore a restricted and already pre-designed set of learning algorithms (e.g. neural nets). In a paper presented at the ICML conference, Google Brain researchers Esteban Real, Chen Liang et al. exhibited AutoML-Zero, an algorithm that rediscovers machine learning algorithm only from basic computational bricks like vector operations, memory manipulation, and a few mathematical functions.
Their approach uses an evolutionary algorithm to rediscover such ML algorithms as linear regression or 2-layer perceptrons with backpropagation: a population of algorithms undergoes random mutations, and the best performing ones are selected at each generation. They remarkably end up exhibiting a really good performance on image recognition tasks and re-inventing “tricks of the trade” like gradient normalization or stochastic gradient descent.
So far all meta-learning research has focused on re-inventing human ideas. It will be really interesting to see if they can come up with ideas of novel algorithms some day.
Another mind-blowing AI news this year has been DeepMind’s announcement of their AlphaFold 2 model (no paper available yet) which really nailed the CASP protein-folding competition achieving a score higher than 80/100, more than a 2-fold improvement on pre-AlphaFold 1 models. Although I do not understand the full implications of this discovery, or even what this score exactly means, it is certainly uplifting to see AI tackle a very concrete problem in another science. Let’s hope for an AlphaCold model next to find a solution to global warming since humans don’t seem very good at it (although it would probably just end up designing new bat viruses if it’s smart enough. Meh, everything has its perks…).
Interesting links:
DeepMind has decidedly had a productive year, also releasing a new paper in their reinforcement learning series (although this was on arXiv since 2019). The gist of it is that reinforcement algorithms so far are mostly divided into model-based algorithms and policy-gradient ones, at least until MuZero implemented indeas to reconcile those two approaches.
To understand this, here is a super quick recap’ on the reinforcement learning formalism: an agent evolves in an environment, which we can generally understand as something being in a certain state at each timestep \(s_t\) (the state can be modelled for example by the vector). The agent has to choose an action \(a_t\) at each timestep from its set of possible actions, and this will lead the environment to move to another state \(s_{t+1}\), and maybe give a reward (positive or negative) to the agent \(r_t\). The objective of the agent is to maximize the total reward during a trial \(\sum_{t=0}^T r_t\) (sometimes using discounted reward, but let’s not go into the details here).
The model-based algorithms relie on a model of the exact rules of the environment. The algorithm has to learn 2 things: how the environment transitions from one state to another (mathematically a mapping from a state and an action to the next state \((s_t, a_t) \to s_{t+1}\)), and which states lead to more reward (a mapping from states to expected final reward, aka the value function, \(s_t \to v_t = \sum_{\tau > t} r_\tau\)). Once the algorithm learns these mappings (which it does by exploring its environment), it can start exploitation, meaning choosing at each step the action that leads to the state with the highest value. Since the value function incorporates knowledge about future timesteps, the algorithm is naturally planning several steps ahead. Thanks to these planning capacities, these algorithms are very good for board games like chess and go, but they behave badly when the state-space cannot be described succintly like in complex video games.
On the other hand, policy-gradient algorithms just give up trying to build a model of their environment and try to focus on getting instincts, predicting which action to take next depending on environment variables. They just learn a state -> action mapping which maximizes some performance measure. This led to the famous actor-critic algorithms like A3C which holds one of the best performances on a set of 57 Atari games often used in RL research, or AlphaStar which famously reached superhuman capacities at the StarCraft game in 2019.
The interesting novelty of Mu-Zero is that it is able to combine the best of both worlds. For this, it relies on a hidden state representation of the environment (called \(s_t\) in the paper, I will call it here \(\hat{s}_t\) to emphasize that it is model-built), for which a dynamical model is learned, along with a policy and a value function. If we simplify it a little, the model essentially learns:
The fact that the state representations are learned from observations of the environment enable the algorithm to learn in very complex environments, but the fact that this state representation still exists gives the possibility of planning ahead, which is an interesting compromise. There is a bit more to it, but if I continue I will be longer than the paper itself.
Lottery tickets were a remarkable theoretical insight of 2019, published by Frankle and Carbin. Briefly, it is a story that started as researchers were looking for ways to prune very big networks after training in order to have more lightweight and faster networks whose performance could match that of the big one. The authors discovered that they could train a big network, prune it, reset the weights of the pruned network to their value before training, retrain, and at the end of this procedure obtain networks which have the same performance as that of the big one with as little as 5% of the initial number of parameters. This phenomenon is probably caused by the fact that by chance, the initial weights of this subnetwork were relevant to the task, so they dubbed their discovery the “lottery ticket hypothesis”, because the pruned subnetworks had won the “initialization lottery”.
This paper opened a whole new area of research, and led to even more surprising developments this year. New discoveries appeared at AAAI20, where Yulong Wang et al presented Pruning from Scratch, a paper showing that the first step of the lottery ticket procedure (pre-training the network) was not necessary, and that winning tickets could be directly found from the randomly initialized network. Authors claim that their pipeline is extremely fast, although I have not found a clear comparison between the pruning + training procedure and a normal training on the big network.
But the most surprising development came at the CVPR conference, where Vivek Ramanujan, Mitchell Wortsmann and others published What’s Hidden in a Randomly Weighted Neural Network ? showing that big networks actually contained subnetworks that were efficient in solving the task without any learning involved!! As a striking example, they claim that “Hidden in a randomly weighted Wide ResNet-50 we find a subnetwork (with random weights) that is smaller than, but matches the performance of a ResNet-34 trained on ImageNet” (to give you an idea, Resnet-50 contains 210M trainable parameters, and Resnet-34 21M, so by achieving a more than 10% reduction we can get a performing network without any training). They present an algorithm to find these miraculous subnetworks, which I have to admit bears a lot of resemblance to a gradient descent (it essentially involves updating scores for each neuron by a rule based on a gradient, and then selecting for each layer the k% of neurons having the highest score), although it is now entirely focused on pruning connexions. Again, it would be interesting to know how this procedure compares to training a network that achieves the same accuracy. However, even if we put aside all practical considerations, this remains an important theoretical milestone in our understanding of deep networks. It is an entirely new form of learning algorithm, which underlines the importance of random structures in learning, and is of course quite reminiscent of the ideas of “synaptic pruning” in the brain. It reminds me a lot of the “reverse learning” hypothesis formulated by Crick and Mitchison in 1983, according to which the brain learns by erasing unwanted memories during REM sleep. If pruning turns out to be enough, might it be the long-awaited learning algorithm of our brains?
But let’s not get carried away to far. A third impressive novelty 2020 has brought to this lottery ticket hypothesis is a flurry of theoretical bounds showing how big a network has to be in order to contain those magical winning subnetworks. They have culminated on the similar concurrent discoveries published at Neurips by Orseau et al. and Pensia et al., who both proved that one could find a good subnetwork of width \(d\) and depth \(l\) by pruning a big random of size \(\mathcal{O}(Poly(dl))\) (a polynomial of \(d\) and \(l\)), which is not that big!
Interesting links:
Computer vision research always has a twist to amaze us, and here are some of my favourite releases of the year:
DeepFaceDrawing : Chen, Su et al. presented at Siggraph an interesting project that generates realistic pictures from very rough drawings. I haven’t looked at the details, but it essentially looks like a GAN conditioned on an input sketch, with the sketches and realistic images both sharing a common abstract feature space. Now I can’t wait for “xkcd : the movie” to be generated with this network! (Project page)
Neural Re-Rendering of Humans from a Single Image : Sarkar et al. from the Saarland MPI presented at ECCV what looks like a new DL-based motion capture technology : take a still image of Alice, and a live footage of Bob dancing the macarena, and the network shall combine it into a live footage of Alice dancing the macarena (keeping Alice’s clothes and silhouette, in contrast with previous deepfakes which only applied a new face onto the footage). This builds on a lot of complex modules (pose estimation, texture inference and adversarial rendering of images), and gives overall very convincing results for pose transfer. (Project page)
PULSE: Self-supervised Photo Upsampling via Latent Space Exploration of Generative Models: this grandiloquent name hides a very nice project presented at CVPR by Menon, Damian et al from Duke for generating hig-res images from low-res ones (and maybe finally give some credibility to those scenes where FBI agent zoom into sunglass reflections of the bad guy to see who he was talking to). As you can see from the image below, it really works with only a bunch of pixels as inputs, and just like DeepFaceDrawing can be understood as a conditional generative model with common latent space for blurry and high-res images. Charles Isbells has pointed to interesting biases of this network in his Neurips keynote, where a downsampled image of Obama would essentially generate an image of a white man with a tan. (Paper)
To my greatest regret I have to leave you here, while I am sure there are a lot of awesome stories I have missed. Feel free to reach me on twitter if you have any comments or want to point to a mistake!
Oh, and here are some robots taking over the dancefloor. Looks like I can learn from their leg twist!
This is what I thought for a long time, until I understood that parallellization in python can actually be made extremely easy with a simple pipeline that you can apply to any bit of code in less than one minute. That’s right with one minute of moving lines of code around, you will be able to win huuuge amounts of computation time!
In order to do this you need a loop that repeats the same operation several times, in which the result of an iteration is not needed in the next one, so that you can run all iterations in parallel. Typically, this applies very well to computations on Monte-Carlo or bootstrap samples, or optimization processes such as simulated annealing.
So let us take for example the following code which will simulate 10000 times a simple 2D random walk for 1000 steps and plot the distribution of the endpoints (not a very mesmerizing example but minimal enough for our purposes):
import seaborn as sns
import matplotlib.pyplot as plt
from time import time
import random
T = 1000
endpoints = []
for i in range(10000):
x = [0, 0]
for t in range(T):
x[0] += random.choice((-1, 1))
x[1] += random.choice((-1, 1))
endpoints.append(x)
x0, x1 = zip(*endpoints)
sns.jointplot(x0, x1, kind='hex')
plt.show()
Here, the outer loop is a clear target for parallellization, since it is just the same operation repeated many times (not the inner loop of course: you need one iteration to go to the next). So to parallellize it, simply copy-paste the code inside the loop in a separate function called <div class="mycode">task</div>. Add all the needed variables as arguments and the ones that will be reused as return values:
def task(T):
x = [0, 0]
for t in range(T):
x[0] += random.choice((-1, 1))
x[1] += random.choice((-1, 1))
return x
Now, only the 3 following lines are needed:
pool = mp.Pool(mp.cpu_count())
args = [1000] * 100000
endpoints = pool.map(task, args)
with the required import statements:
import multiprocessing as mp
The first one creates the Pool
which is the object that will distribute the tasks accross cpu cores. You can give as argument the number of concurrent tasks that you want the computer to do. Usually it is fair to go for the number of cores given by mp.cpu_count()
. In the second line, you create an iterable of arguments for each task. Here it is simple, it is the same argument for all tasks, repeated the number of times we want it to run. Finally the last line will magically execute the task in parallel with the argument list you provided and return all the return values bundled in a list.
Important note: if your task has not one but several arguments you will have to put the arguments in a iterable of tuples, and use the starmap
function instead of map
:
args = [(a, b)] * 100
res = pool.starmap(task, args)
Finally, to build the iterable less brutally the functions of itertools
can be useful, for example:
args = itertools.repeat((a, b), 100)
Here is the final code for the example program. Enjoy your fast computations!
import multiprocessing as mp
import seaborn as sns
import matplotlib.pyplot as plt
import random
def task(T):
x = [0, 0]
for t in range(T):
x[0] += random.choice((-1, 1))
x[1] += random.choice((-1, 1))
return x
pool = mp.Pool(mp.cpu_count())
args = [1000] * 10000
endpoints = pool.map(task, args)
x0, x1 = zip(*endpoints)
sns.jointplot(x0, x1, kind='hex')
plt.show()
Final note: one could imagine that the overhead caused by the creation and deletion of many threads would become an issue. I tried to make a smarter program by for example creating 8 threads with balanced load instead of brutally asking for 10000 of them like above. This brings no improvement, suggesting that the multiprocessing module takes care of these optimizations by itself!
]]>So let us make the presentations: git
is a software that has been originally developed in 2005 by Linus Torvald also lead developer of the Linux kernel. It has two main functions:
model3_12022018.py
.For this last capacity, an online platform supporting git is needed, since the shared version of the code has to be stored on a server. The most well known one is github.com, but there are many others like gitlab.com. Git and github are often confused, but they are 2 distinct things: git is an open-source software, while github is a website offering services for users of the git software.
git --version
which will check if it already installed or offer to install itNow we shall explain the concepts used by git with an example. Let us first create a repository, meaning a project in git’s vocabulary. For this, simply create a directory where you add a few code files. In my example I have 3 files in my directory: model.py
, plot.py
and data.npz
. Now open a terminal (or git bash for Windows users), move to your project’s directory and type:
git init
Voilà, what was a mere directory became a repository. This means that git is ready to handle a history of versions of your directory. Concretely, this history consists in a succession of what git calls commits. You can think of those commits as checkpoints or snapshots of the state of your project that you will always be able to retrieve. Let us create a first commit that will contain the initial state of our project.
git add *.py
git commit -m "initial commit"
Ok, so there are 2 steps for this operation: the first command is git add
, and this is where you tell git which files you want to include in this commit. Indeed, maybe there are some files that you want git to ignore, for example a data file that is very heavy (in this case you should NOT include it), or some png plots. So in this first step you should only include what you want to track along versions, for example code files, notebooks, or maybe simulation parameters or results. Here we use the terminal joker * to add all files ending with .py. We will see later easier ways to select which files are tracked or not.
The second commit effectively adds the commit to the git history. You have to add a small message after the -m
option to remember what this commit was about (if you forget the -m
option, git will force you to write some message by opening the vim text editor. It can be hard to exit this editor).
That’s it, now this initial version of your code is stored forever. Now let us say the next day we have an idea to make better plots. We modify the file plot.py
and once we are satisfied we want to store them in a new commit. You can first look at what has been modified with:
git status
which should tell you that plot.py is tracked and has been modified, and that data.npz is not tracked (meaning git has never stored a version of it). I suggest to always execute git status
before commiting, to remember what you have modified since last time. That’s all good, we can add our new commit with:
git add plot.py
git commit -m "better plots"
Now there are 2 versions of your code. You can visualize the history of your project with:
git log --oneline --decorate
which outputs:
936d456 (HEAD -> master) better plots
40229b2 initial commit
Here you can see the history of your project: the commits are written from most recent to oldest, with a nice little code for each, the name you chose, and a weird indication for the last commit that essentially indicates that this is the current state of your project.
One of the main features of git are its synchronization capacities. Let us see how this works with github. Open a github account and create a new repository named test for example (a big green button normally). Now there are 2 cases:
git init
). You can link it to your online repository with
git remote add origin https://github.com/username/test.git
(the command is listed on the page that opens when you create the new repository). It simply adds a remote, which how git calls an online version of the repository. So far, there is still nothing online. To push your commits online execute:
git push -u origin master
The first push has to be made like this to link a specific remote (called origin, we will use only one remote here) to a specific branch (the default branch is always called master. I might tell more about branches some other time, it is not essential). Once the repository is set up we will just use git push
git clone https://github.com/username/test.git
which will create a new directory containing the same contents as on the online repository.
Note that to push you might have to enter your github identifier and password. If a push is rejected, it probably means you are not listed as a collaborator of the repository, which can be modified on github in the settings page of the repository.
Now the repository is online. You can see the code files, but most of all you can now see all the history of your modifications in a very intuitive way. If you click on the number of commits on the top left of the list of code files, you can see the history of commits. If you click on one of the commits you can visualize all the modifications that have been done in this commit.
Back to the project home page, if you click on a file name, you will notice that you can click on a history function that will list the commits that have modified this file. This can be quite handy if after a few months you get lost in your project.
Now you and your collaborator want to work while keeping a synchronized version of the project. It is all based on pushing your changes online, and pulling your collaborator’s changes.
If you have made some modifications, add a new commit and push it with:
git add *.py
git commit -m "new stuff"
git push
If your collaborator has pushed things online, retrieve them with:
git pull
This will synchronize all the files in your directory with their online version. It also adds the commits that have been pushed online to your project’s history.
Of course, no collaborative system can work without a few rules. Here are the main ones:
you cannot push if you collaborator has already pushed changes that you have not pulled. Said otherwise, the commits that are added online have to follow each other. git cannot accept two concurrent commits. So be sure to always pull right before you push.
when you pull, your local files are overwritten by the online version if they are in the same state as the last commit. For example, say there is a commit 1 and your collaborator pushed a commit 2 that modifies model.py. If you have not touched this file since commit 1, it will simply be replaced by its version in commit 2. This will work even if the meanwhile you have modified plot.py: this file is not in the commit, so your local changes to plot.py are kept. Hence you and your collaborator can gracefully work in parallel, as long as it is on different files.
the tricky part is of course if your collaborator pushes a commit 2 modifying plot.py, and you have also modified plot.py on your computer. Then when you execute git pull
you will have a message saying there is a conflict. This will eventually happen if you use it, and can be very discouraging situation for a beginner. Here are a few options to save the day:
git checkout -- plot.py
the dirty but effective way: rename your version of plot.py
by plot2.py
. Redo git pull
which will now work without conflict. Then you can manually add your changes to plot.py
from your personal copy, and push them afterwards.
git add plot.py
git commit -m "my modifications to plot.py"
git pull
If the modifications are not overlapping, git will intelligently merge them with this sequence. However if they are, you will get a message saying that automatic merge failed and that you have to fix conflicts yourself. This means git will add lines in the conflicting files like
<<<<<<< HEAD
# some stuff...
=======
# some stuff...
>>>>>>> a sequence of letters and digits
The stuff above the =====
are local changes and below are the remote changes. git kept both because it judged it couldn’t choose between them. So you have to choose what to keep by yourself and remove the lines <<<<<<< HEAD
, =====
and >>>>>>>
. When it is done, you will have to add a merge commit:
git add plot.py
git commit -m "merge completed"
git push
After all this has been done, you can visualize what happened with the graph version of git
:
git log --oneline --decorate --graph --all
you will see that your local version and the remote one diverged, and then merged back again.
One important trick is to manage which files should be tracked and which should be ignored by git. For this, git uses a hidden file in your project’s directory called .gitignore
(for those who don’t know, hidden files are files whose name start by . and they are not shown by file explorers by default. You can see them with the command ls -a
in the terminal).
To use this functionality, open a text editor and create a file called .gitignore
in your project’s directory. You can add in it names of files, patterns, or directories, for example:
*.npz
plots/
Actually, to make your life easier, a collection of ready-to-use gitignores for different types of projects is available at gitignore.io.
The advantage is that then to add a commit you can simply run
git add .
git commit -m "my message"
the dot signifying to git that it should add everything except what is in the gitignore.
I want to finish with how to set up some useful aliases I use:
git config --global alias.st status # cause I use it so often
git config --global alias.logp "log --oneline --decorate"
git config --global alias.logg "log --oneline --decorate --graph --all"
This means you can use the commands git st
, git logp
and git logg
(which corresponds to 2 ways to visualize the history of the project in the terminal).