## Neural Turing Machines
Alex Graves Greg Wayne Ivo Danihelka gravesa@google.com gregwayne@google.com danihelka@google.com
Google DeepMind, London, UK
## Abstract
We extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with by attentional processes. The combined system is analogous to a Turing Machine or Von Neumann architecture but is differentiable end-toend, allowing it to be efficiently trained with gradient descent. Preliminary results demonstrate that Neural Turing Machines can infer simple algorithms such as copying, sorting, and associative recall from input and output examples.
## 1 Introduction
Computer programs make use of three fundamental mechanisms: elementary operations (e.g., arithmetic operations), logical flow control (branching), and external memory, which can be written to and read from in the course of computation (Von Neumann, 1945). Despite its wide-ranging success in modelling complicated data, modern machine learning has largely neglected the use of logical flow control and external memory.
Recurrent neural networks (RNNs) stand out from other machine learning methods for their ability to learn and carry out complicated transformations of data over extended periods of time. Moreover, it is known that RNNs are Turing-Complete (Siegelmann and Sontag, 1995), and therefore have the capacity to simulate arbitrary procedures, if properly wired. Yet what is possible in principle is not always what is simple in practice. We therefore enrich the capabilities of standard recurrent networks to simplify the solution of algorithmic tasks. This enrichment is primarily via a large, addressable memory, so, by analogy to Turing's enrichment of finite-state machines by an infinite memory tape, we
dub our device a 'Neural Turing Machine' (NTM). Unlike a Turing machine, an NTM is a differentiable computer that can be trained by gradient descent, yielding a practical mechanism for learning programs.
In human cognition, the process that shares the most similarity to algorithmic operation is known as 'working memory.' While the mechanisms of working memory remain somewhat obscure at the level of neurophysiology, the verbal definition is understood to mean a capacity for short-term storage of information and its rule-based manipulation (Baddeley et al., 2009). In computational terms, these rules are simple programs, and the stored information constitutes the arguments of these programs. Therefore, an NTM resembles a working memory system, as it is designed to solve tasks that require the application of approximate rules to 'rapidly-created variables.' Rapidly-created variables (Hadley, 2009) are data that are quickly bound to memory slots, in the same way that the number 3 and the number 4 are put inside registers in a conventional computer and added to make 7 (Minsky, 1967). An NTM bears another close resemblance to models of working memory since the NTMarchitecture uses an attentional process to read from and write to memory selectively. In contrast to most models of working memory, our architecture can learn to use its working memory instead of deploying a fixed set of procedures over symbolic data.
The organisation of this report begins with a brief review of germane research on working memory in psychology, linguistics, and neuroscience, along with related research in artificial intelligence and neural networks. We then describe our basic contribution, a memory architecture and attentional controller that we believe is well-suited to the performance of tasks that require the induction and execution of simple programs. To test this architecture, we have constructed a battery of problems, and we present their precise descriptions along with our results. We conclude by summarising the strengths of the architecture.
## 2 Foundational Research
## 2.1 Psychology and Neuroscience
The concept of working memory has been most heavily developed in psychology to explain the performance of tasks involving the short-term manipulation of information. The broad picture is that a 'central executive' focuses attention and performs operations on data in a memory buffer (Baddeley et al., 2009). Psychologists have extensively studied the capacity limitations of working memory, which is often quantified by the number of 'chunks' of information that can be readily recalled (Miller, 1956). 1 These capacity limitations lead toward an understanding of structural constraints in the human working memory system, but in our own work we are happy to exceed them.
In neuroscience, the working memory process has been ascribed to the functioning of a system composed of the prefrontal cortex and basal ganglia (Goldman-Rakic, 1995). Typ-
1 There remains vigorous debate about how best to characterise capacity limitations (Barrouillet et al., 2004).
ical experiments involve recording from a single neuron or group of neurons in prefrontal cortex while a monkey is performing a task that involves observing a transient cue, waiting through a 'delay period,' then responding in a manner dependent on the cue. Certain tasks elicit persistent firing from individual neurons during the delay period or more complicated neural dynamics. A recent study quantified delay period activity in prefrontal cortex for a complex, context-dependent task based on measures of 'dimensionality' of the population code and showed that it predicted memory performance (Rigotti et al., 2013).
Modeling studies of working memory range from those that consider how biophysical circuits could implement persistent neuronal firing (Wang, 1999) to those that try to solve explicit tasks (Hazy et al., 2006) (Dayan, 2008) (Eliasmith, 2013). Of these, Hazy et al.'s model is the most relevant to our work, as it is itself analogous to the Long Short-Term Memory architecture, which we have modified ourselves. As in our architecture, Hazy et al.'s has mechanisms to gate information into memory slots, which they use to solve a memory task constructed of nested rules. In contrast to our work, the authors include no sophisticated notion of memory addressing, which limits the system to storage and recall of relatively simple, atomic data. Addressing, fundamental to our work, is usually left out from computational models in neuroscience, though it deserves to be mentioned that Gallistel and King (Gallistel and King, 2009) and Marcus (Marcus, 2003) have argued that addressing must be implicated in the operation of the brain.
## 2.2 Cognitive Science and Linguistics
Historically, cognitive science and linguistics emerged as fields at roughly the same time as artificial intelligence, all deeply influenced by the advent of the computer (Chomsky, 1956) (Miller, 2003). Their intentions were to explain human mental behaviour based on information or symbol-processing metaphors. In the early 1980s, both fields considered recursive or procedural (rule-based) symbol-processing to be the highest mark of cognition. The Parallel Distributed Processing (PDP) or connectionist revolution cast aside the symbol-processing metaphor in favour of a so-called 'sub-symbolic' description of thought processes (Rumelhart et al., 1986).
Fodor and Pylyshyn (Fodor and Pylyshyn, 1988) famously made two barbed claims about the limitations of neural networks for cognitive modeling. They first objected that connectionist theories were incapable of variable-binding , or the assignment of a particular datum to a particular slot in a data structure. In language, variable-binding is ubiquitous; for example, when one produces or interprets a sentence of the form, 'Mary spoke to John,' one has assigned 'Mary' the role of subject, 'John' the role of object, and 'spoke to' the role of the transitive verb. Fodor and Pylyshyn also argued that neural networks with fixedlength input domains could not reproduce human capabilities in tasks that involve processing variable-length structures . In response to this criticism, neural network researchers including Hinton (Hinton, 1986), Smolensky (Smolensky, 1990), Touretzky (Touretzky, 1990), Pollack (Pollack, 1990), Plate (Plate, 2003), and Kanerva (Kanerva, 2009) investigated specific mechanisms that could support both variable-binding and variable-length
structure within a connectionist framework. Our architecture draws on and potentiates this work.
Recursive processing of variable-length structures continues to be regarded as a hallmark of human cognition. In the last decade, a firefight in the linguistics community staked several leaders of the field against one another. At issue was whether recursive processing is the 'uniquely human' evolutionary innovation that enables language and is specialized to language, a view supported by Fitch, Hauser, and Chomsky (Fitch et al., 2005), or whether multiple new adaptations are responsible for human language evolution and recursive processing predates language (Jackendoff and Pinker, 2005). Regardless of recursive processing's evolutionary origins, all agreed that it is essential to human cognitive flexibility.
## 2.3 Recurrent Neural Networks
Recurrent neural networks constitute a broad class of machines with dynamic state; that is, they have state whose evolution depends both on the input to the system and on the current state. In comparison to hidden Markov models, which also contain dynamic state, RNNs have a distributed state and therefore have significantly larger and richer memory and computational capacity. Dynamic state is crucial because it affords the possibility of context-dependent computation; a signal entering at a given moment can alter the behaviour of the network at a much later moment.
A crucial innovation to recurrent networks was the Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997). This very general architecture was developed for a specific purpose, to address the 'vanishing and exploding gradient' problem (Hochreiter et al., 2001a), which we might relabel the problem of 'vanishing and exploding sensitivity.' LSTM ameliorates the problem by embedding perfect integrators (Seung, 1998) for memory storage in the network. The simplest example of a perfect integrator is the equation x ( t + 1) = x ( t ) + i ( t ) , where i ( t ) is an input to the system. The implicit identity matrix I x ( t ) means that signals do not dynamically vanish or explode. If we attach a mechanism to this integrator that allows an enclosing network to choose when the integrator listens to inputs, namely, a programmable gate depending on context, we have an equation of the form x ( t + 1) = x ( t ) + g ( context ) i ( t ) . We can now selectively store information for an indefinite length of time.
Recurrent networks readily process variable-length structures without modification. In sequential problems, inputs to the network arrive at different times, allowing variablelength or composite structures to be processed over multiple steps. Because they natively handle variable-length structures, they have recently been used in a variety of cognitive problems, including speech recognition (Graves et al., 2013; Graves and Jaitly, 2014), text generation (Sutskever et al., 2011), handwriting generation (Graves, 2013) and machine translation (Sutskever et al., 2014). Considering this property, we do not feel that it is urgent or even necessarily valuable to build explicit parse trees to merge composite structures greedily (Pollack, 1990) (Socher et al., 2012) (Frasconi et al., 1998).
Other important precursors to our work include differentiable models of attention (Graves,
Figure 1: Neural Turing Machine Architecture. During each update cycle, the controller network receives inputs from an external environment and emits outputs in response. It also reads to and writes from a memory matrix via a set of parallel read and write heads. The dashed line indicates the division between the NTM circuit and the outside world.
<details>
<summary>Image 1 Details</summary>

### Visual Description
\n
## Diagram: Disk Drive Architecture
### Overview
The image depicts a simplified block diagram of a disk drive architecture. It illustrates the key components and their interconnections, showing the flow of data between external systems and the disk's memory. The diagram is contained within a rectangular border, representing the disk drive unit.
### Components/Axes
The diagram consists of the following labeled components:
* **External Input:** Located at the top-left, indicated by an arrow pointing towards the "Controller".
* **External Output:** Located at the top-right, indicated by an arrow pointing away from the "Controller".
* **Controller:** A rectangular block positioned centrally at the top of the diagram.
* **Read Heads:** A rectangular block positioned below the "Controller" and to the left of the "Write Heads". An arrow points from the "Controller" to the "Read Heads".
* **Write Heads:** A rectangular block positioned below the "Controller" and to the right of the "Read Heads". An arrow points from the "Controller" to the "Write Heads".
* **Memory:** A large rectangular block positioned at the bottom of the diagram, spanning the width of the diagram. Arrows point from the "Read Heads" towards the "Memory" and from the "Memory" towards the "Write Heads".
### Detailed Analysis or Content Details
The diagram shows a unidirectional flow of data.
* **Input Flow:** External Input goes to the Controller.
* **Read Operation:** The Controller directs the Read Heads to access data from the Memory.
* **Write Operation:** The Controller directs the Write Heads to write data to the Memory.
* **Output Flow:** The Controller sends data to the External Output.
There are no numerical values or scales present in the diagram. It is a conceptual representation of the data flow.
### Key Observations
The diagram highlights the central role of the Controller in managing all data transfer operations. The Read and Write Heads act as intermediaries between the Controller and the Memory. The diagram simplifies the architecture, omitting details such as the physical arrangement of the disk platters, the actuator arm, or the specific interface protocols.
### Interpretation
This diagram illustrates the fundamental architecture of a disk drive. The Controller acts as the brain of the system, interpreting commands from the external environment and coordinating the reading and writing of data to the Memory. The separation of Read and Write Heads suggests that these operations can occur independently, allowing for concurrent access to data. The diagram emphasizes the sequential nature of data access, as the Controller must first receive input, then direct the appropriate heads to access the Memory, and finally send output. The absence of error correction or data buffering components suggests that this is a high-level overview, focusing on the core data flow rather than the intricacies of data management. The diagram is a simplified model, useful for understanding the basic principles of disk drive operation.
</details>
2013) (Bahdanau et al., 2014) and program search (Hochreiter et al., 2001b) (Das et al., 1992), constructed with recurrent neural networks.
## 3 Neural Turing Machines
A Neural Turing Machine (NTM) architecture contains two basic components: a neural network controller and a memory bank. Figure 1 presents a high-level diagram of the NTM architecture. Like most neural networks, the controller interacts with the external world via input and output vectors. Unlike a standard network, it also interacts with a memory matrix using selective read and write operations. By analogy to the Turing machine we refer to the network outputs that parametrise these operations as 'heads.'
Crucially, every component of the architecture is differentiable, making it straightforward to train with gradient descent. We achieved this by defining 'blurry' read and write operations that interact to a greater or lesser degree with all the elements in memory (rather than addressing a single element, as in a normal Turing machine or digital computer). The degree of blurriness is determined by an attentional 'focus' mechanism that constrains each read and write operation to interact with a small portion of the memory, while ignoring the rest. Because interaction with the memory is highly sparse, the NTM is biased towards storing data without interference. The memory location brought into attentional focus is determined by specialised outputs emitted by the heads. These outputs define a normalised weighting over the rows in the memory matrix (referred to as memory 'locations'). Each weighting, one per read or write head, defines the degree to which the head reads or writes
at each location. A head can thereby attend sharply to the memory at a single location or weakly to the memory at many locations.
## 3.1 Reading
Let M t be the contents of the N ร M memory matrix at time t, where N is the number of memory locations, and M is the vector size at each location. Let w t be a vector of weightings over the N locations emitted by a read head at time t . Since all weightings are normalised, the N elements w t ( i ) of w t obey the following constraints:
<!-- formula-not-decoded -->
The length M read vector r t returned by the head is defined as a convex combination of the row-vectors M t ( i ) in memory:
<!-- formula-not-decoded -->
which is clearly differentiable with respect to both the memory and the weighting.
## 3.2 Writing
Taking inspiration from the input and forget gates in LSTM, we decompose each write into two parts: an erase followed by an add .
Given a weighting w t emitted by a write head at time t , along with an erase vector e t whose M elements all lie in the range (0 , 1) , the memory vectors M t -1 ( i ) from the previous time-step are modified as follows:
<!-- formula-not-decoded -->
where 1 is a row-vector of all 1 -s, and the multiplication against the memory location acts point-wise. Therefore, the elements of a memory location are reset to zero only if both the weighting at the location and the erase element are one; if either the weighting or the erase is zero, the memory is left unchanged. When multiple write heads are present, the erasures can be performed in any order, as multiplication is commutative.
Each write head also produces a length M add vector a t , which is added to the memory after the erase step has been performed:
<!-- formula-not-decoded -->
Once again, the order in which the adds are performed by multiple heads is irrelevant. The combined erase and add operations of all the write heads produces the final content of the memory at time t . Since both erase and add are differentiable, the composite write operation is differentiable too. Note that both the erase and add vectors have M independent components, allowing fine-grained control over which elements in each memory location are modified.
Figure 2: Flow Diagram of the Addressing Mechanism. The key vector , k t , and key strength , ฮฒ t , are used to perform content-based addressing of the memory matrix, M t . The resulting content-based weighting is interpolated with the weighting from the previous time step based on the value of the interpolation gate , g t . The shift weighting , s t , determines whether and by how much the weighting is rotated. Finally, depending on ฮณ t , the weighting is sharpened and used for memory access.
<details>
<summary>Image 2 Details</summary>

### Visual Description
\n
## Diagram: Neural Network Memory Module
### Overview
The image depicts a diagram of a neural network memory module, illustrating the flow of information through several processing stages. The diagram shows how the previous state and controller outputs are used to generate a new state through content addressing, interpolation, convolutional shift, and sharpening.
### Components/Axes
The diagram consists of the following components:
* **Previous State:** Represented as `W_t-1` and `M_t`.
* **Controller Outputs:** Represented as `k_t`, `ฮฒ_t`, `g_t`, `S_t`, and `ฮณ_t`.
* **Content Addressing:** A rectangular block that takes the previous state and controller outputs as input and produces `W_t^c`.
* **Interpolation:** A rectangular block that takes `W_t^c` as input and produces `g_W_t`.
* **Convolutional Shift:** A rectangular block that takes `g_W_t` as input and produces `W_t`.
* **Sharpening:** A rectangular block that takes `W_t` as input and produces the final output `W_t`.
There are no explicit axes or scales in this diagram. The flow of information is indicated by arrows connecting the components.
### Detailed Analysis or Content Details
The diagram illustrates a sequential process:
1. **Input:** The process begins with the "Previous State" (`W_t-1`, `M_t`) and "Controller Outputs" (`k_t`, `ฮฒ_t`, `g_t`, `S_t`, `ฮณ_t`).
2. **Content Addressing:** The "Previous State" and "Controller Outputs" are fed into the "Content Addressing" block. This block outputs `W_t^c`.
3. **Interpolation:** `W_t^c` is then passed to the "Interpolation" block, which outputs `g_W_t`.
4. **Convolutional Shift:** `g_W_t` is input to the "Convolutional Shift" block, resulting in `W_t`.
5. **Sharpening:** Finally, `W_t` is processed by the "Sharpening" block, producing the final output `W_t`.
The diagram does not provide numerical values or specific parameters for each block. It is a conceptual representation of the information flow.
### Key Observations
The diagram highlights a memory module architecture where the previous state is updated based on controller outputs. The use of content addressing suggests that the memory is accessed based on the content of the previous state and the controller signals. The subsequent interpolation, convolutional shift, and sharpening steps likely refine and enhance the memory representation.
### Interpretation
This diagram represents a neural network architecture designed for dynamic memory management. The "Content Addressing" mechanism allows the network to selectively retrieve and update information from the memory based on the current context (represented by the controller outputs). The "Interpolation" and "Convolutional Shift" stages likely contribute to spatial reasoning and feature extraction within the memory. The "Sharpening" step may enhance the clarity or precision of the memory representation.
The overall architecture suggests a system capable of learning and adapting its memory based on incoming information and internal control signals. This type of architecture is commonly used in recurrent neural networks and memory-augmented neural networks for tasks such as sequence modeling, machine translation, and image captioning. The diagram provides a high-level overview of the memory module's functionality without delving into the specific implementation details of each block.
</details>
## 3.3 Addressing Mechanisms
Although we have now shown the equations of reading and writing, we have not described how the weightings are produced. These weightings arise by combining two addressing mechanisms with complementary facilities. The first mechanism, 'content-based addressing,' focuses attention on locations based on the similarity between their current values and values emitted by the controller. This is related to the content-addressing of Hopfield networks (Hopfield, 1982). The advantage of content-based addressing is that retrieval is simple, merely requiring the controller to produce an approximation to a part of the stored data, which is then compared to memory to yield the exact stored value.
However, not all problems are well-suited to content-based addressing. In certain tasks the content of a variable is arbitrary, but the variable still needs a recognisable name or address. Arithmetic problems fall into this category: the variable x and the variable y can take on any two values, but the procedure f ( x, y ) = x ร y should still be defined. A controller for this task could take the values of the variables x and y , store them in different addresses, then retrieve them and perform a multiplication algorithm. In this case, the variables are addressed by location, not by content. We call this form of addressing 'location-based addressing.' Content-based addressing is strictly more general than location-based addressing as the content of a memory location could include location information inside it. In our experiments however, providing location-based addressing as a primitive operation proved essential for some forms of generalisation, so we employ both mechanisms concurrently.
Figure 2 presents a flow diagram of the entire addressing system that shows the order of operations for constructing a weighting vector when reading or writing.
## 3.3.1 Focusing by Content
For content-addressing, each head (whether employed for reading or writing) first produces a length M key vector k t that is compared to each vector M t ( i ) by a similarity measure K [ ยท , ยท ] . The content-based system produces a normalised weighting w c t based on the similarity and a positive key strength , ฮฒ t , which can amplify or attenuate the precision of the focus:
<!-- formula-not-decoded -->
In our current implementation, the similarity measure is cosine similarity:
<!-- formula-not-decoded -->
## 3.3.2 Focusing by Location
The location-based addressing mechanism is designed to facilitate both simple iteration across the locations of the memory and random-access jumps. It does so by implementing a rotational shift of a weighting. For example, if the current weighting focuses entirely on a single location, a rotation of 1 would shift the focus to the next location. A negative shift would move the weighting in the opposite direction.
Prior to rotation, each head emits a scalar interpolation gate g t in the range (0 , 1) . The value of g is used to blend between the weighting w t -1 produced by the head at the previous time-step and the weighting w c t produced by the content system at the current time-step, yielding the gated weighting w g t :
<!-- formula-not-decoded -->
If the gate is zero, then the content weighting is entirely ignored, and the weighting from the previous time step is used. Conversely, if the gate is one, the weighting from the previous iteration is ignored, and the system applies content-based addressing.
After interpolation, each head emits a shift weighting s t that defines a normalised distribution over the allowed integer shifts. For example, if shifts between -1 and 1 are allowed, s t has three elements corresponding to the degree to which shifts of -1, 0 and 1 are performed. The simplest way to define the shift weightings is to use a softmax layer of the appropriate size attached to the controller. We also experimented with another technique, where the controller emits a single scalar that is interpreted as the lower bound of a width one uniform distribution over shifts. For example, if the shift scalar is 6.7, then s t (6) = 0 . 3 , s t (7) = 0 . 7 , and the rest of s t is zero.
If we index the N memory locations from 0 to N -1 , the rotation applied to w g t by s t can be expressed as the following circular convolution:
<!-- formula-not-decoded -->
where all index arithmetic is computed modulo N . The convolution operation in Equation (8) can cause leakage or dispersion of weightings over time if the shift weighting is not sharp. For example, if shifts of -1, 0 and 1 are given weights of 0.1, 0.8 and 0.1, the rotation will transform a weighting focused at a single point into one slightly blurred over three points. To combat this, each head emits one further scalar ฮณ t โฅ 1 whose effect is to sharpen the final weighting as follows:
<!-- formula-not-decoded -->
The combined addressing system of weighting interpolation and content and locationbased addressing can operate in three complementary modes. One, a weighting can be chosen by the content system without any modification by the location system. Two, a weighting produced by the content addressing system can be chosen and then shifted. This allows the focus to jump to a location next to, but not on, an address accessed by content; in computational terms this allows a head to find a contiguous block of data, then access a particular element within that block. Three, a weighting from the previous time step can be rotated without any input from the content-based addressing system. This allows the weighting to iterate through a sequence of addresses by advancing the same distance at each time-step.
## 3.4 Controller Network
The NTM architecture architecture described above has several free parameters, including the size of the memory, the number of read and write heads, and the range of allowed location shifts. But perhaps the most significant architectural choice is the type of neural network used as the controller. In particular, one has to decide whether to use a recurrent or feedforward network. A recurrent controller such as LSTM has its own internal memory that can complement the larger memory in the matrix. If one compares the controller to the central processing unit in a digital computer (albeit with adaptive rather than predefined instructions) and the memory matrix to RAM, then the hidden activations of the recurrent controller are akin to the registers in the processor. They allow the controller to mix information across multiple time steps of operation. On the other hand a feedforward controller can mimic a recurrent network by reading and writing at the same location in memory at every step. Furthermore, feedforward controllers often confer greater transparency to the network's operation because the pattern of reading from and writing to the memory matrix is usually easier to interpret than the internal state of an RNN. However, one limitation of
a feedforward controller is that the number of concurrent read and write heads imposes a bottleneck on the type of computation the NTM can perform. With a single read head, it can perform only a unary transform on a single memory vector at each time-step, with two read heads it can perform binary vector transforms, and so on. Recurrent controllers can internally store read vectors from previous time-steps, so do not suffer from this limitation.
## 4 Experiments
This section presents preliminary experiments on a set of simple algorithmic tasks such as copying and sorting data sequences. The goal was not only to establish that NTM is able to solve the problems, but also that it is able to do so by learning compact internal programs. The hallmark of such solutions is that they generalise well beyond the range of the training data. For example, we were curious to see if a network that had been trained to copy sequences of length up to 20 could copy a sequence of length 100 with no further training.
For all the experiments we compared three architectures: NTM with a feedforward controller, NTM with an LSTM controller, and a standard LSTM network. Because all the tasks were episodic, we reset the dynamic state of the networks at the start of each input sequence. For the LSTM networks, this meant setting the previous hidden state equal to a learned bias vector. For NTM the previous state of the controller, the value of the previous read vectors, and the contents of the memory were all reset to bias values. All the tasks were supervised learning problems with binary targets; all networks had logistic sigmoid output layers and were trained with the cross-entropy objective function. Sequence prediction errors are reported in bits-per-sequence. For more details about the experimental parameters see Section 4.6.
## 4.1 Copy
The copy task tests whether NTM can store and recall a long sequence of arbitrary information. The network is presented with an input sequence of random binary vectors followed by a delimiter flag. Storage and access of information over long time periods has always been problematic for RNNs and other dynamic architectures. We were particularly interested to see if an NTM is able to bridge longer time delays than LSTM.
The networks were trained to copy sequences of eight bit random vectors, where the sequence lengths were randomised between 1 and 20. The target sequence was simply a copy of the input sequence (without the delimiter flag). Note that no inputs were presented to the network while it receives the targets, to ensure that it recalls the entire sequence with no intermediate assistance.
As can be seen from Figure 3, NTM (with either a feedforward or LSTM controller) learned much faster than LSTM alone, and converged to a lower cost. The disparity between the NTM and LSTM learning curves is dramatic enough to suggest a qualitative,
Figure 3: Copy Learning Curves.
<details>
<summary>Image 3 Details</summary>

### Visual Description
\n
## Line Chart: Cost per Sequence vs. Sequence Number
### Overview
The image presents a line chart illustrating the cost per sequence (in bits) as a function of sequence number (in thousands). Three different models are compared: LSTM, NTM with LSTM Controller, and NTM with Feedforward Controller. The chart demonstrates how the cost per sequence decreases with increasing sequence number for each model.
### Components/Axes
* **X-axis:** Sequence number (thousands), ranging from 0 to 1000.
* **Y-axis:** Cost per sequence (bits), ranging from 0 to 10.
* **Legend:** Located in the top-right corner, identifying the three data series:
* LSTM (Blue line with circle markers)
* NTM with LSTM Controller (Green line with triangle markers)
* NTM with Feedforward Controller (Red line with plus markers)
### Detailed Analysis
* **LSTM (Blue):** The line starts at approximately 9.2 bits at sequence number 0, and rapidly decreases to around 2.5 bits at sequence number 100. It continues to decrease, but at a slower rate, reaching approximately 1.2 bits at sequence number 1000. The trend is strongly downward and appears logarithmic.
* Sequence 0: ~9.2 bits
* Sequence 100: ~2.5 bits
* Sequence 200: ~1.8 bits
* Sequence 300: ~1.5 bits
* Sequence 400: ~1.3 bits
* Sequence 500: ~1.2 bits
* Sequence 600: ~1.1 bits
* Sequence 700: ~1.1 bits
* Sequence 800: ~1.1 bits
* Sequence 900: ~1.1 bits
* Sequence 1000: ~1.2 bits
* **NTM with LSTM Controller (Green):** The line starts at approximately 0.8 bits at sequence number 0 and remains relatively flat, fluctuating around 0.1-0.2 bits throughout the entire range of sequence numbers. The trend is nearly horizontal.
* Sequence 0: ~0.8 bits
* Sequence 100: ~0.1 bits
* Sequence 200: ~0.1 bits
* Sequence 300: ~0.1 bits
* Sequence 400: ~0.1 bits
* Sequence 500: ~0.1 bits
* Sequence 600: ~0.1 bits
* Sequence 700: ~0.1 bits
* Sequence 800: ~0.1 bits
* Sequence 900: ~0.1 bits
* Sequence 1000: ~0.1 bits
* **NTM with Feedforward Controller (Red):** The line starts at approximately 1.2 bits at sequence number 0 and also remains relatively flat, fluctuating around 0.1-0.2 bits throughout the entire range of sequence numbers. The trend is nearly horizontal.
* Sequence 0: ~1.2 bits
* Sequence 100: ~0.1 bits
* Sequence 200: ~0.1 bits
* Sequence 300: ~0.1 bits
* Sequence 400: ~0.1 bits
* Sequence 500: ~0.1 bits
* Sequence 600: ~0.1 bits
* Sequence 700: ~0.1 bits
* Sequence 800: ~0.1 bits
* Sequence 900: ~0.1 bits
* Sequence 1000: ~0.1 bits
### Key Observations
* The LSTM model exhibits a significant decrease in cost per sequence with increasing sequence number, indicating learning and improvement over time.
* Both NTM models (LSTM and Feedforward controllers) maintain a consistently low cost per sequence, suggesting they achieve a stable performance level relatively quickly.
* The NTM models have a much lower cost per sequence than the LSTM model, especially after the initial learning phase of the LSTM.
* The initial cost of the LSTM is significantly higher than the NTM models.
### Interpretation
The chart demonstrates the learning dynamics of different neural network architectures. The LSTM model, while starting with a high cost, shows a clear learning curve as it processes more sequences. This suggests that the LSTM benefits from increased data exposure. However, the NTM models, equipped with neural Turing machine components, achieve a low and stable cost per sequence from the beginning, indicating their ability to efficiently learn and generalize from the data. The NTM models' performance suggests that the external memory mechanism allows them to store and retrieve information more effectively, leading to a lower cost per sequence. The difference in initial cost and learning curves highlights the trade-offs between the computational complexity and learning efficiency of these models. The LSTM requires more sequences to reach a comparable performance level, but it may eventually achieve similar or better results with sufficient training data. The NTM models, on the other hand, offer a more stable and efficient performance from the outset.
</details>
rather than quantitative, difference in the way the two models solve the problem.
We also studied the ability of the networks to generalise to longer sequences than seen during training (that they can generalise to novel vectors is clear from the training error). Figures 4 and 5 demonstrate that the behaviour of LSTM and NTM in this regime is radically different. NTM continues to copy as the length increases 2 , while LSTM rapidly degrades beyond length 20.
The preceding analysis suggests that NTM, unlike LSTM, has learned some form of copy algorithm. To determine what this algorithm is, we examined the interaction between the controller and the memory (Figure 6). We believe that the sequence of operations performed by the network can be summarised by the following pseudocode:
```
```
## end while
This is essentially how a human programmer would perform the same task in a low-
2 The limiting factor was the size of the memory (128 locations), after which the cyclical shifts wrapped around and previous writes were overwritten.
Figure 4: NTM Generalisation on the Copy Task. The four pairs of plots in the top row depict network outputs and corresponding copy targets for test sequences of length 10, 20, 30, and 50, respectively. The plots in the bottom row are for a length 120 sequence. The network was only trained on sequences of up to length 20. The first four sequences are reproduced with high confidence and very few mistakes. The longest one has a few more local errors and one global error: at the point indicated by the red arrow at the bottom, a single vector is duplicated, pushing all subsequent vectors one step back. Despite being subjectively close to a correct copy, this leads to a high loss.
<details>
<summary>Image 4 Details</summary>

### Visual Description
\n
## Heatmap: Targets vs. Outputs over Time
### Overview
The image presents a series of heatmaps visualizing the relationship between "Targets" and "Outputs" over "Time". There are two sets of four heatmaps, arranged in two rows. The top row shows a series of snapshots, while the bottom row appears to show a longer-term view or a different representation of the same data. The heatmaps use a color gradient to represent values ranging from 0.0 to 1.0. A red arrow points to a specific region in the bottom row's last heatmap.
### Components/Axes
* **Labels:** "Targets" (appears twice), "Outputs" (appears twice), "Time" (indicated by an arrow).
* **Color Scale:** A vertical color bar on the right side of the image represents the value range, with:
* 0.0: Dark Blue
* 0.2: Cyan
* 0.4: Light Green
* 0.6: Yellow
* 0.8: Orange
* 1.0: Red
* **Heatmap Arrangement:** Two rows of four heatmaps each. Each heatmap appears to be a 2D grid of cells, with color representing the value at each grid point.
* **Time Axis:** The horizontal axis is labeled "Time" and indicated by an arrow pointing to the right. The heatmaps are arranged sequentially, suggesting a progression in time from left to right.
### Detailed Analysis or Content Details
**Top Row (Snapshots):**
* **Heatmap 1 (Targets):** Predominantly dark blue and red, with scattered yellow and orange. No clear pattern. Approximate values range from 0.0 to 1.0.
* **Heatmap 2 (Targets):** Similar to Heatmap 1, with a mix of dark blue, red, yellow, and orange. No clear pattern. Approximate values range from 0.0 to 1.0.
* **Heatmap 3 (Targets):** Predominantly dark blue and red, with scattered yellow and orange. No clear pattern. Approximate values range from 0.0 to 1.0.
* **Heatmap 4 (Targets):** Predominantly dark blue and red, with scattered yellow and orange. No clear pattern. Approximate values range from 0.0 to 1.0.
* **Heatmap 1 (Outputs):** Predominantly dark blue and red, with scattered yellow and orange. No clear pattern. Approximate values range from 0.0 to 1.0.
* **Heatmap 2 (Outputs):** Predominantly dark blue and red, with scattered yellow and orange. No clear pattern. Approximate values range from 0.0 to 1.0.
* **Heatmap 3 (Outputs):** Predominantly dark blue and red, with a small cluster of cyan and green in the center. Approximate values range from 0.0 to 0.7.
* **Heatmap 4 (Outputs):** A more pronounced cluster of cyan and green in the center, with some yellow and orange. Approximate values range from 0.0 to 0.9.
**Bottom Row (Longer View/Different Representation):**
* **Heatmap 1 (Targets):** Predominantly dark blue and red, with scattered yellow and orange. No clear pattern. Approximate values range from 0.0 to 1.0.
* **Heatmap 2 (Targets):** Predominantly dark blue and red, with scattered yellow and orange. No clear pattern. Approximate values range from 0.0 to 1.0.
* **Heatmap 3 (Targets):** Predominantly dark blue and red, with scattered yellow and orange. No clear pattern. Approximate values range from 0.0 to 1.0.
* **Heatmap 4 (Targets):** Predominantly dark blue and red, with scattered yellow and orange. No clear pattern. Approximate values range from 0.0 to 1.0.
* **Heatmap 1 (Outputs):** A vertical streak of cyan and green on the left side, transitioning to dark blue and red. Approximate values range from 0.0 to 0.8.
* **Heatmap 2 (Outputs):** Similar to Heatmap 1, with a vertical streak of cyan and green on the left side, transitioning to dark blue and red. Approximate values range from 0.0 to 0.8.
* **Heatmap 3 (Outputs):** Similar to Heatmap 1 and 2, with a vertical streak of cyan and green on the left side, transitioning to dark blue and red. Approximate values range from 0.0 to 0.8.
* **Heatmap 4 (Outputs):** Multiple clusters of cyan and green, with some yellow and orange. The red arrow points to a cluster of cyan and green in the center. Approximate values range from 0.0 to 0.9.
### Key Observations
* The "Targets" heatmaps consistently show a random distribution of colors, suggesting no strong correlation or pattern within the target data itself.
* The "Outputs" heatmaps, particularly in the bottom row, show a more dynamic pattern with clusters of cyan and green appearing and shifting over time.
* The bottom row "Outputs" heatmaps exhibit a streak of cyan/green on the left, which may indicate an initial activation or response.
* The red arrow highlights a specific region of interest in the last "Outputs" heatmap, suggesting a significant event or pattern.
* The top row "Outputs" heatmaps show less pronounced patterns compared to the bottom row.
### Interpretation
The image likely represents the output of a machine learning model or a dynamic system where "Targets" represent desired states and "Outputs" represent the actual results. The heatmaps visualize the similarity or correlation between these two over time.
The consistent randomness in the "Targets" suggests that the desired states are either inherently unpredictable or are not directly influencing the system's behavior. The evolving patterns in the "Outputs" indicate that the system is responding to some input or undergoing internal changes.
The streak of cyan/green in the bottom row "Outputs" could represent an initial response or activation phase. The clusters highlighted by the red arrow suggest that the system is converging towards a specific state or achieving a desired outcome at certain points in time.
The difference between the top and bottom row "Outputs" could indicate different levels of granularity or different time scales. The top row might represent snapshots, while the bottom row represents a more continuous view of the system's evolution.
The data suggests a system that is not always perfectly aligned with its targets, but is capable of adapting and generating outputs that sometimes approach the desired states. Further analysis would be needed to understand the underlying mechanisms driving these patterns and the significance of the highlighted region.
</details>
level programming language. In terms of data structures, we could say that NTM has learned how to create and iterate through arrays. Note that the algorithm combines both content-based addressing (to jump to start of the sequence) and location-based addressing (to move along the sequence). Also note that the iteration would not generalise to long sequences without the ability to use relative shifts from the previous read and write weightings (Equation 7), and that without the focus-sharpening mechanism (Equation 9) the weightings would probably lose precision over time.
## 4.2 Repeat Copy
The repeat copy task extends copy by requiring the network to output the copied sequence a specified number of times and then emit an end-of-sequence marker. The main motivation was to see if the NTM could learn a simple nested function. Ideally, we would like it to be able to execute a 'for loop' containing any subroutine it has already learned.
The network receives random-length sequences of random binary vectors, followed by a scalar value indicating the desired number of copies, which appears on a separate input channel. To emit the end marker at the correct time the network must be both able to interpret the extra input and keep count of the number of copies it has performed so far. As with the copy task, no inputs are provided to the network after the initial sequence and repeat number. The networks were trained to reproduce sequences of size eight random binary vectors, where both the sequence length and the number of repetitions were chosen randomly from one to ten. The input representing the repeat number was normalised to have mean zero and variance one.
Figure 5: LSTM Generalisation on the Copy Task. The plots show inputs and outputs for the same sequence lengths as Figure 4. Like NTM, LSTM learns to reproduce sequences of up to length 20 almost perfectly. However it clearly fails to generalise to longer sequences. Also note that the length of the accurate prefix decreases as the sequence length increases, suggesting that the network has trouble retaining information for long periods.
<details>
<summary>Image 5 Details</summary>

### Visual Description
## Heatmaps: Time-Series Data Comparison - Targets vs. Outputs
### Overview
The image presents four heatmaps arranged in two rows and two columns, visualizing time-series data. Each heatmap represents a different set of data: "Targets" and "Outputs" across two distinct time periods or conditions. The heatmaps use a color gradient to represent the magnitude of values, with red indicating higher values and blue indicating lower values. A horizontal axis labeled "Time" is indicated at the bottom of the image.
### Components/Axes
* **Labels:** "Targets" (appears twice), "Outputs" (appears twice). These labels are positioned vertically to the left of each pair of heatmaps.
* **Axis:** A horizontal axis labeled "Time" is indicated with an arrow pointing to the right. The axis is not explicitly scaled with numerical values.
* **Color Scale:** A color bar is positioned vertically on the right side of the image. It ranges from blue (approximately 0.0) to red (approximately 1.0), with intermediate colors representing values between 0.2, 0.4, 0.6, 0.8.
* **Heatmap Grid:** The image is divided into four rectangular heatmaps. Each heatmap appears to represent a snapshot of the "Targets" and "Outputs" at a specific point in time or under a specific condition.
### Detailed Analysis or Content Details
**Top Row:**
* **Targets (Top-Left):** This heatmap is predominantly red, indicating high values across most of the data points. There is some variation in intensity, but overall, the values are clustered around 0.8-1.0. The heatmap is approximately 20x20 cells.
* **Outputs (Top-Right):** This heatmap shows a more complex pattern. It starts with a predominantly blue/dark blue color (values around 0.0-0.2) in the left portion, transitioning to orange/yellow (values around 0.6-0.8) in the right portion. There are several distinct areas of higher intensity (red) scattered throughout. The heatmap is approximately 20x20 cells.
**Bottom Row:**
* **Targets (Bottom-Left):** This heatmap is predominantly red, similar to the top-left heatmap, indicating high values (approximately 0.8-1.0) across most of the data points. There is some variation in intensity. The heatmap is approximately 40x20 cells.
* **Outputs (Bottom-Right):** This heatmap exhibits a more structured pattern. There are several horizontal bands of higher intensity (yellow/orange, values around 0.6-0.8) interspersed with bands of lower intensity (blue/dark blue, values around 0.0-0.2). The bands appear to be aligned with the "Time" axis. The heatmap is approximately 40x20 cells.
**General Observations:**
* The "Targets" heatmaps consistently show high values, suggesting a relatively stable or consistent signal.
* The "Outputs" heatmaps show more variability and complexity, indicating a dynamic response or process.
* The bottom row heatmaps have a higher resolution (more cells) than the top row heatmaps.
### Key Observations
* The "Outputs" in the top row transition from low to high values over time.
* The "Outputs" in the bottom row exhibit a periodic or banded pattern.
* The "Targets" remain consistently high across both time periods/conditions.
* The "Outputs" in the bottom row appear to be more structured and potentially more informative than those in the top row.
### Interpretation
The data suggests a comparison between desired "Targets" and actual "Outputs" of a system or process over time. The consistent high values in the "Targets" heatmaps indicate a stable desired state. The varying patterns in the "Outputs" heatmaps suggest that the system's response is dynamic and potentially influenced by external factors or internal processes.
The transition in the top row "Outputs" could represent a system learning or adapting to reach the target level. The banded pattern in the bottom row "Outputs" might indicate a cyclical process or a response to periodic inputs. The higher resolution of the bottom row heatmaps could allow for a more detailed analysis of the system's behavior.
The difference in the "Outputs" between the top and bottom rows suggests that the system's behavior changes over time or under different conditions. This could be due to changes in the input data, adjustments to the system's parameters, or the emergence of new patterns. The data could be used to evaluate the performance of a control system, identify potential bottlenecks, or optimize the system's parameters.
</details>
Figure 6: NTM Memory Use During the Copy Task. The plots in the left column depict the inputs to the network (top), the vectors added to memory (middle) and the corresponding write weightings (bottom) during a single test sequence for the copy task. The plots on the right show the outputs from the network (top), the vectors read from memory (middle) and the read weightings (bottom). Only a subset of memory locations are shown. Notice the sharp focus of all the weightings on a single location in memory (black is weight zero, white is weight one). Also note the translation of the focal point over time, reflects the network's use of iterative shifts for location-based addressing, as described in Section 3.3.2. Lastly, observe that the read locations exactly match the write locations, and the read vectors match the add vectors. This suggests that the network writes each input vector in turn to a specific memory location during the input phase, then reads from the same location sequence during the output phase.
<details>
<summary>Image 6 Details</summary>

### Visual Description
\n
## Heatmaps: Neural Network Attention Visualization
### Overview
The image presents four heatmaps visualizing attention weights within a neural network, likely a recurrent neural network or transformer. The heatmaps are arranged in a 2x2 grid, labeled "Inputs" and "Outputs" on the top row, and "Write Weightings" and "Read Weightings" on the bottom row. Each heatmap displays a two-dimensional representation of attention, with "Time" on the x-axis and "Location" on the y-axis. The second row of heatmaps also has "Adds" and "Reads" on the y-axis. The color intensity represents the strength of the attention weight, ranging from blue (low attention) to red (high attention).
### Components/Axes
* **X-axis:** "Time" - Represents the temporal dimension of the sequence. The scale is not explicitly defined, but appears to be discrete steps.
* **Y-axis (Inputs/Outputs):** "Location" - Represents the position within the input or output sequence. The scale is not explicitly defined, but appears to be discrete steps.
* **Y-axis (Write/Read Weightings):** "Adds" and "Reads" - Represents the attention mechanism's focus on adding or reading information. The scale is not explicitly defined, but appears to be discrete steps.
* **Color Scale:** Blue to Red - Represents the magnitude of the attention weight. Blue indicates low attention, while red indicates high attention.
* **Labels:** "Inputs", "Outputs", "Write Weightings", "Read Weightings".
### Detailed Analysis or Content Details
**1. Inputs (Top-Left):**
* The heatmap shows a sparse pattern of attention weights.
* There are several bright red spots scattered across the heatmap, indicating strong attention at specific time and location combinations.
* The pattern appears somewhat random, with no clear dominant trend.
* Approximate dimensions: 20x20.
**2. Outputs (Top-Right):**
* The heatmap shows a more structured pattern of attention weights.
* There are several horizontal bands of high attention (red) spanning the entire width of the heatmap.
* These bands are separated by regions of low attention (blue).
* The bands suggest that the network is attending to different parts of the input sequence at different time steps.
* Approximate dimensions: 20x20.
**3. Write Weightings (Bottom-Left):**
* The heatmap shows a strong diagonal pattern, indicating that the network is primarily attending to the current time step and location.
* There are also several smaller, scattered red spots, indicating attention to other time and location combinations.
* The diagonal pattern suggests that the network is writing information to memory based on the current input.
* The diagonal line spans from approximately (0,0) to (20,20).
* There are a few scattered bright spots below the diagonal, indicating some attention to past time steps.
**4. Read Weightings (Bottom-Right):**
* The heatmap shows a similar diagonal pattern to the "Write Weightings" heatmap, but with a slightly less pronounced diagonal.
* There are also several scattered red spots, indicating attention to other time and location combinations.
* The diagonal pattern suggests that the network is reading information from memory based on the current time step.
* The diagonal line spans from approximately (0,0) to (20,20).
* There are a few scattered bright spots below the diagonal, indicating some attention to past time steps.
### Key Observations
* The "Inputs" heatmap shows a more diffuse attention pattern compared to the "Outputs" heatmap.
* The "Write Weightings" and "Read Weightings" heatmaps both exhibit a strong diagonal pattern, suggesting that the network is primarily attending to the current time step.
* The "Outputs" heatmap shows a clear pattern of horizontal bands, indicating that the network is attending to different parts of the input sequence at different time steps.
* The scattered red spots in all heatmaps suggest that the network is also attending to other time and location combinations, but to a lesser extent.
### Interpretation
The heatmaps provide a visual representation of the attention mechanism within a neural network. The "Inputs" heatmap shows how the network attends to the input sequence, while the "Outputs" heatmap shows how the network attends to the output sequence. The "Write Weightings" and "Read Weightings" heatmaps show how the network writes information to and reads information from memory.
The strong diagonal pattern in the "Write Weightings" and "Read Weightings" heatmaps suggests that the network is primarily attending to the current time step, which is consistent with the idea that the network is processing the input sequence sequentially. The horizontal bands in the "Outputs" heatmap suggest that the network is attending to different parts of the input sequence at different time steps, which is consistent with the idea that the network is learning to extract relevant information from the input sequence.
The scattered red spots in all heatmaps suggest that the network is also attending to other time and location combinations, which may be due to noise or to the network's ability to capture long-range dependencies. The difference in attention patterns between the "Inputs" and "Outputs" heatmaps suggests that the network is transforming the input sequence into a more structured representation in the output sequence.
</details>
Figure 7: Repeat Copy Learning Curves.
<details>
<summary>Image 7 Details</summary>

### Visual Description
\n
## Line Chart: Cost per Sequence vs. Sequence Number
### Overview
This line chart depicts the cost per sequence (in bits) as a function of the sequence number (in thousands) for three different models: LSTM, NTM with LSTM Controller, and NTM with Feedforward Controller. The chart illustrates the learning curves of these models, showing how the cost per sequence decreases as the models are trained on more sequences.
### Components/Axes
* **X-axis:** Sequence number (thousands). Scale ranges from approximately 0 to 500.
* **Y-axis:** Cost per sequence (bits). Scale ranges from approximately 0 to 200.
* **Legend:** Located in the top-right corner.
* LSTM (Blue line with circle markers)
* NTM with LSTM Controller (Green line with triangle markers)
* NTM with Feedforward Controller (Red line with plus markers)
### Detailed Analysis
* **LSTM (Blue):** The line starts at approximately 170 bits at sequence number 0. It rapidly decreases to around 20 bits by sequence number 50. It then fluctuates between approximately 10 and 25 bits, with a slight upward trend, reaching around 20 bits at sequence number 500.
* **NTM with LSTM Controller (Green):** The line begins at approximately 180 bits at sequence number 0. It quickly drops to below 10 bits by sequence number 20. It remains relatively stable, fluctuating between approximately 5 and 15 bits for the remainder of the chart, ending at around 8 bits at sequence number 500.
* **NTM with Feedforward Controller (Red):** The line starts at approximately 175 bits at sequence number 0. It decreases to around 60 bits by sequence number 20. It then continues to decrease, reaching below 10 bits by sequence number 100. It remains relatively stable, fluctuating between approximately 5 and 10 bits for the rest of the chart, ending at around 6 bits at sequence number 500.
### Key Observations
* All three models demonstrate a significant decrease in cost per sequence during the initial training phase (first 100 sequence numbers).
* The NTM with LSTM Controller and NTM with Feedforward Controller converge to lower costs per sequence than the LSTM model.
* The NTM with Feedforward Controller appears to achieve the lowest cost per sequence overall.
* The LSTM model exhibits more fluctuation in cost per sequence after the initial decrease, suggesting less stable learning.
### Interpretation
The data suggests that Neural Turing Machines (NTMs), particularly those with a Feedforward Controller, are more effective at learning the task represented by this cost function than a standard LSTM. The rapid initial decrease in cost for all models indicates that they are quickly learning the basic patterns in the data. The lower final cost and greater stability of the NTM models suggest that they are better able to generalize and retain learned information. The LSTM's fluctuating cost after the initial decrease could indicate overfitting or difficulty in capturing the underlying complexity of the data. The chart demonstrates the benefit of incorporating external memory mechanisms (as in NTMs) for sequence learning tasks. The difference between the LSTM controller and the Feedforward controller suggests that the Feedforward controller is more efficient for this specific task.
</details>
Figure 7 shows that NTM learns the task much faster than LSTM, but both were able to solve it perfectly. 3 The difference between the two architectures only becomes clear when they are asked to generalise beyond the training data. In this case we were interested in generalisation along two dimensions: sequence length and number of repetitions. Figure 8 illustrates the effect of doubling first one, then the other, for both LSTM and NTM. Whereas LSTM fails both tests, NTM succeeds with longer sequences and is able to perform more than ten repetitions; however it is unable to keep count of of how many repeats it has completed, and does not predict the end marker correctly. This is probably a consequence of representing the number of repetitions numerically, which does not easily generalise beyond a fixed range.
Figure 9 suggests that NTM learns a simple extension of the copy algorithm in the previous section, where the sequential read is repeated as many times as necessary.
## 4.3 Associative Recall
The previous tasks show that the NTM can apply algorithms to relatively simple, linear data structures. The next order of complexity in organising data arises from 'indirection'-that is, when one data item points to another. We test the NTM's capability for learning an instance of this more interesting class by constructing a list of items so that querying with one of the items demands that the network return the subsequent item. More specifically, we define an item as a sequence of binary vectors that is bounded on the left and right by delimiter symbols. After several items have been propagated to the network, we query by showing a random item, and we ask the network to produce the next item. In our experiments, each item consisted of three six-bit binary vectors (giving a total of 18 bits
3 It surprised us that LSTM performed better here than on the copy problem. The likely reasons are that the sequences were shorter (up to length 10 instead of up to 20), and the LSTM network was larger and therefore had more memory capacity.
Figure 8: NTM and LSTM Generalisation for the Repeat Copy Task. NTM generalises almost perfectly to longer sequences than seen during training. When the number of repeats is increased it is able to continue duplicating the input sequence fairly accurately; but it is unable to predict when the sequence will end, emitting the end marker after the end of every repetition beyond the eleventh. LSTM struggles with both increased length and number, rapidly diverging from the input sequence in both cases.
<details>
<summary>Image 8 Details</summary>

### Visual Description
## Heatmaps: NTM and LSTM Performance Comparison
### Overview
The image presents four heatmaps visualizing the performance of two neural network models, NTM (Neural Turing Machine) and LSTM (Long Short-Term Memory), under different conditions. Each heatmap represents a model's "Targets" (desired outputs) versus its "Outputs" (actual predictions) for a specific combination of sequence length and repetition count. The heatmaps use a color gradient to represent the intensity of activation or similarity, with darker colors (approaching black) indicating lower values and brighter colors (red/yellow) indicating higher values.
### Components/Axes
The image is divided into two main sections, one for NTM and one for LSTM. Each section contains two heatmaps, defined by:
* **Model:** NTM or LSTM
* **Length:** 10 or 20 (representing the sequence length)
* **Repeat:** 20 or 10 (representing the number of repetitions)
* **Rows:** "Targets" and "Outputs"
* **Horizontal Axis:** Represents "Time" as indicated by the arrow at the bottom-right. The axis is not numerically labeled, but represents progression through the sequence.
* **Color Scale:** A gradient from dark blue (low activation) to red/yellow (high activation).
### Detailed Analysis or Content Details
**NTM - Length 10, Repeat 20:**
* **Targets:** The heatmap shows a relatively uniform distribution of dark blue and red pixels, indicating a consistent but not highly concentrated activation pattern. There are some areas of brighter red, but they are interspersed with darker blue.
* **Outputs:** Similar to "Targets", the "Outputs" heatmap displays a mix of dark blue and red pixels. There appears to be a slightly more structured pattern, with some vertical bands of red.
**NTM - Length 20, Repeat 10:**
* **Targets:** The "Targets" heatmap exhibits a more pronounced pattern of alternating dark blue and red bands. The red bands are more frequent and wider than in the previous heatmap.
* **Outputs:** The "Outputs" heatmap shows a similar pattern to "Targets", with alternating bands of dark blue and red. The red bands appear to be slightly less frequent and less defined than in the "Targets" heatmap.
**LSTM - Length 10, Repeat 20:**
* **Targets:** The "Targets" heatmap displays a very distinct and regular pattern of alternating dark blue and bright red pixels. This creates a clear, striped appearance.
* **Outputs:** The "Outputs" heatmap shows a more complex pattern. While there are still alternating bands, they are less regular and more fragmented than in the "Targets" heatmap. There are also areas of yellow and green, indicating intermediate activation levels.
**LSTM - Length 20, Repeat 10:**
* **Targets:** The "Targets" heatmap shows a similar pattern to the LSTM - Length 10, Repeat 20, with alternating dark blue and bright red bands.
* **Outputs:** The "Outputs" heatmap displays a highly dynamic pattern. The color distribution is more varied, with a gradient from blue to green to yellow, and finally to red. There is a clear progression of color from left to right, with the rightmost side being predominantly red.
### Key Observations
* **NTM:** The NTM heatmaps show a more diffuse activation pattern compared to the LSTM heatmaps. The patterns are less structured and less consistent.
* **LSTM:** The LSTM heatmaps exhibit a more distinct and regular activation pattern, particularly in the "Targets" heatmaps. The "Outputs" heatmaps show a more complex and dynamic pattern, with a clear progression of activation over time.
* **Length & Repeat:** Increasing the sequence length (from 10 to 20) appears to result in more pronounced patterns in both models. Decreasing the repetition count (from 20 to 10) seems to affect the complexity of the output patterns.
* **Targets vs. Outputs:** In general, the "Outputs" heatmaps are more complex and less structured than the "Targets" heatmaps, indicating that the models are not perfectly replicating the desired outputs.
### Interpretation
The heatmaps suggest that the LSTM model is better at learning and reproducing the target patterns compared to the NTM model, especially for the shorter sequence length (10). The clear, striped patterns in the LSTM "Targets" heatmaps indicate a strong ability to capture the underlying structure of the data. The more complex patterns in the LSTM "Outputs" heatmaps suggest that the model is actively processing the information and generating dynamic responses.
The NTM model, on the other hand, exhibits a more diffuse activation pattern, indicating that it may be struggling to capture the underlying structure of the data. The less structured patterns in the NTM "Outputs" heatmaps suggest that the model is generating less coherent and less predictable responses.
The changes in patterns with different sequence lengths and repetition counts suggest that the models' performance is sensitive to these parameters. The progression of color in the LSTM - Length 20, Repeat 10 "Outputs" heatmap may indicate that the model is learning to adapt its responses over time.
The image provides a visual comparison of the internal representations learned by the two models, offering insights into their strengths and weaknesses. The data suggests that LSTM is more effective at this task, but both models exhibit complex behavior that warrants further investigation.
</details>
per item). During training, we used a minimum of 2 items and a maximum of 6 items in a single episode.
Figure 10 shows that NTM learns this task significantly faster than LSTM, terminating at near zero cost within approximately 30 , 000 episodes, whereas LSTM does not reach zero cost after a million episodes. Additionally, NTM with a feedforward controller learns faster than NTM with an LSTM controller. These two results suggest that NTM's external memory is a more effective way of maintaining the data structure than LSTM's internal state. NTM also generalises much better to longer sequences than LSTM, as can be seen in Figure 11. NTM with a feedforward controller is nearly perfect for sequences of up to 12 items (twice the maximum length used in training), and still has an average cost below 1 bit per sequence for sequences of 15 items.
In Figure 12, we show the operation of the NTM memory, controlled by an LSTM with one head, on a single test episode. In 'Inputs,' we see that the input denotes item delimiters as single bits in row 7. After the sequence of items has been propagated, a
Figure 9: NTM Memory Use During the Repeat Copy Task. As with the copy task the network first writes the input vectors to memory using iterative shifts. It then reads through the sequence to replicate the input as many times as necessary (six in this case). The white dot at the bottom of the read weightings seems to correspond to an intermediate location used to redirect the head to the start of the sequence (The NTM equivalent of a goto statement).
<details>
<summary>Image 9 Details</summary>

### Visual Description
## Heatmaps: Neural Network Attention Visualization
### Overview
The image presents four heatmaps visualizing attention weights within a neural network. Two heatmaps represent the "Inputs" side, and two represent the "Outputs" side. Each side has two rows: the top row displays attention weights related to "Adds", and the bottom row displays attention weights related to "Location" over "Time". The color intensity represents the magnitude of the attention weight, with warmer colors (red/orange/yellow) indicating higher weights and cooler colors (blue/green) indicating lower weights.
### Components/Axes
* **Titles:** "Inputs" (top) and "Outputs" (right)
* **Row Labels:** "Adds" (top row) and "Location" (bottom row)
* **Column Label:** "Reads" (right side)
* **X-axis:** "Time" (horizontal axis for both sides)
* **Y-axis:** "Location" (vertical axis for the bottom row heatmaps)
* **Color Scale:** Ranges from dark blue (low weight) to red (high weight), passing through green, yellow, and orange.
### Detailed Analysis or Content Details
**Inputs - Adds:**
The heatmap shows a complex pattern of attention weights. The horizontal axis represents time, and the vertical axis represents the location. The heatmap is approximately 20x20 units. The color intensity varies significantly across the heatmap. There are several areas of high attention (red/orange) scattered throughout, with a general trend of decreasing attention as time progresses. The pattern appears somewhat chaotic, with no clear dominant structure. Approximate values (based on color intensity):
* Maximum attention weight: ~90% (bright red)
* Minimum attention weight: ~10% (dark blue)
* Average attention weight: ~50% (green/yellow)
**Inputs - Location:**
This heatmap shows a strong diagonal pattern. The attention weights are highest along the diagonal, indicating that the network attends to locations that are close to each other in time. The diagonal fades as time progresses. The heatmap is approximately 20x20 units.
* Diagonal attention weight: ~80-90% (bright red)
* Off-diagonal attention weight: ~10-20% (dark blue)
**Outputs - Reads:**
Similar to the "Inputs - Adds" heatmap, this shows a complex pattern of attention weights. The heatmap is approximately 20x20 units. There are several areas of high attention (red/orange) scattered throughout, with a general trend of decreasing attention as time progresses. The pattern appears somewhat chaotic, with no clear dominant structure. Approximate values (based on color intensity):
* Maximum attention weight: ~90% (bright red)
* Minimum attention weight: ~10% (dark blue)
* Average attention weight: ~50% (green/yellow)
**Outputs - Read Weightings:**
This heatmap also exhibits a strong diagonal pattern, similar to "Inputs - Location". The attention weights are highest along the diagonal, indicating that the network attends to locations that are close to each other in time. The diagonal fades as time progresses. The heatmap is approximately 20x20 units.
* Diagonal attention weight: ~80-90% (bright red)
* Off-diagonal attention weight: ~10-20% (dark blue)
### Key Observations
* The "Adds" heatmaps (both inputs and outputs) show a more diffuse and complex attention pattern compared to the "Location" heatmaps.
* The "Location" heatmaps (both inputs and outputs) exhibit a clear diagonal pattern, suggesting a temporal locality of attention.
* The attention weights generally decrease over time in all heatmaps.
* The "Inputs" and "Outputs" heatmaps show similar patterns, suggesting that the network's attention mechanism is consistent across the input and output stages.
### Interpretation
The image demonstrates how a neural network attends to different parts of the input sequence over time. The "Adds" heatmaps suggest that the network is attending to a wide range of features and locations, while the "Location" heatmaps suggest that the network is prioritizing temporal locality. The diagonal pattern in the "Location" heatmaps indicates that the network is more likely to attend to locations that are close to each other in time. The decreasing attention weights over time may indicate that the network is focusing on the most recent information.
The visualization provides insights into the network's internal workings and can be used to understand how it makes its predictions. The differences in attention patterns between "Adds" and "Location" suggest that these two features are processed differently by the network. The consistency between "Inputs" and "Outputs" suggests that the attention mechanism is a stable and reliable component of the network.
</details>
Figure 10: Associative Recall Learning Curves for NTM and LSTM.
<details>
<summary>Image 10 Details</summary>

### Visual Description
\n
## Line Chart: Cost per Sequence vs. Sequence Number
### Overview
This image presents a line chart illustrating the cost per sequence (in bits) as a function of the sequence number (in thousands). Three different models are compared: LSTM, NTM with LSTM Controller, and NTM with Feedforward Controller. The chart demonstrates how the cost per sequence changes as the model processes more sequences.
### Components/Axes
* **X-axis:** Sequence number (thousands). Scale ranges from 0 to 1000.
* **Y-axis:** Cost per sequence (bits). Scale ranges from 0 to 20.
* **Legend:** Located at the top-right corner of the chart.
* LSTM (Blue line)
* NTM with LSTM Controller (Green line)
* NTM with Feedforward Controller (Red line)
### Detailed Analysis
* **LSTM (Blue Line):** The line starts at approximately 18 bits at sequence number 0, and exhibits a steep downward slope initially. The cost per sequence decreases rapidly until around sequence number 200, where the slope becomes less steep. The line continues to decrease, approaching approximately 1 bit at sequence number 1000.
* Sequence 0: ~18 bits
* Sequence 100: ~10 bits
* Sequence 200: ~4 bits
* Sequence 400: ~2.5 bits
* Sequence 600: ~1.7 bits
* Sequence 800: ~1.3 bits
* Sequence 1000: ~1 bit
* **NTM with LSTM Controller (Green Line):** The line begins at approximately 17 bits at sequence number 0, and decreases rapidly to nearly 0 bits within the first 100 sequence numbers. It remains relatively flat at around 0 bits for the remainder of the sequence numbers.
* Sequence 0: ~17 bits
* Sequence 50: ~1 bit
* Sequence 100: ~0.2 bits
* Sequence 200: ~0.1 bits
* Sequence 400: ~0.1 bits
* Sequence 600: ~0.1 bits
* Sequence 800: ~0.1 bits
* Sequence 1000: ~0.1 bits
* **NTM with Feedforward Controller (Red Line):** The line starts at approximately 18 bits at sequence number 0, and decreases rapidly to nearly 0 bits within the first 50 sequence numbers. It remains relatively flat at around 0 bits for the remainder of the sequence numbers.
* Sequence 0: ~18 bits
* Sequence 25: ~1 bit
* Sequence 50: ~0.1 bits
* Sequence 100: ~0.1 bits
* Sequence 200: ~0.1 bits
* Sequence 400: ~0.1 bits
* Sequence 600: ~0.1 bits
* Sequence 800: ~0.1 bits
* Sequence 1000: ~0.1 bits
### Key Observations
* The LSTM model exhibits a continuous decrease in cost per sequence, but never reaches the low values achieved by the NTM models.
* Both NTM models (LSTM and Feedforward Controller) converge to a very low cost per sequence much faster than the LSTM model.
* The NTM with Feedforward Controller and NTM with LSTM Controller perform similarly, reaching near-zero cost per sequence very quickly.
* The initial drop in cost per sequence is most pronounced for the NTM models.
### Interpretation
The data suggests that the NTM models, particularly those with either LSTM or Feedforward controllers, are significantly more efficient at learning and processing sequences compared to the standard LSTM model. The rapid convergence to a low cost per sequence indicates that the NTM models are able to effectively utilize their memory mechanisms to store and retrieve information, leading to improved performance. The LSTM model, while still improving with more sequences, does not achieve the same level of efficiency. The similarity in performance between the two NTM models suggests that the choice of controller (LSTM vs. Feedforward) has a minimal impact on the overall cost per sequence in this scenario. The chart demonstrates the benefits of incorporating neural Turing machine (NTM) architectures for sequence learning tasks. The initial high cost likely represents the model's initial uncertainty and the need to learn the underlying patterns in the data. As the model processes more sequences, it refines its understanding and reduces the cost per sequence.
</details>
Figure 11: Generalisation Performance on Associative Recall for Longer Item Sequences. The NTM with either a feedforward or LSTM controller generalises to much longer sequences of items than the LSTM alone. In particular, the NTM with a feedforward controller is nearly perfect for item sequences of twice the length of sequences in its training set.
<details>
<summary>Image 11 Details</summary>

### Visual Description
## Line Chart: Cost per Sequence vs. Number of Items per Sequence
### Overview
This line chart depicts the relationship between the number of items per sequence and the cost per sequence (in bits) for three different neural network architectures: LSTM, NTM with LSTM Controller, and NTM with Feedforward Controller. The chart visually compares the computational cost of each architecture as the sequence length increases.
### Components/Axes
* **X-axis:** "number of items per sequence". Scale ranges from 6 to 20, with markers at 6, 8, 10, 12, 14, 16, 18, and 20.
* **Y-axis:** "cost per sequence (bits)". Scale ranges from 0 to 40, with markers at 0, 5, 10, 15, 20, 25, 30, 35, and 40.
* **Legend:** Located in the top-right corner.
* LSTM (Blue line with circle markers)
* NTM with LSTM Controller (Green line with triangle markers)
* NTM with Feedforward Controller (Red line with diamond markers)
### Detailed Analysis
* **LSTM (Blue Line):** The line slopes sharply upward from x=6 to x=10, then plateaus with a slight downward trend from x=14 to x=20.
* At x=6, y โ 1.5 bits.
* At x=8, y โ 11 bits.
* At x=10, y โ 36 bits.
* At x=12, y โ 37 bits.
* At x=14, y โ 39 bits.
* At x=16, y โ 39 bits.
* At x=18, y โ 38 bits.
* At x=20, y โ 38 bits.
* **NTM with LSTM Controller (Green Line):** The line exhibits a relatively flat trend with a slight upward slope.
* At x=6, y โ 2 bits.
* At x=8, y โ 2 bits.
* At x=10, y โ 3 bits.
* At x=12, y โ 4 bits.
* At x=14, y โ 5 bits.
* At x=16, y โ 6 bits.
* At x=18, y โ 7 bits.
* At x=20, y โ 8 bits.
* **NTM with Feedforward Controller (Red Line):** The line shows a gradual upward slope throughout the entire range.
* At x=6, y โ 1 bit.
* At x=8, y โ 2 bits.
* At x=10, y โ 2 bits.
* At x=12, y โ 3 bits.
* At x=14, y โ 4 bits.
* At x=16, y โ 5 bits.
* At x=18, y โ 6 bits.
* At x=20, y โ 7 bits.
### Key Observations
* The LSTM architecture has significantly higher cost per sequence compared to both NTM architectures, especially as the number of items per sequence increases.
* The NTM with LSTM Controller and NTM with Feedforward Controller exhibit similar cost per sequence values, with the Feedforward Controller consistently slightly lower.
* The LSTM cost per sequence appears to saturate around 38-39 bits after x=14, while the NTM architectures continue to increase, albeit at a slower rate.
### Interpretation
The data suggests that LSTM networks become computationally expensive as the sequence length increases, likely due to the vanishing gradient problem or the increased memory requirements for longer sequences. The NTM architectures, which incorporate external memory, demonstrate a more scalable approach, maintaining lower costs per sequence even with longer sequences. The NTM with Feedforward Controller appears to be slightly more efficient than the NTM with LSTM Controller, potentially due to the simpler controller structure. The saturation of the LSTM cost per sequence after a certain length could indicate a limit to its ability to effectively process longer sequences, while the NTM architectures continue to scale, albeit with increasing cost. This implies that NTMs are better suited for tasks involving long-range dependencies and variable-length sequences. The difference in cost could be due to the complexity of the LSTM's internal state updates versus the NTM's external memory access.
</details>
delimiter in row 8 prepares the network to receive a query item. In this case, the query item corresponds to the second item in the sequence (contained in the green box). In 'Outputs,' we see that the network crisply outputs item 3 in the sequence (from the red box). In 'Read Weightings,' on the last three time steps, we see that the controller reads from contiguous locations that each store the time slices of item 3. This is curious because it appears that the network has jumped directly to the correct location storing item 3. However we can explain this behaviour by looking at 'Write Weightings.' Here we see that the memory is written to even when the input presents a delimiter symbol between items. One can confirm in 'Adds' that data are indeed written to memory when the delimiters are presented (e.g., the data within the black box); furthermore, each time a delimiter is presented, the vector added to memory is different. Further analysis of the memory reveals that the network accesses the location it reads after the query by using a content-based lookup that produces a weighting that is shifted by one. Additionally, the key used for content-lookup corresponds to the vector that was added in the black box. This implies the following memory-access algorithm: when each item delimiter is presented, the controller writes a compressed representation of the previous three time slices of the item. After the query arrives, the controller recomputes the same compressed representation of the query item, uses a content-based lookup to find the location where it wrote the first representation, and then shifts by one to produce the subsequent item in the sequence (thereby combining content-based lookup with location-based offsetting).
## 4.4 Dynamic N-Grams
The goal of the dynamic N-Grams task was to test whether NTM could rapidly adapt to new predictive distributions. In particular we were interested to see if it were able to use its
Figure 12: NTM Memory Use During the Associative Recall Task. In 'Inputs,' a sequence of items, each composed of three consecutive binary random vectors is propagated to the controller. The distinction between items is designated by delimiter symbols (row 7 in 'Inputs'). After several items have been presented, a delimiter that designates a query is presented (row 8 in 'Inputs'). A single query item is presented (green box), and the network target corresponds to the subsequent item in the sequence (red box). In 'Outputs,' we see that the network correctly produces the target item. The red boxes in the read and write weightings highlight the three locations where the target item was written and then read. The solution the network finds is to form a compressed representation (black box in 'Adds') of each item that it can store in a single location. For further analysis, see the main text.
<details>
<summary>Image 12 Details</summary>

### Visual Description
\n
## Heatmap Grid: Neural Network Attention Visualization
### Overview
The image presents a 2x2 grid of heatmaps visualizing attention weights within a neural network, likely a recurrent neural network or transformer. The grid shows "Inputs" on the left and "Outputs" on the right, with "Adds" and "Location" represented vertically. The heatmaps are color-coded, with warmer colors (red, yellow) indicating higher attention weights and cooler colors (blue, cyan) indicating lower weights. Each heatmap has a small red square in the top-right corner, likely indicating a specific focus or region of interest.
### Components/Axes
* **Inputs:** Top row, representing the input data.
* **Outputs:** Right column, representing the output data.
* **Adds:** Left column (below "Inputs"), representing additive attention mechanisms. The y-axis is labeled "Adds".
* **Location:** Bottom row, representing spatial or temporal location. The y-axis is labeled "Location" and the x-axis is labeled "Time".
* **Write Weightings:** Below the "Inputs" and "Adds" heatmaps.
* **Read Weightings:** Below the "Outputs" and "Reads" heatmaps.
* **Color Scale:** Implicitly, red/yellow = high attention, blue/cyan = low attention.
* **Red Square:** Present in each heatmap, top-right corner. Its purpose is unclear without further context, but it likely highlights a specific area of focus.
### Detailed Analysis or Content Details
**1. Inputs Heatmap (Top-Left):**
This heatmap displays a pattern resembling the input sequence "Hellow". The heatmap is predominantly black, with white/light-colored regions corresponding to the letters. The red square is positioned over the last letter, "o".
* The input sequence appears to be "Hellow" (approximate).
* The attention is focused on the last character, "o".
**2. Adds Heatmap (Middle-Left):**
This heatmap shows a complex pattern of attention weights. The colors are highly varied, ranging from deep blue to bright red. A vertical blue line is present, approximately in the center of the heatmap. The red square is positioned in the top-right corner.
* The heatmap is highly dynamic, with no clear dominant trend.
* The vertical blue line suggests a consistent attention pattern across the "Adds" dimension.
**3. Location Heatmap (Bottom-Left):**
This heatmap shows a diagonal line of high attention weights (yellow/red) extending from the bottom-left to the top-right corner. This indicates a strong correlation between "Time" and "Location". The red square is positioned at the bottom-left corner.
* The diagonal line indicates a sequential processing of information over time.
* The attention weights increase linearly with both time and location.
**4. Outputs Heatmap (Top-Right):**
This heatmap is predominantly cyan/blue, with a few horizontal lines of slightly higher attention (yellow/green). The red square is positioned in the top-right corner.
* The heatmap shows a relatively uniform distribution of attention weights.
* The horizontal lines suggest a focus on specific output features.
**5. Reads Heatmap (Middle-Right):**
This heatmap is similar to the "Outputs" heatmap, predominantly cyan/blue with a few horizontal lines of slightly higher attention. The red square is positioned in the top-right corner.
* The heatmap shows a relatively uniform distribution of attention weights.
* The horizontal lines suggest a focus on specific output features.
**6. Read Weightings Heatmap (Bottom-Right):**
This heatmap is almost entirely black, with very few areas of attention. The red square is positioned in the bottom-right corner.
* The heatmap shows minimal attention weights.
* This suggests that the network is not actively reading from this region.
### Key Observations
* The "Location" heatmap shows a clear diagonal trend, indicating a sequential processing of information.
* The "Adds" heatmap is highly dynamic, with no clear dominant trend.
* The "Inputs" heatmap shows attention focused on the last character of the input sequence.
* The "Outputs" and "Reads" heatmaps show relatively uniform attention weights, with a few horizontal lines of slightly higher attention.
* The "Read Weightings" heatmap shows minimal attention weights.
### Interpretation
The image demonstrates how attention mechanisms work within a neural network. The "Inputs" heatmap shows where the network is focusing its attention on the input sequence. The "Adds" heatmap visualizes the additive attention weights, while the "Location" heatmap shows how attention changes over time and location. The "Outputs" and "Reads" heatmaps show where the network is focusing its attention on the output sequence. The "Read Weightings" heatmap shows where the network is actively reading from.
The diagonal trend in the "Location" heatmap suggests that the network is processing information sequentially. The dynamic pattern in the "Adds" heatmap suggests that the network is using a complex attention mechanism. The minimal attention weights in the "Read Weightings" heatmap suggest that the network is not actively reading from that region.
The red squares in each heatmap likely highlight a specific area of focus or region of interest. The consistent placement of the red square in the top-right corner suggests that this region is important for the network's performance.
The data suggests a model that processes sequential data, focusing on the end of the input sequence and utilizing a complex attention mechanism to generate the output. The model appears to be actively reading from certain regions of the output sequence, but not others.
</details>
memory as a re-writable table that it could use to keep count of transition statistics, thereby emulating a conventional N-Gram model.
We considered the set of all possible 6-Gram distributions over binary sequences. Each 6-Gram distribution can be expressed as a table of 2 5 = 32 numbers, specifying the probability that the next bit will be one, given all possible length five binary histories. For each training example, we first generated random 6-Gram probabilities by independently drawing all 32 probabilities from the Beta ( 1 2 , 1 2 ) distribution.
We then generated a particular training sequence by drawing 200 successive bits using the current lookup table. 4 The network observes the sequence one bit at a time and is then asked to predict the next bit. The optimal estimator for the problem can be determined by
4 The first 5 bits, for which insufficient context exists to sample from the table, are drawn i.i.d. from a Bernoulli distribution with p = 0 . 5 .
Figure 13: Dynamic N-Gram Learning Curves.
<details>
<summary>Image 13 Details</summary>

### Visual Description
\n
## Line Chart: Cost per Sequence vs. Sequence Number
### Overview
This line chart depicts the cost per sequence (in bits) as a function of the sequence number (in thousands) for four different models: LSTM, NTM with LSTM Controller, NTM with Feedforward Controller, and an Optimal Estimator. The chart illustrates how the cost per sequence changes as the model processes more sequences.
### Components/Axes
* **X-axis:** Sequence number (thousands), ranging from 0 to 1000.
* **Y-axis:** Cost per sequence (bits), ranging from 130 to 160.
* **Data Series:**
* LSTM (Blue line with circle markers)
* NTM with LSTM Controller (Green line with triangle markers)
* NTM with Feedforward Controller (Red line with plus markers)
* Optimal Estimator (Orange line with 'x' markers)
* **Legend:** Located in the top-right corner of the chart, associating colors with each model.
### Detailed Analysis
* **LSTM (Blue):** The line starts at approximately 158 bits at sequence number 0, rapidly decreases to around 142 bits by sequence number 100, and then fluctuates between approximately 134 and 138 bits for the remainder of the sequence numbers. There is a slight upward trend towards the end of the sequence.
* **NTM with LSTM Controller (Green):** This line begins at approximately 158 bits at sequence number 0, and quickly drops to around 134 bits by sequence number 100. It then fluctuates between approximately 132 and 136 bits for the rest of the sequence numbers.
* **NTM with Feedforward Controller (Red):** The line starts at approximately 158 bits at sequence number 0, and decreases to around 133 bits by sequence number 100. It then fluctuates between approximately 131 and 135 bits for the remainder of the sequence numbers.
* **Optimal Estimator (Orange):** This line starts at approximately 158 bits at sequence number 0, and quickly drops to around 131 bits by sequence number 100. It then remains relatively stable, fluctuating between approximately 130 and 133 bits for the rest of the sequence numbers.
### Key Observations
* All four models show a significant decrease in cost per sequence during the initial 100,000 sequences.
* The Optimal Estimator consistently exhibits the lowest cost per sequence after the initial drop.
* The LSTM model has the highest cost per sequence after the initial drop, and shows a slight upward trend towards the end of the sequence.
* The NTM with LSTM Controller and NTM with Feedforward Controller perform similarly, with costs per sequence between the LSTM and Optimal Estimator.
### Interpretation
The data suggests that all four models improve their performance (reduce cost per sequence) as they process more sequences. The Optimal Estimator represents a theoretical lower bound on the cost per sequence, and the other models converge towards this bound. The LSTM model, while initially performing well, appears to plateau at a higher cost per sequence compared to the NTM-based models and the Optimal Estimator. This could indicate that the LSTM model is less efficient at learning long-term dependencies or that the NTM architecture provides a more effective mechanism for managing memory and attention. The slight upward trend in the LSTM model's cost per sequence towards the end of the sequence could indicate overfitting or a loss of generalization ability. The NTM models, particularly the one with the LSTM controller, demonstrate a strong ability to approach the performance of the Optimal Estimator, suggesting their effectiveness in sequence learning tasks.
</details>
Bayesian analysis (Murphy, 2012):
<!-- formula-not-decoded -->
where c is the five bit previous context, B is the value of the next bit and N 0 and N 1 are respectively the number of zeros and ones observed after c so far in the sequence. We can therefore compare NTM to the optimal predictor as well as LSTM. To assess performance we used a validation set of 1000 length 200 sequences sampled from the same distribution as the training data. As shown in Figure 13, NTM achieves a small, but significant performance advantage over LSTM, but never quite reaches the optimum cost.
The evolution of the two architecture's predictions as they observe new inputs is shown in Figure 14, along with the optimal predictions. Close analysis of NTM's memory usage (Figure 15) suggests that the controller uses the memory to count how many ones and zeros it has observed in different contexts, allowing it to implement an algorithm similar to the optimal estimator.
## 4.5 Priority Sort
This task tests whether the NTM can sort data-an important elementary algorithm. A sequence of random binary vectors is input to the network along with a scalar priority rating for each vector. The priority is drawn uniformly from the range [-1, 1]. The target sequence contains the binary vectors sorted according to their priorities, as depicted in Figure 16.
Each input sequence contained 20 binary vectors with corresponding priorities, and each target sequence was the 16 highest-priority vectors in the input. 5 Inspection of NTM's
5 We limited the sort to size 16 because we were interested to see if NTM would solve the task using a binary heap sort of depth 4.
Figure 14: Dynamic N-Gram Inference. The top row shows a test sequence from the N-Gram task, and the rows below show the corresponding predictive distributions emitted by the optimal estimator, NTM, and LSTM. In most places the NTM predictions are almost indistinguishable from the optimal ones. However at the points indicated by the two arrows it makes clear mistakes, one of which is explained in Figure 15. LSTM follows the optimal predictions closely in some places but appears to diverge further as the sequence progresses; we speculate that this is due to LSTM 'forgetting' the observations at the start of the sequence.
<details>
<summary>Image 14 Details</summary>

### Visual Description
\n
## Heatmap: Model Performance Visualization
### Overview
The image presents a heatmap comparing the performance of four different models โ Input, Optimal, NTM (Neural Turing Machine), and LSTM (Long Short-Term Memory) โ across a sequence of data points. The heatmap uses grayscale shading to represent the model's output or activation level, with darker shades indicating higher values and lighter shades indicating lower values. Two red arrows point to a specific region of the heatmap, likely highlighting an area of interest or a point of comparison.
### Components/Axes
* **Rows:** Represent the different models: Input, Optimal, NTM, LSTM.
* **Columns:** Represent the sequence of data points or time steps. The number of data points is approximately 30.
* **Color Scale:** Grayscale, where darker shades represent higher values and lighter shades represent lower values. There is no explicit legend, but the grayscale gradient is implied.
* **Annotations:** Two red arrows pointing to a specific region in the NTM and LSTM rows, around the 15th-20th column.
### Detailed Analysis
The heatmap displays the activation patterns of each model over the sequence.
* **Input:** Shows a clear alternating pattern of black and white, suggesting a binary or discrete input signal.
* **Optimal:** Displays a more complex pattern with varying shades of gray, indicating a more nuanced response.
* **NTM:** Shows a pattern that initially resembles the Input, but becomes more complex and variable over time. The region highlighted by the red arrows shows a strong activation (darker shade) in the NTM model.
* **LSTM:** Similar to NTM, the LSTM model starts with a pattern resembling the Input, but evolves into a more complex pattern. The red arrows highlight a region where the LSTM also shows a strong activation, though potentially less intense than the NTM.
Visually, the NTM and LSTM models appear to "learn" or adapt to the input sequence, as their patterns diverge from the simple alternating pattern of the Input. The Optimal model shows the most complex and varied pattern, suggesting it is the most capable of representing the underlying data.
### Key Observations
* The Input signal is a simple alternating pattern.
* The Optimal model exhibits the most complex activation pattern.
* Both NTM and LSTM models show increasing complexity in their activation patterns over time.
* The red arrows highlight a region where both NTM and LSTM models exhibit strong activation, suggesting a point of significant processing or response.
* The NTM appears to have a slightly stronger activation in the highlighted region compared to the LSTM.
### Interpretation
This heatmap likely visualizes the internal state or activation patterns of different models as they process a sequence of data. The "Optimal" model likely represents a theoretical or ideal solution, serving as a benchmark for the other models. The NTM and LSTM models are attempting to approximate the behavior of the Optimal model.
The red arrows draw attention to a specific point in the sequence where the NTM and LSTM models demonstrate a significant response. This could indicate a challenging or important part of the input sequence that requires more processing power or attention from the models. The difference in activation intensity between NTM and LSTM in this region suggests that the NTM might be slightly better at handling this specific aspect of the data.
The heatmap demonstrates the ability of NTM and LSTM models to learn and adapt to complex input sequences, but also highlights the gap between their performance and the "Optimal" solution. The visualization provides insights into the internal workings of these models and can be used to identify areas for improvement.
</details>
Figure 15: NTM Memory Use During the Dynamic N-Gram Task. The red and green arrows indicate point where the same context is repeatedly observed during the test sequence ('00010' for the green arrows, '01111' for the red arrows). At each such point the same location is accessed by the read head, and then, on the next time-step, accessed by the write head. We postulate that the network uses the writes to keep count of the fraction of ones and zeros following each context in the sequence so far. This is supported by the add vectors, which are clearly anti-correlated at places where the input is one or zero, suggesting a distributed 'counter.' Note that the write weightings grow fainter as the same context is repeatedly seen; this may be because the memory records a ratio of ones to zeros, rather than absolute counts. The red box in the prediction sequence corresponds to the mistake at the first red arrow in Figure 14; the controller appears to have accessed the wrong memory location, as the previous context was '01101' and not '01111.'
<details>
<summary>Image 15 Details</summary>

### Visual Description
\n
## Diagram: Neural Network Weight Update Visualization
### Overview
The image is a diagram illustrating the process of updating weights in a neural network over time. It depicts five horizontal layers representing "Add Vectors", "Write Weights", "Predictions", "Inputs", and "Real Weights", with time progressing from left to right. The diagram uses color gradients, arrows, and rectangular blocks to visualize the flow of information and weight adjustments.
### Components/Axes
The diagram consists of the following components:
* **Time Axis:** A horizontal arrow at the bottom labeled "Time" indicates the progression of the process.
* **Add Vectors:** The top layer displays a heatmap-like visualization with a color gradient ranging from blue to red.
* **Write Weights:** This layer shows white squares on a black background, with green upward arrows and red downward arrows indicating weight adjustments.
* **Predictions:** A layer with varying shades of gray rectangles, representing the network's predictions. A red rectangle highlights a specific prediction.
* **Inputs:** A layer with black and white rectangles, representing the input data.
* **Real Weights:** This layer mirrors the "Write Weights" layer, showing white squares on a black background with green upward and red downward arrows.
### Detailed Analysis or Content Details
The diagram shows a sequence of operations occurring over time.
* **Add Vectors:** The top layer shows a dynamic color pattern, suggesting a continuous change in vector values. The colors range from deep blue (approximately -1) to bright red (approximately +1).
* **Write Weights:** This layer shows a series of weight updates. At each time step, there are both green upward arrows (positive weight adjustments) and red downward arrows (negative weight adjustments). The magnitude of the adjustments appears to vary. There are approximately 10 time steps shown.
* **Predictions:** The "Predictions" layer displays a series of gray rectangles, with one rectangle highlighted in red. The intensity of the gray rectangles varies, indicating different prediction values. The red rectangle appears to indicate a specific prediction that is being focused on.
* **Inputs:** The "Inputs" layer shows a sequence of black and white rectangles, representing the input data. The pattern of black and white rectangles changes over time.
* **Real Weights:** This layer mirrors the "Write Weights" layer, showing the actual weights being adjusted. The green and red arrows indicate the direction and magnitude of the weight changes.
The arrows in the "Write Weights" and "Real Weights" layers are aligned vertically, suggesting a direct relationship between the predicted weight changes and the actual weight updates.
### Key Observations
* The diagram illustrates a feedback loop where predictions are made, compared to real weights, and then used to adjust the weights.
* The "Add Vectors" layer suggests that the weight updates are based on a combination of multiple vectors.
* The red rectangle in the "Predictions" layer highlights a specific prediction that is being used for weight adjustment.
* The diagram does not provide specific numerical values for the weights or predictions, but it visually represents the process of weight updates.
### Interpretation
The diagram demonstrates the core mechanism of learning in a neural network โ iterative weight adjustment based on the difference between predictions and actual values. The "Add Vectors" layer likely represents the gradient calculation, which determines the direction and magnitude of the weight updates. The "Write Weights" and "Real Weights" layers show the application of these updates to the network's weights. The "Predictions" and "Inputs" layers represent the data flowing through the network.
The diagram suggests that the network is continuously learning by adjusting its weights to minimize the difference between its predictions and the real values. The highlighted red rectangle in the "Predictions" layer indicates that the network is focusing on a specific prediction to improve its accuracy. The dynamic color pattern in the "Add Vectors" layer suggests that the gradient is constantly changing as the network learns.
The diagram is a simplified representation of a complex process, but it effectively conveys the fundamental principles of neural network learning. It is a conceptual visualization rather than a precise data representation. The diagram does not provide quantitative data, but it illustrates the qualitative relationships between the different components of the network.
</details>
Figure 16: Example Input and Target Sequence for the Priority Sort Task. The input sequence contains random binary vectors and random scalar priorities. The target sequence is a subset of the input vectors sorted by the priorities.
<details>
<summary>Image 16 Details</summary>

### Visual Description
\n
## Diagram: Priority-Based Input to Target Transformation
### Overview
The image depicts a diagram illustrating a process where "Inputs" are transformed into "Targets" based on a "Priority" signal. The priority is represented by a fluctuating line graph, and the inputs and targets are shown as binary matrices (grids of black and white squares). An arrow indicates the transformation direction.
### Components/Axes
* **Priority:** A line graph positioned at the top of the diagram. The x-axis is labeled with numbers 1 through 7, indicating discrete time steps or iterations. The y-axis is unlabeled, representing the priority value.
* **Inputs:** A binary matrix (approximately 8x8) located on the left side of the diagram. The matrix consists of black and white squares.
* **Targets:** A binary matrix (approximately 8x8) located on the right side of the diagram. Similar to the inputs, it consists of black and white squares.
* **Arrow:** A horizontal arrow pointing from the "Inputs" matrix to the "Targets" matrix, indicating the direction of transformation.
### Detailed Analysis or Content Details
* **Priority Line Graph:** The line graph starts at a high value (approximately 7) and decreases to a low value (approximately 2), then increases again to a value of approximately 1. The line fluctuates significantly between these values. Specific approximate values at each x-axis point are:
* x=1: y โ 1
* x=2: y โ 2
* x=3: y โ 3
* x=4: y โ 6
* x=5: y โ 7
* x=6: y โ 4
* x=7: y โ 5
* **Inputs Matrix:** The input matrix contains a seemingly random distribution of black and white squares. There is no obvious pattern.
* **Targets Matrix:** The target matrix also contains a seemingly random distribution of black and white squares. There are some visual differences between the input and target matrices, with some squares changing from black to white and vice versa. The target matrix appears to have slightly fewer black squares than the input matrix.
* **Labels:** The diagram includes the labels "Priority", "Inputs", and "Targets". The x-axis of the priority graph is labeled with the numbers 1 through 7. The targets matrix is labeled with "1234567..."
### Key Observations
* The priority signal fluctuates over time, potentially influencing the transformation from inputs to targets.
* The transformation appears to involve changing the state of some squares in the matrix (from black to white or vice versa).
* The transformation is not a simple copy of the input matrix to the target matrix.
### Interpretation
The diagram likely represents a system where inputs are processed to generate targets, and the processing is guided by a priority signal. The priority signal could represent the importance or relevance of different input features at different time steps. The transformation from inputs to targets could be a form of attention mechanism, where the priority signal determines which input features are emphasized or suppressed during the transformation. The binary matrices could represent feature maps or activation patterns in a neural network. The diagram suggests a dynamic process where the transformation is not static but changes over time based on the priority signal. The "1234567..." label on the targets matrix could indicate a sequential processing or ordering of the targets.
The diagram is conceptual and does not provide specific numerical data or algorithms. It illustrates a general principle of priority-based processing. The lack of a y-axis label on the priority graph makes it difficult to interpret the absolute values of the priority signal.
</details>
Figure 17: NTM Memory Use During the Priority Sort Task. Left: Write locations returned by fitting a linear function of the priorities to the observed write locations. Middle: Observed write locations. Right: Read locations.
<details>
<summary>Image 17 Details</summary>

### Visual Description
\n
## Diagram: Neural Network Activity Visualization
### Overview
The image presents three sequential visualizations of activity within a neural network, likely during a learning or processing phase. Each visualization is a 2D representation with axes labeled "Time" and "Location", and a title indicating the type of activity being displayed: "Hypothesised Locations", "Write Weightings", and "Read Weightings". The visualizations appear as heatmaps, where brighter pixels indicate higher activity levels.
### Components/Axes
* **X-axis:** Labeled "Time", running horizontally.
* **Y-axis:** Labeled "Location", running vertically.
* **Visualizations:** Three panels, each representing a different stage or aspect of neural network operation.
* "Hypothesised Locations" - Shows scattered bright points.
* "Write Weightings" - Shows scattered bright points, some with faint trails.
* "Read Weightings" - Shows a concentrated, diagonal band of bright points.
### Detailed Analysis or Content Details
The visualizations are essentially pixelated heatmaps. It's difficult to extract precise numerical values without knowing the scale or resolution of the "Time" and "Location" axes. However, we can describe the distribution of activity:
**1. Hypothesised Locations:**
* The bright points are scattered relatively evenly across the "Location" axis.
* The points are distributed across the "Time" axis, but there doesn't appear to be a strong temporal trend.
* Approximately 10-15 bright points are visible.
* The points are roughly square in shape, suggesting a uniform pixel size.
**2. Write Weightings:**
* Similar to "Hypothesised Locations", bright points are scattered.
* Some points exhibit faint trails or streaks, suggesting a brief duration of activity or a gradual change in weighting.
* Approximately 10-15 bright points are visible.
* The distribution appears slightly more concentrated towards the upper portion of the "Location" axis.
**3. Read Weightings:**
* A distinct diagonal band of bright points is visible, extending from the bottom-left to the top-right of the visualization.
* The band indicates a strong correlation between "Time" and "Location". As time progresses, activity shifts to higher "Location" values.
* The band is significantly more concentrated than the scattered points in the previous visualizations.
* Approximately 20-30 bright points are visible within the band.
### Key Observations
* The activity pattern changes significantly across the three visualizations.
* "Hypothesised Locations" and "Write Weightings" show dispersed activity, while "Read Weightings" exhibits a clear directional trend.
* The diagonal band in "Read Weightings" suggests a sequential activation or processing of locations over time.
* The faint trails in "Write Weightings" suggest a dynamic process of adjusting weights.
### Interpretation
This diagram likely illustrates the internal workings of a neural network, potentially during a reinforcement learning or sequential processing task.
* **"Hypothesised Locations"** could represent the network's initial exploration or prediction of relevant locations for processing.
* **"Write Weightings"** could show the process of updating the network's internal parameters (weights) based on the hypothesized locations. The trails suggest that the weight adjustments are not instantaneous.
* **"Read Weightings"** likely represents the network's actual processing of information, where activity propagates sequentially through different locations. The diagonal band indicates that the network is systematically attending to locations in a specific order.
The transition from scattered activity in the first two visualizations to the directional band in the third suggests that the network is learning to focus its attention and process information in a more efficient and organized manner. The diagram provides a visual representation of how the network's internal state evolves over time, revealing the dynamic interplay between exploration, learning, and processing. The lack of numerical data makes it difficult to quantify the network's performance or learning rate, but the visual trends provide valuable insights into its behavior.
</details>
memory use led us to hypothesise that it uses the priorities to determine the relative location of each write. To test this hypothesis we fitted a linear function of the priority to the observed write locations. Figure 17 shows that the locations returned by the linear function closely match the observed write locations. It also shows that the network reads from the memory locations in increasing order, thereby traversing the sorted sequence.
The learning curves in Figure 18 demonstrate that NTM with both feedforward and LSTM controllers substantially outperform LSTM on this task. Note that eight parallel read and write heads were needed for best performance with a feedforward controller on this task; this may reflect the difficulty of sorting vectors using only unary vector operations (see Section 3.4).
## 4.6 Experimental Details
For all experiments, the RMSProp algorithm was used for training in the form described in (Graves, 2013) with momentum of 0.9. Tables 1 to 3 give details about the network configurations and learning rates used in the experiments. All LSTM networks had three stacked hidden layers. Note that the number of LSTM parameters grows quadratically with
Figure 18: Priority Sort Learning Curves.
<details>
<summary>Image 18 Details</summary>

### Visual Description
\n
## Line Chart: Cost per Sequence vs. Sequence Number
### Overview
This image presents a line chart illustrating the cost per sequence (in bits) as a function of the sequence number (in thousands). The chart compares the performance of three different models: LSTM, NTM with LSTM Controller, and NTM with Feedforward Controller. All three models demonstrate a decreasing cost per sequence as the sequence number increases, indicating learning and improvement over time.
### Components/Axes
* **X-axis:** Sequence number (thousands). Scale ranges from 0 to 1000, with tick marks at intervals of 200.
* **Y-axis:** Cost per sequence (bits). Scale ranges from 0 to 140, with tick marks at intervals of 20.
* **Legend:** Located in the top-right corner.
* LSTM (Blue line with circle markers)
* NTM with LSTM Controller (Green line with triangle markers)
* NTM with Feedforward Controller (Red line with diamond markers)
### Detailed Analysis
**LSTM (Blue Line):**
The blue line representing LSTM starts at approximately 122 bits at sequence number 0. It exhibits a decreasing trend, becoming less steep over time.
* At sequence number 200: ~85 bits
* At sequence number 400: ~65 bits
* At sequence number 600: ~55 bits
* At sequence number 800: ~50 bits
* At sequence number 1000: ~47 bits
**NTM with LSTM Controller (Green Line):**
The green line starts at approximately 125 bits at sequence number 0. It shows a rapid initial decrease, followed by a slower decline and a plateau.
* At sequence number 200: ~45 bits
* At sequence number 400: ~25 bits
* At sequence number 600: ~20 bits
* At sequence number 800: ~18 bits
* At sequence number 1000: ~17 bits
**NTM with Feedforward Controller (Red Line):**
The red line begins at approximately 115 bits at sequence number 0. It demonstrates the steepest initial decrease among the three models, followed by a leveling off.
* At sequence number 200: ~55 bits
* At sequence number 400: ~30 bits
* At sequence number 600: ~22 bits
* At sequence number 800: ~20 bits
* At sequence number 1000: ~19 bits
### Key Observations
* All three models show a decreasing cost per sequence, indicating learning.
* The NTM with Feedforward Controller initially learns the fastest, exhibiting the steepest decline in cost.
* The NTM with LSTM Controller reaches the lowest cost per sequence, but its initial learning rate is slower than the Feedforward Controller.
* The LSTM model has the slowest learning rate and the highest final cost per sequence.
* The NTM with LSTM Controller and NTM with Feedforward Controller converge to similar cost per sequence values at higher sequence numbers (around 800-1000).
### Interpretation
The data suggests that Neural Turing Machines (NTMs) outperform standard LSTMs in this task, as evidenced by the lower cost per sequence achieved by both NTM variants. The choice of controller (LSTM vs. Feedforward) impacts the learning speed, with the Feedforward controller providing faster initial learning but ultimately converging to a similar performance level as the LSTM controller. The LSTM model, while still learning, demonstrates a significantly higher cost per sequence, indicating it is less efficient at this task compared to the NTM architectures. The leveling off of the curves at higher sequence numbers suggests that the models are approaching their performance limits or that further training yields diminishing returns. This data could be used to evaluate the effectiveness of different memory-augmented neural network architectures for sequence learning tasks.
</details>
Table 1: NTM with Feedforward Controller Experimental Settings
| Task | #Heads | Controller Size | Memory Size | Learning Rate | #Parameters |
|---------------|----------|-------------------|---------------|-----------------|---------------|
| Copy | 1 | 100 | 128 ร 20 | 10 - 4 | 17 , 162 |
| Repeat Copy | 1 | 100 | 128 ร 20 | 10 - 4 | 16 , 712 |
| Associative | 4 | 256 | 128 ร 20 | 10 - 4 | 146 , 845 |
| N-Grams | 1 | 100 | 128 ร 20 | 3 ร 10 - 5 | 14 , 656 |
| Priority Sort | 8 | 512 | 128 ร 20 | 3 ร 10 - 5 | 508 , 305 |
the number of hidden units (due to the recurrent connections in the hidden layers). This contrasts with NTM, where the number of parameters does not increase with the number of memory locations. During the training backward pass, all gradient components are clipped elementwise to the range (-10, 10).
## 5 Conclusion
We have introduced the Neural Turing Machine, a neural network architecture that takes inspiration from both models of biological working memory and the design of digital computers. Like conventional neural networks, the architecture is differentiable end-to-end and can be trained with gradient descent. Our experiments demonstrate that it is capable of learning simple algorithms from example data and of using these algorithms to generalise well outside its training regime.
Table 2: NTM with LSTM Controller Experimental Settings
| Task | #Heads | Controller Size | Memory Size | Learning Rate | #Parameters |
|---------------|----------|-------------------|---------------|-----------------|---------------|
| Copy | 1 | 100 | 128 ร 20 | 10 - 4 | 67 , 561 |
| Repeat Copy | 1 | 100 | 128 ร 20 | 10 - 4 | 66 , 111 |
| Associative | 1 | 100 | 128 ร 20 | 10 - 4 | 70 , 330 |
| N-Grams | 1 | 100 | 128 ร 20 | 3 ร 10 - 5 | 61 , 749 |
| Priority Sort | 5 | 2 ร 100 | 128 ร 20 | 3 ร 10 - 5 | 269 , 038 |
Table 3: LSTM Network Experimental Settings
| Task | Network Size | Learning Rate | #Parameters |
|---------------|----------------|-----------------|---------------|
| Copy | 3 ร 256 | 3 ร 10 - 5 | 1 , 352 , 969 |
| Repeat Copy | 3 ร 512 | 3 ร 10 - 5 | 5 , 312 , 007 |
| Associative | 3 ร 256 | 10 - 4 | 1 , 344 , 518 |
| N-Grams | 3 ร 128 | 10 - 4 | 331 , 905 |
| Priority Sort | 3 ร 128 | 3 ร 10 - 5 | 384 , 424 |
## 6 Acknowledgments
Many have offered thoughtful insights, but we would especially like to thank Daan Wierstra, Peter Dayan, Ilya Sutskever, Charles Blundell, Joel Veness, Koray Kavukcuoglu, Dharshan Kumaran, Georg Ostrovski, Chris Summerfield, Jeff Dean, Geoffrey Hinton, and Demis Hassabis.
## References
Baddeley, A., Eysenck, M., and Anderson, M. (2009). Memory . Psychology Press.
- Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. abs/1409.0473.
- Barrouillet, P., Bernardin, S., and Camos, V. (2004). Time constraints and resource sharing in adults' working memory spans. Journal of Experimental Psychology: General , 133(1):83.
- Chomsky, N. (1956). Three models for the description of language. Information Theory, IEEE Transactions on , 2(3):113-124.
- Das, S., Giles, C. L., and Sun, G.-Z. (1992). Learning context-free grammars: Capabilities and limitations of a recurrent neural network with an external stack memory. In Proceedings of The Fourteenth Annual Conference of Cognitive Science Society. Indiana University .
- Dayan, P. (2008). Simple substrates for complex cognition. Frontiers in neuroscience , 2(2):255.
- Eliasmith, C. (2013). How to build a brain: A neural architecture for biological cognition . Oxford University Press.
- Fitch, W., Hauser, M. D., and Chomsky, N. (2005). The evolution of the language faculty: clarifications and implications. Cognition , 97(2):179-210.
- Fodor, J. A. and Pylyshyn, Z. W. (1988). Connectionism and cognitive architecture: A critical analysis. Cognition , 28(1):3-71.
- Frasconi, P., Gori, M., and Sperduti, A. (1998). A general framework for adaptive processing of data structures. Neural Networks, IEEE Transactions on , 9(5):768-786.
- Gallistel, C. R. and King, A. P. (2009). Memory and the computational brain: Why cognitive science will transform neuroscience , volume 3. John Wiley & Sons.
- Goldman-Rakic, P. S. (1995). Cellular basis of working memory. Neuron , 14(3):477-485.
- Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 .
- Graves, A. and Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the 31st International Conference on Machine Learning (ICML-14) , pages 1764-1772.
- Graves, A., Mohamed, A., and Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on , pages 6645-6649. IEEE.
- Hadley, R. F. (2009). The problem of rapid variable creation. Neural computation , 21(2):510-532.
- Hazy, T. E., Frank, M. J., and O'Reilly, R. C. (2006). Banishing the homunculus: making working memory work. Neuroscience , 139(1):105-118.
- Hinton, G. E. (1986). Learning distributed representations of concepts. In Proceedings of the eighth annual conference of the cognitive science society , volume 1, page 12. Amherst, MA.
- Hochreiter, S., Bengio, Y., Frasconi, P., and Schmidhuber, J. (2001a). Gradient flow in recurrent nets: the difficulty of learning long-term dependencies.
- Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation , 9(8):1735-1780.
- Hochreiter, S., Younger, A. S., and Conwell, P. R. (2001b). Learning to learn using gradient descent. In Artificial Neural Networks?ICANN 2001 , pages 87-94. Springer.
- Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences , 79(8):25542558.
- Jackendoff, R. and Pinker, S. (2005). The nature of the language faculty and its implications for evolution of language (reply to fitch, hauser, and chomsky). Cognition , 97(2):211225.
- Kanerva, P. (2009). Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors. Cognitive Computation , 1(2):139-159.
- Marcus, G. F. (2003). The algebraic mind: Integrating connectionism and cognitive science . MIT press.
- Miller, G. A. (1956). The magical number seven, plus or minus two: some limits on our capacity for processing information. Psychological review , 63(2):81.
- Miller, G. A. (2003). The cognitive revolution: a historical perspective. Trends in cognitive sciences , 7(3):141-144.
- Minsky, M. L. (1967). Computation: finite and infinite machines . Prentice-Hall, Inc.
- Murphy, K. P. (2012). Machine learning: a probabilistic perspective . MIT press.
- Plate, T. A. (2003). Holographic Reduced Representation: Distributed representation for cognitive structures . CSLI.
- Pollack, J. B. (1990). Recursive distributed representations. Artificial Intelligence , 46(1):77-105.
- Rigotti, M., Barak, O., Warden, M. R., Wang, X.-J., Daw, N. D., Miller, E. K., and Fusi, S. (2013). The importance of mixed selectivity in complex cognitive tasks. Nature , 497(7451):585-590.
- Rumelhart, D. E., McClelland, J. L., Group, P. R., et al. (1986). Parallel distributed processing , volume 1. MIT press.
- Seung, H. S. (1998). Continuous attractors and oculomotor control. Neural Networks , 11(7):1253-1258.
- Siegelmann, H. T. and Sontag, E. D. (1995). On the computational power of neural nets. Journal of computer and system sciences , 50(1):132-150.
- Smolensky, P. (1990). Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial intelligence , 46(1):159-216.
- Socher, R., Huval, B., Manning, C. D., and Ng, A. Y. (2012). Semantic compositionality through recursive matrix-vector spaces. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning , pages 1201-1211. Association for Computational Linguistics.
- Sutskever, I., Martens, J., and Hinton, G. E. (2011). Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11) , pages 1017-1024.
- Sutskever, I., Vinyals, O., and Le, Q. V . (2014). Sequence to sequence learning with neural networks. arXiv preprint arXiv:1409.3215 .
- Touretzky, D. S. (1990). Boltzcons: Dynamic symbol structures in a connectionist network. Artificial Intelligence , 46(1):5-46.
- Von Neumann, J. (1945). First draft of a report on the edvac.
- Wang, X.-J. (1999). Synaptic basis of cortical persistent activity: the importance of nmda receptors to working memory. The Journal of Neuroscience , 19(21):9587-9603.