Abstract and 1. Introduction
1.1 Syllogisms composition
1.2 Hardness of long compositions
1.3 Hardness of global reasoning
1.4 Our contributions
Results on the local reasoning barrier
2.1 Defining locality and auto-regressive locality
2.2 Transformers require low locality: formal results
2.3 Agnostic scratchpads cannot break the locality
Scratchpads to break the locality
3.1 Educated scratchpad
3.2 Inductive Scratchpads
Conclusion, Acknowledgments, and References
A. Further related literature
B. Additional experiments
C. Experiment and implementation details
D. Proof of Theorem 1
E. Comment on Lemma 1
F. Discussion on circuit complexity connections
G. More experiments with ChatGPT
In all of the experiments, we use GPT2-style [29] decoder-only Transformers and we train them from scratch in an autoregressive manner with the cross-entropy loss and AdamW [90] optimizer. Our implementation uses the PyTorch framework [91] and is mostly built on NanoGPT’s implementation [92]. In particular, our Transformers use causal attention masking and absolute learnable positional embeddings. For most experiments, we use a small model with 6 layers, 6 heads, and an embedding dimension of 384 which results in a model with approximately 10M parameters.12 We only change the size of the model in Figure 1b where we use models with 8 layers, 8 heads, and an embedding dimension of 512 (approximately 25M parameters), and 12 layers, 12 heads, and an embedding dimension of 768 (roughly 85M parameters).
\ For the cycles task, we use 1000 node names formatted like v123. For simplicity of analysis, we regard each node name as a single token. Other than that and for the other tasks, we treat each character as a single token.
\ For the length generalization experiments, we realized that the performance of the model depends on the random seed of the network. So we repeated each experiment 10 times and reported the median of the results in the plots (along with other runs). For other experiments, we did not observe much variation between seeds and repeated each experiment 5/10 times and reported 95% confidence intervals. We used different Nvidia GPU devices for running our experiments including H100, A100, and RTX4090. We approximate that the runtime for experiments presented in this paper is around 200 hours (excluding hyperparameter search).
\ Our code is publicly available at https://github.com/aryol/inductive-scratchpad.
\ C.1.1 Hyperparameter tuning and sensitivity
\ In general, we tried different values for the learning rate (with different schedules), batch size, dropout, and weight decay. For different tasks, we have used the hyperparameters that were the most stable and fast. The most significant sensitivity is to the batch size. We often find that larger batch sizes help with the training to the point that some tasks cannot be learned with batch sizes smaller than 256.[13] Also, in the length generalization experiments, we observed that the experiments are rather sensitive to the dropout and weight decay parameters. Generally, strong regularizations can increase the uncertainty of the models. Considering the large number of tokens in the scratchpad of the length generalization experiments and their sensitivity, this uncertainty can increase the probability of error. Of course, a weak regularization can also result in a worse generalization. In addition, for most of the experiments, we used either fresh samples or a rather large number of samples as our objective is mostly measuring the time complexity or OOD generalization. The exact value of hyperparameters for each task is available in our code.
Here we describe the implementation of the inductive scratchpad. Assume that the question and scratchpad sequence are given by Qs[1]#s[2]#…#s[k]. Note that Q and s[i]s can each contain a number of tokens. Moreover, note that Q can also include a part of the scratchpad besides the question. Anything that comes before acts like a permanent memory and the model can attend to it for the generation of all states. So for example, if there is some shared information between all states, it is more efficient to put it in the scratchpad before rather than including it in all of the states. Note that, in general, our goal is to only learn the induction function. In other words, we want to generate tokens of the ith state s[i] as if the sequence to this point was only Qs[i-1]#. We now explain how this can be achieved during training and generation.
\ First, we provide two solutions for training time. One simple way is to break the original sequence Qs[1]#s[2]#…#s[k]. into
\ Qs[1], Q#s[1]#s[2]#, . . . , Qs[k-1]#s[k].
\ We also need to make sure that no loss is not computed for Qs[i]# part of Qs[i]#s[i+1]# for 1 ≤ i < k which can be easily done using a loss mask. This approach ensures that the loss is computed once for the question and each state and also each state is generated from the previous state and the question in an identical manner. Note that all of these operations are done once as a preprocessing step. Now, we describe a second implementation method for the train time. We first duplicate each state other than the last state, i.e., Qs[1]#s[1]#s[2]#s[2]#…#s[k]. Next, we group the consecutive states, reindex the position of the tokens, and design attention masks and loss masks as follows:
\ 
\ where we have assumed that t tokens have appeared before the first state s[1]. Note that this number may vary across samples. We can easily create an attention mask based on the attention groups. All groups only attend to tokens in their group and tokens in group 0 (everything that comes before ). Also, using the loss mask vector, we do not compute the loss for the second copy of each state. Also, for each state generation, we reset the positional indices of the tokens. Using the appropriate loss mask, attention mask, and positional indices explained above guarantees that the loss is computed exactly once for each state, and each state is generated based on the previous state and tokens before (i.e., the question Q) in an identical manner. Note that both approaches that we discussed for the train time can be implemented easily for conventional Transformer models. Further, they do not change Transformer models’ behavior on non-inductive samples. They also work with both absolute and relative positional embeddings. Note that the first approach favors Transformers with a small block size (context length) and the second approach is more suitable when the block size is large. So based on the block size, a mixture of these two approaches can be used. In our implementation, we use the second approach as we do not have an issue with the block size.
\ Now, we discuss token generation. Assume that we are generating the tokens of ith state s[i], i.e., Qs[1]#s[2]#…s[i-1]# have already been generated. To have the inductive behavior, we need the model to generate tokens of s[i] using only the last state and the tokens before . Similar to the training time, this is achievable through using two methods. The first method is to change the input of the model, basically, we can give Qs[i-1]# as the input to the model when generating tokens for s[i]. Alternatively, we can keep the input intact and just use an attention mask that prevents s[i] tokens from attending to any token other than tokens of Q and s[i-1]#. Similar to the training time, one also needs to reindex the position of tokens of s[i-1]# and s[i]# so that they appear exactly after Q. Note that it is still possible to do key-value (KV) caching [93, 94] to increase the decoding speed. KV caching stores previous keys and values corresponding to the previous tokens and does not compute them again at the expense of the memory. Generally, for KV caching, we are only in trouble when going from (i − 1)th state to the ith state, because the current keys and values of tokens of s[i-1] are computed based on s[i-2]. However, for the generation of s[i], we only attend to s[i-1] and do not want s[i-1] to attend to any of the previous states to conserve the inductive behavior. One solution to this problem is to compute the key-value pairs for tokens of s[i-1] again with the correct positional indices and attention masking once we have the transition from s[i-1] to s[i]. Alternatively, one can always cache two versions of keys and values, one for the generation of the current state and one for the generation of the future state.
In this section, we discuss whether it is possible to induce an inductive structure for the scratchpad using relative positional embedding [46, 47] instead of using the explicit inductive scratchpad format introduced in this paper. For an inductive structure to work, we want each state to be generated using only the tokens of the previous state and the question in an identical manner for different states.
\ More precisely, assume that the question and scratchpad are given by Qs[1]#s[2]#…#s[k] (one can also decide to remove and # tokens). For simplicity, assume that the size of all states (i.e., the number of tokens in each state) is equal to T (where we also include # if it is used). Relative positional embeddings compute the attention between two tokens based on their distance instead of their absolute positions in the sequence. Therefore, the attention pattern between s[i+1] and s[i] is similar to the attention pattern between s[i] and s[i-1], and one could hope for an inductive structure to emerge.
\ There are a few obstacles, however:
\
The distance between the states and question tokens increases as each state is generated. However, in order to have an inductive structure, we want the attention pattern between different states and the question not to change. One could think of using encoder-decoder architectures where the question is in the encoder part and computing the cross-attention between states and the question ignoring the positions of the state tokens in the decoder part. However, even this approach would lose some information. I.e., the attention between the scratchpad tokens and the question tokens cannot use the position of the scratchpad tokens within a state.
\
Most of the relative positional embeddings [48, 49, 50] allow tokens to attend to all other tokens. Even if one uses an attention mechanism that limits the attention to the D previous tokens, after L transformer layers, tokens of up to distance DL can attend to each other. So in general, it is most likely, that tokens of s[i] can attend to tokens of other states as well as tokens of s[i-1] which hinders the inductive behavior.
\
Similarly, assume that the Tth (last) token of s[i] attends to all tokens of s[i-1] including its first token. Note that the distance between the last token of state s[i] and the first token of state s[i-1] is 2T − 1. As a result, the first token of s[i] would also attend to all tokens of s[i-2] except the first one in a similar manner (because the distance between the first token of s[i] and the second token of s[i-2] is 2T − 1) which is undesirable.
\ Note that in the analysis above we assumed a fixed number of tokens per state. If the number of tokens in each state varies, issues like (2) and (3) above can become more difficult to handle for a model that only relies on relative positional embedding. To sum up, there is a minor similarity between relative positional embeddings and the idea of induction in the scratchpad. Factors such as pretraining and the recency bias of attention may promote inductive behavior in the model to some extent. Nevertheless, there is no reason to think that the model can implement the inductive behavior relying only on relative positional embeddings for a general inductive task. In general, one can use the inductive scratchpad idea along with relative positional embedding to get better length generalization. Note that inductive scratchpad can generalize on the number of training steps. However, understanding an input with growing length still requires relying on relative positional embeddings.
\ As an alternative solution, independent of positional embeddings being absolute or relative, one can use special tokens # without hard-coding their meaning and hope the model realizes the meaning of these two tokens on its own and implement a soft inductive structure (same attention masking but in a soft manner). However, such an event is also very unlikely for moderate amounts of inductive data.
\
:::info Authors:
(1) Emmanuel Abbe, Apple and EPFL;
(2) Samy Bengio, Apple;
(3) Aryo Lotf, EPFL;
(4) Colin Sandon, EPFL;
(5) Omid Saremi, Apple.
:::
:::info This paper is available on arxiv under CC BY 4.0 license.
:::
[12] The exact number depends on the task and the vocabulary size of it.
\ [13] For some experiments, we increased the gradient accumulation steps instead of the batch size to get the same effect with less memory consumption.


