2310.03511v1

Model: nemotron-free

# Supplementary material for: Reinforcement learning for traversing chemical structure space: Optimizing transition states and minimum energy paths of molecules **Authors**: Rhyan Barrett, Julia Westermayr > Institute of Chemistry, Faculty of Chemistry and Mineralogy, University of Leipzig, Johannisallee 29, 04103 Leipzig > julia.westermayr@uni-leipzig.deInstitute of Chemistry, Faculty of Chemistry and Mineralogy, University of Leipzig, Johannisallee 29, 04103 LeipzigCenter for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI), Dresden/Leipzig, Germany (October 5, 2023) Contents 1. I Loss Implementation 1. II Actor Implementation Details 1. III Critic Implementation Details 1. IV Hyperparameters and Training 1. IV.1 PaiNN Model Training Details 1. IV.2 Claisen Rearrangement Datset Error Graph 1. IV.3 $S_{N}2$ Dataset Error Graph 1. IV.4 Model Architecture and Training Details 1. V Details of Implemented Nudged Elastic Band (NEB) Method 1. VI Quantum chemical reference calculations 1. VI.1 Frequency Calculations I Loss Implementation To reiterate from the paper $A_{k+n}>0$ and $A_{k+n}<0$ correspond to the associated action being positively and negatively reinforced, respectively. The weight updates of a general actor are then constructed by weighting the policy gradients of each sample by their advantage value, $$ \nabla_{\theta}\mathcal{L}_{a}=\nabla_{\theta}V(S_{k})=\mathop{\mathbb{E}}_{% \pi_{\theta}}[\nabla_{\theta}\pi_{\theta}(s|a)A_{\pi_{\theta}}(s,a)] \tag{1} $$ The loss, $\mathcal{L}_{c}$ and weight updates for a general critic are calculated using a mean squared error (MSE) between the observed reward function and predicted value function $V(S_{k})$ : $$ \mathcal{L}_{c}=\left(\sum_{t=k}^{k+n}\gamma^{t}r_{t}+V(S_{k+n})-V(S_{k})% \right). \tag{2} $$ The training process in our actor-critic implementation is described as follows. The loss is computed using the advantage values obtained from the temporal difference targets (TD-targets). The total loss is equal to the sum of the actor and critic loss, $\mathcal{L}_{a}$ and $\mathcal{L}_{c}$ , respectively: $$ \mathcal{L}_{total}=\mathcal{L}_{a}+\mathcal{L}_{c} \tag{3} $$ The total loss function for the actor, $\mathcal{L}_{policy}$ is calculated using an adapted version of the loss as mentioned below. The value $\mathcal{L}_{policy}$ weights the policy gradients calculated by the advantages of those actions. The entropy loss, i.e., the second part in equation (4), encourages exploration by penalizing overly confident predictions. To this end we utilize a hyper-parameter $\beta$ controlling the strength of the penalty of the entropy loss in order to control the amount of exploration. $$ \begin{split}\mathcal{L}_{policy}(action_{Pos})\\ =\sum_{k=1}^{samples}\left(\sum_{i=1}^{n}A_{k,i}\>log(Softmax(t_{k,i}))+\beta% \sum_{i=1}^{n}t_{k,i}*log(Softmax(t_{k,i}))\right)\end{split} \tag{4} $$ Here $t_{i,k}$ and $A_{k,i}$ corresponds to probability associated to the action at element $k$ in the batch and at step $i$ in the episode. Since the actions correspond to position changes $\alpha_{i}$ and $\beta_{i}$ , values which are similar to other $\alpha_{i}$ and $\beta_{i}$ pairs will likely correspond to similar rewards. Thus for each action chosen we weight the corresponding policy gradients by their distance to the chosen action. More precisely the final actor loss policy for a single sample becomes: $$ \begin{split}\sum_{i=1}^{n}A_{k,i}\sum_{a=1}^{N_{a}}\frac{1}{1+|a^{\prime}_{k,% i}-a_{k,i}|}log(Softmax(t_{k,i,a}))\>+\\ \beta\sum_{i=1}^{n}t_{k,i}*log(Softmax(t_{k,i}))\end{split} \tag{5} $$ The values $a^{\prime}_{k,i}$ and $a_{k,i}$ are the chosen action and an arbitrary action respectively. $N_{a}$ corresponds to the total number of actions. Additionally, to generate an initial starting guess for the reaction pathway we use geodesic interpolation followed by high speed sampling using the model’s initial distribution over the actions. This produces a guess much closer to the minimum energy pathway meaning less training time is wasted on pathways which are not useful. The critic is trained using standard temporal difference learning. In this case, the rewards are obtained from the $(F_{max})^{-1}$ values. II Actor Implementation Details Based on the representation mentioned in the paper, we aim to predict changes to the positions of atoms in each image along the reaction pathway at each step in order to move closer to the minimum energy pathway. To achieve changes in atomic positions, we look to generate distributions for positions changes around each atom. To elaborate further, around each atom in the molecule, we construct a fine discrete grid of $\alpha_{i}$ and $\beta_{i}$ values to produce a linear combination of $F_{int}\perp$ and $F_{spr}\parallel$ . A distribution is then constructed by the model on this grid and the new position is sampled from the respective distribution. It is important to note these $\alpha_{i}$ and $\beta_{i}$ values can be generated per atom in each image or be the same for every atom in the image with the former being the most computationally intense. In the following section we will describe the case where an $\alpha_{i}$ and $\beta_{i}$ value are generated for each individual atom. In more detail, let us take the initial state $S_{t}$ , of a reaction pathway, which contains the atomic types $\{Z_{i}\}$ and associated positions $\{\{R_{i_{1}}\}^{1},\{R_{i_{2}}\}^{2},...,\\ \{R_{i_{n}}\}^{n}\}$ of all atoms in each molecule along the reaction pathway where $n$ is the number of images and $i_{j}$ corresponds to the index of the atom in image $j$ . The probability distributions of the positions of the atoms in the reaction pathway at step $T$ is given by $\{\{P(R_{i})\}_{j}\>∀ j∈[0,n]\}$ . To accommodate the auto-regressive nature of the reinforcement model we can write this as the product of conditional probabilities of the positions at each step which can be decomposed as changes in positions: $$ \{P(R_{i})\}_{j}=\left[\prod_{t=0}^{T}\prod_{i=0}^{N_{atoms}}P(R^{t-1}_{i,j}+% \delta R^{t}_{i,j}|R^{t-1}_{i,j-1},R^{t-1}_{i,j},R^{t-1}_{i,j+1})\right]_{j}% \forall j\in[0,n]. \tag{6} $$ From now on we will abbreviate $R^{t-1}_{i,j-1},R^{t-1}_{i,j},R^{t-1}_{i,j+1}$ with $Y^{t-1}_{i,j}$ . Additionally, we can decompose the distribution into the joint probability distribution of $\alpha_{i}$ and $\beta_{i}$ values from which the changes in positions for each molecule along the reaction path are constructed: $$ \left[\prod_{t=0}^{T}\prod_{i=0}^{N_{atoms}}P(\delta R^{t}_{i,j}|Y^{t-1}_{i,j}% )\right]_{j}=\left[\prod_{t=0}^{T}\prod_{i=0}^{N_{atoms}}P(\alpha_{i},\beta_{i% }|Y^{t-1}_{i,j})\right]_{j}\forall j\in[0,n] \tag{7} $$ Furthermore, we add a dependency of $\beta_{i}$ on the $\alpha_{i}$ value chosen. This leads to the joint distribution being rewritten so that $\beta_{i}$ depends on $\alpha_{i}$ ; $$ \begin{split}\left[\prod_{t=0}^{T}\prod_{i=0}^{N_{atoms}}P(\alpha_{i},\beta_{i% }|Y^{t-1}_{i,j})\right]_{j}=\left[\prod_{t=0}^{T}\prod_{i=0}^{N_{atoms}}P(% \beta_{i}|Y^{t-1}_{i,j},\alpha_{i})P(\alpha_{i}|Y^{t-1}_{i,j})\right]_{j}\\ \forall j\in[0,n].\end{split} \tag{8} $$ These distributions are produced as an output of the actor model and combined into a categorical distribution over a fine grid. A new position is then sampled from this grid for each atom in every image. The methodology is then implemented into a deep learning architecture as seen in the paper. III Critic Implementation Details As mentioned in the paper, the critic model employs identical PaiNN representation layers as the actor model, with the exception that the final layers are utilized to furnish an estimation of the value function $$ V^{\pi}(S_{k})=\mathop{\mathbb{E}}_{\pi}(R_{k}|S_{k}=S)=\mathop{\mathbb{E}}_{% \pi}\left[\sum_{t=k}^{\infty}\gamma^{t}r_{t}|S_{k}=S\right]. \tag{9} $$ At each step in the auto-regressive structure of the RL algorithm, the reward is given from the NEB calculation using the $F_{max}$ quantity and a conventional PaiNN model. The atom and image of which $F_{max}$ occurs in future steps is highly uncertain. However, it can be decomposed as the probability of each atom causing the largest force multiplied by the estimated force if this is the case; the expected $F_{max}$ for a particular atom. More formally the value function decomposes as follows: $$ \mathop{\mathbb{E}}_{\pi}\left[\sum_{t=K}^{\infty}\gamma^{t}r_{t}|S_{t}=S% \right]=\sum_{j=0}^{n}\sum_{i=0}^{N_{atoms}}\left[\sum_{t=k}^{\infty}\gamma^{t% }\mathop{\mathbb{E}}_{\pi}(f_{max}^{z}|S_{t}=S)\right]. \tag{10} $$ Using the shared representation layer between the actor and critic models we then predict the expected values of each atom in each image, as it is shown in Figure 1. The resulting expectations are then summed in the final layers to give an estimate of the value function. <details> <summary>2310.03511v1/extracted/5153470/critic.png Details</summary> ![c2fc4ef7fc490e24161ab9e035b3ff70e6edad32967048c780785341ec8d547f](http://localhost:8000/v1/image/c2fc4ef7fc490e24161ab9e035b3ff70e6edad32967048c780785341ec8d547f) ### Visual Description # Technical Diagram Analysis ## Left Branch: Sequence Processing ### Inputs - **R Sequences**: `(R₁,..., Rₘ)ᵐ` - **Z Sequences**: `(Z₁,..., Zₘ)ᵐ` ### Components 1. **Embedding Layer** - Output dimension: 64 - Input: `(R₁,..., Rₘ)ᵐ` and `(Z₁,..., Zₘ)ᵐ` 2. **Interaction Layers** (4 layers) - Each layer output dimension: 64 - Input: Output from previous layer - Output: `(x₁,..., xₘ)ᵐ` ## Right Branch: Atom-Wise Processing ### Inputs - **Atom-Wise Inputs**: `(x₁,..., xₘ)¹` to `(x₁,..., xₘ)ⁿ` ### Components 1. **Atom-Wise Layers** - Output dimension: `mx64` - Input: `(x₁,..., xₘ)ᵢ` for `i ∈ [1, n]` 2. **Summation Steps** - **Equation**: `E(fₘₐₓ|z = Iᵢⱼ=₁, Sₜ = S)` for `∀i ∈ [0, m]` - **Flow**: - Atom-wise outputs → Summation (per `i`) - Final summation across all `i` ### Final Output - **Equation**: `E(Rₜ|Sₜ = S)` ## Key Observations 1. **Dimensionality**: - Embedding/Interaction layers: 64 units - Atom-wise layers: `mx64` units 2. **Flow**: - Left branch processes sequences `(R, Z)` into `(x₁,..., xₘ)ᵐ`. - Right branch processes atom-wise inputs `(x₁,..., xₘ)¹...ⁿ` through summation steps. 3. **Equations**: - `E(fₘₐₓ|z = Iᵢⱼ=₁, Sₜ = S)` - `E(Rₜ|Sₜ = S)` ## Diagram Structure </details> Figure 1: Critic architecture. The PaiNN model provides an initial feature representation, $(\mathbf{x_{1}},...,\mathbf{x_{m}})^{j}$ with $j$ referring to the index of the image. For each image in the reaction path we pass the feature vectors for each atom through a series of atom-wise layers. The output contains the expectation of each atom in a molecule. The expectations are then summed over all atoms in each image and finally over all images to deliver the final expectation. IV Hyperparameters and Training All experiments and trainings were run on an NVIDIA RTX A2000 12 GB GPU as well as 20 i7-12700 12th Gen CPUs. IV.1 PaiNN Model Training Details The datasets used for the training of the models are freely available and comprise the allyl- p -tolyl ether Claisen rearrangement reaction [5] as well as $S_{N}2$ reactions [15]. The data were generated with PBE0/def2-TZVP [1, 17] and DSD-BLYP-D3(BJ)/def2-TZVP [9, 6] level of theory, respectively. The target properties used for training of the Claisen rearrangement and $S_{N}2$ reactions were energy and forces. The loss weighting of energy and forces was 0.02 and 0.98, respectively, for both datasets. The dataset was split into training, validation, and test sets randomly. For the claisen rearragement reaction, 50000 data points were used for training, 6000 for validation, and 5000 for testing. We used a batch size of 50. For the $S_{N}2$ dataset, 350000 data points were used for training, 50000 for validation, and 52709 for testing. A batch size of 500 was used. For both datasets a cutoff function was applied. In addition, for the Claisen rearrangement reaction, the means of each target property were substracted from the dataset as well. Both models were built using SchNetpack [13] in combination with the equivariant PaiNN [12] representation. The hyperparameters for the model used in both the Claisen rearrangement and $S_{N}2$ reactions are as stated in Table 1. Table 1: Hyperparameters of the final PaiNN models trained on the Claisen rearrangement reaction and $S_{N}2$ reactions. | Parameter | Claisen rearrangment | $S_{N}2$ Reaction | | --- | --- | --- | | Representation | PaiNN | PaiNN | | n_atom_basis | 64 | 128 | | n_interactions | 9 | 9 | | Radial basis function | Guassian | Guassian | | n_rbf | 20 | 20 | | cutoff_fn | CosineCutoff | CosineCutoff | | cutoff | 10 | 10 | | loss_fn | MSE | MSE | | Optimizer | AdamW | AdamW | | Learning rate | 10 ${}^{-4}$ | 10 ${}^{-4}$ | Different hyperparameters were tested starting from the default values of PaiNN and changing them such that a good compromise between accuracy and computational efficiency could be achieved. In SchNetPack, interaction layers are followed by a series of atom-wise dense layers [14] where the following layer is compromised of half the number of neurons of the previous until 1 is reached, i.e. 64, 32, 16, 8, 4, 2, 1. Figure 2 and 3 show the energies and forces in the test dataset against the energies and forces predicted by the PaiNN [12] models respectively. IV.2 Claisen Rearrangement Datset Error Graph The test set consists of 5000 data points. The original data set contains energies being measured in Hartree and forces measured in Hartree/Bohr. The mean absolute errors (MAEs) of energies and forces are 0.0003438 Ha and 0.000316 Ha/Bohr. <details> <summary>2310.03511v1/extracted/5153470/claisen_painn.png Details</summary> ![88ec9596b4bc94139c41b6546824c0ecf9f9848d940a236ce4f9bc3b50490673](http://localhost:8000/v1/image/88ec9596b4bc94139c41b6546824c0ecf9f9848d940a236ce4f9bc3b50490673) ### Visual Description # Technical Document Extraction: Scatter Plot Analysis ## Plot (a): Energy Predictions ### Axes Labels - **X-axis**: "Dataset Energies (Hartree)" - Range: -0.04 to 0.08 - **Y-axis**: "Predicted Energies (Hartree)" - Range: -0.04 to 0.08 ### Data Characteristics - **Data Points**: Red markers - **Trend**: Linear correlation (bottom-left to top-right) - **Key Observations**: - Data points align closely along a diagonal line, indicating strong agreement between dataset and predicted energies. - No outliers or deviations from the trendline are visible. ## Plot (b): Force Predictions ### Axes Labels - **X-axis**: "Dataset Forces (Hartree/Bohr)" - Range: -0.10 to 0.10 - **Y-axis**: "Predicted Forces (Hartree/Bohr)" - Range: -0.10 to 0.10 ### Data Characteristics - **Data Points**: Blue markers - **Trend**: Linear correlation (bottom-left to top-right) - **Key Observations**: - Data points follow a tight linear pattern, suggesting high predictive accuracy for forces. - Symmetrical distribution around the origin (0,0) indicates balanced predictions across positive and negative force values. ## Cross-Reference Validation - **Legend Consistency**: - Plot (a) red markers correspond to energy predictions. - Plot (b) blue markers correspond to force predictions. - **Axis Units**: - Energies measured in Hartree (a). - Forces measured in Hartree/Bohr (b). ## Summary Both plots demonstrate strong linear relationships between dataset and predicted values for energies and forces, respectively. The consistent diagonal alignment of data points across both plots suggests effective model performance in capturing underlying trends. </details> Figure 2: Claisen rearrangement reaction dataset: Scatter plot of a) energies and b) forces showing the test data against the values predicted by the PaiNN model. IV.3 $S_{N}2$ Dataset Error Graph The test set consists of 52709 datapoints with the energies being measured in eV and the forces measured in eV/Å. The MAEs of energies and forces are 0.03756 eV and 0.004517 eV/Å. <details> <summary>2310.03511v1/extracted/5153470/SN2_painn.png Details</summary> ![79f371c955e714f8f76a22cf392c4f18958840a99d196addced778d612604082](http://localhost:8000/v1/image/79f371c955e714f8f76a22cf392c4f18958840a99d196addced778d612604082) ### Visual Description # Technical Document Extraction: Chart Analysis ## Chart (a) - **Title**: (a) - **Axes**: - **Y-Axis**: "Predicted Energy (eV)" - **X-Axis**: "Dataset Energy (eV)" - **Line**: - **Color**: Red - **Trend**: Linear, ascending from bottom-left to top-right - **Data Points**: Plotted along the line, indicating a direct correlation between predicted and dataset energy values. - **Shaded Region**: - **Color**: Red (matching the line) - **Purpose**: Likely represents prediction intervals or confidence bands (uncertainty in predictions). - **Key Observations**: - Predicted energy values closely align with dataset energy values across the range. - The shaded region widens slightly at higher dataset energy values, suggesting increased variability in predictions at extreme values. ## Chart (b) - **Title**: (b) - **Axes**: - **Y-Axis**: "Predicted Force (eV/Angstrom)" - **X-Axis**: "Dataset Force (eV/Angstrom)" - **Line**: - **Color**: Blue - **Trend**: Linear, ascending from bottom-left to top-right - **Data Points**: Plotted along the line, indicating a direct correlation between predicted and dataset force values. - **Shaded Region**: - **Color**: Blue (matching the line) - **Purpose**: Likely represents prediction intervals or confidence bands (uncertainty in predictions). - **Key Observations**: - Predicted force values closely align with dataset force values across the range. - The shaded region widens slightly at higher dataset force values, suggesting increased variability in predictions at extreme values. ## Cross-Chart Analysis - **Similarities**: - Both charts exhibit linear relationships between predicted and dataset values. - Both include shaded regions to indicate prediction uncertainty. - Both use distinct line colors (red for energy, blue for force) to differentiate variables. - **Differences**: - Chart (a) focuses on energy (eV), while chart (b) focuses on force (eV/Angstrom). - The range of dataset values differs: energy spans -15.0 to 2.5 eV, while force spans -20 to 20 eV/Angstrom. ## Conclusion Both charts demonstrate strong linear correlations between predicted and dataset values for energy and force, respectively. The shaded regions highlight prediction uncertainty, which increases at the extremes of the dataset ranges. These visualizations suggest that the predictive model performs well within the observed ranges but may require refinement for extrapolation beyond these limits. </details> Figure 3: $S_{N}2$ dataset: Scatter plots of a) energies and b) forces showing the test dataset against the values predicted by the PaiNN model. IV.4 Model Architecture and Training Details The model utilizes the pretrained PaiNN representation for each dataset, which were introduced in the previous section, as an initial representation for the model. These representation vectors remain constant throughout the training episodes of the model. <details> <summary>2310.03511v1/extracted/5153470/actor.png Details</summary> ![e80686404793fdc95bc0a6d4091b1b76ae4122777a6a1e683f59edc3ad6352fd](http://localhost:8000/v1/image/e80686404793fdc95bc0a6d4091b1b76ae4122777a6a1e683f59edc3ad6352fd) ### Visual Description # Technical Document Extraction: Flowchart Analysis ## Diagram Overview The image depicts a multi-stage computational architecture with sequential and parallel processing components. The diagram uses color-coded blocks to represent different processing stages and mathematical operations. --- ## Component Breakdown ### Left Processing Path 1. **Input Layer** - `(R₁,..., Rₘ)ⱼ` and `(Z₁,..., Zₘ)ⱼ` → Feed into embedding layer 2. **Embedding Layer** - `Embedding, 64` → 64-unit embedding layer 3. **Interaction Stack** - Four sequential `Interaction, 64` layers - Input: `(x₁,..., xₘ)ⱼ` - Output: `(x₁,..., xⱼ)` → Feeds into self-interaction blocks ### Central Processing Blocks 1. **Self-Interaction Module (Left)** - **Input**: `(x₁,..., xⱼ-1)(x₁,..., xⱼ)(x₁,..., xⱼ+1)` - **Processing**: - `Self Interaction` → Yellow block - `atom-wise, mx192` → Orange block (matrix multiplication) - `abs(y₁,..., yₘ)ⱼ` → Pink block (absolute value) - `Norm(y₁,..., yⱼ)ⱼ` → Red block (normalization) - **Output**: - `P(aj|(x₁,..., xⱼ))` → Conditional probability for action selection - `∀j∈[0,m]` → Universal quantification over action indices 2. **Self-Interaction Module (Right)** - Identical structure to left module - Output: `P(βj|(x₁,..., xⱼ))` → Conditional probability for β parameter ### Right Processing Path 1. **Final Output Layer** - Input: `(|Fspr|, Fspr·ajFint)ⱼ` - **Processing**: - `atom-wise, mx2` → Orange block (matrix multiplication) - `atom-wise, mx192` → Orange block (matrix multiplication) - **Output**: Final feature representation --- ## Mathematical Notations 1. **Indexing**: - Subscript `ⱼ` denotes time step or sequence position - Superscript `j-1`/`j+1` indicates temporal neighbors 2. **Operations**: - `mxN`: Matrix multiplication with N-dimensional weight matrix - `abs()`: Element-wise absolute value - `Norm()`: L2 normalization 3. **Probability Distributions**: - `P(aj|x)`: Action probability distribution - `P(βj|x)`: Parameter probability distribution --- ## Color-Coded Component Mapping | Color | Component Type | Quantity | |-----------|----------------------|----------| | Green | Embedding Layer | 1 | | Yellow | Self Interaction | 2 | | Orange | Matrix Multiplication| 4 | | Pink | Absolute Value | 2 | | Red | Normalization | 2 | --- ## Data Flow Diagram </details> Figure 4: Actor architecture. The PaiNN representation forms the initial feature vectors for the actor. The positions of each atom in the current image are passed through a self interaction layer and a dense layer to produce the $\alpha_{i}$ distribution. A similar process is used to construct the associated $\beta_{i}$ distribution. The two distributions are then sampled from to produce the change in positions at the current time step. We show the actor architecture again in Figure 4 for better understanding of the following text. The self-interaction layers of the model are composed of a multi-attention head [16], with one head specifically designed to combine neighboring image representation vectors. The atom-wise dense layers in the model follow from the self-interaction layers, maintaining the same construction methodology discussed in the previous section; the following dense layer being composed of half the number of neurons in the previous layer. The output values were then sampled from the generated distributions. All energies and forces throughout were predicted using the PaiNN model from the previous section. To streamline the performance, the spring force vector is used. The spring force vector is precalculated as an input into the model. In addition, the internal force vector is derived using the pretrained PaiNN model, serving as an additional input to the model. This incorporation of both force vectors reduces the computational time needed for training. For the Claisen rearrangement reaction and the $S_{N}2$ reactions, the model utilizes specific hyperparameters, which are provided in Table 2. Table 2: Hyperparameters used for the model when training on the Claisen rearrangement reaction and $S_{N}2$ reactions. | Parameter | Claisen rearrangement reaction | $S_{N}2$ reactions | | --- | --- | --- | | Batch size | 10 | 20 | | Grid size | $10× 10$ | $10× 10$ | | $\alpha,\beta$ range | 100 | 3 | | Spring constant (k) | 0.1 | 0.1 | | Learning rate | 10 ${}^{-4}$ | 10 ${}^{-4}$ | | Preoptimise length | 50 | 0 | | Episode length | 10 | 10 | | Optimizer | Adam | Adam | During the training period two pathways found by the model are stored. These are the pathway corresponding to the minimum activation energy and minimum $F_{max}$ found through all episodes. Due to the stochastic sampling procedure it is worth to mention that results are likely not reproducible exactly, but within a certain statistical accuracy. For better reproducibility, however, the associated trained PaiNN models are provided in addition to this document and code. V Details of Implemented Nudged Elastic Band (NEB) Method The reaction pathway is a crucial concept in understanding chemical reactions. It provides insights into the sequence of structural changes occurring between the initial and final states of a reaction. In order to construct a reaction pathway, a method called the nudged elastic band (NEB) [8] is commonly used. The reaction pathway is formed starting from the start and end structures and using a series of $N+1$ images connected by virtual spring forces forming an elastic band. The elastic band with $N+1$ images can be written as $\{\textbf{R}_{0},\textbf{R}_{1},...,\textbf{R}_{n-1},\textbf{R}_{n}\}$ where $\textbf{R}_{0}$ and $\textbf{R}_{n}$ are the initial and final structures respectively. The initial and final images correspond to specific molecular configurations, which correspond to local energy minima. The positions of the endpoints are thus held fixed to maintain a constant energy throughout the process. To connect these images and simulate the structural changes along the pathway, virtual spring forces are introduced. The key component of these spring forces is the tangent vector, denoted as $\tau_{i}$ , which describes the direction of the pathway at each image. Initially, a simple estimate of the tangent vector can be obtained by calculating the normalized line segments between neighboring images and summing them: $$ \tau_{i}=\frac{\textbf{R}_{i}-\textbf{R}_{i-1}}{|\textbf{R}_{i}-\textbf{R}_{i-% 1}|}+\frac{\textbf{R}_{i+1}-\textbf{R}_{i}}{|\textbf{R}_{i+1}-\textbf{R}_{i}|} \tag{11} $$ However, this basic approach can lead to the formation of kinks or irregularities in the pathway. More information about this can be found in reference 8. To address this issue, a more sophisticated method is proposed. Let $i_{max}$ be the index of the image with the maximum potential energy; this corresponds to the transition state. $$ \begin{gathered}\tau_{i}^{-}=\frac{\textbf{R}_{i}-\textbf{R}_{i-1}}{|\textbf{R% }_{i}-\textbf{R}_{i-1}|}\>\>\>\>\>i<i_{max}\\ \tau_{i}^{+}=\frac{\textbf{R}_{i+1}-\textbf{R}_{i}}{|\textbf{R}_{i+1}-\textbf{% R}_{i}|}\>\>\>\>\>i>i_{max}\\ \tau_{i_{max}}=\frac{\textbf{R}_{i}-\textbf{R}_{i-1}}{|\textbf{R}_{i}-\textbf{% R}_{i-1}|}+\frac{\textbf{R}_{i+1}-\textbf{R}_{i}}{|\textbf{R}_{i+1}-\textbf{R}% _{i}|}\>\>\>\>\>i=i_{max}\end{gathered} \tag{12} $$ With the tangent vectors established, the respective spring forces between neighboring images can be calculated. Additionally these forces are parameterized by a spring constant, denoted as k, which quantifies the strength of the interaction between neighboring images. The spring force denoted as $\textbf{F}^{i}_{spr}\parallel$ , can be expressed as: $$ \textbf{F}^{i}_{spr}\parallel=k(|\textbf{R}_{i+1}-\textbf{R}_{i}|-|\textbf{R}_% {i}-\textbf{R}_{i-1}|)\tau_{i} \tag{13} $$ Additionally, the perpendicular force, denoted as $\textbf{F}_{int}\perp$ , which arises from internal molecular forces is considered. This term is defined as the internal force vector that acts orthogonal to the spring force vector. The combined force acting on each image, denoted as $\textbf{F}^{i}$ , is then defined as the sum of the spring force and the internal molecular force acting perpendicular to the pathway: $$ \begin{gathered}\textbf{F}^{i}=\textbf{F}^{i}_{spr}\parallel+\textbf{F}_{int}% \perp\\ \text{with }\textbf{F}_{int}\perp=\nabla E(R_{i})\perp=\nabla E(R_{i})-\nabla E% (R_{i})\parallel.\end{gathered} \tag{14} $$ After each episode, molecular positions are then optimised using an optimisation algorithm. In the NEB experiments performed optimisation was done using BFGS [18]. By applying the BFGS algorithm to the force vector, one can iteratively adjust the positions of the images along the reaction coordinate, seeking the minimum energy pathway. The algorithm uses the current force vector and the approximated Hessian matrix to determine the search direction and step length for each iteration. The positions of the images are updated accordingly, aiming to converge towards a pathway with lower reaction barrier. More information on this can be found in 18. The model constructs a new forces vector with the predicted $\alpha_{i}$ and $\beta_{i}$ values; $$ F^{i}=\alpha_{i}F_{int}\perp+\beta_{i}F^{i}_{spr}\parallel \tag{15} $$ one step is then taken with the BFGS algorithm to generate the new positions given the relevant force vectors. The BFGS algorithm and script to calculate $F^{i}$ values given the $\alpha_{i}$ and $\beta_{i}$ used the Atomic Simulation Environment (ASE). [10] VI Quantum chemical reference calculations All quantum chemical calculations were conducted using the ORCA 5.0.3 [11] program suite. For SCF cycles the convergence levels were set to tight SCF and the integration grid was set to defgrid2. In the dataset for the Claisen rearrangement reaction all calculations were conducted using the TZVP basis set and the PBE0 [1] functional. For the $S_{N}2$ dataset all calculations were performed with the TZVP basis set, the DSD-BLYP functional [9] and the D3BJ [6] dispersion correction. The details for methods used in constructing the respective datasets for the Claisen rearrangement reaction and $S_{N}2$ reactions can be found at [5] and [15] respectively. The minimum energy pathway for the Claisen rearrangement reaction was calculated using the NEB implementation in ORCA. Initially the NEB was run with the PBE functional [4] and SVP basis set. This pathway was used as a starting point for the following NEB. This NEB was done using the TZVP basis set with the PBE0 functional and D4 dispersion correction [3]. The NEB calculations were run using the climbing image implementation [7]. To calculate the root mean squared deviation (RMSD) between the generated transition states and the reference transition states we did a transition state optimisation with DFT. For the $S_{N}2$ reactions, predicted transition states from the model were taken as the initial starting points for transition state optimisation. An eigenvalue following algorithm was applied using OptTS option in ORCA. The TZVP basis set was used along with the RI-C auxiliary basis set [2]. Again the DSD-BLYP functional and the D3BJ dispersion correction were used. VI.1 Frequency Calculations The frequency calculations of generated structures were performed using the DSD-BLYP functional, D3BJ dispersion correction, TZVP basis set and RI-C auxiliary basis set. The frequency calculations were computed numerically. The associated frequencies for each of the $S_{N}2$ reactions of the following form can be seen in the table below. $$ X^{\text{\textendash}}+H_{3}C\text{\textendash}Y\rightarrow X\text{\textendash% }CH_{3}+Y^{\text{\textendash}}\>\>\>\>X,Y\in\{F,Cl,Br,I\} $$ Table 3: Normal modes of the found transition state for the $S_{N}2$ reactions. | Degrees of Freedom | $X=Cl$ , $Y=Br$ [cm ${}^{-1}$ ] | $X=Cl$ , $Y=I$ [cm ${}^{-1}$ ] | $X=Br$ , $Y=I$ [cm ${}^{-1}$ ] | | --- | --- | --- | --- | | 0 | 0.0 | 0.0 | 0.0 | | 1 | 0.0 | 0.0 | 0.0 | | 2 | 0.0 | 0.0 | 0.0 | | 3 | 0.0 | 0.0 | 0.0 | | 4 | 0.0 | 0.0 | 0.0 | | 5 | 0.0 | 0.0 | 0.0 | | 6 | -795.92 | -718.09 | -604.47 | | 7 | 142.54 | 132.18 | 104.66 | | 8 | 149.33 | 157.69 | 134.27 | | 9 | 150.75 | 160.16 | 137.94 | | 10 | 791.66 | 801.02 | 801.50 | | 11 | 792.68 | 802.21 | 803.23 | | 12 | 895.63 | 907.19 | 885.53 | | 13 | 1387.45 | 1392.63 | 1390.87 | | 14 | 1387.92 | 1394.92 | 1391.26 | | 15 | 3352.51 | 3336.25 | 3338.38 | | 16 | 3575.85 | 3553.72 | 3560.16 | | 17 | 3577.45 | 3556.60 | 3560.77 | The frequency calculation for the transition state of the Claisen rearrangement reaction generated by the model was performed with the TZVP basis set and the PBE0 functional. As with the $S_{N}2$ reaction all frequencies were computed numerically. The molecules has 63 normal modes hence results will be provided in an additional document. The structure has 6 negative frequencies, which reveals that the structure is not exactly the transition state. However, most frequencies are very small in their magnitude, i.e., only two are smaller than -300, which indicates that the structure is close to the transition state. In fact, many transition states found with NEB and quantum chemistry reference methods often comprise of more than one negative frequency. The RMSD of the calculated transition state and the transition state obtained with NEB using DFT is about 0.79 Åwhich is reasonable given the many degrees of freedom of the molecule. References - [1] Carlo Adamo and Vincenzo Barone. Toward reliable density functional methods without adjustable parameters: The pbe0 model. J. Chem. Phys., 110(13):6158–6170, 1999. - [2] Francesco Aquilante, Roland Lindh, and Thomas Bondo Pedersen. Unbiased auxiliary basis sets for accurate two-electron integral approximations. J. Chem. Phys., 127(11):114107, 2007. - [3] Eike Caldeweyher, Sebastian Ehlert, Andreas Hansen, Hagen Neugebauer, Sebastian Spicher, Christoph Bannwarth, and Stefan Grimme. A generally applicable atomic-charge dependent london dispersion correction. J. Chem. Phys., 150(15):154122, 2019. - [4] Matthias Ernzerhof and Gustavo E Scuseria. Assessment of the perdew–burke–ernzerhof exchange-correlation functional. J. Chem. Phys., 110(11):5029–5036, 1999. - [5] Michael Gastegger, Kristof T Schütt, and Klaus-Robert Müller. Machine learning of solvent effects on molecular spectra and reactions. Chem. Sci., 12(34):11473–11483, 2021. - [6] Lars Goerigk. A comprehensive overview of the dft-d3 london-dispersion correction. Non-Covalent Interactions in Quantum Chemistry and Physics, pages 195–219, 2017. - [7] Graeme Henkelman, Blas P Uberuaga, and Hannes Jónsson. A climbing image nudged elastic band method for finding saddle points and minimum energy paths. J. Chem. Phys., 113(22):9901–9904, 2000. - [8] Hannes Jónsson, Greg Mills, and Karsten W Jacobsen. Nudged elastic band method for finding minimum energy paths of transitions. In Classical and quantum dynamics in condensed phase simulations, pages 385–404. World Scientific, 1998. - [9] Sebastian Kozuch, David Gruzman, and Jan ML Martin. Dsd-blyp: A general purpose double hybrid density functional including spin component scaling and dispersion correction. J. Phys. Chem. C., 114(48):20801–20808, 2010. - [10] Ask Hjorth Larsen, Jens Jørgen Mortensen, Jakob Blomqvist, Ivano E Castelli, Rune Christensen, Marcin Dułak, Jesper Friis, Michael N Groves, Bjørk Hammer, Cory Hargus, et al. The atomic simulation environment—a python library for working with atoms. J. Phys. Condens., 29(27):273002, 2017. - [11] Frank Neese. The orca program system. WIREs: Comput. Mol. Sci., 2(1):73–78, 2012. - [12] Kristof Schütt, Oliver Unke, and Michael Gastegger. Equivariant message passing for the prediction of tensorial properties and molecular spectra. In International Conference on Machine Learning, pages 9377–9388. PMLR, 2021. - [13] Kristof T Schütt, Stefaan SP Hessmann, Niklas WA Gebauer, Jonas Lederer, and Michael Gastegger. Schnetpack 2.0: A neural network toolbox for atomistic machine learning. arXiv:2212.05517, 2022. - [14] Kristof T Schütt, Huziel E Sauceda, P-J Kindermans, Alexandre Tkatchenko, and K-R Müller. Schnet–a deep learning architecture for molecules and materials. Chem. Phys., 148(24), 2018. - [15] Oliver T Unke and Markus Meuwly. Physnet: A neural network for predicting energies, forces, dipole moments, and partial charges. J. Chem. Theory Comput., 15(6):3678–3693, 2019. - [16] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. - [17] Florian Weigend and Reinhart Ahlrichs. Balanced basis sets of split valence, triple zeta valence and quadruple zeta valence quality for h to rn: Design and assessment of accuracy. Phys. Chem. Chem. Phys., 7(18):3297–3305, 2005. - [18] Ya-xiang Yuan. A modified bfgs algorithm for unconstrained optimization. IMA J. Numer. Anal., 11(3):325–332, 1991.

Rendering Paper...