# Automatic Change-Point Detection in Time Series via Deep Learning
**Authors**: Jie Li111Addresses for correspondence: Jie Li, Department of Statistics, London School of Economics and Political Science, London, WC2A 2AE.Email: j.li196@lse.ac.uk, Paul Fearnhead, Piotr Fryzlewicz, Tengyao Wang
> Department of Statistics, London School of Economics and Political Science, London, UK
> Department of Mathematics and Statistics, Lancaster University, Lancaster, UK
Abstract
Detecting change-points in data is challenging because of the range of possible types of change and types of behaviour of data when there is no change. Statistically efficient methods for detecting a change will depend on both of these features, and it can be difficult for a practitioner to develop an appropriate detection method for their application of interest. We show how to automatically generate new offline detection methods based on training a neural network. Our approach is motivated by many existing tests for the presence of a change-point being representable by a simple neural network, and thus a neural network trained with sufficient data should have performance at least as good as these methods. We present theory that quantifies the error rate for such an approach, and how it depends on the amount of training data. Empirical results show that, even with limited training data, its performance is competitive with the standard CUSUM-based classifier for detecting a change in mean when the noise is independent and Gaussian, and can substantially outperform it in the presence of auto-correlated or heavy-tailed noise. Our method also shows strong results in detecting and localising changes in activity based on accelerometer data.
Keywords— Automatic statistician; Classification; Likelihood-free inference; Neural networks; Structural breaks; Supervised learning {textblock}
12.5[0,0](2,1) [To be read before The Royal Statistical Society at the Society’s 2023 annual conference held in Harrogate on Wednesday, September 6th, 2023, the President, Dr Andrew Garrett, in the Chair.] {textblock} 12.5[0,0](2,2) [Accepted (with discussion), to appear]
1 Introduction
Detecting change-points in data sequences is of interest in many application areas such as bioinformatics (Picard et al., 2005), climatology (Reeves et al., 2007), signal processing (Haynes et al., 2017) and neuroscience (Oh et al., 2005). In this work, we are primarily concerned with the problem of offline change-point detection, where the entire data is available to the analyst beforehand. Over the past few decades, various methodologies have been extensively studied in this area, see Killick et al. (2012); Jandhyala et al. (2013); Fryzlewicz (2014, 2023); Wang and Samworth (2018); Truong et al. (2020) and references therein. Most research on change-point detection has concentrated on detecting and localising different types of change, e.g. change in mean (Killick et al., 2012; Fryzlewicz, 2014), variance (Gao et al., 2019; Li et al., 2015), median (Fryzlewicz, 2021) or slope (Baranowski et al., 2019; Fearnhead et al., 2019), amongst many others. Many change-point detection methods are based upon modelling data when there is no change and when there is a single change, and then constructing an appropriate test statistic to detect the presence of a change (e.g. James et al., 1987; Fearnhead and Rigaill, 2020). The form of a good test statistic will vary with our modelling assumptions and the type of change we wish to detect. This can lead to difficulties in practice. As we use new models, it is unlikely that there will be a change-point detection method specifically designed for our modelling assumptions. Furthermore, developing an appropriate method under a complex model may be challenging, while in some applications an appropriate model for the data may be unclear but we may have substantial historical data that shows what patterns of data to expect when there is, or is not, a change. In these scenarios, currently a practitioner would need to choose the existing change detection method which seems the most appropriate for the type of data they have and the type of change they wish to detect. To obtain reliable performance, they would then need to adapt its implementation, for example tuning the choice of threshold for detecting a change. Often, this would involve applying the method to simulated or historical data. To address the challenge of automatically developing new change detection methods, this paper is motivated by the question: Can we construct new test statistics for detecting a change based only on having labelled examples of change-points? We show that this is indeed possible by training a neural network to classify whether or not a data set has a change of interest. This turns change-point detection in a supervised learning problem. A key motivation for our approach are results that show many common test statistics for detecting changes, such as the CUSUM test for detecting a change in mean, can be represented by simple neural networks. This means that with sufficient training data, the classifier learnt by such a neural network will give performance at least as good as classifiers corresponding to these standard tests. In scenarios where a standard test, such as CUSUM, is being applied but its modelling assumptions do not hold, we can expect the classifier learnt by the neural network to outperform it. There has been increasing recent interest in whether ideas from machine learning, and methods for classification, can be used for change-point detection. Within computer science and engineering, these include a number of methods designed for and that show promise on specific applications (e.g. Ahmadzadeh, 2018; De Ryck et al., 2021; Gupta et al., 2022; Huang et al., 2023). Within statistics, Londschien et al. (2022) and Lee et al. (2023) consider training a classifier as a way to estimate the likelihood-ratio statistic for a change. However these methods train the classifier in an un-supervised way on the data being analysed, using the idea that a classifier would more easily distinguish between two segments of data if they are separated by a change-point. Chang et al. (2019) use simulated data to help tune a kernel-based change detection method. Methods that use historical, labelled data have been used to train the tuning parameters of change-point algorithms (e.g. Hocking et al., 2015; Liehrmann et al., 2021). Also, neural networks have been employed to construct similarity scores of new observations to learned pre-change distributions for online change-point detection (Lee et al., 2023). However, we are unaware of any previous work using historical, labelled data to develop offline change-point methods. As such, and for simplicity, we focus on the most fundamental aspect, namely the problem of detecting a single change. Detecting and localising multiple changes is considered in Section 6 when analysing activity data. We remark that by viewing the change-point detection problem as a classification instead of a testing problem, we aim to control the overall misclassification error rate instead of handling the Type I and Type II errors separately. In practice, asymmetric treatment of the two error types can be achieved by suitably re-weighting misclassification in the two directions in the training loss function. The method we develop has parallels with likelihood-free inference methods Gourieroux et al. (1993); Beaumont (2019) in that one application of our work is to use the ability to simulate from a model so as to circumvent the need to analytically calculate likelihoods. However, the approach we take is very different from standard likelihood-free methods which tend to use simulation to estimate the likelihood function itself. By comparison, we directly target learning a function of the data that can discriminate between instances that do or do not contain a change (though see Gutmann et al., 2018, for likelihood-free methods based on re-casting the likelihood as a classification problem). For an introduction to the statistical aspects of neural network-based classification, albeit not specifically in a change-point context, see Ripley (1994). We now briefly introduce our notation. For any $n∈\mathbb{Z}^{+}$ , we define $[n]\coloneqq\{1,...,n\}$ . We take all vectors to be column vectors unless otherwise stated. Let $\boldsymbol{1}_{n}$ be the all-one vector of length $n$ . Let $\mathbbm{1}\{·\}$ represent the indicator function. The vertical symbol $|·|$ represents the absolute value or cardinality of $·$ depending on the context. For vector $\boldsymbol{x}=(x_{1},...,x_{n})^{→p}$ , we define its $p$ -norm as $\|\boldsymbol{x}\|_{p}\coloneqq\big{(}\sum_{i=1}^{n}|x_{i}|^{p}\big{)}^{1/p},p≥
1$ ; when $p=∞$ , define $\|\boldsymbol{x}\|_{∞}\coloneqq\max_{i}|x_{i}|$ . All proofs, as well as additional simulations and real data analyses appear in the supplement.
2 Neural networks
The initial focus of our work is on the binary classification problem for whether a change-point exists in a given time series. We will work with multilayer neural networks with Rectified Linear Unit (ReLU) activation functions and binary output. The multilayer neural network consists of an input layer, hidden layers and an output layer, and can be represented by a directed acyclic graph, see Figure 1.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Neural Network Architecture
### Overview
The image depicts a schematic diagram of a feedforward neural network. It illustrates the basic structure of an artificial neural network with an input layer, one or more hidden layers, and an output layer. The connections between nodes (neurons) are represented by lines, indicating the flow of information.
### Components/Axes
The diagram is divided into three main sections, labeled from left to right: "Input", "Hidden Layers", and "Output".
* **Input Layer:** Contains three nodes labeled x₁, x₂, and x₃. These nodes are represented by orange circles.
* **Hidden Layers:** Consists of two layers of nodes. The first hidden layer has four nodes, and the second hidden layer has two nodes. These nodes are represented by blue circles.
* **Output Layer:** Contains two nodes labeled y₁ and y₂. These nodes are represented by orange circles.
* **Connections:** Blue lines connect nodes between layers, representing weighted connections.
* **Activation Function:** The equation "σ(wᵀx + b)" is present at the bottom-left, likely representing an activation function applied to the weighted sum of inputs.
### Detailed Analysis or Content Details
The diagram shows a fully connected network, meaning each node in one layer is connected to every node in the next layer.
* **Input Layer:**
* x₁ is the first input feature.
* x₂ is the second input feature.
* x₃ is the third input feature.
* **Hidden Layers:**
* The first hidden layer receives input from all three input nodes (x₁, x₂, x₃).
* The second hidden layer receives input from all four nodes in the first hidden layer.
* **Output Layer:**
* y₁ is the first output.
* y₂ is the second output.
* The output layer receives input from all two nodes in the second hidden layer.
* **Activation Function:**
* σ represents the sigmoid function or another activation function.
* wᵀ represents the transpose of the weight matrix.
* x represents the input vector.
* b represents the bias vector.
### Key Observations
The network has a relatively simple architecture with only three input nodes and two output nodes. The presence of the activation function suggests that the network is capable of learning non-linear relationships between the inputs and outputs. The diagram does not provide any specific values for weights, biases, or activation function parameters.
### Interpretation
This diagram illustrates the fundamental structure of a neural network used for machine learning tasks. The network takes three input features (x₁, x₂, x₃), processes them through two hidden layers, and produces two outputs (y₁, y₂). The connections between nodes represent the weights that are learned during the training process. The activation function introduces non-linearity, allowing the network to model complex relationships. The diagram is a conceptual representation and does not provide any information about the specific application or performance of the network. It serves as a visual aid for understanding the basic principles of neural network architecture. The equation at the bottom suggests that the network uses a weighted sum of inputs, combined with a bias, and then passed through an activation function to produce the output of each neuron. This is a standard approach in many neural network models.
</details>
Figure 1: A neural network with 2 hidden layers and width vector $\mathbf{m}=(4,4)$ .
Let $L∈\mathbb{Z}^{+}$ represent the number of hidden layers and $\boldsymbol{m}={(m_{1},...,m_{L})}^{→p}$ the vector of the hidden layers widths, i.e. $m_{i}$ is the number of nodes in the $i$ th hidden layer. For a neural network with $L$ hidden layers we use the convention that $m_{0}=n$ and $m_{L+1}=1$ . For any bias vector $\boldsymbol{b}={(b_{1},b_{2},...,b_{r})}^{→p}∈\mathbb{R}^{r}$ , define the shifted activation function $\sigma_{\boldsymbol{b}}:\mathbb{R}^{r}→\mathbb{R}^{r}$ :
$$
\sigma_{\boldsymbol{b}}((y_{1},\ldots,y_{r})^{\top})=(\sigma(y_{1}-b_{1}),%
\ldots,\sigma(y_{r}-b_{r}))^{\top},
$$
where $\sigma(x)=\max(x,0)$ is the ReLU activation function. The neural network can be mathematically represented by the composite function $h:\mathbb{R}^{n}→\{0,1\}$ as
$$
h(\boldsymbol{x})\coloneqq\sigma^{*}_{\lambda}W_{L}\sigma_{\boldsymbol{b}_{L}}%
W_{L-1}\sigma_{\boldsymbol{b}_{L-1}}\cdots W_{1}\sigma_{\boldsymbol{b}_{1}}W_{%
0}\boldsymbol{x}, \tag{1}
$$
where $\sigma^{*}_{\lambda}(x)=\mathbbm{1}\{x>\lambda\}$ , $\lambda>0$ and $W_{\ell}∈\mathbb{R}^{m_{\ell+1}× m_{\ell}}$ for $\ell∈\{0,...,L\}$ represent the weight matrices. We define the function class $\mathcal{H}_{L,\boldsymbol{m}}$ to be the class of functions $h(\boldsymbol{x})$ with $L$ hidden layers and width vector $\boldsymbol{m}$ . The output layer in (1) employs the shifted heaviside function $\sigma^{*}_{\lambda}(x)$ , which is used for binary classification as the final activation function. This choice is guided by the fact that we use the 0-1 loss, which focuses on the percentage of samples assigned to the correct class, a natural performance criterion for binary classification. Besides its wide adoption in machine learning practice, another advantage of using the 0-1 loss is that it is possible to utilise the theory of the Vapnik–Chervonenkis (VC) dimension (see, e.g. Shalev-Shwartz and Ben-David, 2014, Definition 6.5) to bound the generalisation error of a binary classifier equipped with this loss; indeed, this is the approach we take in this work. The relevant results regarding the VC dimension of neural network classifiers are e.g. in Bartlett et al. (2019). As in Schmidt-Hieber (2020), we work with the exact minimiser of the empirical risk. In both binary or multiclass classification, it is possible to work with other losses which make it computationally easier to minimise the corresponding risk, see e.g. Bos and Schmidt-Hieber (2022), who use a version of the cross-entropy loss. However, loss functions different from the 0-1 loss make it impossible to use VC-dimension arguments to control the generalisation error, and more involved arguments, such as those using the covering number (Bos and Schmidt-Hieber, 2022) need to be used instead. We do not pursue these generalisations in the current work.
3 CUSUM-based classifier and its generalisations are neural networks
3.1 Change in mean
We initially consider the case of a single change-point with an unknown location $\tau∈[n-1]$ , $n≥ 2$ , in the model
| | $\displaystyle\boldsymbol{X}$ | $\displaystyle=\boldsymbol{\mu}+\boldsymbol{\xi},$ | |
| --- | --- | --- | --- |
where $\mu_{\mathrm{L}},\mu_{\mathrm{R}}$ are the unknown signal values before and after the change-point; $\boldsymbol{\xi}\sim N_{n}(0,I_{n})$ . The CUSUM test is widely used to detect mean changes in univariate data. For the observation $\boldsymbol{x}$ , the CUSUM transformation $\mathcal{C}:\mathbb{R}^{n}→\mathbb{R}^{n-1}$ is defined as $\mathcal{C}(\boldsymbol{x}):=(\boldsymbol{v}_{1}^{→p}\boldsymbol{x},...,%
\boldsymbol{v}_{n-1}^{→p}\boldsymbol{x})^{→p}$ , where $\boldsymbol{v}_{i}\coloneqq\bigl{(}\sqrt{\frac{n-i}{in}}\boldsymbol{1}_{i}^{%
→p},-\sqrt{\frac{i}{(n-i)n}}\boldsymbol{1}_{n-i}^{→p}\bigr{)}^{→p}$ for $i∈[n-1]$ . Here, for each $i∈[n-1]$ , $(\boldsymbol{v}_{i}^{→p}\boldsymbol{x})^{2}$ is the log likelihood-ratio statistic for testing a change at time $i$ against the null of no change (e.g. Baranowski et al., 2019). For a given threshold $\lambda>0$ , the classical CUSUM test for a change in the mean of the data is defined as
$$
h^{\mathrm{CUSUM}}_{\lambda}(\boldsymbol{x})=\mathbbm{1}\{\|\mathcal{C}(%
\boldsymbol{x})\|_{\infty}>\lambda\}.
$$
The following lemma shows that $h^{\mathrm{CUSUM}}_{\lambda}(\boldsymbol{x})$ can be represented as a neural network.
**Lemma 3.1**
*For any $\lambda>0$ , we have $h^{\mathrm{CUSUM}}_{\lambda}(\boldsymbol{x})∈\mathcal{H}_{1,2n-2}$ .*
The fact that the widely-used CUSUM statistic can be viewed as a simple neural network has far-reaching consequences: this means that given enough training data, a neural network architecture that permits the CUSUM-based classifier as its special case cannot do worse than CUSUM in classifying change-point versus no-change-point signals. This serves as the main motivation for our work, and a prelude to our next results.
3.2 Beyond the mean change model
We can generalise the simple change in mean model to allow for different types of change or for non-independent noise. In this section, we consider change-point models that can be expressed as a change in regression problem, where the model for data given a change at $\tau$ is of the form
$$
\boldsymbol{X}=\boldsymbol{Z}\boldsymbol{\beta}+\boldsymbol{c}_{\tau}\phi+%
\boldsymbol{\Gamma}\boldsymbol{\xi}, \tag{2}
$$
where for some $p≥ 1$ , $\boldsymbol{Z}$ is an $n× p$ matrix of covariates for the model with no change, $\boldsymbol{c}_{\tau}$ is an $n× 1$ vector of covariates specific to the change at $\tau$ , and the parameters $\boldsymbol{\beta}$ and $\phi$ are, respectively, a $p× 1$ vector and a scalar. The noise is defined in terms of an $n× n$ matrix $\boldsymbol{\Gamma}$ and an $n× 1$ vector of independent standard normal random variables, $\boldsymbol{\xi}$ . For example, the change in mean problem has $p=1$ , with $\boldsymbol{Z}$ a column vector of ones, and $\boldsymbol{c}_{\tau}$ being a vector whose first $\tau$ entries are zeros, and the remaining entries are ones. In this formulation $\beta$ is the pre-change mean, and $\phi$ is the size of the change. The change in slope problem Fearnhead et al. (2019) has $p=2$ with the columns of $\boldsymbol{Z}$ being a vector of ones, and a vector whose $i$ th entry is $i$ ; and $\boldsymbol{c}_{\tau}$ has $i$ th entry that is $\max\{0,i-\tau\}$ . In this formulation $\boldsymbol{\beta}$ defines the pre-change linear mean, and $\phi$ the size of the change in slope. Choosing $\boldsymbol{\Gamma}$ to be proportional to the identity matrix gives a model with independent, identically distributed noise; but other choices would allow for auto-correlation. The following result is a generalisation of Lemma 3.1, which shows that the likelihood-ratio test for (2), viewed as a classifier, can be represented by our neural network.
**Lemma 3.2**
*Consider the change-point model (2) with a possible change at $\tau∈[n-1]$ . Assume further that $\boldsymbol{\Gamma}$ is invertible. Then there is an $h^{*}∈\mathcal{H}_{1,2n-2}$ equivalent to the likelihood-ratio test for testing $\phi=0$ against $\phi≠ 0$ .*
Importantly, this result shows that for this much wider class of change-point models, we can replicate the likelihood-ratio-based classifier for change using a simple neural network. Other types of changes can be handled by suitably pre-transforming the data. For instance, squaring the input data would be helpful in detecting changes in the variance and if the data followed an AR(1) structure, then changes in autocorrelation could be handled by including transformations of the original input of the form $(x_{t}x_{t+1})_{t=1,...,n-1}$ . On the other hand, even if such transformations are not supplied as the input, a neural network of suitable depth is able to approximate these transformations and consequently successfully detect the change (Schmidt-Hieber, 2020, Lemma A.2). This is illustrated in Figure 7 of appendix, where we compare the performance of neural network based classifiers of various depths constructed with and without using the transformed data as inputs.
4 Generalisation error of neural network change-point classifiers
In Section 3, we showed that CUSUM and generalised CUSUM could be represented by a neural network. Therefore, with a large enough amount of training data, a trained neural network classifier that included CUSUM, or generalised CUSUM, as a special case, would perform no worse than it on unseen data. In this section, we provide generalisation bounds for a neural network classifier for the change-in-mean problem, given a finite amount of training data. En route to this main result, stated in Theorem 4.3, we provide generalisation bounds for the CUSUM-based classifier, in which the threshold has been chosen on a finite training data set. We write $P(n,\tau,\mu_{\mathrm{L}},\mu_{\mathrm{R}})$ for the distribution of the multivariate normal random vector $\boldsymbol{X}\sim N_{n}(\boldsymbol{\mu},I_{n})$ where $\boldsymbol{\mu}\coloneqq{(\mu_{\mathrm{L}}\mathbbm{1}\{i≤\tau\}+\mu_{%
\mathrm{R}}\mathbbm{1}\{i>\tau\})}_{i∈[n]}$ . Define $\eta\coloneqq\tau/n$ . Lemma 4.1 and Corollary 4.1 control the misclassification error of the CUSUM-based classifier.
**Lemma 4.1**
*Fix $\varepsilon∈(0,1)$ . Suppose $\boldsymbol{X}\sim P(n,\tau,\mu_{\mathrm{L}},\mu_{\mathrm{R}})$ for some $\tau∈\mathbb{Z}^{+}$ and $\mu_{\mathrm{L}},\mu_{\mathrm{R}}∈\mathbb{R}$ .
1. If $\mu_{\mathrm{L}}=\mu_{\mathrm{R}}$ , then $\mathbb{P}\bigl{\{}\|\mathcal{C}(\boldsymbol{X})\|_{∞}>\sqrt{2\log(n/%
\varepsilon)}\bigr{\}}≤\varepsilon.$
1. If $|\mu_{\mathrm{L}}-\mu_{\mathrm{R}}|\sqrt{\eta(1-\eta)}>\sqrt{8\log(n/%
\varepsilon)/n}$ , then $\mathbb{P}\bigl{\{}\|\mathcal{C}(\boldsymbol{X})\|_{∞}≤\sqrt{2\log(n/%
\varepsilon)}\bigr{\}}≤\varepsilon.$*
For any $B>0$ , define
$$
\Theta(B)\coloneqq\left\{(\tau,\mu_{\mathrm{L}},\mu_{\mathrm{R}})\in[n-1]%
\times\mathbb{R}\times\mathbb{R}:|\mu_{\mathrm{L}}-\mu_{\mathrm{R}}|\sqrt{\tau%
(n-\tau)}/n\in\{0\}\cup\left(B,\infty\right)\right\}.
$$
Here, $|\mu_{\mathrm{L}}-\mu_{\mathrm{R}}|\sqrt{\tau(n-\tau)}/n=|\mu_{\mathrm{L}}-\mu%
_{\mathrm{R}}|\sqrt{\eta(1-\eta)}$ can be interpreted as the signal-to-noise ratio of the mean change problem. Thus, $\Theta(B)$ is the parameter space of data distributions where there is either no change, or a single change-point in mean whose signal-to-noise ratio is at least $B$ . The following corollary controls the misclassification risk of a CUSUM statistics-based classifier:
**Corollary 4.1**
*Fix $B>0$ . Let $\pi_{0}$ be any prior distribution on $\Theta(B)$ , then draw $(\tau,\mu_{\mathrm{L}},\mu_{\mathrm{R}})\sim\pi_{0}$ and $\boldsymbol{X}\sim P(n,\tau,\mu_{\mathrm{L}},\mu_{\mathrm{R}})$ , and define $Y=\mathbbm{1}\{\mu_{\mathrm{L}}≠\mu_{\mathrm{R}}\}$ . For $\lambda=B\sqrt{n}/2$ , the classifier $h^{\mathrm{CUSUM}}_{\lambda}$ satisfies
$$
\mathbb{P}(h^{\mathrm{CUSUM}}_{\lambda}(\boldsymbol{X})\neq Y)\leq ne^{-nB^{2}%
/8}.
$$*
Theorem 4.2 below, which is based on Corollary 4.1, Bartlett et al. (2019, Theorem 7) and Mohri et al. (2012, Corollary 3.4), shows that the empirical risk minimiser in the neural network class $\mathcal{H}_{1,2n-2}$ has good generalisation properties over the class of change-point problems parameterised by $\Theta(B)$ . Given training data $(\boldsymbol{X}^{(1)},Y^{(1)}),...,(\boldsymbol{X}^{(N)},Y^{(N)})$ and any $h:\mathbb{R}^{n}→\{0,1\}$ , we define the empirical risk of $h$ as
$$
L_{N}(h)\coloneqq\frac{1}{N}\sum_{i=1}^{N}\mathbbm{1}\{Y^{(i)}\neq h(%
\boldsymbol{X}^{(i)})\}.
$$
**Theorem 4.2**
*Fix $B>0$ and let $\pi_{0}$ be any prior distribution on $\Theta(B)$ . We draw $(\tau,\mu_{\mathrm{L}},\mu_{\mathrm{R}})\sim\pi_{0}$ , $\boldsymbol{X}\sim P(n,\tau,\mu_{\mathrm{L}},\mu_{\mathrm{R}})$ , and set $Y=\mathbbm{1}\{\mu_{\mathrm{L}}≠\mu_{\mathrm{R}}\}$ . Suppose that the training data $\mathcal{D}:=\bigl{(}(\boldsymbol{X}^{(1)},Y^{(1)}),...,(\boldsymbol{X}^{(N%
)},Y^{(N)})\bigr{)}$ consist of independent copies of $(\boldsymbol{X},Y)$ and $h_{\mathrm{ERM}}\coloneqq\operatorname*{arg\,min}_{h∈\mathcal{H}_{1,2n-2}}L_%
{N}(h)$ is the empirical risk minimiser. There exists a universal constant $C>0$ such that for any $\delta∈(0,1)$ , (3) holds with probability $1-\delta$ .
$$
\mathbb{P}(h_{\mathrm{ERM}}(\boldsymbol{X})\neq Y\mid\mathcal{D})\leq ne^{-nB^%
{2}/8}+C\sqrt{\frac{n^{2}\log(n)\log(N)+\log(1/\delta)}{N}}. \tag{3}
$$*
The theoretical results derived for the neural network-based classifier, here and below, all rely on the fact that the training and test data are drawn from the same distribution. However, we observe that in practice, even when the training and test sets have different error distributions, neural network-based classifiers still provide accurate results on the test set; see our discussion of Figure 2 in Section 5 for more details. The misclassification error in (3) is bounded by two terms. The first term represents the misclassification error of CUSUM-based classifier, see Corollary 4.1, and the second term depends on the complexity of the neural network class measured in its VC dimension. Theorem 4.2 suggests that for training sample size $N\gg n^{2}\log n$ , a well-trained single-hidden-layer neural network with $2n-2$ hidden nodes would have comparable performance to that of the CUSUM-based classifier. However, as we will see in Section 5, in practice, a much smaller training sample size $N$ is needed for the neural network to be competitive in the change-point detection task. This is because the $2n-2$ hidden layer nodes in the neural network representation of $h^{\mathrm{CUSUM}}_{\lambda}$ encode the components of the CUSUM transformation $(±\boldsymbol{v}_{t}^{→p}\boldsymbol{x}:t∈[n-1])$ , which are highly correlated. By suitably pruning the hidden layer nodes, we can show that a single-hidden-layer neural network with $O(\log n)$ hidden nodes is able to represent a modified version of the CUSUM-based classifier with essentially the same misclassification error. More precisely, let $Q:=\lfloor\log_{2}(n/2)\rfloor$ and write $T_{0}:=\{2^{q}:0≤ q≤ Q\}\cup\{n-2^{q}:0≤ q≤ Q\}$ . We can then define
$$
h^{\mathrm{CUSUM}_{*}}_{\lambda^{*}}(\boldsymbol{X})=\mathbbm{1}\Bigl{\{}\max_%
{t\in T_{0}}|\boldsymbol{v}_{t}^{\top}\boldsymbol{X}|>\lambda^{*}\Bigr{\}}.
$$
By the same argument as in Lemma 3.1, we can show that $h^{\mathrm{CUSUM}_{*}}_{\lambda^{*}}∈\mathcal{H}_{1,4\lfloor\log_{2}(n)\rfloor}$ for any $\lambda^{*}>0$ . The following Theorem shows that high classification accuracy can be achieved under a weaker training sample size condition compared to Theorem 4.2.
**Theorem 4.3**
*Fix $B>0$ and let the training data $\mathcal{D}$ be generated as in Theorem 4.2. Let $h_{\mathrm{ERM}}\coloneqq\operatorname*{arg\,min}_{h∈\mathcal{H}_{L,%
\boldsymbol{m}}}L_{N}(h)$ be the empirical risk minimiser for a neural network with $L≥ 1$ layers and $\boldsymbol{m}=(m_{1},...,m_{L})^{→p}$ hidden layer widths. If $m_{1}≥ 4\lfloor\log_{2}(n)\rfloor$ and $m_{r}m_{r+1}=O(n\log n)$ for all $r∈[L-1]$ , then there exists a universal constant $C>0$ such that for any $\delta∈(0,1)$ , (4) holds with probability $1-\delta$ .
$$
\mathbb{P}(h_{\mathrm{ERM}}(\boldsymbol{X})\neq Y\mid\mathcal{D})\leq 2\lfloor%
\log_{2}(n)\rfloor e^{-nB^{2}/24}+C\sqrt{\frac{L^{2}n\log^{2}(Ln)\log(N)+\log(%
1/\delta)}{N}}. \tag{4}
$$*
Theorem 4.3 generalises the single hidden layer neural network representation in Theorem 4.2 to multiple hidden layers. In practice, multiple hidden layers help keep the misclassification error rate low even when $N$ is small, see the numerical study in Section 5. Theorems 4.2 and 4.3 are examples of how to derive generalisation errors of a neural network-based classifier in the change-point detection task. The same workflow can be employed in other types of changes, provided that suitable representation results of likelihood-based tests in terms of neural networks (e.g. Lemma 3.2) can be obtained. In a general result of this type, the generalisation error of the neural network will again be bounded by a sum of the error of the likelihood-based classifier together with a term originating from the VC-dimension bound of the complexity of the neural network architecture. We further remark that for simplicity of discussion, we have focused our attention on data models where the noise vector $\boldsymbol{\xi}=\boldsymbol{X}-\mathbb{E}\boldsymbol{X}$ has independent and identically distributed normal components. However, since CUSUM-based tests are available for temporally correlated or sub-Weibull data, with suitably adjusted test threshold values, the above theoretical results readily generalise to such settings. See Theorems A.3 and A.5 in appendix for more details.
5 Numerical study
We now investigate empirically our approach of learning a change-point detection method by training a neural network. Motivated by the results from the previous section we will fit a neural network with a single layer and consider how varying the number of hidden layers and the amount of training data affects performance. We will compare to a test based on the CUSUM statistic, both for scenarios where the noise is independent and Gaussian, and for scenarios where there is auto-correlation or heavy-tailed noise. The CUSUM test can be sensitive to the choice of threshold, particularly when we do not have independent Gaussian noise, so we tune its threshold based on training data. When training the neural network, we first standardise the data onto $[0,1]$ , i.e. $\tilde{\boldsymbol{x}}_{i}=((x_{ij}-x_{i}^{\mathrm{min}})/(x_{i}^{\mathrm{max}%
}-x_{i}^{\mathrm{min}}))_{j∈[n]}$ where $x_{i}^{\mathrm{max}}:=\max_{j}x_{ij},x_{i}^{\mathrm{min}}:=\min_{j}x_{ij}$ . This makes the neural network procedure invariant to either adding a constant to the data or scaling the data by a constant, which are natural properties to require. We train the neural network by minimising the cross-entropy loss on the training data. We run training for 200 epochs with a batch size of 32 and a learning rate of 0.001 using the Adam optimiser (Kingma and Ba, 2015). These hyperparameters are chosen based on a training dataset with cross-validation, more details can be found in Appendix B. We generate our data as follows. Given a sequence of length $n$ , we draw $\tau\sim\mathrm{Unif}\{2,...,n-2\}$ , set $\mu_{\mathrm{L}}=0$ and draw $\mu_{\mathrm{R}}|\tau\sim\mathrm{Unif}([-1.5b,-0.5b]\cup[0.5b,1.5b])$ , where $b:=\sqrt{\frac{8n\log(20n)}{\tau(n-\tau)}}$ is chosen in line with Lemma 4.1 to ensure a good range of signal-to-noise ratios. We then generate $\boldsymbol{x}_{1}=(\mu_{\mathrm{L}}\mathbbm{1}_{\{t≤\tau\}}+\mu_{\mathrm{R%
}}\mathbbm{1}_{\{t>\tau\}}+\varepsilon_{t})_{t∈[n]}$ , with the noise $(\varepsilon_{t})_{t∈[n]}$ following an $\mathrm{AR}(1)$ model with possibly time-varying autocorrelation $\varepsilon_{t}|\rho_{t}=\xi_{1}$ for $t=1$ and $\rho_{t}\varepsilon_{t-1}+\xi_{t}$ for $t≥ 2$ , where $(\xi_{t})_{t∈[n]}$ are independent, possibly heavy-tailed noise. The autocorrelations $\rho_{t}$ and innovations $\xi_{t}$ are from one of the three scenarios:
1. $n=100$ , $N∈\{100,200,...,700\}$ , $\rho_{t}=0$ and $\xi_{t}\sim N(0,1)$ .
1. $n=100$ , $N∈\{100,200,...,700\}$ , $\rho_{t}=0.7$ and $\xi_{t}\sim N(0,1)$ .
1. $n=100$ , $N∈\{100,200,...,1000\}$ , $\rho_{t}\sim\mathrm{Unif}([0,1])$ and $\xi_{t}\sim N(0,2)$ .
1. $n=100$ , $N∈\{100,200,...,1000\}$ , $\rho_{t}=0$ and $\xi_{t}\sim\text{Cauchy}(0,0.3)$ .
The above procedure is then repeated $N/2$ times to generate independent sequences $\boldsymbol{x}_{1},...,\boldsymbol{x}_{N/2}$ with a single change, and the associated labels are $(y_{1},...,y_{N/2})^{→p}=\mathbf{1}_{N/2}$ . We then repeat the process another $N/2$ times with $\mu_{\mathrm{R}}=\mu_{\mathrm{L}}$ to generate sequences without changes $\boldsymbol{x}_{N/2+1},...,\boldsymbol{x}_{N}$ with $(y_{N/2+1},...,y_{N})^{→p}=\mathbf{0}_{N/2}$ . The data with and without change $(\boldsymbol{x}_{i},y_{i})_{i∈[N]}$ are combined and randomly shuffled to form the training data. The test data are generated in a similar way, with a sample size $N_{\mathrm{test}}=30000$ and the slight modification that $\mu_{\mathrm{R}}|\tau\sim\mathrm{Unif}([-1.75b,-0.25b]\cup[0.25b,1.75b])$ when a change occurs. We note that the test data is drawn from the same distribution as the training set, though potentially having changes with signal-to-noise ratios outside the range covered by the training set. We have also conducted robustness studies to investigate the effect of training the neural networks on scenario S1 and test on S1 ${}^{\prime}$ , S2 or S3. Qualitatively similar results to Figure 2 have been obtained in this misspecified setting (see Figure 6 in appendix).
<details>
<summary>x2.png Details</summary>

### Visual Description
## Line Chart: MER Average vs. N for Different Algorithms
### Overview
The image presents a line chart comparing the Mean Error Rate (MER) Average for several algorithms as a function of 'N', likely representing sample size or number of iterations. The algorithms are CUSUM, m^(1),L=1, m^(2),L=1, m^(1),L=5, and m^(1),L=10. The chart visually demonstrates how the MER Average changes with increasing N for each algorithm.
### Components/Axes
* **X-axis:** Labeled "N", ranging from approximately 100 to 700, with tick marks at 100, 200, 300, 400, 500, 600, and 700.
* **Y-axis:** Labeled "MER Average", ranging from approximately 0.06 to 0.16, with tick marks at 0.06, 0.08, 0.10, 0.12, 0.14, and 0.16.
* **Legend:** Located in the top-right corner of the chart. It identifies each line with the following labels and corresponding colors:
* CUSUM (Blue)
* m^(1),L=1 (Orange)
* m^(2),L=1 (Green)
* m^(1),L=5 (Red)
* m^(1),L=10 (Purple)
### Detailed Analysis
Here's a breakdown of each line's trend and approximate data points, verified against the legend colors:
* **CUSUM (Blue):** The line starts at approximately (100, 0.13), decreases to around (200, 0.07), fluctuates between approximately 0.06 and 0.08 until around (600, 0.08), and then increases slightly to approximately (700, 0.09).
* **m^(1),L=1 (Orange):** This line exhibits a steep decline from approximately (100, 0.17) to around (200, 0.09), then continues to decrease, reaching a minimum of approximately (600, 0.05), and slightly increases to approximately (700, 0.06).
* **m^(2),L=1 (Green):** The line starts at approximately (100, 0.14), decreases to around (200, 0.08), and then fluctuates between approximately 0.06 and 0.07, ending at approximately (700, 0.06).
* **m^(1),L=5 (Red):** The line begins at approximately (100, 0.08), decreases to around (200, 0.07), and then fluctuates between approximately 0.05 and 0.06, ending at approximately (700, 0.05).
* **m^(1),L=10 (Purple):** The line starts at approximately (100, 0.07), decreases to around (200, 0.06), and then fluctuates between approximately 0.05 and 0.06, ending at approximately (700, 0.06).
### Key Observations
* The algorithm m^(1),L=1 (orange) shows the most significant initial decrease in MER Average as N increases.
* The algorithms m^(1),L=5 (red) and m^(1),L=10 (purple) consistently exhibit the lowest MER Average values across the range of N.
* CUSUM (blue) has a relatively stable MER Average after the initial decrease.
* m^(2),L=1 (green) shows a moderate decrease in MER Average, but remains higher than m^(1),L=5 and m^(1),L=10.
### Interpretation
The chart suggests that the algorithms m^(1),L=5 and m^(1),L=10 are the most effective in minimizing the Mean Error Rate, particularly as the sample size (N) increases. The algorithm m^(1),L=1 demonstrates a rapid initial improvement but plateaus at a higher MER Average compared to the other two. CUSUM shows a moderate performance, while m^(2),L=1 performs relatively worse.
The parameter 'L' appears to play a crucial role in the performance of the m^(1) algorithm, with larger values of L (5 and 10) leading to lower MER Averages. This could indicate that a larger 'L' value allows for more accurate error detection or a more stable estimation process. The initial steep decline in MER Average for all algorithms suggests that increasing the sample size initially provides significant benefits in reducing error rates. However, beyond a certain point, the improvements diminish, and the algorithms converge towards a stable MER Average. The differences between the algorithms become less pronounced at higher values of N.
</details>
<details>
<summary>x3.png Details</summary>

### Visual Description
## Line Chart: MER Average vs. N for Different Methods
### Overview
This image presents a line chart comparing the Mean Error Rate (MER) Average for several methods as a function of 'N'. The methods include CUSUM, and variations of m^(1) and m^(2) with different values of L (1, 5, and 10). The chart aims to demonstrate how the performance of each method changes with increasing 'N'.
### Components/Axes
* **X-axis:** Labeled "N", ranging from approximately 100 to 700, with tick marks at 100, 200, 300, 400, 500, 600, and 700.
* **Y-axis:** Labeled "MER Average", ranging from approximately 0.18 to 0.32, with tick marks at 0.18, 0.20, 0.22, 0.24, 0.26, 0.28, 0.30, and 0.32.
* **Legend:** Located in the top-right corner of the chart. It identifies the following data series:
* CUSUM (Blue)
* m^(1), L=1 (Orange)
* m^(2), L=1 (Green)
* m^(1), L=5 (Red)
* m^(1), L=10 (Purple)
### Detailed Analysis
Here's a breakdown of each line's trend and approximate data points:
* **CUSUM (Blue):** The line starts at approximately (100, 0.26), decreases slightly to around (200, 0.25), remains relatively stable between (200, 0.25) and (600, 0.25), and then increases slightly to approximately (700, 0.26).
* **m^(1), L=1 (Orange):** This line exhibits a strong downward trend from approximately (100, 0.32) to (200, 0.22). It then plateaus around (300, 0.23) and gradually decreases to approximately (700, 0.20).
* **m^(2), L=1 (Green):** This line also shows a significant decrease from approximately (100, 0.31) to (200, 0.22). It then fluctuates between approximately (200, 0.22) and (600, 0.22), and decreases slightly to approximately (700, 0.21).
* **m^(1), L=5 (Red):** This line starts at approximately (100, 0.28), decreases to around (200, 0.21), then increases to approximately (300, 0.23), and decreases again to approximately (700, 0.19).
* **m^(1), L=10 (Purple):** This line begins at approximately (100, 0.29), decreases steadily to approximately (700, 0.18). It shows the most consistent downward trend among all the methods.
### Key Observations
* The methods m^(1), L=1 and m^(2), L=1 start with the highest MER averages and show significant improvement as N increases.
* The CUSUM method maintains a relatively stable MER average across the range of N values.
* m^(1), L=10 consistently exhibits the lowest MER average, indicating the best performance across all N values.
* m^(1), L=5 shows some fluctuation in MER average, with a slight increase around N=300.
### Interpretation
The chart demonstrates the impact of different methods and parameter settings (L) on the Mean Error Rate (MER) as the sample size (N) increases. The consistent decrease in MER for m^(1), L=10 suggests that increasing the value of L improves the method's performance, likely by reducing sensitivity to noise or outliers. The stability of the CUSUM method indicates its robustness to changes in N. The initial high MER averages for m^(1), L=1 and m^(2), L=1 suggest that these methods may require larger sample sizes to achieve comparable performance to CUSUM or m^(1), L=10. The fluctuations observed in m^(1), L=5 could indicate a sensitivity to specific data patterns or a suboptimal parameter setting for certain N values. Overall, the data suggests that the choice of method and parameter tuning are crucial for achieving accurate results, and that increasing the value of L generally leads to improved performance.
</details>
(a) Scenario S1 with $\rho_{t}=0$ (b) Scenario S1 ${}^{\prime}$ with $\rho_{t}=0.7$
<details>
<summary>x4.png Details</summary>

### Visual Description
## Line Chart: MER Average vs. N for Different Methods
### Overview
This image presents a line chart comparing the Mean Error Rate (MER) Average for several methods as a function of 'N', likely representing sample size or number of observations. The methods being compared are CUSUM, m<sup>(1)</sup> with L=1, m<sup>(2)</sup> with L=1, m<sup>(1)</sup> with L=5, and m<sup>(1)</sup> with L=10.
### Components/Axes
* **X-axis:** Labeled "N", ranging from approximately 0 to 1000, with markers at 200, 400, 600, 800, and 1000.
* **Y-axis:** Labeled "MER Average", ranging from approximately 0.18 to 0.36, with markers at 0.18, 0.20, 0.22, 0.24, 0.26, 0.28, 0.30, 0.32, 0.34, and 0.36.
* **Legend:** Located in the top-right corner of the chart. It identifies each line with its corresponding method:
* Blue circle: CUSUM
* Orange triangle: m<sup>(1)</sup>, L=1
* Green diamond: m<sup>(2)</sup>, L=1
* Red triangle: m<sup>(1)</sup>, L=5
* Purple cross: m<sup>(1)</sup>, L=10
### Detailed Analysis
* **CUSUM (Blue Line):** The line starts at approximately 0.25 at N=0, fluctuates between approximately 0.23 and 0.25, and ends at approximately 0.24 at N=1000. It shows a relatively stable performance across the range of N.
* **m<sup>(1)</sup>, L=1 (Orange Line):** This line begins at approximately 0.32 at N=0, decreases sharply to approximately 0.22 at N=200, continues to decrease to approximately 0.19 at N=400, and then plateaus around 0.18-0.20 for the remainder of the range, ending at approximately 0.19 at N=1000.
* **m<sup>(2)</sup>, L=1 (Green Line):** This line starts at approximately 0.36 at N=0, decreases rapidly to approximately 0.22 at N=200, continues to decrease to approximately 0.18 at N=400, and then remains relatively stable between approximately 0.18 and 0.19, ending at approximately 0.18 at N=1000.
* **m<sup>(1)</sup>, L=5 (Red Line):** This line begins at approximately 0.30 at N=0, decreases to approximately 0.21 at N=200, continues to decrease to approximately 0.18 at N=400, and then fluctuates between approximately 0.18 and 0.20, ending at approximately 0.18 at N=1000.
* **m<sup>(1)</sup>, L=10 (Purple Line):** This line starts at approximately 0.27 at N=0, decreases to approximately 0.20 at N=200, continues to decrease to approximately 0.18 at N=400, and then remains relatively stable between approximately 0.17 and 0.19, ending at approximately 0.18 at N=1000.
### Key Observations
* All methods show a decreasing MER Average as N increases, indicating improved performance with larger sample sizes.
* m<sup>(2)</sup>, L=1 and m<sup>(1)</sup>, L=1 initially have the highest MER Average, but they also show the most significant decrease in MER Average as N increases.
* CUSUM exhibits the most stable performance, with the smallest fluctuations in MER Average.
* The methods m<sup>(1)</sup>, L=5 and m<sup>(1)</sup>, L=10 converge to similar MER Average values as N increases.
### Interpretation
The chart demonstrates the performance of different methods for detecting changes or anomalies, as measured by the Mean Error Rate. The 'N' parameter likely represents the number of data points used in the analysis. The results suggest that increasing the sample size (N) generally improves the accuracy of all methods.
The initial higher error rates for m<sup>(2)</sup>, L=1 and m<sup>(1)</sup>, L=1 could indicate that these methods require a larger sample size to achieve optimal performance. The stability of the CUSUM method suggests it is less sensitive to sample size variations, making it a robust choice when sample sizes are limited.
The convergence of m<sup>(1)</sup>, L=5 and m<sup>(1)</sup>, L=10 suggests that the value of 'L' (likely a smoothing parameter or window size) has a diminishing effect on performance beyond a certain point. The choice of method and parameter settings should be based on the specific application and the trade-off between initial performance and sensitivity to sample size.
</details>
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Line Chart: MER Average vs. N for Different Methods
### Overview
This image presents a line chart comparing the Mean Error Rate (MER) Average for several methods as a function of 'N', likely representing sample size or number of iterations. The methods being compared are CUSUM, m^(1),L=1, m^(2),L=1, m^(1),L=5, and m^(1),L=10. The chart visually demonstrates how the MER Average changes for each method as N increases from approximately 100 to 1000.
### Components/Axes
* **X-axis:** Labeled "N", ranging from approximately 100 to 1000. The scale appears linear.
* **Y-axis:** Labeled "MER Average", ranging from 0.25 to 0.50. The scale appears linear.
* **Legend:** Located in the top-right corner of the chart. It identifies each line with the following labels and corresponding colors:
* CUSUM (Blue)
* m^(1),L=1 (Orange)
* m^(2),L=1 (Light Green)
* m^(1),L=5 (Red)
* m^(1),L=10 (Purple)
### Detailed Analysis
* **CUSUM (Blue Line):** The blue line representing CUSUM starts at approximately 0.37 at N=100 and gradually increases to approximately 0.36 at N=1000. It exhibits a relatively flat trend with minor fluctuations.
* N=100: MER Average ≈ 0.37
* N=200: MER Average ≈ 0.36
* N=400: MER Average ≈ 0.35
* N=600: MER Average ≈ 0.35
* N=800: MER Average ≈ 0.36
* N=1000: MER Average ≈ 0.36
* **m^(1),L=1 (Orange Line):** This line begins at approximately 0.42 at N=100 and decreases sharply to approximately 0.27 at N=1000. It shows a strong downward trend.
* N=100: MER Average ≈ 0.42
* N=200: MER Average ≈ 0.36
* N=400: MER Average ≈ 0.31
* N=600: MER Average ≈ 0.29
* N=800: MER Average ≈ 0.27
* N=1000: MER Average ≈ 0.27
* **m^(2),L=1 (Light Green Line):** Starting at approximately 0.39 at N=100, this line decreases to approximately 0.26 at N=1000. It exhibits a similar downward trend to m^(1),L=1, but starts at a higher MER Average.
* N=100: MER Average ≈ 0.39
* N=200: MER Average ≈ 0.34
* N=400: MER Average ≈ 0.30
* N=600: MER Average ≈ 0.28
* N=800: MER Average ≈ 0.26
* N=1000: MER Average ≈ 0.26
* **m^(1),L=5 (Red Line):** This line starts at approximately 0.32 at N=100 and decreases to approximately 0.25 at N=1000. It shows a moderate downward trend.
* N=100: MER Average ≈ 0.32
* N=200: MER Average ≈ 0.29
* N=400: MER Average ≈ 0.28
* N=600: MER Average ≈ 0.27
* N=800: MER Average ≈ 0.26
* N=1000: MER Average ≈ 0.25
* **m^(1),L=10 (Purple Line):** Beginning at approximately 0.34 at N=100, this line decreases to approximately 0.26 at N=1000. It exhibits a similar downward trend to m^(1),L=5, but starts at a higher MER Average.
* N=100: MER Average ≈ 0.34
* N=200: MER Average ≈ 0.30
* N=400: MER Average ≈ 0.28
* N=600: MER Average ≈ 0.27
* N=800: MER Average ≈ 0.28
* N=1000: MER Average ≈ 0.26
### Key Observations
* All methods, except CUSUM, demonstrate a decreasing MER Average as N increases, suggesting improved performance with larger sample sizes or more iterations.
* The m^(1),L=1 method exhibits the most significant decrease in MER Average.
* CUSUM maintains a relatively constant MER Average across all values of N.
* The m^(2),L=1 method starts with the highest MER Average but also shows a substantial reduction as N increases.
### Interpretation
The chart suggests that the methods m^(1),L=1, m^(2),L=1, m^(1),L=5, and m^(1),L=10 are all sensitive to the value of N, with their performance (as measured by MER Average) improving as N increases. This implies that these methods benefit from more data or iterations. The CUSUM method, however, appears to be less affected by N, maintaining a consistent level of performance. This could indicate that CUSUM is more robust to variations in sample size or that it reaches a performance plateau quickly. The differences in initial MER Average and the rate of decrease among the m^(1) and m^(2) methods suggest that the choice of method and its parameters (L) can significantly impact performance. The parameter 'L' appears to influence the rate of improvement, with larger values of L potentially leading to slower but more stable convergence. The chart provides valuable insights for selecting an appropriate method based on the expected sample size and desired level of accuracy.
</details>
(c) Scenario S2 with $\rho_{t}\sim\text{Unif}([0,1])$ (d) Scenario S3 with Cauchy noise
Figure 2: Plot of the test set MER, computed on a test set of size $N_{\mathrm{test}}=30000$ , against training sample size $N$ for detecting the existence of a change-point on data series of length $n=100$ . We compare the performance of the CUSUM test and neural networks from four function classes: $\mathcal{H}_{1,m^{(1)}}$ , $\mathcal{H}_{1,m^{(2)}}$ , $\mathcal{H}_{5,m^{(1)}\mathbf{1}_{5}}$ and $\mathcal{H}_{10,m^{(1)}\mathbf{1}_{10}}$ where $m^{(1)}=4\lfloor\log_{2}(n)\rfloor$ and $m^{(2)}=2n-2$ respectively under scenarios S1, S1 ${}^{\prime}$ , S2 and S3 described in Section 5.
We compare the performance of the CUSUM-based classifier with the threshold cross-validated on the training data with neural networks from four function classes: $\mathcal{H}_{1,m^{(1)}}$ , $\mathcal{H}_{1,m^{(2)}}$ , $\mathcal{H}_{5,m^{(1)}\mathbf{1}_{5}}$ and $\mathcal{H}_{10,m^{(1)}\mathbf{1}_{10}}$ where $m^{(1)}=4\lfloor\log_{2}(n)\rfloor$ and $m^{(2)}=2n-2$ respectively (cf. Theorem 4.3 and Lemma 3.1). Figure 2 shows the test misclassification error rate (MER) of the four procedures in the four scenarios S1, S1 ${}^{\prime}$ , S2 and S3. We observe that when data are generated with independent Gaussian noise ( Figure 2 (a)), the trained neural networks with $m^{(1)}$ and $m^{(2)}$ single hidden layer nodes attain very similar test MER compared to the CUSUM-based classifier. This is in line with our Theorem 4.3. More interestingly, when noise has either autocorrelation ( Figure 2 (b, c)) or heavy-tailed distribution ( Figure 2 (d)), trained neural networks with $(L,\mathbf{m})$ : $(1,m^{(1)})$ , $(1,m^{(2)})$ , $(5,m^{(1)}\mathbf{1}_{5})$ and $(10,m^{(1)}\mathbf{1}_{10})$ outperform the CUSUM-based classifier, even after we have optimised the threshold choice of the latter. In addition, as shown in Figure 5 in the online supplement, when the first two layers of the network are set to carry out truncation, which can be seen as a composition of two ReLU operations, the resulting neural network outperforms the Wilcoxon statistics-based classifier (Dehling et al., 2015), which is a standard benchmark for change-point detection in the presence of heavy-tailed noise. Furthermore, from Figure 2, we see that increasing $L$ can significantly reduce the average MER when $N≤ 200$ . Theoretically, as the number of layers $L$ increases, the neural network is better able to approximate the optimal decision boundary, but it becomes increasingly difficult to train the weights due to issues such as vanishing gradients (He et al., 2016). A combination of these considerations leads us to develop deep neural network architecture with residual connections for detecting multiple changes and multiple change types in Section 6.
6 Detecting multiple changes and multiple change types – case study
From the previous section, we see that single and multiple hidden layer neural networks can represent CUSUM or generalised CUSUM tests and may perform better than likelihood-based test statistics when the model is misspecified. This prompted us to seek a general network architecture that can detect, and even classify, multiple types of change. Motivated by the similarities between signal processing and image recognition, we employed a deep convolutional neural network (CNN) (Yamashita et al., 2018) to learn the various features of multiple change-types. However, stacking more CNN layers cannot guarantee a better network because of vanishing gradients in training (He et al., 2016). Therefore, we adopted the residual block structure (He et al., 2016) for our neural network architecture. After experimenting with various architectures with different numbers of residual blocks and fully connected layers on synthetic data, we arrived at a network architecture with 21 residual blocks followed by a number of fully connected layers. Figure 9 shows an overview of the architecture of the final general-purpose deep neural network for change-point detection. The precise architecture and training methodology of this network $\widehat{NN}$ can be found in Appendix C. Neural Architecture Search (NAS) approaches (see Paaß and Giesselbach, 2023, Section 2.4.3) offer principled ways of selecting neural architectures. Some of these approaches could be made applicable in our setting. We demonstrate the power of our general purpose change-point detection network in a numerical study. We train the network on $N=10000$ instances of data sequences generated from a mixture of no change-point in mean or variance, change in mean only, change in variance only, no-change in a non-zero slope and change in slope only, and compare its classification performance on a test set of size $2500$ against that of oracle likelihood-based classifiers (where we pre-specify whether we are testing for change in mean, variance or slope) and adaptive likelihood-based classifiers (where we combine likelihood based tests using the Bayesian Information Criterion). Details of the data-generating mechanism and classifiers can be found in Appendix B. The classification accuracy of the three approaches in weak and strong signal-to-noise ratio settings are reported in Table 1. We see that the neural network-based approach achieves similar classification accuracy as adaptive likelihood based method for weak SNR and higher classification accuracy than the adaptive likelihood based method for strong SNR. We would not expect the neural network to outperform the oracle likelihood-based classifiers as it has no knowledge of the exact change-type of each time series.
Table 1: Test classification accuracy of oracle likelihood-ratio based method (LR ${}^{\mathrm{oracle}}$ ), adaptive likelihood ratio method (LR ${}^{\mathrm{adapt}}$ ) and our residual neural network (NN) classifier for setups with weak and strong signal-to-noise ratios (SNR). Data are generated as a mixture of no change-point in mean or variance (Class 1), change in mean only (Class 2), change in variance only (Class 3), no-change in a non-zero slope (Class 4), change in slope only (Class 5). We report the true positive rate of each class and the accuracy in the last row.
Weak SNR Strong SNR LR ${}^{\mathrm{oracle}}$ LR ${}^{\mathrm{adapt}}$ NN LR ${}^{\mathrm{oracle}}$ LR ${}^{\mathrm{adapt}}$ NN Class 1 0.9787 0.9457 0.8062 0.9787 0.9341 0.9651 Class 2 0.8443 0.8164 0.8882 1.0000 0.7784 0.9860 Class 3 0.8350 0.8291 0.8585 0.9902 0.9902 0.9705 Class 4 0.9960 0.9453 0.8826 0.9980 0.9372 0.9312 Class 5 0.8729 0.8604 0.8353 0.9958 0.9917 0.9147 Accuracy 0.9056 0.8796 0.8660 0.9924 0.9260 0.9672
We now consider an application to detecting different types of change. The HASC (Human Activity Sensing Consortium) project data contain motion sensor measurements during a sequence of human activities, including “stay”, “walk”, “jog”, “skip”, “stair up” and “stair down”. Complex changes in sensor signals occur during transition from one activity to the next (see Figure 3). We have 28 labels in HASC data, see Figure 10 in appendix. To agree with the dimension of the output, we drop two dense layers “Dense(10)” and “Dense(20)” in Figure 9. The resulting network can be effectively applied for change-point detection in sensory signals of human activities, and can achieve high accuracy in change-point classification tasks (Figure 12 in appendix). Finally, we remark that our neural network-based change-point detector can be utilised to detect multiple change-points. Algorithm 1 outlines a general scheme for turning a change-point classifier into a location estimator, where we employ an idea similar to that of MOSUM (Eichinger and Kirch, 2018) and repeatedly apply a classifier $\psi$ to data from a sliding window of size $n$ . Here, we require $\psi$ applied to each data segment $\boldsymbol{X}^{*}_{[i,i+n)}$ to output both the class label $L_{i}=0$ or $1$ if no change or a change is predicted and the corresponding probability $p_{i}$ of having a change. In our particular example, for each data segment $\boldsymbol{X}^{*}_{[i,i+n)}$ of length $n=700$ , we define $\psi(\boldsymbol{X}^{*}_{[i,i+n)})=0$ if $\widehat{NN}(\boldsymbol{X}^{*}_{[i,i+n)})$ predicts a class label in $\{0,4,8,12,16,22\}$ (see Figure 10 in appendix) and 1 otherwise. The thresholding parameter $\gamma∈\mathbb{Z}^{+}$ is chosen to be $1/2$ .
Input: new data $\boldsymbol{x}_{1}^{*},...,\boldsymbol{x}_{n^{*}}^{*}∈\mathbb{R}^{d}$ , a trained classifier $\psi:\mathbb{R}^{d× n}→\{0,1\}$ , $\gamma>0$ .
1 Form $\boldsymbol{X}_{[i,i+n)}^{*}:=(\boldsymbol{x}_{i}^{*},...,\boldsymbol{x}_{i%
+n-1})$ and compute $L_{i}←\psi(\boldsymbol{X}^{*}_{[i,i+n)})$ for all $i=1,...,n^{*}-n+1$ ;
2 Compute $\bar{L}_{i}← n^{-1}\sum_{j=i-n+1}^{i}L_{j}$ for $i=n,...,n^{*}-n+1$ ;
3 Let $\{[s_{1},e_{1}],...,[s_{\hat{\nu}},e_{\hat{\nu}}]\}$ be the set of all maximal segments such that $\bar{L}_{i}≥\gamma$ for all $i∈[s_{r},e_{r}]$ , $r∈[\hat{\nu}]$ ;
4 Compute $\hat{\tau}_{r}←\operatorname*{arg\,max}_{i∈[s_{r},e_{r}]}\bar{L}_{i}$ for all $r∈[\hat{\nu}]$ ;
Output: Estimated change-points $\hat{\tau}_{1},...,\hat{\tau}_{\hat{\nu}}$
Algorithm 1 Algorithm for change-point localisation
Figure 4 illustrates the result of multiple change-point detection in HASC data which provides evidence that the trained neural network can detect both the multiple change-types and multiple change-points.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Time Series Chart: 3-Axis Motion Data
### Overview
The image presents three time series charts stacked vertically, representing motion data along the X, Y, and Z axes. Each chart displays signal intensity (approximately ranging from -4 to 2) over time (from 0 to 3500 time units). Red boxes highlight specific time intervals labeled with activity types: "stair down", "stay", "stair up", and "walk". Grey dashed vertical lines demarcate the boundaries of these activity segments.
### Components/Axes
* **X-axis:** Time (units from 0 to 3500).
* **Y-axis (for each chart):** Signal Intensity (approximately -4 to 2). Each chart has a different label: "X", "Y", and "Z".
* **Labels:** "stair down", "stay", "stair up", "walk".
* **Highlighting:** Red rectangular boxes around specific time intervals.
* **Vertical Lines:** Grey dashed lines marking segment boundaries.
### Detailed Analysis or Content Details
**Chart 1: X-Axis Data**
* The X-axis data (blue line) exhibits a generally oscillating pattern throughout the entire time series.
* **"stair down" (0-500):** High-frequency oscillations with an approximate amplitude of 1.5.
* **"stay" (500-1000):** The signal stabilizes around 0, with minimal oscillation.
* **"stair up" (1000-1500):** Similar to "stair down", high-frequency oscillations with an approximate amplitude of 1.5.
* **"walk" (1500-3500):** Oscillations continue, but with a slightly lower average amplitude (around 1.2) and a more irregular pattern.
**Chart 2: Y-Axis Data**
* The Y-axis data (orange line) also shows oscillations, but with a different character than the X-axis.
* **"stair down" (0-500):** Oscillations with an approximate amplitude of 1.
* **"stay" (500-1000):** Signal stabilizes around 0, with minimal oscillation.
* **"stair up" (1000-1500):** Oscillations with an approximate amplitude of 1.
* **"walk" (1500-3500):** Oscillations continue, with a slightly lower average amplitude (around 0.8) and a more irregular pattern.
**Chart 3: Z-Axis Data**
* The Z-axis data (green line) displays a more complex pattern, with a noticeable negative offset.
* **"stair down" (0-500):** Oscillations around -2, with an approximate amplitude of 1.
* **"stay" (500-1000):** Signal stabilizes around -2, with minimal oscillation.
* **"stair up" (1000-1500):** Oscillations around -2, with an approximate amplitude of 1.
* **"walk" (1500-3500):** Oscillations continue, with a slightly lower average amplitude (around 0.7) and a more irregular pattern.
### Key Observations
* The "stay" periods consistently show minimal signal variation across all three axes, indicating a stationary state.
* "Stair down" and "stair up" exhibit similar oscillatory patterns, suggesting a consistent motion profile during these activities.
* The "walk" period shows more irregular oscillations, likely due to the more complex and variable nature of walking.
* The Z-axis data is consistently offset to negative values, suggesting a predominant downward force or orientation.
### Interpretation
The data likely represents accelerometer readings from a sensor attached to a person or object. The three axes (X, Y, and Z) capture motion in three dimensions. The labeled segments ("stair down", "stay", "stair up", "walk") represent different activities, and the corresponding signal patterns allow for activity recognition.
The consistent patterns during "stair down" and "stair up" suggest a relatively regular motion profile, while the more irregular pattern during "walk" reflects the more complex dynamics of walking. The stabilization of signals during "stay" confirms a lack of movement. The negative offset in the Z-axis could indicate the sensor is oriented downwards, or that the subject is generally leaning or applying force in a downward direction.
This type of data is commonly used in activity recognition systems, wearable sensors, and human-computer interaction applications. The data suggests a clear correlation between the observed motion patterns and the labeled activities, which could be used to train a machine learning model for automated activity classification.
</details>
Figure 3: The sequence of accelerometer data in $x,y$ and $z$ axes. From left to right, there are 4 activities: “stair down”, “stay”, “stair up” and “walk”, their change-points are 990, 1691, 2733 respectively marked by black solid lines. The grey rectangles represent the group of “no-change” with labels: “stair down”, “stair up” and “walk”; The red rectangles represent the group of “one-change” with labels: “stair down $→$ stay”, “stay $→$ stair up” and “stair up $→$ walk”.
<details>
<summary>x7.png Details</summary>

### Visual Description
\n
## Time Series Chart: Sensor Signal During Activities
### Overview
The image presents a time series chart displaying sensor signal data over time, segmented into different activity types. The chart shows three distinct signal lines (x, y, and z) fluctuating in amplitude as the activity changes. The x-axis represents time, and the y-axis represents the signal strength. The activities are labeled above the chart, indicating the type of movement or state being recorded.
### Components/Axes
* **X-axis:** Time (ranging from approximately 0 to 11000)
* **Y-axis:** Signal (ranging from approximately -2 to 2)
* **Legend:** Located in the top-right corner, identifies the three signal lines:
* x (represented by a blue line)
* y (represented by an orange line)
* z (represented by a green line)
* **Activity Labels:** Placed horizontally above the chart, marking time intervals with activity names: walk, skip, stay, jog, walk, stUp, stay, stDown, walk, stay, skip, jog.
* **Vertical Lines:** Thin vertical lines are present at the boundaries of each activity segment, visually separating the different activities.
### Detailed Analysis
The chart displays three time series, each representing a different sensor axis (x, y, z). The signal values fluctuate over time, with distinct patterns corresponding to each activity.
* **x (Blue Line):** The x-axis signal generally exhibits smaller amplitude fluctuations compared to the y and z axes. It shows a relatively stable signal during "stay" periods and more variation during dynamic activities like "walk," "skip," and "jog."
* **y (Orange Line):** The y-axis signal shows significant fluctuations, particularly during "skip," "stUp," and "stDown" activities. It appears to have a more pronounced positive and negative swing during these movements.
* **z (Green Line):** The z-axis signal exhibits the largest amplitude fluctuations and is the most visually dominant signal in the chart. It shows clear patterns related to the different activities. For example, during "walk" and "jog," the signal oscillates with a relatively consistent frequency.
Here's a breakdown of signal characteristics for each activity (approximate values):
* **walk (0-1000, 4000-5000, 8000-9000):** z fluctuates between -1.5 and 1.5, y between -0.5 and 0.5, x is relatively stable around 0.
* **skip (1000-2000, 10000-11000):** z fluctuates between -2 and 2, y between -1 and 1, x shows some variation but remains smaller in amplitude.
* **stay (2000-3000, 6000-7000, 9000-10000):** z is relatively stable around 0, y is also close to 0, x is minimal.
* **jog (3000-4000, 11000-end):** z fluctuates between -1.5 and 1.5, y between -0.5 and 0.5, x is relatively stable around 0.
* **stUp (5000-6000):** y shows a large positive spike, z fluctuates between -1 and 1, x is minimal.
* **stDown (7000-8000):** y shows a large negative spike, z fluctuates between -1 and 1, x is minimal.
### Key Observations
* The "stay" activity consistently exhibits the lowest signal amplitude across all three axes, indicating minimal movement.
* "Skip," "stUp," and "stDown" activities generate the highest signal amplitude, particularly in the y-axis, suggesting significant and rapid changes in orientation.
* The z-axis signal appears to be the most sensitive to changes in activity, showing the most pronounced fluctuations.
* The x-axis signal is the least sensitive, providing a relatively stable baseline.
### Interpretation
This chart likely represents data from an accelerometer or similar sensor measuring movement along three axes. The different signal patterns correspond to different human activities. The data suggests that the sensor can effectively differentiate between static ("stay") and dynamic ("walk," "skip," "jog") activities. The "stUp" and "stDown" activities likely represent standing up and sitting down, respectively, and are characterized by a strong signal change in the y-axis, indicating vertical movement.
The consistent patterns observed for each activity suggest that this data could be used to develop an activity recognition system. The sensor data could be used as input to a machine learning model trained to classify different activities based on the signal characteristics. The outliers, such as the large spikes during "stUp" and "stDown," could be used as key features for accurate classification. The relative stability of the x-axis could be used to filter out noise or compensate for sensor drift.
</details>
Figure 4: Change-point detection in HASC data. The red vertical lines represent the underlying change-points, the blue vertical lines represent the estimated change-points. More details on multiple change-point detection can be found in Appendix C.
7 Discussion
Reliable testing for change-points and estimating their locations, especially in the presence of multiple change-points, other heterogeneities or untidy data, is typically a difficult problem for the applied statistician: they need to understand what type of change is sought, be able to characterise it mathematically, find a satisfactory stochastic model for the data, formulate the appropriate statistic, and fine-tune its parameters. This makes for a long workflow, with scope for errors at its every stage. In this paper, we showed how a carefully constructed statistical learning framework could automatically take over some of those tasks, and perform many of them ‘in one go’ when provided with examples of labelled data. This turned the change-point detection problem into a supervised learning problem, and meant that the task of learning the appropriate test statistic and fine-tuning its parameters was left to the ‘machine’ rather than the human user. The crucial question was that of choosing an appropriate statistical learning framework. The key factor behind our choice of neural networks was the discovery that the traditionally-used likelihood-ratio-based change-point detection statistics could be viewed as simple neural networks, which (together with bounds on generalisation errors beyond the training set) enabled us to formulate and prove the corresponding learning theory. However, there are a plethora of other excellent predictive frameworks, such as XGBoost, LightGBM or Random Forests (Chen and Guestrin, 2016; Ke et al., 2017; Breiman, 2001) and it would be of interest to establish whether and why they could or could not provide a viable alternative to neural nets here. Furthermore, if we view the neural network as emulating the likelihood-ratio test statistic, in that it will create test statistics for each possible location of a change and then amalgamate these into a single classifier, then we know that test statistics for nearby changes will often be similar. This suggests that imposing some smoothness on the weights of the neural network may be beneficial. A further challenge is to develop methods that can adapt easily to input data of different sizes, without having to train a different neural network for each input size. For changes in the structure of the mean of the data, it may be possible to use ideas from functional data analysis so that we pre-process the data, with some form of smoothing or imputation, to produce input data of the correct length. If historical labelled examples of change-points, perhaps provided by subject-matter experts (who are not necessarily statisticians) are not available, one question of interest is whether simulation can be used to obtain such labelled examples artificially, based on (say) a single dataset of interest. Such simulated examples would need to come in two flavours: one batch ‘likely containing no change-points’ and the other containing some artificially induced ones. How to simulate reliably in this way is an important problem, which this paper does not solve. Indeed, we can envisage situations in which simulating in this way may be easier than solving the original unsupervised change-point problem involving the single dataset at hand, with the bulk of the difficulty left to the ‘machine’ at the learning stage when provided with the simulated data. For situations where there is no historical data, but there are statistical models, one can obtain training data by simulation from the model. In this case, training a neural network to detect a change has similarities with likelihood-free inference methods in that it replaces analytic calculations associated with a model by the ability to simulate from the model. It is of interest whether ideas from that area of statistics can be used here. The main focus of our work was on testing for a single offline change-point, and we treated location estimation and extensions to multiple-change scenarios only superficially, via the heuristics of testing-based estimation in Section 6. Similar extensions can be made to the online setting once the neural network is trained, by retaining the final $n$ observations in an online stream in memory and applying our change-point classifier sequentially. One question of interest is whether and how these heuristics can be made more rigorous: equipped with an offline classifier only, how can we translate the theoretical guarantee of this offline classifier to that of the corresponding location estimator or online detection procedure? In addition to this approach, how else can a neural network, however complex, be trained to estimate locations or detect change-points sequentially? In our view, these questions merit further work.
Availability of data and computer code
The data underlying this article are available in http://hasc.jp/hc2011/index-en.html. The computer code and algorithm are available in Python Package: AutoCPD.
Acknowledgement
This work was supported by the High End Computing Cluster at Lancaster University, and EPSRC grants EP/V053590/1, EP/V053639/1 and EP/T02772X/1. We highly appreciate Yudong Chen’s contribution to debug our Python scripts and improve their readability.
Conflicts of Interest
We have no conflicts of interest to disclose.
References
- Ahmadzadeh (2018) Ahmadzadeh, F. (2018). Change point detection with multivariate control charts by artificial neural network. J. Adv. Manuf. Technol. 97 (9), 3179–3190.
- Aminikhanghahi and Cook (2017) Aminikhanghahi, S. and D. J. Cook (2017). Using change point detection to automate daily activity segmentation. In 2017 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops), pp. 262–267.
- Baranowski et al. (2019) Baranowski, R., Y. Chen, and P. Fryzlewicz (2019). Narrowest-over-threshold detection of multiple change points and change-point-like features. J. Roy. Stat. Soc., Ser. B 81 (3), 649–672.
- Bartlett et al. (2019) Bartlett, P. L., N. Harvey, C. Liaw, and A. Mehrabian (2019). Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. J. Mach. Learn. Res. 20 (63), 1–17.
- Beaumont (2019) Beaumont, M. A. (2019). Approximate Bayesian computation. Annu. Rev. Stat. Appl. 6, 379–403.
- Bengio et al. (1994) Bengio, Y., P. Simard, and P. Frasconi (1994). Learning long-term dependencies with gradient descent is difficult. IEEE T. Neural Networ. 5 (2), 157–166.
- Bos and Schmidt-Hieber (2022) Bos, T. and J. Schmidt-Hieber (2022). Convergence rates of deep ReLU networks for multiclass classification. Electron. J. Stat. 16 (1), 2724–2773.
- Breiman (2001) Breiman, L. (2001). Random forests. Mach. Learn. 45 (1), 5–32.
- Chang et al. (2019) Chang, W.-C., C.-L. Li, Y. Yang, and B. Póczos (2019). Kernel change-point detection with auxiliary deep generative models. In International Conference on Learning Representations.
- Chen and Gupta (2012) Chen, J. and A. K. Gupta (2012). Parametric Statistical Change Point Analysis: With Applications to Genetics, Medicine, and Finance (2nd ed.). New York: Birkhäuser.
- Chen and Guestrin (2016) Chen, T. and C. Guestrin (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794.
- De Ryck et al. (2021) De Ryck, T., M. De Vos, and A. Bertrand (2021). Change point detection in time series data using autoencoders with a time-invariant representation. IEEE T. Signal Proces. 69, 3513–3524.
- Dehling et al. (2015) Dehling, H., R. Fried, I. Garcia, and M. Wendler (2015). Change-point detection under dependence based on two-sample U-statistics. In D. Dawson, R. Kulik, M. Ould Haye, B. Szyszkowicz, and Y. Zhao (Eds.), Asymptotic Laws and Methods in Stochastics: A Volume in Honour of Miklós Csörgő, pp. 195–220. New York, NY: Springer New York.
- Dürre et al. (2016) Dürre, A., R. Fried, T. Liboschik, and J. Rathjens (2016). robts: Robust Time Series Analysis. R package version 0.3.0/r251.
- Eichinger and Kirch (2018) Eichinger, B. and C. Kirch (2018). A MOSUM procedure for the estimation of multiple random change points. Bernoulli 24 (1), 526–564.
- Fearnhead et al. (2019) Fearnhead, P., R. Maidstone, and A. Letchford (2019). Detecting changes in slope with an $l_{0}$ penalty. J. Comput. Graph. Stat. 28 (2), 265–275.
- Fearnhead and Rigaill (2020) Fearnhead, P. and G. Rigaill (2020). Relating and comparing methods for detecting changes in mean. Stat 9 (1), 1–11.
- Fryzlewicz (2014) Fryzlewicz, P. (2014). Wild binary segmentation for multiple change-point detection. Ann. Stat. 42 (6), 2243–2281.
- Fryzlewicz (2021) Fryzlewicz, P. (2021). Robust narrowest significance pursuit: Inference for multiple change-points in the median. arXiv preprint, arxiv:2109.02487.
- Fryzlewicz (2023) Fryzlewicz, P. (2023). Narrowest significance pursuit: Inference for multiple change-points in linear models. J. Am. Stat. Assoc., to appear.
- Gao et al. (2019) Gao, Z., Z. Shang, P. Du, and J. L. Robertson (2019). Variance change point detection under a smoothly-changing mean trend with application to liver procurement. J. Am. Stat. Assoc. 114 (526), 773–781.
- Glorot and Bengio (2010) Glorot, X. and Y. Bengio (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256. JMLR Workshop and Conference Proceedings.
- Gourieroux et al. (1993) Gourieroux, C., A. Monfort, and E. Renault (1993). Indirect inference. J. Appl. Econom. 8 (S1), S85–S118.
- Gupta et al. (2022) Gupta, M., R. Wadhvani, and A. Rasool (2022). Real-time change-point detection: A deep neural network-based adaptive approach for detecting changes in multivariate time series data. Expert Syst. Appl. 209, 1–16.
- Gutmann et al. (2018) Gutmann, M. U., R. Dutta, S. Kaski, and J. Corander (2018). Likelihood-free inference via classification. Stat. Comput. 28 (2), 411–425.
- Haynes et al. (2017) Haynes, K., I. A. Eckley, and P. Fearnhead (2017). Computationally efficient changepoint detection for a range of penalties. J. Comput. Graph. Stat. 26 (1), 134–143.
- He and Sun (2015) He, K. and J. Sun (2015). Convolutional neural networks at constrained time cost. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5353–5360.
- He et al. (2016) He, K., X. Zhang, S. Ren, and J. Sun (2016, June). Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778.
- Hocking et al. (2015) Hocking, T., G. Rigaill, and G. Bourque (2015). PeakSeg: constrained optimal segmentation and supervised penalty learning for peak detection in count data. In International Conference on Machine Learning, pp. 324–332. PMLR.
- Huang et al. (2023) Huang, T.-J., Q.-L. Zhou, H.-J. Ye, and D.-C. Zhan (2023). Change point detection via synthetic signals. In 8th Workshop on Advanced Analytics and Learning on Temporal Data.
- Ioffe and Szegedy (2015) Ioffe, S. and C. Szegedy (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 448–456. JMLR.org.
- James et al. (1987) James, B., K. L. James, and D. Siegmund (1987). Tests for a change-point. Biometrika 74 (1), 71–83.
- Jandhyala et al. (2013) Jandhyala, V., S. Fotopoulos, I. MacNeill, and P. Liu (2013). Inference for single and multiple change-points in time series. J. Time Ser. Anal. 34 (4), 423–446.
- Ke et al. (2017) Ke, G., Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu (2017). LightGBM: A highly efficient gradient boosting decision tree. Adv. Neur. In. 30, 3146–3154.
- Killick et al. (2012) Killick, R., P. Fearnhead, and I. A. Eckley (2012). Optimal detection of changepoints with a linear computational cost. J. Am. Stat. Assoc. 107 (500), 1590–1598.
- Kingma and Ba (2015) Kingma, D. P. and J. Ba (2015). Adam: A method for stochastic optimization. In Y. Bengio and Y. LeCun (Eds.), ICLR (Poster).
- Kuchibhotla and Chakrabortty (2022) Kuchibhotla, A. K. and A. Chakrabortty (2022). Moving beyond sub-Gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression. Inf. Inference: A Journal of the IMA 11 (4), 1389–1456.
- Lee et al. (2023) Lee, J., Y. Xie, and X. Cheng (2023). Training neural networks for sequential change-point detection. In IEEE ICASSP 2023, pp. 1–5. IEEE.
- Li et al. (2015) Li, F., Z. Tian, Y. Xiao, and Z. Chen (2015). Variance change-point detection in panel data models. Econ. Lett. 126, 140–143.
- Li et al. (2023) Li, J., P. Fearnhead, P. Fryzlewicz, and T. Wang (2023). Automatic change-point detection in time series via deep learning. submitted, arxiv:2211.03860.
- Li et al. (2023) Li, M., Y. Chen, T. Wang, and Y. Yu (2023). Robust mean change point testing in high-dimensional data with heavy tails. arXiv preprint, arxiv:2305.18987.
- Liehrmann et al. (2021) Liehrmann, A., G. Rigaill, and T. D. Hocking (2021). Increased peak detection accuracy in over-dispersed ChIP-seq data with supervised segmentation models. BMC Bioinform. 22 (1), 1–18.
- Londschien et al. (2022) Londschien, M., P. Bühlmann, and S. Kovács (2022). Random forests for change point detection. arXiv preprint, arxiv:2205.04997.
- Mohri et al. (2012) Mohri, M., A. Rostamizadeh, and A. Talwalkar (2012). Foundations of Machine Learning. Adaptive Computation and Machine Learning Series. Cambridge, MA: MIT Press.
- Ng (2004) Ng, A. Y. (2004). Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In Proceedings of the Twenty-First International Conference on Machine Learning, ICML ’04, New York, NY, USA, pp. 78. Association for Computing Machinery.
- Oh et al. (2005) Oh, K. J., M. S. Moon, and T. Y. Kim (2005). Variance change point detection via artificial neural networks for data separation. Neurocomputing 68, 239–250.
- Paaß and Giesselbach (2023) Paaß, G. and S. Giesselbach (2023). Foundation Models for Natural Language Processing: Pre-trained Language Models Integrating Media. Artificial Intelligence: Foundations, Theory, and Algorithms. Springer International Publishing.
- Picard et al. (2005) Picard, F., S. Robin, M. Lavielle, C. Vaisse, and J.-J. Daudin (2005). A statistical approach for array CGH data analysis. BMC Bioinform. 6 (1).
- Reeves et al. (2007) Reeves, J., J. Chen, X. L. Wang, R. Lund, and Q. Q. Lu (2007). A review and comparison of changepoint detection techniques for climate data. J. Appl. Meteorol. Clim. 46 (6), 900–915.
- Ripley (1994) Ripley, B. D. (1994). Neural networks and related methods for classification. J. Roy. Stat. Soc., Ser. B 56 (3), 409–456.
- Schmidt-Hieber (2020) Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with ReLU activation function. Ann. Stat. 48 (4), 1875–1897.
- Shalev-Shwartz and Ben-David (2014) Shalev-Shwartz, S. and S. Ben-David (2014). Understanding Machine Learning: From Theory to Algorithms. New York, NY, USA: Cambridge University Press.
- Truong et al. (2020) Truong, C., L. Oudre, and N. Vayatis (2020). Selective review of offline change point detection methods. Signal Process. 167, 107299.
- Verzelen et al. (2020) Verzelen, N., M. Fromont, M. Lerasle, and P. Reynaud-Bouret (2020). Optimal change-point detection and localization. arXiv preprint, arxiv:2010.11470.
- Wang and Samworth (2018) Wang, T. and R. J. Samworth (2018). High dimensional change point estimation via sparse projection. J. Roy. Stat. Soc., Ser. B 80 (1), 57–83.
- Yamashita et al. (2018) Yamashita, R., M. Nishio, R. K. G. Do, and K. Togashi (2018). Convolutional neural networks: an overview and application in radiology. Insights into Imaging 9 (4), 611–629.
This is the appendix for the main paper Li, Fearnhead, Fryzlewicz, and Wang (2023), hereafter referred to as the main text. We present proofs of our main lemmas and theorems. Various technical details, results of numerical study and real data analysis are also listed here.
Appendix A Proofs
A.1 The proof of Lemma 3.1
Define $W_{0}\coloneqq(\boldsymbol{v}_{1},...,\boldsymbol{v}_{n-1},-\boldsymbol{v}_%
{1},...,-\boldsymbol{v}_{n-1})^{→p}$ and $W_{1}\coloneqq\boldsymbol{1}_{2n-2}$ , $\boldsymbol{b}_{1}\coloneqq\lambda\boldsymbol{1}_{2n-2}$ and $b_{2}\coloneqq 0$ . Then $h(\boldsymbol{x})\coloneqq\sigma^{*}_{b_{2}}W_{1}\sigma_{\boldsymbol{b}_{1}}W_%
{0}\boldsymbol{x}∈\mathcal{H}_{1,2n-2}$ can be rewritten as
$$
h(\boldsymbol{x})=\mathbbm{1}\biggl{\{}\sum_{i=1}^{n-1}\bigl{\{}(\boldsymbol{v%
}_{i}^{\top}\boldsymbol{x}-\lambda)_{+}+(-\boldsymbol{v}_{i}^{\top}\boldsymbol%
{x}-\lambda)_{+}\bigr{\}}>b_{2}\biggr{\}}=\mathbbm{1}\{\|\mathcal{C}(%
\boldsymbol{x})\|_{\infty}>\lambda\}=h_{\lambda}^{\mathrm{CUSUM}}(\boldsymbol{%
x}),
$$
as desired.
A.2 The Proof of Lemma 3.2
As $\boldsymbol{\Gamma}$ is invertible, (2) in main text is equivalent to
$$
\boldsymbol{\Gamma}^{-1}\boldsymbol{X}=\boldsymbol{\Gamma}^{-1}\boldsymbol{Z}%
\boldsymbol{\beta}+\boldsymbol{\Gamma}^{-1}\boldsymbol{c}_{\tau}\phi+%
\boldsymbol{\xi}.
$$
Write $\tilde{\boldsymbol{X}}=\boldsymbol{\Gamma}^{-1}\boldsymbol{X}$ , $\tilde{\boldsymbol{Z}}=\boldsymbol{\Gamma}^{-1}\boldsymbol{Z}$ and $\tilde{\boldsymbol{c}}_{\tau}=\boldsymbol{\Gamma}^{-1}\boldsymbol{c}_{\tau}$ . If $\tilde{\boldsymbol{c}}_{\tau}$ lies in the column span of $\tilde{\boldsymbol{Z}}$ , then the model with a change at $\tau$ is equivalent to the model with no change, and the likelihood-ratio test statistic will be 0. Otherwise we can assume, without loss of generality that $\tilde{\boldsymbol{c}}_{\tau}$ is orthogonal to each column of $\tilde{\boldsymbol{Z}}$ : if this is not the case we can construct an equivalent model where we replace $\tilde{\boldsymbol{c}}_{\tau}$ with its projection to the space that is orthogonal to the column span of $\tilde{\boldsymbol{Z}}$ . As $\boldsymbol{\xi}$ is a vector of independent standard normal random variables, the likelihood-ratio statistic for a change at $\tau$ against no change is a monotone function of the reduction in the residual sum of squares of the model with a change at $\tau$ . The residual sum of squares of the no change model is
$$
\tilde{\boldsymbol{X}}^{\top}\tilde{\boldsymbol{X}}-\tilde{\boldsymbol{X}}^{%
\top}\tilde{\boldsymbol{Z}}(\tilde{\boldsymbol{Z}}^{\top}\tilde{\boldsymbol{Z}%
})^{-1}\tilde{\boldsymbol{Z}}^{\top}\tilde{\boldsymbol{X}}.
$$
The residual sum of squares for the model with a change at $\tau$ is
$$
\tilde{\boldsymbol{X}}^{\top}\tilde{\boldsymbol{X}}-\tilde{\boldsymbol{X}}^{%
\top}[\tilde{\boldsymbol{Z}},\tilde{\boldsymbol{c}}_{\tau}]([\tilde{%
\boldsymbol{Z}},\tilde{\boldsymbol{c}}_{\tau}]^{\top}[\tilde{\boldsymbol{Z}},%
\tilde{\boldsymbol{c}}_{\tau}])^{-1}[\tilde{\boldsymbol{Z}},\tilde{\boldsymbol%
{c}}_{\tau}]^{\top}\tilde{\boldsymbol{X}}=\tilde{\boldsymbol{X}}^{\top}\tilde{%
\boldsymbol{X}}-\tilde{\boldsymbol{X}}^{\top}\tilde{\boldsymbol{Z}}(\tilde{%
\boldsymbol{Z}}^{\top}\tilde{\boldsymbol{Z}})^{-1}\tilde{\boldsymbol{Z}}^{\top%
}\tilde{\boldsymbol{X}}-\tilde{\boldsymbol{X}}^{\top}\tilde{\boldsymbol{c}}_{%
\tau}(\tilde{\boldsymbol{c}}_{\tau}^{\top}\tilde{\boldsymbol{c}}_{\tau})^{-1}%
\tilde{\boldsymbol{c}}_{\tau}^{\top}\tilde{\boldsymbol{X}}.
$$
Thus, the reduction in residual sum of square of the model with the change at $\tau$ over the no change model is
$$
\tilde{\boldsymbol{X}}^{\top}\tilde{\boldsymbol{c}}_{\tau}(\tilde{\boldsymbol{%
c}}_{\tau}^{\top}\tilde{\boldsymbol{c}}_{\tau})^{-1}\tilde{\boldsymbol{c}}_{%
\tau}^{\top}\tilde{\boldsymbol{X}}=\left(\frac{1}{\sqrt{\tilde{\boldsymbol{c}}%
_{\tau}^{\top}\tilde{\boldsymbol{c}}_{\tau}}}\tilde{\boldsymbol{c}}_{\tau}^{%
\top}\tilde{\boldsymbol{X}}\right)^{2}
$$
Thus if we define
$$
\boldsymbol{v}_{\tau}=\frac{1}{\sqrt{\tilde{\boldsymbol{c}}_{\tau}^{\top}%
\tilde{\boldsymbol{c}}_{\tau}}}\tilde{\boldsymbol{c}}_{\tau}^{\top}\boldsymbol%
{\Gamma}^{-1,}
$$
then the likelihood-ratio test statistic is a monotone function of $|\boldsymbol{v}_{\tau}\boldsymbol{X}|$ . This is true for all $\tau$ so the likelihood-ratio test is equivalent to
$$
\max_{\tau\in[n-1]}|\boldsymbol{v}_{\tau}\boldsymbol{X}|>\lambda,
$$
for some $\lambda$ . This is of a similar form to the standard CUSUM test, except that the form of $\boldsymbol{v}_{\tau}$ is different. Thus, by the same argument as for Lemma 3.1 in main text, we can replicate this test with $h(\boldsymbol{x})∈\mathcal{H}_{1,2n-2}$ , but with different weights to represent the different form for $\boldsymbol{v}_{\tau}$ .
A.3 The Proof of Lemma 4.1
* Proof*
(a) For each $i∈[n-1]$ , since ${\|\boldsymbol{v}_{i}\|_{2}}=1$ , we have $\boldsymbol{v}_{i}^{→p}\boldsymbol{X}\sim N(0,1)$ . Hence, by the Gaussian tail bound and a union bound,
$$
\mathbb{P}\Bigl{\{}\|\mathcal{C}(\boldsymbol{X})\|_{\infty}>t\Bigr{\}}\leq\sum%
_{i=1}^{n-1}\mathbb{P}\left(\left|\boldsymbol{v}_{i}^{\top}\boldsymbol{X}%
\right|>t\right)\leq n\exp(-t^{2}/2).
$$
The result follows by taking $t=\sqrt{2\log(n/\varepsilon)}$ . (b) We write $\boldsymbol{X}=\boldsymbol{\mu}+\boldsymbol{Z}$ , where $\boldsymbol{Z}\sim N_{n}(0,I_{n})$ . Since the CUSUM transformation is linear, we have $\mathcal{C}(\boldsymbol{X})=\mathcal{C}(\boldsymbol{\mu})+\mathcal{C}(%
\boldsymbol{Z})$ . By part (a) there is an event $\Omega$ with probability at least $1-\varepsilon$ on which $\|\mathcal{C}(\boldsymbol{Z})\|_{∞}≤\sqrt{2\log(n/\varepsilon)}$ . Moreover, we have $\|\mathcal{C}(\boldsymbol{\mu})\|_{∞}=|\boldsymbol{v}_{\tau}^{→p}%
\boldsymbol{\mu}|=|\mu_{\mathrm{L}}-\mu_{\mathrm{R}}|\sqrt{n\eta(1-\eta)}$ . Hence on $\Omega$ , we have by the triangle inequality that
$$
\|\mathcal{C}(\boldsymbol{X})\|_{\infty}\geq\|\mathcal{C}(\boldsymbol{\mu})\|_%
{\infty}-\|\mathcal{C}(\boldsymbol{Z})\|_{\infty}\geq|\mu_{\mathrm{L}}-\mu_{%
\mathrm{R}}|\sqrt{n\eta(1-\eta)}-\sqrt{2\log(n/\varepsilon)}>\sqrt{2\log(n/%
\varepsilon)},
$$
as desired. ∎
A.4 The Proof of Corollary 4.1
* Proof*
From Lemma 4.1 in main text with $\varepsilon=ne^{-nB^{2}/8}$ , we have
$$
\mathbb{P}(h_{\lambda}^{\mathrm{CUSUM}}(\boldsymbol{X})\neq Y\mid\tau,\mu_{%
\mathrm{L}},\mu_{\mathrm{R}})\leq ne^{-nB^{2}/8},
$$
and the desired result follows by integrating over $\pi_{0}$ . ∎
A.5 Auxiliary Lemma
**Lemma A.1**
*Define $T^{\prime}\coloneqq\{t_{0}∈\mathbb{Z}^{+}:{\left\lvert t_{0}-\tau\right%
\rvert}≤\min(\tau,n-\tau)/2\}$ , for any $t_{0}∈ T^{\prime}$ , we have
$$
\min_{t_{0}\in T^{\prime}}|\boldsymbol{v}_{t_{0}}^{\top}\boldsymbol{\mu}|\geq%
\frac{\sqrt{3}}{3}|\mu_{\mathrm{L}}-\mu_{\mathrm{R}}|\sqrt{n\eta(1-\eta)}.
$$*
* Proof*
For simplicity, let $\Delta\coloneqq|\mu_{\mathrm{L}}-\mu_{\mathrm{R}}|$ , we can compute the CUSUM test statistics $a_{i}=|\boldsymbol{v}_{i}^{→p}\boldsymbol{\mu}|$ as:
$$
a_{i}=\begin{cases}\Delta\left(1-\eta\right)\sqrt{\frac{ni}{n-i}}&1\leq i\leq%
\tau\\
\Delta\eta\sqrt{\frac{n\left(n-i\right)}{i}}&\tau<i\leq n-1\end{cases}
$$
It is easy to verified that $a_{\tau}\coloneqq\max_{i}(a_{i})=\Delta\sqrt{n\eta(1-\eta)}$ when $i=\tau$ . Next, we only discuss the case of $1≤\tau≤\lfloor n/2\rfloor$ as one can obtain the same result when $\lceil n/2\rceil≤\tau≤ n$ by the similar discussion. When $1≤\tau≤\lfloor n/2\rfloor$ , ${\left\lvert t_{0}-\tau\right\rvert}≤\min(\tau,n-\tau)/2$ implies that $t_{l}≤ t_{0}≤ t_{u}$ where $t_{l}\coloneqq\lceil\tau/2\rceil,t_{u}\coloneqq\lfloor 3\tau/2\rfloor$ . Because $a_{i}$ is an increasing function of $i$ on $[1,\tau]$ and a decreasing function of $i$ on $[\tau+1,n-1]$ respectively, the minimum of $a_{t_{0}},t_{l}≤ t_{0}≤ t_{u}$ happens at either $t_{l}$ or $t_{u}$ . Hence, we have
| | $\displaystyle a_{t_{l}}$ | $\displaystyle≥ a_{\tau/2}=a_{\tau}\sqrt{\frac{n-\tau}{2n-\tau}}$ | |
| --- | --- | --- | --- |
Define $f(x)\coloneqq\sqrt{\frac{n-x}{2n-x}}$ and $g(x)\coloneqq\sqrt{\frac{2n-3x}{3(n-x)}}$ . We notice that $f(x)$ and $g(x)$ are both decreasing functions of $x∈[1,n]$ , therefore $f(\lfloor n/2\rfloor)≥ f(n/2)=\sqrt{3}/3$ and $g(\lfloor n/2\rfloor)≥ g(n/2)=\sqrt{3}/3$ as desired. ∎
A.6 The Proof of Theorem 4.2
* Proof*
Given any $L≥ 1$ and $\boldsymbol{m}=(m_{1},...,m_{L})^{→p}$ , let $m_{0}:=n$ and $m_{L+1}:=1$ and set $W^{*}=\sum_{r=1}^{L+1}m_{r-1}m_{r}$ . Let $d\coloneqq\mathrm{VCdim}(\mathcal{H}_{L,\boldsymbol{m}})$ , then by Bartlett et al. (2019, Theorem 7), we have $d=O(LW^{*}\log(W^{*}))$ . Thus, by Mohri et al. (2012, Corollary 3.4), for some universal constant $C>0$ , we have with probability at least $1-\delta$ that
$$
\mathbb{P}(h_{\mathrm{ERM}}(\boldsymbol{X})\neq Y\mid\mathcal{D})\leq\min_{h%
\in\mathcal{H}_{L,\boldsymbol{m}}}\mathbb{P}(h(\boldsymbol{X})\neq Y)+\sqrt{%
\frac{8d\log(2eN/d)+8\log(4/\delta)}{N}}. \tag{5}
$$
Here, we have $L=1$ , $m=2n-2$ , $W^{*}=O(n^{2})$ , so $d=O(n^{2}\log(n))$ . In addition, since $h^{\mathrm{CUSUM}}_{\lambda}∈\mathcal{H}_{1,2n-2}$ , we have $\min_{h∈\mathcal{H}_{L,\boldsymbol{m}}}≤\mathbb{P}(h^{\mathrm{CUSUM}}_{%
\lambda}(\boldsymbol{X})≠ Y)≤ ne^{-nB^{2}/8}$ . Substituting these bounds into (5) we arrive at the desired result. ∎
A.7 The Proof of Theorem 4.3
The following lemma, gives the misclassification for the generalised CUSUM test where we only test for changes on a grid of $O(\log n)$ values.
**Lemma A.2**
*Fix $\varepsilon∈(0,1)$ and suppose that $\boldsymbol{X}\sim P(n,\tau,\mu_{\mathrm{L}},\mu_{\mathrm{R}})$ for some $\tau∈[n-1]$ and $\mu_{\mathrm{L}},\mu_{\mathrm{R}}∈\mathbb{R}$ .
1. If $\mu_{\mathrm{L}}=\mu_{\mathrm{R}}$ , then
$$
\mathbb{P}\Bigl{\{}\max_{t\in T_{0}}|\boldsymbol{v}_{t}^{\top}\boldsymbol{X}|>%
\sqrt{2\log(|T_{0}|/\varepsilon)}\Bigr{\}}\leq\varepsilon.
$$
1. If $|\mu_{\mathrm{L}}-\mu_{\mathrm{R}}|\sqrt{\eta(1-\eta)}>\sqrt{24\log(|T_{0}|/%
\varepsilon)/n}$ , then we have
$$
\mathbb{P}\Bigl{\{}\max_{t\in T_{0}}|\boldsymbol{v}_{t}^{\top}\boldsymbol{X}|%
\leq\sqrt{2\log(|T_{0}|/\varepsilon)}\Bigr{\}}\leq\varepsilon.
$$*
* Proof*
(a) For each $t∈[n-1]$ , since ${\|\boldsymbol{v}_{t}\|_{2}}=1$ , we have $\boldsymbol{v}_{t}^{→p}\boldsymbol{X}\sim N(0,1)$ . Hence, by the Gaussian tail bound and a union bound,
$$
\mathbb{P}\Bigl{\{}\max_{t\in T_{0}}|\boldsymbol{v}_{t}^{\top}\boldsymbol{X}|>%
y\Bigr{\}}\leq\sum_{t\in T_{0}}\mathbb{P}\left(\left|\boldsymbol{v}_{t}^{\top}%
\boldsymbol{X}\right|>y\right)\leq|T_{0}|\exp(-y^{2}/2).
$$
The result follows by taking $y=\sqrt{2\log(|T_{0}|/\varepsilon)}$ . (b) There exists some $t_{0}∈ T_{0}$ such that $|t_{0}-\tau|≤\min\{\tau,n-\tau\}/2$ . By Lemma A.1, we have
$$
|\boldsymbol{v}_{t_{0}}^{\top}\mathbb{E}\boldsymbol{X}|\geq\frac{\sqrt{3}}{3}%
\|\mathcal{C}(\mathbb{E}\boldsymbol{X})\|_{\infty}\geq\frac{\sqrt{3}}{3}|\mu_{%
\mathrm{L}}-\mu_{\mathrm{R}}|\sqrt{n\eta(1-\eta)}\geq 2\sqrt{2\log(|T_{0}|/%
\varepsilon)}.
$$
Consequently, by the triangle inequality and result from part (a), we have with probability at least $1-\varepsilon$ that
$$
\max_{t\in T_{0}}|\boldsymbol{v}_{t}^{\top}\boldsymbol{X}|\geq|\boldsymbol{v}_%
{t_{0}}^{\top}\boldsymbol{X}|\geq|\boldsymbol{v}_{t_{0}}^{\top}\mathbb{E}%
\boldsymbol{X}|-|\boldsymbol{v}_{t_{0}}^{\top}(\boldsymbol{X}-\mathbb{E}%
\boldsymbol{X})|\geq\sqrt{2\log(|T_{0}|/\varepsilon)},
$$
as desired. ∎
Using the above lemma we have the following result.
**Corollary A.1**
*Fix $B>0$ . Let $\pi_{0}$ be any prior distribution on $\Theta(B)$ , then draw $(\tau,\mu_{\mathrm{L}},\mu_{\mathrm{R}})\sim\pi_{0}$ , $\boldsymbol{X}\sim P(n,\tau,\mu_{\mathrm{L}},\mu_{\mathrm{R}})$ , and define $Y=\mathbbm{1}\{\mu_{\mathrm{L}}≠\mu_{\mathrm{R}}\}$ . Then for $\lambda^{*}=B\sqrt{3n}/6$ , the test $h^{\mathrm{CUSUM}_{*}}_{\lambda^{*}}$ satisfies
$$
\mathbb{P}(h^{\mathrm{CUSUM}_{*}}_{\lambda^{*}}(\boldsymbol{X})\neq Y)\leq 2%
\lfloor\log_{2}(n)\rfloor e^{-nB^{2}/24}.
$$*
* Proof*
Setting $\varepsilon=|T_{0}|e^{-nB^{2}/24}$ in Lemma A.2, we have for any $(\tau,\mu_{\mathrm{L}},\mu_{\mathrm{R}})∈\Theta(B)$ that
$$
\mathbb{P}(h^{\mathrm{CUSUM}_{*}}_{\lambda^{*}}(\boldsymbol{X})\neq\mathbbm{1}%
\{\mu_{\mathrm{L}}\neq\mu_{\mathrm{R}}\})\leq|T_{0}|e^{-nB^{2}/24}.
$$
The result then follows by integrating over $\pi_{0}$ and the fact that $|T_{0}|=2\lfloor\log_{2}(n)\rfloor$ . ∎
* Proof ofTheorem4.3*
We follow the proof of Theorem 4.2 up to (5). From the conditions of the theorem, we have $W^{*}=O(Ln\log n)$ . Moreover, we have $h^{\mathrm{CUSUM}_{*}}_{\lambda^{*}}∈\mathcal{H}_{1,4\lfloor\log_{2}(n)%
\rfloor}⊂eq\mathcal{H}_{L,\boldsymbol{m}}$ . Thus,
| | $\displaystyle\mathbb{P}(h_{\mathrm{ERM}}(\boldsymbol{X})≠ Y\mid\mathcal{D})$ | $\displaystyle≤\mathbb{P}(h^{\mathrm{CUSUM}_{*}}_{\lambda^{*}}(\boldsymbol{X%
})≠ Y)+C\sqrt{\frac{L^{2}n\log n\log(Ln)\log(N)+\log(1/\delta)}{N}}$ | |
| --- | --- | --- | --- |
as desired. ∎
A.8 Generalisation to time-dependent or heavy-tailed observations
So far, for simplicity of exposition, we have primarily focused on change-point models with independent and identically distributed Gaussian observations. However, neural network based procedures can also be applied to time-dependent or heavy-tailed observations. We first considered the case where the noise series $\xi_{1},...,\xi_{n}$ is a centred stationary Gaussian process with short-ranged temporal dependence. Specifically, writing $K(u):=\mathrm{cov}(\xi_{t},\xi_{t+u})$ , we assume that
$$
\sum_{u=0}^{n-1}K(u)\leq D. \tag{6}
$$
**Theorem A.3**
*Fix $B>0$ , $n>0$ and let $\pi_{0}$ be any prior distribution on $\Theta(B)$ . We draw $(\tau,\mu_{\mathrm{L}},\mu_{\mathrm{R}})\sim\pi_{0}$ , set $Y:=\mathbbm{1}\{\mu_{\mathrm{L}}≠\mu_{\mathrm{R}}\}$ and generate $\boldsymbol{X}:=\boldsymbol{\mu}+\boldsymbol{\xi}$ such that $\boldsymbol{\mu}:=(\mu_{\mathrm{L}}\mathbbm{1}\{i≤\tau\}+\mu_{\mathrm{R}}%
\mathbbm{1}\{i>\tau\})_{i∈[n]}$ and $\boldsymbol{\xi}$ is a centred stationary Gaussian process satisfying (6). Suppose that the training data $\mathcal{D}:=\bigl{(}(\boldsymbol{X}^{(1)},Y^{(1)}),...,(\boldsymbol{X}^{(N%
)},Y^{(N)})\bigr{)}$ consist of independent copies of $(\boldsymbol{X},Y)$ and let $h_{\mathrm{ERM}}:=\operatorname*{arg\,min}_{h∈\mathcal{L}_{L,\boldsymbol{m}}%
}L_{N}(h)$ be the empirical risk minimiser for a neural network with $L≥ 1$ layers and $\boldsymbol{m}=(m_{1},...,m_{L})^{→p}$ hidden layer widths. If $m_{1}≥ 4\lfloor\log_{2}(n)\rfloor$ and $m_{r}m_{r+1}=O(n\log n)$ for all $r∈[L-1]$ , then for any $\delta∈(0,1)$ , we have with probability at least $1-\delta$ that
$$
\mathbb{P}(h_{\mathrm{ERM}}(\boldsymbol{X})\neq Y\mid\mathcal{D})\leq 2\lfloor%
\log_{2}(n)\rfloor e^{-nB^{2}/(48D)}+C\sqrt{\frac{L^{2}n\log^{2}(Ln)\log(N)+%
\log(1/\delta)}{N}}.
$$*
* Proof*
By the proof of Wang and Samworth (2018, supplementary Lemma 10),
$$
\mathbb{P}\bigl{\{}\max_{t\in T_{0}}|\boldsymbol{v}_{t}^{\top}\boldsymbol{\xi}%
|>B\sqrt{3n}/6\bigr{\}}\leq|T_{0}|e^{-nB^{2}/(48D)}.
$$
On the other hand, for $t_{0}$ defined in the proof of Lemma A.1, we have that $|\mu_{\mathrm{L}}-\mu_{\mathrm{R}}|\sqrt{\tau(n-\tau)}/n>B$ , then $|\boldsymbol{v}_{t_{0}}^{→p}\mathbb{E}X|≥ B\sqrt{3n}/3$ . Hence for $\lambda^{*}=B\sqrt{3n}/6$ , we have $h_{\lambda^{*}}^{\mathrm{CUSUM}_{*}}$ satisfying
$$
\mathbb{P}(h_{\lambda^{*}}^{\mathrm{CUSUM}_{*}}(\boldsymbol{X}\neq Y))\leq|T_{%
0}|e^{-nB^{2}/(48D)}.
$$
We can then complete the proof using the same arguments as in the proof of Theorem 4.3. ∎
We now turn to non-Gaussian distributions and recall that the Orlicz $\psi_{\alpha}$ -norm of a random variable $Y$ is defined as
$$
\|Y\|_{\psi_{\alpha}}:=\inf\{\eta:\mathbb{E}\exp(|Y/\eta|^{\alpha})\leq 2\}.
$$
For $\alpha∈(0,2)$ , the random variable $Y$ has heavier tail than a sub-Gaussian distribution. The following lemma is a direct consequence of Kuchibhotla and Chakrabortty (2022, Theorem 3.1) (We state the version used in Li et al. (2023, Proposition 14)).
**Lemma A.4**
*Fix $\alpha∈(0,2)$ . Suppose $\boldsymbol{\xi}=(\xi_{1},...,\xi_{n})^{→p}$ has independent components satisfying $\mathbb{E}\xi_{t}=0$ , $\mathrm{Var}(\xi_{t})=1$ and $\|\xi_{t}\|_{\psi_{\alpha}}≤ K$ for all $t∈[n]$ . There exists $c_{\alpha}>0$ , depending only on $\alpha$ , such that for any $1≤ t≤ n/2$ , we have
$$
\mathbb{P}\bigl{(}|\boldsymbol{v}_{t}^{\top}\boldsymbol{\xi}|\geq y\bigr{)}%
\leq\exp\biggl{\{}1-c_{\alpha}\min\biggl{\{}\biggl{(}\frac{y}{K}\biggr{)}^{2},%
\,\biggl{(}\frac{y}{K\|\boldsymbol{v}_{t}\|_{\beta(\alpha)}}\biggr{)}^{\alpha}%
\biggr{\}}\biggr{\}},
$$
where $\beta(\alpha)=∞$ for $\alpha≤ 1$ and $\beta(\alpha)=\alpha/(\alpha-1)$ when $\alpha>1$ .*
**Theorem A.5**
*Fix $\alpha∈(0,2)$ , $B>0$ , $n>0$ and let $\pi_{0}$ be any prior distribution on $\Theta(B)$ . We draw $(\tau,\mu_{\mathrm{L}},\mu_{\mathrm{R}})\sim\pi_{0}$ , set $Y:=\mathbbm{1}\{\mu_{\mathrm{L}}≠\mu_{\mathrm{R}}\}$ and generate $\boldsymbol{X}:=\boldsymbol{\mu}+\boldsymbol{\xi}$ such that $\boldsymbol{\mu}:=(\mu_{\mathrm{L}}\mathbbm{1}\{i≤\tau\}+\mu_{\mathrm{R}}%
\mathbbm{1}\{i>\tau\})_{i∈[n]}$ and $\boldsymbol{\xi}=(\xi_{1},...,\xi_{n})^{→p}$ satisfies $\mathbb{E}\xi_{i}=0$ , $\mathrm{Var}(\xi_{i})=1$ and $\|\xi_{i}\|_{\psi_{\alpha}}≤ K$ for all $i∈[n]$ . Suppose that the training data $\mathcal{D}:=\bigl{(}(\boldsymbol{X}^{(1)},Y^{(1)}),...,(\boldsymbol{X}^{(N%
)},Y^{(N)})\bigr{)}$ consist of independent copies of $(\boldsymbol{X},Y)$ and let $h_{\mathrm{ERM}}:=\operatorname*{arg\,min}_{h∈\mathcal{L}_{L,\boldsymbol{m}}%
}L_{N}(h)$ be the empirical risk minimiser for a neural network with $L≥ 1$ layers and $\boldsymbol{m}=(m_{1},...,m_{L})^{→p}$ hidden layer widths. If $m_{1}≥ 4\lfloor\log_{2}(n)\rfloor$ and $m_{r}m_{r+1}=O(n\log n)$ for all $r∈[L-1]$ , then there exists a constant $c_{\alpha}>0$ , depending only on $\alpha$ such that for any $\delta∈(0,1)$ , we have with probability at least $1-\delta$ that
$$
\mathbb{P}(h_{\mathrm{ERM}}(\boldsymbol{X})\neq Y\mid\mathcal{D})\leq 2\lfloor%
\log_{2}(n)\rfloor e^{1-c_{\alpha}(\sqrt{n}B/K)^{\alpha}}+C\sqrt{\frac{L^{2}n%
\log^{2}(Ln)\log(N)+\log(1/\delta)}{N}}.
$$*
* Proof*
For $\alpha∈(0,2)$ , we have $\beta(\alpha)>2$ , so $\|\boldsymbol{v}_{t}\|_{\beta(\alpha)}≥\|\boldsymbol{v}_{t}\|_{2}=1$ . Thus, from Lemma A.4, we have $\mathbb{P}(|\boldsymbol{v}_{t}^{→p}\boldsymbol{\xi}|≥ y)≤ e^{1-c_{%
\alpha}(y/K)^{\alpha}}$ . Thus, following the proof of Corollary A.1, we can obtain that $\mathbb{P}(h_{\lambda^{*}}^{\mathrm{CUSUM}_{*}}(\boldsymbol{X}≠ Y))≤ 2%
\lfloor\log_{2}(n)\rfloor e^{1-c_{\alpha}(\sqrt{n}B/K)^{\alpha}}$ . Finally, the desired conclusion follows from the same argument as in the proof of Theorem 4.3. ∎
A.9 Multiple change-point estimation
Algorithm 1 is a general scheme for turning a change-point classifier into a location estimator. While it is theoretically challenging to derive theoretical guarantees for the neural network based change-point location estimation error, we motivate this methodological proposal here by showing that Algorithm 1, applied in conjunction with a CUSUM-based classifier have optimal rate of convergence for the change-point localisation task. We consider the model $x_{i}=\mu_{i}+\xi_{i}$ , where $\xi_{i}\stackrel{{\scriptstyle\mathrm{iid}}}{{\sim}}N(0,1)$ for $i∈[n^{*}]$ . Moreover, for a sequence of change-points $0=\tau_{0}<\tau_{1}<·s<\tau_{\nu}<n=\tau_{\nu+1}$ satisfying $\tau_{r}-\tau_{r-1}≥ 2n$ for all $r∈[\nu+1]$ we have $\mu_{i}=\mu^{(r-1)}$ for all $i∈[\tau_{r-1},\tau_{r}]$ , $r∈[\nu+1]$ .
**Theorem A.6**
*Suppose data $x_{1},...,x_{n^{*}}$ are generated as above satisfying $|\mu^{(r)}-\mu^{(r-1)}|>2\sqrt{2}B$ for all $r∈[\nu]$ . Let $h_{\lambda^{*}}^{\mathrm{CUSUM}_{*}}$ be defined as in Corollary A.1. Let $\hat{\tau}_{1},...,\hat{\tau}_{\hat{\nu}}$ be the output of Algorithm 1 with input $x_{1},...,x_{n^{*}}$ , $\psi=h_{\lambda^{*}}^{\mathrm{CUSUM}_{*}}$ and $\gamma=\lfloor n/2\rfloor/n$ . Then we have
$$
\mathbb{P}\biggl{\{}\hat{\nu}=\nu\text{ and }|\tau_{i}-\hat{\tau}_{i}|\leq%
\frac{2B^{2}}{|\mu^{(r)}-\mu^{(r-1)}|^{2}}\biggr{\}}\geq 1-2n^{*}\lfloor\log_{%
2}(n)\rfloor e^{-nB^{2}/24}.
$$*
* Proof*
For simplicity of presentation, we focus on the case where $n$ is a multiple of 4, so $\gamma=1/2$ . Define
| | $\displaystyle I_{0}$ | $\displaystyle:=\{i:\mu_{i+n-1}=\mu_{i}\},$ | |
| --- | --- | --- | --- |
By Lemma A.2 and a union bound, the event
$$
\Omega=\bigl{\{}h_{\lambda^{*}}^{\mathrm{CUSUM}_{*}}(\boldsymbol{X}^{*}_{[i,i+%
n)})=k,\text{ for all $i\in I_{k}$, $k=0,1$}\bigr{\}}
$$
has probability at least $1-2n^{*}\lfloor\log_{2}(n)\rfloor e^{-nB^{2}/24}$ . We work on the event $\Omega$ henceforth. Denote $\Delta_{r}:=2B^{2}/|\mu^{(r)}-\mu^{(r-1)}|^{2}$ . Since $|\mu^{(r)}-\mu^{(r-1)}|>2\sqrt{2}B$ , we have $\Delta_{r}<n/4$ . Note that for each $r∈[\nu]$ , we have $\{i:\tau_{r-1}<i≤\tau_{r}-n\text{ or }\tau_{r}<i≤\tau_{r+1}-n\}⊂eq
I%
_{0}$ and $\{i:\tau_{r}-n+\Delta_{r}<i≤\tau_{r}-\Delta_{r}\}⊂eq I_{1}$ . Consequently, $\bar{L}_{i}$ defined in Algorithm 1 is below the threshold $\gamma=1/2$ for all $i∈(\tau_{r-1}+n/2,\tau_{r}-n/2]\cup(\tau_{r}+n/2,\tau_{r+1}-n/2]$ , monotonically increases for $i∈(\tau_{r}-n/2,\tau_{r}-\Delta]$ and monotonically decreases for $i∈(\tau_{r}+\Delta,\tau_{r}+n/2]$ and is above the threshold $\gamma$ for $i∈(\tau_{r}-\Delta,\tau_{r}+\Delta]$ . Thus, exactly one change-point, say $\hat{\tau}_{r}$ , will be identified on $(\tau_{r-1}+n/2,\tau_{r+1}-n/2]$ and $\hat{\tau}_{r}=\operatorname*{arg\,max}_{i∈(\tau_{r-1}+n/2,\tau_{r+1}-n/2]}%
\bar{L}_{i}∈(\tau_{r}-\Delta,\tau_{r}+\Delta]$ as desired. Since the above holds for all $r∈[\nu]$ , the proof is complete. ∎
Assuming that $\log(n^{*})\asymp\log(n)$ and choosing $B$ to be of order $\sqrt{\log n}$ , the above theorem shows that using the CUSUM-based change-point classifier $\psi=h_{\lambda^{*}}^{\mathrm{CUSUM}_{*}}$ in conjunction with Algorithm 1 allows for consistent estimation of both the number of locations of multiple change-points in the data stream. In fact, the rate of estimating each change-point, $2B^{2}/|\mu^{(r)}-\mu^{(r-1)}|^{2}$ , is minimax optimal up to logarithmic factors (see, e.g. Verzelen et al., 2020, Proposition 6). An inspection of the proof of Theorem A.6 reveals that the same result would hold for any $\psi$ for which the event $\Omega$ holds with high probability. In view of the representability of $h_{\lambda^{*}}^{\mathrm{CUSUM}_{*}}$ in the class of neural networks, one would intuitively expect that a similar theoretical guarantee as in Theorem A.6 would be available to the empirical risk minimiser in the corresponding neural network function class. However, the particular way in which we handle the generalisation error in the proof of Theorem 4.3 makes it difficult to proceed in this way, due to the fact that the distribution of the data segments obtained via sliding windows have complex dependence and no longer follow a common prior distribution $\pi_{0}$ used in Theorem 4.2.
Appendix B Simulation and Result
B.1 Simulation for Multiple Change-types
In this section, we illustrate the numerical study for one-change-point but with multiple change-types: change in mean, change in slope and change in variance. The data set with change/no-change in mean is generated from $P(n,\tau,\mu_{\mathrm{L}},\mu_{\mathrm{R}})$ . We employ the model of change in slope from Fearnhead et al. (2019), namely
$$
x_{t}=f_{t}+\xi_{t}=\begin{cases}\phi_{0}+\phi_{1}t+\xi_{t}&\quad\text{if }1%
\leq t\leq\tau\\
\phi_{0}+(\phi_{1}-\phi_{2})\tau+\phi_{2}t+\xi_{t}&\quad\tau+1\leq t\leq n,%
\end{cases}
$$
where $\phi_{0},\phi_{1}$ and $\phi_{2}$ are parameters that can guarantee the continuity of two pieces of linear function at time $t=\tau$ . We use the following model to generate the data set with change in variance.
$$
y_{t}=\begin{cases}\mu+\varepsilon_{t}\quad\varepsilon_{t}\sim N(0,\sigma_{1}^%
{2}),&\text{ if }t\leq\tau\\
\mu+\varepsilon_{t}\quad\varepsilon_{t}\sim N(0,\sigma_{2}^{2}),&\text{ %
otherwise }\end{cases}
$$
where $\sigma_{1}^{2},\sigma_{2}^{2}$ are the variances of two Gaussian distributions. $\tau$ is the change-point in variance. When $\sigma_{1}^{2}=\sigma_{2}^{2}$ , there is no-change in model. The labels of no change-point, change in mean only, change in variance only, no-change in variance and change in slope only are 0, 1, 2, 3, 4 respectively. For each label, we randomly generate $N_{sub}$ time series. In each replication of $N_{sub}$ , we update these parameters: $\tau,\mu_{\mathrm{L}},\mu_{\mathrm{R}},\sigma_{1},\sigma_{2},\alpha_{1},\phi_{%
1},\phi_{2}$ . To avoid the boundary effect, we randomly choose $\tau$ from the discrete uniform distribution $U(n^{\prime}+1,n-n^{\prime})$ in each replication, where $1≤ n^{\prime}<\lfloor n/2\rfloor,n^{\prime}∈\mathbb{N}$ . The other parameters are generated as follows:
- $\mu_{\mathrm{L}},\mu_{\mathrm{R}}\sim U(\mu_{l},\mu_{u})$ and $\mu_{dl}≤\left|\mu_{\mathrm{L}}-\mu_{\mathrm{R}}\right|≤\mu_{du}$ , where $\mu_{l},\mu_{u}$ are the lower and upper bounds of $\mu_{\mathrm{L}},\mu_{\mathrm{R}}$ . $\mu_{dl},\mu_{du}$ are the lower and upper bounds of $\left|\mu_{\mathrm{L}}-\mu_{\mathrm{R}}\right|$ .
- $\sigma_{1},\sigma_{2}\sim U(\sigma_{l},\sigma_{u})$ and $\sigma_{dl}≤\left|\sigma_{1}-\sigma_{2}\right|≤\sigma_{du}$ , where $\sigma_{l},\sigma_{u}$ are the lower and upper bounds of $\sigma_{1},\sigma_{2}$ . $\sigma_{dl},\sigma_{du}$ are the lower and upper bounds of $\left|\sigma_{1}-\sigma_{2}\right|$ .
- $\phi_{1},\phi_{2}\sim U(\phi_{l},\phi_{u})$ and $\phi_{dl}≤\left|\phi_{1}-\phi_{2}\right|≤\phi_{du}$ , where $\phi_{l},\phi_{u}$ are the lower and upper bounds of $\phi_{1},\phi_{2}$ . $\phi_{dl},\phi_{du}$ are the lower and upper bounds of $\left|\phi_{1}-\phi_{2}\right|$ .
Besides, we let $\mu=0$ , $\phi_{0}=0$ and the noise follows normal distribution with mean 0. For flexibility, we let the noise variance of change in mean and slope be $0.49$ and $0.25$ respectively. Both Scenarios 1 and 2 defined below use the neural network architecture displayed in Figure 9. Benchmark. Aminikhanghahi and Cook (2017) reviewed the methodologies for change-point detection in different types. To be simple, we employ the Narrowest-Over-Threshold (NOT) (Baranowski et al., 2019) and single variance change-point detection (Chen and Gupta, 2012) algorithms to detect the change in mean, slope and variance respectively. These two algorithms are available in R packages: not and changepoint. The oracle likelihood based tests $\text{LR}^{\mathrm{oracle}}$ means that we pre-specified whether we are testing for change in mean, variance or slope. For the construction of adaptive likelihood-ratio based test $\text{LR}^{\mathrm{adapt}}$ , we first separately apply 3 detection algorithms of change in mean, variance and slope to each time series, then we can compute 3 values of Bayesian information criterion (BIC) for each change-type based on the results of change-point detection. Lastly, the corresponding label of minimum of BIC values is treated as the predicted label. Scenario 1: Weak SNR. Let $n=400$ , $N_{sub}=2000$ and $n^{\prime}=40$ . The data is generated by the parameters settings in Table 2. We use the model architecture in Figure 9 to train the classifier. The learning rate is 0.001, the batch size is 64, filter size in convolution layer is 16, the kernel size is $(3,30)$ , the epoch size is 500. The transformations are ( $x,x^{2}$ ). We also use the inverse time decay technique to dynamically reduce the learning rate. The result which is displayed in Table 1 of main text shows that the test accuracy of $\text{LR}^{\mathrm{oracle}}$ , $\text{LR}^{\mathrm{adapt}}$ and NN based on 2500 test data sets are 0.9056, 0.8796 and 0.8660 respectively.
Table 2: The parameters for weak and strong signal-to-noise ratio (SNR).
Chang in mean $\mu_{l}$ $\mu_{u}$ $\mu_{dl}$ $\mu_{du}$ Weak SNR -5 5 0.25 0.5 Strong SNR -5 5 0.6 1.2 Chang in variance $\sigma_{l}$ $\sigma_{u}$ $\sigma_{dl}$ $\sigma_{du}$ Weak SNR 0.3 0.7 0.12 0.24 Strong SNR 0.3 0.7 0.2 0.4 Change in slope $\phi_{l}$ $\phi_{u}$ $\phi_{dl}$ $\phi_{du}$ Weak SNR -0.025 0.025 0.006 0.012 Strong SNR -0.025 0.025 0.015 0.03
Scenario 2: Strong SNR. The parameters for generating strong-signal data are listed in Table 2. The other hyperparameters are same as in Scenario 1. The test accuracy of $\text{LR}^{\mathrm{oracle}}$ , $\text{LR}^{\mathrm{adapt}}$ and NN based on 2500 test data sets are 0.9924, 0.9260 and 0.9672 respectively. We can see that the neural network-based approach achieves higher classification accuracy than the adaptive likelihood based method.
B.2 Some Additional Simulations
B.2.1 Simulation for simultaneous changes
In this simulation, we compare the classification accuracies of likelihood-based classifier and NN-based classifier under the circumstance of simultaneous changes. For simplicity, we only focus on two classes: no change-point (Class 1) and change in mean and variance at a same change-point (Class 2). The change-point location $\tau$ is randomly drawn from $\mathrm{Unif}\{40,...,n-41\}$ where $n=400$ is the length of time series. Given $\tau$ , to generate the data of Class 2, we use the parameter settings of change in mean and change in variance in Table 2 to randomly draw $\mu_{\mathrm{L}},\mu_{\mathrm{R}}$ and $\sigma_{1},\sigma_{2}$ respectively. The data before and after the change-point $\tau$ are generated from $N(\mu_{\mathrm{L}},\sigma_{1}^{2})$ and $N(\mu_{\mathrm{R}},\sigma_{2}^{2})$ respectively. To generate the data of Class 1, we just draw the data from $N(\mu_{\mathrm{L}},\sigma_{1}^{2})$ . Then, we repeat each data generation of Class 1 and 2 $2500$ times as the training dataset. The test dataset is generated in the same procedure as the training dataset, but the testing size is 15000. We use two classifiers: likelihood-ratio (LR) based classifier (Chen and Gupta, 2012, p.59) and a 21-residual-block neural network (NN) based classifier displayed in Figure 9 to evaluate the classification accuracy of simultaneous change v.s. no change. The result are displayed in Table 3. We can see that under weak SNR, the NN has a good performance than LR-based method while it performs as well as the LR-based method under strong SNR.
Table 3: Test classification accuracy of likelihood-ratio (LR) based classifier (Chen and Gupta, 2012, p.59) and our residual neural network (NN) based classifier with 21 residual blocks for setups with weak and strong signal-to-noise ratios (SNR). Data are generated as a mixture of no change-point (Class 1), change in mean and variance at a same change-point (Class 2). We report the true positive rate of each class and the accuracy in the last row. The optimal threshold value of LR is chosen by the grid search method on the training dataset.
Weak SNR Strong SNR LR NN LR NN Class 1 0.9823 0.9668 1.0000 0.9991 Class 2 0.8759 0.9621 0.9995 0.9992 Accuracy 0.9291 0.9645 0.9997 0.9991
B.2.2 Simulation for heavy-tailed noise
In this simulation, we compare the performance of Wilcoxon change-point test (Dehling et al., 2015), CUSUM, simple neural network $\mathcal{H}_{L,\boldsymbol{m}}$ as well as truncated $\mathcal{H}_{L,\boldsymbol{m}}$ for heavy-tailed noise. Consider the model: $X_{i}=\mu_{i}+\xi_{i},\quad i≥ 1,$ where $(\mu_{i})_{i≥ 1}$ are signals and $(\xi_{i})_{i≥ 1}$ is a stochastic process. To test the null hypothesis
$$
\mathbb{H}:\mu_{1}=\mu_{2}=\cdots=\mu_{n}
$$
against the alternative
$$
\mathbb{A}:~{}\text{There exists }1\leq k\leq n-1~{}\text{such that }\mu_{1}=%
\cdots=\mu_{k}\neq\mu_{k+1}=\cdots=\mu_{n}.
$$
Dehling et al. (2015) proposed the so-called Wilcoxon type of cumulative sum statistic
$$
T_{n}\coloneqq\max_{1\leq k<n}{\left\lvert\frac{2\sqrt{k(n-k)}}{n}\frac{1}{n^{%
3/2}}\sum_{i=1}^{k}\sum_{j=k+1}^{n}\left(\mathbf{1}_{\{X_{i}<X_{j}\}}-1/2%
\right)\right\rvert} \tag{7}
$$
to detect the change-point in time series with outlier or heavy tails. Under the null hypothesis $\mathbb{H}$ , the limit distribution of $T_{n}$ The definition of $T_{n}$ in Dehling et al. (2015, Theorem 3.1) does not include $2\sqrt{k(n-k)}/n$ . However, the repository of the R package robts (Dürre et al., 2016) normalises the Wilcoxon test by this item, for details see function wilcoxsuk in here. In this simulation, we adopt the definition of (7). can be approximately by the supreme of standard Brownian bridge process $(W^{(0)}(\lambda))_{0≤\lambda≤ 1}$ up to a scaling factor (Dehling et al., 2015, Theorem 3.1). In our simulation, we choose the optimal thresh value based on the training dataset by using the grid search method. The truncated simple neural network means that we truncate the data by the $z$ -score in data preprocessing step, i.e. given vector $\boldsymbol{x}=(x_{1},x_{2},...,x_{n})^{→p}$ , then $x_{i}[{\left\lvert x_{i}-\bar{x}\right\rvert}>Z\sigma_{x}]=\bar{x}+\text{sgn}(%
x_{i}-\bar{x})Z\sigma_{x}$ , $\bar{x}$ and $\sigma_{x}$ are the mean and standard deviation of $\boldsymbol{x}$ . The training dataset is generated by using the same parameter settings of Figure 2 (d) of the main text. The result of misclassification error rate (MER) of each method is reported in Figure 5. We can see that truncated simple neural network has the best performance. As expected, the Wilcoxon based test has better performance than the simple neural network based tests. However, we would like to mention that the main focus of Figure 2 of the main text is to demonstrate the point that simple neural networks can replicate the performance of CUSUM tests. Even though, the prior information of heavy-tailed noise is available, we still encourage the practitioner to use simple neural network by adding the $z$ -score truncation in data preprocessing step.
<details>
<summary>x8.png Details</summary>

### Visual Description
## Line Chart: MER Average vs. N
### Overview
This image presents a line chart comparing the Mean Error Rate (MER) Average for four different methods as a function of 'N', likely representing sample size or number of iterations. The chart displays the performance of CUSUM, Wilcoxon, m(2), L=1, and m(2), L=1, Z=3 methods.
### Components/Axes
* **X-axis:** Labeled "N", ranging from approximately 100 to 1000, with markers at 100, 200, 300, 400, 500, 600, 700, 800, 900, and 1000.
* **Y-axis:** Labeled "MER Average", ranging from approximately 0.05 to 0.42, with markers at 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40.
* **Legend:** Located in the top-left corner of the chart. It identifies the four data series:
* CUSUM (Blue) - Represented by circular markers.
* Wilcoxon (Orange) - Represented by triangular markers.
* m(2), L=1 (Green) - Represented by square markers.
* m(2), L=1, Z=3 (Red) - Represented by diamond markers.
### Detailed Analysis
* **CUSUM (Blue):** The line starts at approximately 0.36 at N=100, dips slightly to around 0.34 at N=200, remains relatively stable between 0.34 and 0.36 until N=1000, where it rises slightly to approximately 0.36.
* **Wilcoxon (Orange):** The line begins at approximately 0.21 at N=100, remains relatively stable around 0.18-0.20 from N=200 to N=1000, with minor fluctuations.
* **m(2), L=1 (Green):** The line starts at approximately 0.31 at N=100, decreases to around 0.26 at N=200, then fluctuates between approximately 0.26 and 0.31 until N=1000, where it is around 0.29.
* **m(2), L=1, Z=3 (Red):** The line starts at approximately 0.18 at N=100, rapidly decreases to around 0.11 at N=300, and remains relatively stable between 0.10 and 0.12 from N=300 to N=1000.
### Key Observations
* The m(2), L=1, Z=3 method consistently exhibits the lowest MER Average across all values of N.
* The Wilcoxon method maintains a relatively stable MER Average throughout the range of N.
* The CUSUM method shows the least variation in MER Average, remaining relatively constant.
* The m(2), L=1 method shows a decrease in MER Average from N=100 to N=200, but then fluctuates.
### Interpretation
The chart demonstrates the performance of different statistical methods for detecting changes or anomalies, as measured by the Mean Error Rate (MER). The 'N' parameter likely represents the sample size or the number of observations used. The results suggest that the m(2), L=1, Z=3 method is the most effective at minimizing the MER Average, indicating it has the highest accuracy in detecting changes. The Wilcoxon method provides a stable performance, while CUSUM shows minimal variation. The m(2), L=1 method shows initial improvement but then plateaus with some fluctuation.
The consistent low MER Average of m(2), L=1, Z=3 suggests it is less susceptible to false positives or false negatives compared to the other methods. The stability of the Wilcoxon method indicates it is a reliable, though potentially less sensitive, approach. The chart highlights the importance of selecting an appropriate statistical method based on the specific application and desired level of accuracy. The initial drop in MER for m(2), L=1, followed by stabilization, could indicate a point of diminishing returns where increasing the sample size beyond a certain point does not significantly improve performance.
</details>
Figure 5: Scenario S3 with Cauchy noise by adding Wilcoxon type of change-point detection method (Dehling et al., 2015) and simple neural network with truncation in data preprocessing. The average misclassification error rate (MER) is computed on a test set of size $N_{\mathrm{test}}=15000$ , against training sample size $N$ for detecting the existence of a change-point on data series of length $n=100$ . We compare the performance of the CUSUM test, Wilcoxon test, $\mathcal{H}_{1,m^{(2)}}$ and $\mathcal{H}_{1,m^{(2)}}$ with $Z=3$ where $m^{(2)}=2n-2$ and $Z=3$ means the truncated $z$ -score, i.e. given vector $\boldsymbol{x}=(x_{1},x_{2},...,x_{n})^{→p}$ , then $x_{i}[{\left|x_{i}-\bar{x}\right|}>Z\sigma_{x}]=\bar{x}+\mathrm{sgn}(x_{i}-%
\bar{x})Z\sigma_{x}$ , $\bar{x}$ and $\sigma_{x}$ are the mean and standard deviation of $\boldsymbol{x}$ .
B.2.3 Robustness Study
This simulation is an extension of numerical study of Section 5 in main text. We trained our neural network using training data generated under scenario S1 with $\rho_{t}=0$ (i.e. corresponding to Figure 2 (a) of the main text), but generate the test data under settings corresponding to Figure 2 (a, b, c, d). In other words, apart the top-left panel, in the remaining panels of Figure 6, the trained network is misspecified for the test data. We see that the neural networks continue to work well in all panels, and in fact have performance similar to those in Figure 2 (b, c, d) of the main text. This indicates that the trained neural network has likely learned features related to the change-point rather than any distributional specific artefacts.
<details>
<summary>x9.png Details</summary>

### Visual Description
\n
## Line Chart: MER Average vs. N for Different Methods
### Overview
This image presents a line chart comparing the Mean Error Rate (MER) Average for different methods as a function of 'N'. The chart displays five distinct data series, each representing a different method or parameter setting. The x-axis represents 'N', and the y-axis represents the MER Average.
### Components/Axes
* **X-axis:** Labeled "N", ranging from approximately 100 to 700, with markers at 100, 200, 300, 400, 500, 600, and 700.
* **Y-axis:** Labeled "MER Average", ranging from approximately 0.04 to 0.16, with markers at 0.06, 0.08, 0.10, 0.12, 0.14, and 0.16.
* **Legend:** Located in the top-right corner of the chart. It identifies the following data series:
* CUSUM (Blue)
* m<sup>(1)</sup>, L=1 (Orange)
* m<sup>(2)</sup>, L=1 (Green)
* m<sup>(1)</sup>, L=5 (Red)
* m<sup>(1)</sup>, L=10 (Purple)
### Detailed Analysis
Here's a breakdown of each data series and their trends:
* **CUSUM (Blue):** This line starts at approximately 0.062 at N=100, decreases to a minimum of around 0.058 at N=400, then increases to approximately 0.072 at N=700. The trend is generally flat with a slight dip around N=400.
* N=100: MER Average ≈ 0.062
* N=200: MER Average ≈ 0.060
* N=300: MER Average ≈ 0.059
* N=400: MER Average ≈ 0.058
* N=500: MER Average ≈ 0.060
* N=600: MER Average ≈ 0.066
* N=700: MER Average ≈ 0.072
* **m<sup>(1)</sup>, L=1 (Orange):** This line exhibits a steep decline from approximately 0.16 at N=100 to around 0.062 at N=700. The trend is consistently downward.
* N=100: MER Average ≈ 0.16
* N=200: MER Average ≈ 0.12
* N=300: MER Average ≈ 0.09
* N=400: MER Average ≈ 0.075
* N=500: MER Average ≈ 0.066
* N=600: MER Average ≈ 0.063
* N=700: MER Average ≈ 0.062
* **m<sup>(2)</sup>, L=1 (Green):** This line starts at approximately 0.13 at N=100 and decreases to around 0.06 at N=700. The trend is downward, but less steep than the orange line.
* N=100: MER Average ≈ 0.13
* N=200: MER Average ≈ 0.10
* N=300: MER Average ≈ 0.08
* N=400: MER Average ≈ 0.07
* N=500: MER Average ≈ 0.065
* N=600: MER Average ≈ 0.062
* N=700: MER Average ≈ 0.060
* **m<sup>(1)</sup>, L=5 (Red):** This line begins at approximately 0.078 at N=100, decreases to a minimum of around 0.055 at N=500, and then increases slightly to approximately 0.058 at N=700. The trend is relatively flat with a slight dip.
* N=100: MER Average ≈ 0.078
* N=200: MER Average ≈ 0.072
* N=300: MER Average ≈ 0.065
* N=400: MER Average ≈ 0.060
* N=500: MER Average ≈ 0.055
* N=600: MER Average ≈ 0.057
* N=700: MER Average ≈ 0.058
* **m<sup>(1)</sup>, L=10 (Purple):** This line starts at approximately 0.065 at N=100, decreases to around 0.056 at N=400, and remains relatively stable around 0.056-0.058 until N=700.
* N=100: MER Average ≈ 0.065
* N=200: MER Average ≈ 0.062
* N=300: MER Average ≈ 0.060
* N=400: MER Average ≈ 0.056
* N=500: MER Average ≈ 0.057
* N=600: MER Average ≈ 0.057
* N=700: MER Average ≈ 0.058
### Key Observations
* The method m<sup>(1)</sup>, L=1 (orange) shows the most significant decrease in MER Average as N increases.
* The CUSUM method (blue) exhibits the most stable performance, with relatively small fluctuations in MER Average across the range of N values.
* The methods m<sup>(1)</sup>, L=5 (red) and m<sup>(1)</sup>, L=10 (purple) show similar trends, with a slight decrease in MER Average followed by stabilization.
* The initial MER Average values are significantly higher for methods m<sup>(1)</sup>, L=1 (orange) and m<sup>(2)</sup>, L=1 (green) compared to the other methods.
### Interpretation
The chart demonstrates the performance of different methods for estimating or predicting something, as measured by the Mean Error Rate (MER). The parameter 'N' likely represents the sample size or the amount of data used. The results suggest that increasing the sample size ('N') generally improves the accuracy of the methods, as indicated by the decreasing MER Average.
The method m<sup>(1)</sup>, L=1 (orange) appears to benefit the most from larger sample sizes, showing a substantial reduction in error rate. This could indicate that this method is particularly sensitive to the amount of data available. The CUSUM method (blue) is relatively robust to changes in sample size, maintaining a consistently low error rate.
The parameter 'L' in the method names (e.g., m<sup>(1)</sup>, L=1) likely controls some aspect of the method's behavior, such as a smoothing factor or a window size. The comparison of different 'L' values (L=1, L=5, L=10) suggests that the optimal value of 'L' may depend on the sample size and the specific application. The differences between m<sup>(1)</sup> and m<sup>(2)</sup> are not immediately clear without further context.
The chart provides valuable insights into the trade-offs between different methods and the importance of sample size in achieving accurate results.
</details>
<details>
<summary>x10.png Details</summary>

### Visual Description
\n
## Line Chart: MER Average vs. N
### Overview
This image presents a line chart illustrating the relationship between 'N' (likely sample size or number of observations) and the 'MER Average' (Mean Error Rate Average). The chart compares several methods: CUSUM, m^(1),L=1, m^(2),L=1, m^(1),L=5, and m^(1),L=10.
### Components/Axes
* **X-axis:** Labeled "N", ranging from approximately 100 to 700, with tick marks at 100, 200, 300, 400, 500, 600, and 700.
* **Y-axis:** Labeled "MER Average", ranging from approximately 0.18 to 0.30, with tick marks at 0.18, 0.20, 0.22, 0.24, 0.26, 0.28, and 0.30.
* **Legend:** Located in the top-right corner, listing the following data series with corresponding colors:
* CUSUM (Blue)
* m^(1),L=1 (Orange)
* m^(2),L=1 (Green)
* m^(1),L=5 (Red)
* m^(1),L=10 (Purple)
### Detailed Analysis
* **CUSUM (Blue Line):** The line starts at approximately 0.285 at N=100, decreases to around 0.255 at N=200, remains relatively stable around 0.25 between N=200 and N=600, and then slightly increases to approximately 0.255 at N=700.
* **m^(1),L=1 (Orange Line):** This line exhibits a steep decline from approximately 0.28 at N=100 to around 0.19 at N=200. It then plateaus, fluctuating between approximately 0.20 and 0.22 from N=200 to N=700, with a slight increase to around 0.22 at N=600.
* **m^(2),L=1 (Green Line):** Similar to the orange line, it shows a significant drop from approximately 0.275 at N=100 to around 0.185 at N=200. It then stabilizes, remaining between approximately 0.20 and 0.22 from N=200 to N=700, with a slight increase to around 0.21 at N=600.
* **m^(1),L=5 (Red Line):** This line starts at approximately 0.27 at N=100, decreases to around 0.20 at N=200, and then fluctuates between approximately 0.20 and 0.22 from N=200 to N=700, peaking at around 0.22 at N=600.
* **m^(1),L=10 (Purple Line):** This line begins at approximately 0.275 at N=100, drops to around 0.20 at N=200, and then remains relatively stable between approximately 0.20 and 0.22 from N=200 to N=700, with a slight increase to around 0.21 at N=600.
### Key Observations
* All methods show a decreasing trend in MER Average as N increases from 100 to 200.
* After N=200, the MER Average stabilizes for all methods, with minor fluctuations.
* The CUSUM method consistently exhibits a higher MER Average compared to the other methods across all values of N.
* The methods m^(1),L=1, m^(2),L=1, m^(1),L=5, and m^(1),L=10 converge to similar MER Average values as N increases beyond 200.
* The most significant reduction in MER Average occurs between N=100 and N=200 for all methods.
### Interpretation
The chart demonstrates the impact of sample size (N) on the Mean Error Rate Average (MER Average) for different methods. The initial steep decline in MER Average as N increases suggests that increasing the sample size initially leads to a substantial improvement in accuracy or reliability. However, beyond a certain point (around N=200 in this case), the benefit of increasing the sample size diminishes, and the MER Average plateaus.
The CUSUM method consistently performs worse than the other methods, indicating that it may be less effective in this scenario. The convergence of the other methods suggests that the choice of L (a parameter within the methods) has a limited impact on performance when N is sufficiently large.
The data suggests that there is a trade-off between sample size and accuracy. While increasing the sample size can improve accuracy, there is a point of diminishing returns. The optimal sample size depends on the specific method used and the desired level of accuracy. The chart provides valuable insights for selecting an appropriate sample size and method for a given application.
</details>
(a) Trained S1 ( $\rho_{t}=0$ ) $→$ S1 ( $\rho_{t}=0$ ) (b)Trained S1 ( $\rho_{t}=0$ ) $→$ S1 ${}^{\prime}$ ( $\rho_{t}=0.7$ )
<details>
<summary>x11.png Details</summary>

### Visual Description
## Line Chart: MER Average vs. N for Different Algorithms
### Overview
This image presents a line chart comparing the Mean Error Rate (MER) Average for different algorithms as a function of 'N'. The algorithms are CUSUM, m<sup>(1)</sup> with L=1, m<sup>(2)</sup> with L=1, m<sup>(1)</sup> with L=5, and m<sup>(1)</sup> with L=10. The chart displays how the MER Average changes as 'N' increases from 100 to 700.
### Components/Axes
* **X-axis:** Labeled "N". Scale ranges from approximately 100 to 700, with markers at 100, 200, 300, 400, 500, 600, and 700.
* **Y-axis:** Labeled "MER Average". Scale ranges from approximately 0.18 to 0.30, with markers at 0.18, 0.20, 0.22, 0.24, 0.26, 0.28, and 0.30.
* **Legend:** Located in the top-right corner. Contains the following labels and corresponding colors:
* CUSUM (Blue)
* m<sup>(1)</sup>, L=1 (Orange)
* m<sup>(2)</sup>, L=1 (Green)
* m<sup>(1)</sup>, L=5 (Red)
* m<sup>(1)</sup>, L=10 (Purple)
### Detailed Analysis
* **CUSUM (Blue Line):** The line starts at approximately 0.245 at N=100, decreases slightly to around 0.24 at N=200, remains relatively stable around 0.24 to 0.25 until N=600, and then increases slightly to approximately 0.245 at N=700.
* **m<sup>(1)</sup>, L=1 (Orange Line):** This line exhibits a steep downward trend from N=100 to N=200, decreasing from approximately 0.29 to 0.19. It then fluctuates between approximately 0.19 and 0.21 from N=200 to N=700.
* **m<sup>(2)</sup>, L=1 (Green Line):** The line starts at approximately 0.25 at N=100, decreases to around 0.21 at N=200, increases to approximately 0.215 at N=400, and then decreases slightly to around 0.205 at N=700.
* **m<sup>(1)</sup>, L=5 (Red Line):** This line shows a very steep decrease from approximately 0.29 at N=100 to approximately 0.16 at N=200. It then remains relatively stable, fluctuating between approximately 0.16 and 0.18 from N=200 to N=700.
* **m<sup>(1)</sup>, L=10 (Purple Line):** The line starts at approximately 0.25 at N=100, decreases to approximately 0.18 at N=200, and remains relatively stable, fluctuating between approximately 0.18 and 0.20 from N=200 to N=700.
### Key Observations
* The algorithms m<sup>(1)</sup>, L=5 and m<sup>(1)</sup>, L=1 show the most significant initial decrease in MER Average as N increases from 100 to 200.
* CUSUM exhibits the most stable MER Average across the range of N values.
* m<sup>(1)</sup>, L=1 and m<sup>(2)</sup>, L=1 show similar trends, but m<sup>(1)</sup>, L=1 generally has a lower MER Average.
* All algorithms converge to similar MER Average values as N approaches 700, with values between approximately 0.18 and 0.25.
### Interpretation
The chart demonstrates the performance of different algorithms in terms of Mean Error Rate (MER) as the input size 'N' increases. The rapid initial decrease in MER Average for algorithms m<sup>(1)</sup>, L=5 and m<sup>(1)</sup>, L=1 suggests that these algorithms benefit significantly from larger input sizes, potentially due to improved statistical power or more accurate parameter estimation. The stability of the CUSUM algorithm indicates its robustness to changes in input size. The convergence of all algorithms at higher N values suggests that the algorithms achieve similar performance levels when sufficient data is available. The parameter 'L' appears to influence the performance of the m<sup>(1)</sup> algorithm, with smaller values of L (L=1) generally resulting in lower MER Averages compared to larger values (L=10). This could indicate that a smaller 'L' value allows for faster detection of changes or anomalies. The chart provides valuable insights into the trade-offs between different algorithms and the importance of input size in achieving optimal performance.
</details>
<details>
<summary>x12.png Details</summary>

### Visual Description
\n
## Line Chart: MER Average vs. N for Different Methods
### Overview
This image presents a line chart comparing the Mean Error Rate (MER) Average for different methods (CUSUM, m^(1),L=1, m^(2),L=1, m^(1),L=5, m^(1),L=10) as a function of N, likely representing sample size or number of iterations. The chart visually demonstrates how the MER Average changes with increasing N for each method.
### Components/Axes
* **X-axis:** Labeled "N", ranging from 100 to 700, with markers at 100, 200, 300, 400, 500, 600, and 700.
* **Y-axis:** Labeled "MER Average", ranging from 0.26 to 0.36, with markers at 0.26, 0.28, 0.30, 0.32, 0.34, and 0.36.
* **Legend:** Located in the top-right corner of the chart. It identifies the following data series:
* CUSUM (Blue)
* m^(1),L=1 (Orange)
* m^(2),L=1 (Light Green)
* m^(1),L=5 (Red)
* m^(1),L=10 (Purple)
### Detailed Analysis
Here's a breakdown of each data series and their trends:
* **CUSUM (Blue):** The line starts at approximately 0.355 at N=100, decreases slightly to around 0.345 at N=200, remains relatively stable around 0.345-0.35 until N=600, and then increases slightly to approximately 0.35 at N=700. The trend is generally flat with minor fluctuations.
* **m^(1),L=1 (Orange):** This line begins at approximately 0.355 at N=100, decreases sharply to around 0.28 at N=200, continues to decrease to approximately 0.26 at N=300, then increases slightly to around 0.27 at N=400, and remains relatively stable around 0.27-0.28 until N=700. The trend is a steep initial decline followed by stabilization.
* **m^(2),L=1 (Light Green):** The line starts at approximately 0.33 at N=100, decreases to around 0.28 at N=200, continues to decrease to approximately 0.26 at N=300, then increases slightly to around 0.27 at N=400, and remains relatively stable around 0.27-0.28 until N=700. The trend is similar to m^(1),L=1, with a decline and stabilization.
* **m^(1),L=5 (Red):** This line begins at approximately 0.31 at N=100, decreases to around 0.27 at N=200, continues to decrease to approximately 0.26 at N=300, then increases to around 0.27 at N=400, and remains relatively stable around 0.27-0.28 until N=700. The trend is a decline and stabilization, similar to the previous two.
* **m^(1),L=10 (Purple):** The line starts at approximately 0.31 at N=100, decreases to around 0.26 at N=200, remains relatively stable around 0.26-0.27 until N=700. The trend is a decline and stabilization, with a flatter profile than the others.
### Key Observations
* All methods show a decreasing MER Average as N increases initially, suggesting improved performance with larger sample sizes.
* The m^(1),L=1, m^(2),L=1, m^(1),L=5, and m^(1),L=10 methods all converge to a similar MER Average around 0.27-0.28 as N increases.
* CUSUM consistently exhibits a higher MER Average compared to the other methods across all values of N.
* The most significant performance improvement for the m^(1),L=1, m^(2),L=1, m^(1),L=5, and m^(1),L=10 methods occurs between N=100 and N=300.
### Interpretation
The data suggests that the methods m^(1),L=1, m^(2),L=1, m^(1),L=5, and m^(1),L=10 are more effective at reducing the Mean Error Rate than the CUSUM method, particularly for smaller values of N. As N increases, the performance gap between these methods narrows, indicating that all methods achieve comparable accuracy with sufficiently large sample sizes. The initial rapid decline in MER Average for the m^(1),L=1, m^(2),L=1, m^(1),L=5, and m^(1),L=10 methods suggests that they benefit significantly from increased data, while CUSUM's performance plateaus. The parameter 'L' likely represents a smoothing factor or window size, and the results suggest that increasing L (from 1 to 5 to 10) leads to a more stable, but potentially slightly less responsive, performance. The choice of method would depend on the trade-off between initial accuracy and stability, as well as the computational cost of each method. The fact that all methods converge suggests a limit to the achievable accuracy, regardless of the method used.
</details>
(c) Trained S1 ( $\rho_{t}=0$ ) $→$ S2 (d) Trained S1 ( $\rho_{t}=0$ ) $→$ S3
Figure 6: Plot of the test set MER, computed on a test set of size $N_{\mathrm{test}}=30000$ , against training sample size $N$ for detecting the existence of a change-point on data series of length $n=100$ . We compare the performance of the CUSUM test and neural networks from four function classes: $\mathcal{H}_{1,m^{(1)}}$ , $\mathcal{H}_{1,m^{(2)}}$ , $\mathcal{H}_{5,m^{(1)}\mathbf{1}_{5}}$ and $\mathcal{H}_{10,m^{(1)}\mathbf{1}_{10}}$ where $m^{(1)}=4\lfloor\log_{2}(n)\rfloor$ and $m^{(2)}=2n-2$ respectively under scenarios S1, S1 ${}^{\prime}$ , S2 and S3 described in Section 5. The subcaption “A $→$ B” means that we apply the trained classifier “A” to target testing dataset “B”.
B.2.4 Simulation for change in autocorrelation
In this simulation, we discuss how we can use neural networks to recreate test statistics for various types of changes. For instance, if the data follows an AR(1) structure, then changes in autocorrelation can be handled by including transformations of the original input of the form $(x_{t}x_{t+1})_{t=1,...,n-1}$ . On the other hand, even if such transformations are not supplied as the input, a deep neural network of suitable depth is able to approximate these transformations and consequently successfully detect the change (Schmidt-Hieber, 2020, Lemma A.2). This is illustrated in Figure 7, where we compare the performance of neural network based classifiers of various depths constructed with and without using the transformed data as inputs.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Line Chart: MER Average vs. N
### Overview
This image presents a line chart illustrating the relationship between 'N' (likely a sample size or number of iterations) and the 'MER Average' (likely a metric for error rate). Four different data series are plotted, each representing a different configuration or model.
### Components/Axes
* **X-axis:** Labeled "N", with values ranging from 100 to 700, incrementing by 100.
* **Y-axis:** Labeled "MER Average", with values ranging from 0.05 to 0.40, incrementing by 0.05.
* **Legend:** Located in the top-right corner, containing the following labels and corresponding colors:
* `m(1), L=1` (Blue)
* `m(1), L=5` (Orange)
* `m(2), L=1` (Green)
* `NN` (Red)
### Detailed Analysis
Let's analyze each line individually, noting trends and approximate data points.
* **`m(1), L=1` (Blue Line):** This line starts at approximately 0.38 at N=100 and slopes downward, reaching approximately 0.16 at N=700. Data points (approximate): (100, 0.38), (200, 0.32), (300, 0.27), (400, 0.22), (500, 0.19), (600, 0.17), (700, 0.16).
* **`m(1), L=5` (Orange Line):** This line begins at approximately 0.34 at N=100 and decreases more rapidly than the blue line initially, reaching approximately 0.17 at N=700. Data points (approximate): (100, 0.34), (200, 0.28), (300, 0.23), (400, 0.19), (500, 0.17), (600, 0.17), (700, 0.17).
* **`m(2), L=1` (Green Line):** This line starts at approximately 0.39 at N=100 and decreases, similar to the blue line, reaching approximately 0.15 at N=700. Data points (approximate): (100, 0.39), (200, 0.33), (300, 0.26), (400, 0.20), (500, 0.17), (600, 0.16), (700, 0.15).
* **`NN` (Red Line):** This line is relatively flat, starting at approximately 0.12 at N=100 and decreasing slightly to approximately 0.08 at N=700. Data points (approximate): (100, 0.12), (200, 0.11), (300, 0.10), (400, 0.09), (500, 0.08), (600, 0.07), (700, 0.08).
### Key Observations
* All lines exhibit a decreasing trend, indicating that as 'N' increases, the 'MER Average' generally decreases.
* The `NN` (red) line consistently has the lowest 'MER Average' across all values of 'N'.
* The `m(1), L=1` (blue) and `m(2), L=1` (green) lines are relatively close to each other in terms of 'MER Average'.
* The `m(1), L=5` (orange) line shows a steeper initial decrease compared to the other lines.
* The rate of decrease slows down for all lines as 'N' increases, suggesting diminishing returns.
### Interpretation
The chart demonstrates the impact of increasing 'N' on the 'MER Average' for different model configurations. The 'NN' model consistently outperforms the other models (`m(1)` and `m(2)`) across all values of 'N', suggesting it is more robust or efficient. The parameter 'L' (likely a length or level) seems to influence the initial rate of improvement, as `m(1), L=5` decreases more rapidly at lower values of 'N' than `m(1), L=1`. The diminishing returns observed as 'N' increases suggest that there is a point beyond which increasing 'N' provides minimal improvement in 'MER Average'. The 'MER Average' likely represents some form of error rate, and the goal is to minimize it. The chart suggests that increasing the sample size ('N') and using the 'NN' model are effective strategies for reducing the error rate. The specific meaning of 'm(1)', 'm(2)', and 'L' would require additional context about the models being compared.
</details>
<details>
<summary>x14.png Details</summary>

### Visual Description
\n
## Line Chart: MER Average vs. N for Different Parameters
### Overview
This image presents a line chart illustrating the relationship between 'N' (likely a sample size or number of iterations) and the 'MER Average' (likely a metric for error or performance) for three different parameter sets: m<sup>(1)</sup>, L=1; m<sup>(1)</sup>, L=5; and m<sup>(2)</sup>, L=1. The chart aims to demonstrate how the MER Average changes as N increases for each of these configurations.
### Components/Axes
* **X-axis:** Labeled "N". Scale ranges from approximately 100 to 700, with markers at 100, 200, 300, 400, 500, 600, and 700.
* **Y-axis:** Labeled "MER Average". Scale ranges from approximately 0.05 to 0.40, with markers at 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, and 0.40.
* **Legend:** Located in the top-right corner of the chart. Contains the following entries:
* Blue line with circle markers: m<sup>(1)</sup>, L=1
* Orange line with triangle markers: m<sup>(1)</sup>, L=5
* Green line with diamond markers: m<sup>(2)</sup>, L=1
### Detailed Analysis
* **m<sup>(1)</sup>, L=1 (Blue Line):** The blue line shows a decreasing trend from N=100 to N=400, then plateaus.
* N=100: MER Average ≈ 0.16
* N=200: MER Average ≈ 0.14
* N=300: MER Average ≈ 0.13
* N=400: MER Average ≈ 0.12
* N=500: MER Average ≈ 0.11
* N=600: MER Average ≈ 0.10
* N=700: MER Average ≈ 0.09
* **m<sup>(1)</sup>, L=5 (Orange Line):** The orange line also shows a decreasing trend from N=100 to N=400, then plateaus. It starts slightly higher than the blue line and ends slightly higher.
* N=100: MER Average ≈ 0.17
* N=200: MER Average ≈ 0.16
* N=300: MER Average ≈ 0.14
* N=400: MER Average ≈ 0.12
* N=500: MER Average ≈ 0.11
* N=600: MER Average ≈ 0.10
* N=700: MER Average ≈ 0.09
* **m<sup>(2)</sup>, L=1 (Green Line):** The green line exhibits a similar decreasing trend from N=100 to N=400, then plateaus. It starts and ends lower than the other two lines.
* N=100: MER Average ≈ 0.16
* N=200: MER Average ≈ 0.15
* N=300: MER Average ≈ 0.13
* N=400: MER Average ≈ 0.11
* N=500: MER Average ≈ 0.10
* N=600: MER Average ≈ 0.09
* N=700: MER Average ≈ 0.08
### Key Observations
* All three lines demonstrate a decreasing MER Average as N increases, suggesting that increasing the sample size or number of iterations improves performance (reduces error).
* The parameter set m<sup>(1)</sup>, L=5 consistently yields a slightly higher MER Average than m<sup>(1)</sup>, L=1.
* The parameter set m<sup>(2)</sup>, L=1 consistently yields a lower MER Average than the other two parameter sets.
* The rate of decrease in MER Average diminishes as N increases, indicating diminishing returns.
* All three lines converge towards a similar MER Average value around N=700.
### Interpretation
The chart suggests that increasing 'N' generally leads to improved performance, as measured by the MER Average. The differences between the parameter sets (m<sup>(1)</sup>, L=1 vs. m<sup>(1)</sup>, L=5 vs. m<sup>(2)</sup>, L=1) indicate that the choice of parameters significantly impacts performance. Specifically, m<sup>(2)</sup>, L=1 appears to be the most effective configuration, while m<sup>(1)</sup>, L=5 is the least effective. The plateauing of the lines at higher values of N suggests that there is a limit to the performance improvement achievable by simply increasing N. Further investigation might be needed to determine if other parameters or techniques could be used to further enhance performance beyond this point. The 'L' parameter likely represents a length or level, and the 'm' parameter likely represents a model or method. The convergence of the lines at higher N values could indicate that the effect of the 'L' parameter diminishes as N increases.
</details>
(a) Original Input (b) Original and $x_{t}x_{t+1}$ Input
Figure 7: Plot of the test set MER, computed on a test set of size $N_{\mathrm{test}}=30000$ , against training sample size $N$ for detecting the existence of a change-point on data series of length $n=100$ . We compare the performance of neural networks from four function classes: $\mathcal{H}_{1,m^{(1)}}$ , $\mathcal{H}_{1,m^{(2)}}$ , $\mathcal{H}_{5,m^{(1)}\mathbf{1}_{5}}$ and neural network with 21 residual blocks where $m^{(1)}=4\lfloor\log_{2}(n)\rfloor$ and $m^{(2)}=2n-2$ respectively. The change-points are randomly chosen from $\mathrm{Unif}\{10,...,89\}$ . Given change-point $\tau$ , data are generated from the autoregressive model $x_{t}=\alpha_{t}x_{t-1}+\epsilon_{t}$ for $\epsilon_{t}\stackrel{{\scriptstyle\mathrm{iid}}}{{\sim}}N(0,0.25^{2})$ and $\alpha_{t}=0.2\mathbf{1}_{\{t<\tau\}}+0.8\mathbf{1}_{\{t≥\tau\}}$ .
B.2.5 Simulation on change-point location estimation
Here, we describe simulation results on the performance of change-point location estimator constructed using a combination of simple neural network-based classifier and Algorithm 1 from the main text. Given a sequence of length $n^{\prime}=2000$ , we draw $\tau\sim\text{Unif}\{750,...,1250\}$ . Set $\mu_{L}=0$ and draw $\mu_{R}|\tau$ from 2 different uniform distributions: $\text{Unif}([-1.5b,-0.5b]\cup[0.5b,1.5b])$ (Weak) and $\text{Unif}([-3b,-b]\cup[b,3b])$ (Strong), where $b\coloneqq\sqrt{\frac{8n^{\prime}\log(20n^{\prime})}{\tau(n^{\prime}-\tau)}}$ is chosen in line with Lemma 4.1 to ensure a good range of signal-to-noise ratio. We then generate $\boldsymbol{x}=(\mu_{\mathrm{L}}\mathbbm{1}_{\{t≤\tau\}}+\mu_{\mathrm{R}}%
\mathbbm{1}_{\{t>\tau\}}+\varepsilon_{t})_{t∈[n^{\prime}]}$ , with the noise $\boldsymbol{\varepsilon}=(\varepsilon_{t})_{t∈[n^{\prime}]}\sim N_{n^{\prime%
}}(0,I_{n^{\prime}})$ . We then draw independent copies $\boldsymbol{x}_{1},...,\boldsymbol{x}_{N^{\prime}}$ of $\boldsymbol{x}$ . For each $\boldsymbol{x}_{k}$ , we randomly choose 60 segments with length $n∈\{300,400,500,600\}$ , the segments which include $\tau_{k}$ are labelled ‘1’, others are labelled ‘0’. The training dataset size is $N=60N^{\prime}$ where $N^{\prime}=500$ . We then draw another $N_{\text{test}}=3000$ independent copies of $\boldsymbol{x}$ as our test data for change-point location estimation. We study the performance of change-point location estimator produced by using Algorithm 1 together with a single-layer neural network, and compare it with the performance of CUSUM, MOSUM and Wilcoxon statistics-based estimators. As we can see from the Figure 8, under Gaussian models where CUSUM is known to work well, our simple neural network-based procedure is competitive. On the other hand, when the noise is heavy-tailed, our simple neural network-based estimator greatly outperforms CUSUM-based estimator.
<details>
<summary>x15.png Details</summary>

### Visual Description
\n
## Line Chart: RMSE vs. n for Change Detection Algorithms
### Overview
This image presents a line chart comparing the Root Mean Squared Error (RMSE) performance of three change detection algorithms – CUSUM, MOSUM, and Alg. 1 – as a function of the sample size 'n'. The chart visually demonstrates how the accuracy of each algorithm changes with increasing data points.
### Components/Axes
* **X-axis:** Labeled "n", representing the sample size. The scale ranges from approximately 300 to 600, with markers at 300, 350, 400, 450, 500, 550, and 600.
* **Y-axis:** Labeled "RMSE", representing the Root Mean Squared Error. The scale ranges from approximately 50 to 280, with markers at 50, 100, 150, 200, 250.
* **Legend:** Located in the top-right corner, identifying the three data series:
* CUSUM (Blue line with circle markers)
* MOSUM (Orange line with circle markers)
* Alg. 1 (Green line with triangle markers)
### Detailed Analysis
* **CUSUM (Blue):** The line slopes downward, indicating decreasing RMSE with increasing 'n'.
* At n = 300, RMSE ≈ 55
* At n = 350, RMSE ≈ 53
* At n = 400, RMSE ≈ 52
* At n = 450, RMSE ≈ 54
* At n = 500, RMSE ≈ 53
* At n = 550, RMSE ≈ 51
* At n = 600, RMSE ≈ 50
* **MOSUM (Orange):** The line exhibits a steep downward slope initially, then flattens out.
* At n = 300, RMSE ≈ 280
* At n = 350, RMSE ≈ 230
* At n = 400, RMSE ≈ 200
* At n = 450, RMSE ≈ 180
* At n = 500, RMSE ≈ 170
* At n = 550, RMSE ≈ 160
* At n = 600, RMSE ≈ 150
* **Alg. 1 (Green):** The line shows a slight downward trend with some fluctuations.
* At n = 300, RMSE ≈ 75
* At n = 350, RMSE ≈ 70
* At n = 400, RMSE ≈ 65
* At n = 450, RMSE ≈ 70
* At n = 500, RMSE ≈ 72
* At n = 550, RMSE ≈ 60
* At n = 600, RMSE ≈ 55
### Key Observations
* MOSUM starts with a significantly higher RMSE than CUSUM and Alg. 1, but its RMSE decreases more rapidly initially.
* CUSUM consistently exhibits the lowest RMSE across all sample sizes.
* Alg. 1 has a relatively stable RMSE, lower than MOSUM but higher than CUSUM.
* The rate of RMSE reduction for MOSUM diminishes as 'n' increases.
### Interpretation
The chart demonstrates the performance characteristics of three change detection algorithms. CUSUM appears to be the most accurate algorithm across the tested range of sample sizes, consistently achieving the lowest RMSE. MOSUM, while starting with poor performance, shows substantial improvement with increasing data, suggesting it benefits from larger datasets. Alg. 1 provides a moderate level of accuracy, falling between CUSUM and MOSUM.
The diminishing returns of increasing 'n' for MOSUM suggest that beyond a certain point, adding more data does not significantly improve its performance. This could be due to the algorithm reaching its optimal performance level or encountering limitations in its underlying methodology. The consistent performance of CUSUM indicates its robustness and suitability for change detection tasks, even with limited data. The differences in performance highlight the importance of algorithm selection based on the specific application and available data resources.
</details>
<details>
<summary>x16.png Details</summary>

### Visual Description
\n
## Line Chart: RMSE vs. n for Change Detection Algorithms
### Overview
The image presents a line chart comparing the Root Mean Squared Error (RMSE) performance of three change detection algorithms – CUSUM, MOSUM, and Alg. 1 – as a function of the sample size 'n'. The chart visually demonstrates how the RMSE changes for each algorithm as 'n' increases from 300 to 600.
### Components/Axes
* **X-axis:** Labeled "n", representing the sample size. The scale ranges from approximately 300 to 600, with markers at 300, 350, 400, 450, 500, 550, and 600.
* **Y-axis:** Labeled "RMSE", representing the Root Mean Squared Error. The scale ranges from approximately 10 to 65, with markers at 10, 20, 30, 40, 50, and 60.
* **Legend:** Located in the top-right corner of the chart. It identifies the three data series:
* CUSUM (Blue line with circle markers)
* MOSUM (Orange line with square markers)
* Alg. 1 (Green line with triangle markers)
### Detailed Analysis
* **CUSUM (Blue):** The line representing CUSUM is relatively flat. It starts at approximately RMSE = 12 at n = 300. It decreases slightly to approximately RMSE = 11 at n = 500, and then increases slightly to approximately RMSE = 12 at n = 600.
* **MOSUM (Orange):** The MOSUM line exhibits a strong downward trend. It begins at approximately RMSE = 68 at n = 300. It decreases sharply to approximately RMSE = 27 at n = 400. The decline continues, reaching approximately RMSE = 24 at n = 500, and finally approximately RMSE = 21 at n = 600.
* **Alg. 1 (Green):** The Alg. 1 line shows a moderate upward trend. It starts at approximately RMSE = 17 at n = 300. It increases to approximately RMSE = 19 at n = 400 and remains relatively stable around RMSE = 19-20 until n = 500. It then decreases slightly to approximately RMSE = 18 at n = 600.
### Key Observations
* MOSUM demonstrates the most significant improvement in RMSE as 'n' increases, indicating it benefits substantially from larger sample sizes.
* CUSUM's RMSE remains relatively constant across all sample sizes, suggesting its performance is less sensitive to 'n'.
* Alg. 1 shows a slight initial increase in RMSE, followed by stabilization and a minor decrease.
* At n=300, MOSUM has a significantly higher RMSE than CUSUM and Alg. 1.
* At n=600, the RMSE values for all three algorithms are relatively close, but MOSUM still has the lowest RMSE.
### Interpretation
The chart suggests that MOSUM is the most effective algorithm for change detection when a sufficiently large sample size is available. While it starts with a high RMSE, its performance improves dramatically with increasing 'n'. CUSUM provides a stable, but less optimal, performance regardless of sample size. Alg. 1's performance is intermediate, showing some sensitivity to 'n' but not as pronounced as MOSUM.
The data implies that the choice of algorithm should be guided by the expected sample size. If 'n' is small, CUSUM or Alg. 1 might be preferable. However, if 'n' is large, MOSUM is likely to yield the most accurate results. The initial high RMSE of MOSUM at small 'n' could be due to its reliance on statistical properties that require a sufficient number of observations to be accurately estimated. The relatively flat performance of CUSUM suggests it may be more robust to small sample sizes or less reliant on precise statistical estimation.
</details>
(a) S1 with $\rho_{t}=0$ , weak SNR (b) S1 with $\rho_{t}=0$ , strong SNR
<details>
<summary>x17.png Details</summary>

### Visual Description
## Line Chart: RMSE vs. n for Change Detection Algorithms
### Overview
This image presents a line chart comparing the Root Mean Squared Error (RMSE) performance of four change detection algorithms (CUSUM, MOSUM, Alg. 1, and Wilcoxon) as a function of the sample size 'n'. The chart visually assesses how the accuracy of each algorithm changes with increasing data points.
### Components/Axes
* **X-axis:** 'n' - Sample Size, ranging from approximately 300 to 600. The axis is labeled "n".
* **Y-axis:** 'RMSE' - Root Mean Squared Error, ranging from 0 to 175. The axis is labeled "RMSE".
* **Data Series:**
* CUSUM (Blue line with circle markers)
* MOSUM (Orange line with triangle markers)
* Alg. 1 (Green line with plus markers)
* Wilcoxon (Red line with star markers)
* **Legend:** Located in the top-right corner of the chart, clearly identifying each line with its corresponding algorithm name and marker.
### Detailed Analysis
* **CUSUM (Blue):** The line starts at approximately 162 RMSE at n=300, decreases to a minimum of approximately 160 at n=350, then increases to a maximum of approximately 175 at n=500, and finally decreases to approximately 165 at n=600. The trend is generally flat with some fluctuation.
* n=300: RMSE ≈ 162
* n=350: RMSE ≈ 160
* n=400: RMSE ≈ 168
* n=450: RMSE ≈ 172
* n=500: RMSE ≈ 175
* n=550: RMSE ≈ 170
* n=600: RMSE ≈ 165
* **MOSUM (Orange):** The line starts at approximately 95 RMSE at n=300, increases to a maximum of approximately 102 at n=450, and then decreases to approximately 82 at n=600. The trend is generally decreasing.
* n=300: RMSE ≈ 95
* n=350: RMSE ≈ 98
* n=400: RMSE ≈ 100
* n=450: RMSE ≈ 102
* n=500: RMSE ≈ 98
* n=550: RMSE ≈ 90
* n=600: RMSE ≈ 82
* **Alg. 1 (Green):** The line remains relatively flat throughout the entire range of 'n', starting at approximately 12 RMSE at n=300 and ending at approximately 15 RMSE at n=600.
* n=300: RMSE ≈ 12
* n=350: RMSE ≈ 12
* n=400: RMSE ≈ 12
* n=450: RMSE ≈ 13
* n=500: RMSE ≈ 13
* n=550: RMSE ≈ 14
* n=600: RMSE ≈ 15
* **Wilcoxon (Red):** The line is consistently very close to zero, starting at approximately 2 RMSE at n=300 and ending at approximately 5 RMSE at n=600.
* n=300: RMSE ≈ 2
* n=350: RMSE ≈ 3
* n=400: RMSE ≈ 3
* n=450: RMSE ≈ 3
* n=500: RMSE ≈ 4
* n=550: RMSE ≈ 4
* n=600: RMSE ≈ 5
### Key Observations
* Alg. 1 and Wilcoxon consistently exhibit significantly lower RMSE values compared to CUSUM and MOSUM across all sample sizes.
* CUSUM shows the highest RMSE values, indicating the lowest accuracy among the four algorithms.
* MOSUM's RMSE decreases as 'n' increases, suggesting improved performance with larger sample sizes.
* CUSUM's RMSE fluctuates with increasing 'n', showing no clear trend.
### Interpretation
The chart demonstrates the performance characteristics of different change detection algorithms as the sample size increases. Alg. 1 and Wilcoxon are clearly superior in terms of RMSE, indicating higher accuracy in detecting changes. The decreasing RMSE of MOSUM with increasing 'n' suggests that it benefits from more data, while CUSUM's performance remains relatively unstable. The consistent low RMSE of Alg. 1 and Wilcoxon suggests they are robust to sample size variations. This data could be used to select the most appropriate algorithm for a given application, considering the expected sample size and the desired level of accuracy. The large difference in RMSE values between the algorithms suggests that the choice of algorithm can have a substantial impact on the reliability of change detection results.
</details>
<details>
<summary>x18.png Details</summary>

### Visual Description
## Line Chart: RMSE vs. n for Change Detection Algorithms
### Overview
This image presents a line chart comparing the Root Mean Squared Error (RMSE) performance of four change detection algorithms – CUSUM, MOSUM, Alg. 1, and Wilcoxon – as a function of the sample size 'n'. The chart visually assesses how the accuracy of each algorithm changes with increasing data points.
### Components/Axes
* **X-axis:** 'n' - Sample Size, ranging from approximately 300 to 600, with markers at 300, 350, 400, 450, 500, 550, and 600.
* **Y-axis:** 'RMSE' - Root Mean Squared Error, ranging from 0 to 130, with markers at 0, 20, 40, 60, 80, 100, and 120.
* **Legend:** Located in the top-right corner, identifying the four algorithms:
* CUSUM (Blue circles)
* MOSUM (Orange triangles)
* Alg. 1 (Green stars)
* Wilcoxon (Red crosses)
### Detailed Analysis
* **CUSUM (Blue):** The line starts at approximately 105 at n=300, increases to a peak of approximately 130 at n=400, and then decreases to approximately 118 at n=600. This indicates that CUSUM performs worst around n=400.
* **MOSUM (Orange):** The line begins at approximately 58 at n=300, decreases to a minimum of approximately 45 at n=350, and then increases to approximately 55 at n=600. MOSUM shows a slight improvement initially, then a gradual decline in performance.
* **Alg. 1 (Green):** The line is relatively flat, starting at approximately 8 at n=300 and ending at approximately 10 at n=600. It fluctuates slightly around this level.
* **Wilcoxon (Red):** The line is nearly flat, starting at approximately 2 at n=300 and remaining close to this value (approximately 3) throughout the range of 'n' up to n=600.
Here's a table summarizing the approximate RMSE values for each algorithm at each 'n' value:
| n | CUSUM | MOSUM | Alg. 1 | Wilcoxon |
| --- | ----- | ----- | ------ | -------- |
| 300 | 105 | 58 | 8 | 2 |
| 350 | 115 | 45 | 7 | 2 |
| 400 | 130 | 48 | 9 | 2 |
| 450 | 125 | 52 | 8 | 3 |
| 500 | 122 | 57 | 10 | 3 |
| 550 | 120 | 60 | 9 | 3 |
| 600 | 118 | 55 | 10 | 3 |
### Key Observations
* Wilcoxon consistently exhibits the lowest RMSE values across all sample sizes.
* Alg. 1 maintains a low and stable RMSE, slightly higher than Wilcoxon.
* CUSUM shows the highest RMSE values and the most significant fluctuation with changing 'n'.
* MOSUM's RMSE is intermediate, showing a slight initial improvement followed by a gradual increase.
* The performance difference between Wilcoxon and the other algorithms becomes more pronounced as 'n' increases.
### Interpretation
The chart demonstrates the comparative effectiveness of different change detection algorithms as the sample size grows. Wilcoxon appears to be the most robust algorithm, consistently providing the lowest error rates regardless of the sample size. Alg. 1 also performs well, maintaining a low error rate. CUSUM, however, is sensitive to the sample size, exhibiting a peak in RMSE around n=400, suggesting it may be less reliable for certain data volumes. MOSUM's performance is moderate, showing a slight initial improvement but ultimately increasing with 'n'.
The consistent low RMSE of Wilcoxon suggests it is a good choice for change detection tasks, particularly when dealing with larger datasets. The fluctuating RMSE of CUSUM indicates that its performance is more dependent on the specific characteristics of the data and the sample size. The chart highlights the importance of selecting an appropriate algorithm based on the expected data volume and the desired level of accuracy. The data suggests that as the sample size increases, the advantage of using Wilcoxon over the other algorithms becomes more significant.
</details>
(c) S3, weak SNR (d) S3, strong SNR
Figure 8: Plot of the root mean square error (RMSE) of change-point estimation (S1 with $\rho_{t}=0$ and S3), computed on a test set of size $N_{\text{test}}=3000$ , against bandwidth $n$ for detecting the existence of a change-point on data series of length $n^{*}=2000$ . We compare the performance of the change-point detection by CUSUM, MOSUM, Algorithm 1 and Wilcoxon (only for S3) respectively. The RMSE here is defined by $\sqrt{1/N\sum_{i=1}^{N}(\hat{\tau}_{i}-\tau_{i})^{2}}$ where $\hat{\tau}_{i}$ is the estimator of change-point for the $i$ -th observation and $\tau_{i}$ is the true change-point. The weak and strong signal-to-noise ratio (SNR) correspond to $\mu_{R}|\tau\sim\text{Unif}([-1.5b,-0.5b]\cup[0.5b,1.5b])$ and $\mu_{R}|\tau\sim\text{Unif}([-3b,-b]\cup[b,3b])$ respectively.
Appendix C Real Data Analysis
The HASC (Human Activity Sensing Consortium) project aims at understanding the human activities based on the sensor data. This data includes 6 human activities: “stay”, “walk”, “jog”, “skip”, “stair up” and “stair down”. Each activity lasts at least 10 seconds, the sampling frequency is 100 Hz.
C.1 Data Cleaning
The HASC offers sequential data where there are multiple change-types and multiple change-points, see Figure 3 in main text. Hence, we can not directly feed them into our deep convolutional residual neural network. The training data fed into our neural network requires fixed length $n$ and either one change-point or no change-point existence in each time series. Next, we describe how to obtain this kind of training data from HASC sequential data. In general, Let $\boldsymbol{x}={(x_{1},x_{2},...,x_{d})}^{→p},d≥ 1$ be the $d$ -channel vector. Define $\boldsymbol{X}\coloneqq(\boldsymbol{x}_{t_{1}},\boldsymbol{x}_{t_{2}},...,%
\boldsymbol{x}_{t_{n^{*}}})$ as a realization of $d$ -variate time series where $\boldsymbol{x}_{t_{j}},j=1,2,...,n^{*}$ are the observations of $\boldsymbol{x}$ at $n^{*}$ consecutive time stamps $t_{1},t_{2},...,t_{n^{*}}$ . Let $\boldsymbol{X}_{i},i=1,2,...,N^{*}$ represent the observation from the $i$ -th subject. $\boldsymbol{\tau}_{i}\coloneqq(\tau_{i,1},\tau_{i,2},...,\tau_{i,K})^{→p}%
,K∈\mathbb{Z}^{+},\tau_{i,k}∈[2,n^{*}-1],1≤ k≤ K$ with convention $\tau_{i,0}=0$ and $\tau_{i,K+1}=n^{*}$ represents the change-points of the $i$ -th observation which are well-labelled in the sequential data sets. Furthermore, define $n\coloneqq\min_{i∈[N^{*}]}\min_{k∈[K+1]}(\tau_{i,k}-\tau_{i,k-1})$ . In practice, we require that $n$ is not too small, this can be achieved by controlling the sampling frequency in experiment, see HASC data. We randomly choose $q$ sub-segments with length $n$ from $\boldsymbol{X}_{i}$ like the gray dash rectangles in Figure 3 of main text. By the definition of $n$ , there is at most one change-point in each sub-segment. Meanwhile, we assign the label to each sub-segment according to the type and existence of change-point. After that, we stack all the sub-segments to form a tensor $\mathcal{X}$ with dimensions of $(N^{*}q,d,n)$ . The label vector is denoted as $\mathcal{Y}$ with length $N^{*}q$ . To guarantee that there is at most one change-point in each segment, we set the length of segment $n=700$ . Let $q=15$ , as the change-points are well labelled, it is easy to draw 15 segments without any change-point, i.e., the segments with labels: “stay”, “walk”, “jog”, “skip”, “stair up” and “stair down”. Next, we randomly draw 15 segments (the red rectangles in Figure 3 of main text) for each transition point.
C.2 Transformation
Section 3 in main text suggests that changes in the mean/signal may be captured by feeding the raw data directly. For other type of change, we recommend appropriate transformations before training the model depending on the interest of change-type. For instance, if we are interested in changes in the second order structure, we suggest using the square transformation; for change in auto-correlation with order $p$ we could input the cross-products of data up to a $p$ -lag. In multiple change-types, we allow applying several transformations to the data in data pre-processing step. The mixture of raw data and transformed data is treated as the training data. We employ the square transformation here. All the segments are mapped onto scale $[-1,1]$ after the transformation. The frequency of training labels are list in Figure 11. Finally, the shapes of training and test data sets are $(4875,6,700)$ and $(1035,6,700)$ respectively.
C.3 Network Architecture
We propose a general deep convolutional residual neural network architecture to identify the multiple change-types based on the residual block technique (He et al., 2016) (see Figure 9). There are two reasons to explain why we choose residual block as the skeleton frame.
- The problem of vanishing gradients (Bengio et al., 1994; Glorot and Bengio, 2010). As the number of convolution layers goes significantly deep, some layer weights might vanish in back-propagation which hinders the convergence. Residual block can solve this issue by the so-called “shortcut connection”, see the flow chart in Figure 9.
- Degradation. He et al. (2016) has pointed out that when the number of convolution layers increases significantly, the accuracy might get saturated and degrade quickly. This phenomenon is reported and verified in He and Sun (2015) and He et al. (2016).
<details>
<summary>x19.png Details</summary>

### Visual Description
\n
## Diagram: Neural Network Architecture
### Overview
The image depicts the architecture of a neural network, likely for image classification or regression. It shows a sequence of layers, starting with an input layer, followed by convolutional layers, residual blocks, and finally, dense (fully connected) layers leading to an output layer. The diagram illustrates the flow of data through these layers.
### Components/Axes
The diagram consists of the following components:
* **Input:** (d, n) - Represents the input layer with dimensions 'd' and 'n'.
* **Conv2D:** Convolutional 2D layers.
* **Batch Normalisation:** Batch normalization layers.
* **ReLU:** Rectified Linear Unit activation functions.
* **Max Polling:** Max pooling layers.
* **Residual Blocks:** A series of 21 residual blocks.
* **Global Average Pooling:** Global average pooling layer.
* **Dense(50), Dense(40), Dense(30), Dense(20), Dense(10):** Dense (fully connected) layers with 50, 40, 30, 20, and 10 neurons respectively.
* **Output:** (m, 1) - Represents the output layer with dimensions 'm' and 1.
* **Arrows:** Indicate the flow of data between layers.
* **Multiplication Symbols (x):** Indicate repetition of layers (x1 and x21).
### Detailed Analysis or Content Details
The diagram can be broken down into three main sections:
**Left Section (Feature Extraction):**
1. **Input (d, n):** The input layer.
2. **Conv2D:** A convolutional layer.
3. **Batch Normalisation:** Normalizes the output of the Conv2D layer.
4. **ReLU:** Applies the ReLU activation function.
5. **Max Polling:** Reduces the spatial dimensions of the feature maps.
**Center Section (Residual Blocks):**
1. **21 x Residual Blocks:** A series of 21 identical residual blocks. Each block consists of:
* **Conv2D:** A convolutional layer.
* **Batch Normalisation:** Normalizes the output of the Conv2D layer.
* **ReLU:** Applies the ReLU activation function.
* **Conv2D:** Another convolutional layer.
* **Batch Normalisation:** Normalizes the output of the second Conv2D layer.
* **ReLU:** Applies the ReLU activation function.
2. **Global Average Pooling:** Reduces the spatial dimensions to a single value per feature map.
**Right Section (Classification/Regression):**
1. **Dense(50):** A fully connected layer with 50 neurons.
2. **Dense(40):** A fully connected layer with 40 neurons.
3. **Dense(30):** A fully connected layer with 30 neurons.
4. **Dense(20):** A fully connected layer with 20 neurons.
5. **Dense(10):** A fully connected layer with 10 neurons.
6. **Output (m, 1):** The output layer.
The flow of data is as follows: Input -> Conv2D -> Batch Normalisation -> ReLU -> Max Polling -> 21 x Residual Blocks -> Global Average Pooling -> Dense(50) -> Dense(40) -> Dense(30) -> Dense(20) -> Dense(10) -> Output.
### Key Observations
* The architecture utilizes a significant number of residual blocks (21), suggesting a deep network designed to mitigate the vanishing gradient problem.
* The use of Batch Normalisation after each convolutional layer likely improves training stability and speed.
* The Global Average Pooling layer reduces the number of parameters and helps prevent overfitting.
* The decreasing number of neurons in the dense layers (50, 40, 30, 20, 10) suggests a funneling of information towards the output layer.
### Interpretation
This diagram represents a convolutional neural network (CNN) architecture, likely designed for image-related tasks. The initial convolutional layers and residual blocks are responsible for feature extraction from the input image. The residual blocks allow for the training of very deep networks without suffering from the vanishing gradient problem. The Global Average Pooling layer summarizes the extracted features, and the subsequent dense layers perform the classification or regression task. The output layer (m, 1) suggests that the network is designed to predict a single value (regression) or classify the input into 'm' categories. The architecture is a common pattern for image recognition tasks, leveraging the power of deep learning to learn complex patterns from image data. The specific values of 'd', 'n', and 'm' would determine the exact input and output dimensions of the network.
</details>
Figure 9: Architecture of our general-purpose change-point detection neural network. The left column shows the standard layers of neural network with input size $(d,n)$ , $d$ may represent the number of transformations or channels; We use 21 residual blocks and one global average pooling in the middle column; The right column includes 5 dense layers with nodes in bracket and output layer. More details of the neural network architecture appear in the supplement.
There are 21 residual blocks in our deep neural network, each residual block contains 2 convolutional layers. Like the suggestion in Ioffe and Szegedy (2015) and He et al. (2016), each convolution layer is followed by one Batch Normalization (BN) layer and one ReLU layer. Besides, there exist 5 fully-connected convolution layers right after the residual blocks, see the third column of Figure 9. For example, Dense(50) means that the dense layer has 50 nodes and is connected to a dropout layer with dropout rate 0.3. To further prevent the effect of overfitting, we also implement the $L_{2}$ regularization in each fully-connected layer (Ng, 2004). As the number of labels in HASC is 28, see Figure 10, we drop the dense layers “Dense(20)” and “Dense(10)” in Figure 9. The output layer has size $(28,1)$ . We remark two discussable issues here. (a) For other problems, the number of residual blocks, dense layers and the hyperparameters may vary depending on the complexity of the problem. In Section 6 of main text, the architecture of neural network for both synthetic data and real data has 21 residual blocks considering the trade-off between time complexity and model complexity. Like the suggestion in He et al. (2016), one can also add more residual blocks into the architecture to improve the accuracy of classification. (b) In practice, we would not have enough training data; but there would be potential ways to overcome this via either using Data Argumentation or increasing $q$ . In some extreme cases that we only mainly have data with no-change, we can artificially add changes into such data in line with the type of change we want to detect.
C.4 Training and Detection
<details>
<summary>x20.png Details</summary>

### Visual Description
\n
## Data Structure: Dictionary of State Transitions
### Overview
The image presents a dictionary-like data structure, likely representing state transitions within a system. The keys are strings describing transitions between different states (e.g., "jog", "jog->skip", "stDown->walk"), and the values are integer indices.
### Components/Axes
There are no axes or traditional components as in a chart or diagram. The structure consists of key-value pairs. The keys represent state transitions, and the values are numerical identifiers.
### Detailed Analysis or Content Details
The dictionary contains 28 key-value pairs. Here's a complete listing:
* 'jog': 0
* 'jog->skip': 1
* 'jog->stay': 2
* 'jog->walk': 3
* 'skip': 4
* 'skip->jog': 5
* 'skip->stay': 6
* 'skip->walk': 7
* 'stDown': 8
* 'stDown->jog': 9
* 'stDown->stay': 10
* 'stDown->walk': 11
* 'stUp': 12
* 'stUp->skip': 13
* 'stUp->stay': 14
* 'stUp->walk': 15
* 'stay': 16
* 'stay->jog': 17
* 'stay->skip': 18
* 'stay->stDown': 19
* 'stay->stUp': 20
* 'stay->walk': 21
* 'walk': 22
* 'walk->jog': 23
* 'walk->skip': 24
* 'walk->stDown': 25
* 'walk->stUp': 26
* 'walk->stay': 27
The states involved appear to be: 'jog', 'skip', 'stDown' (likely standing down), 'stUp' (likely standing up), 'stay', and 'walk'. The transitions indicate possible movements or changes between these states.
### Key Observations
The dictionary provides a mapping between state transitions and integer identifiers. The identifiers range from 0 to 27. The structure suggests a discrete state machine or a similar system where transitions between states are explicitly defined and indexed.
### Interpretation
This data likely represents a state transition table used in a control system, animation system, or a similar application where an entity can be in one of several states and move between them. The integer identifiers could be used as indices into an array or other data structure to store additional information about each transition (e.g., duration, animation frame, associated action). The naming convention of the keys (e.g., "state1->state2") clearly indicates the direction of the transition. The inclusion of 'stay' as a state suggests the possibility of remaining in the current state. The 'stDown' and 'stUp' states suggest a vertical component to the system's behavior. This data is a foundational element for implementing state-based logic in a software or hardware system.
</details>
Figure 10: Label Dictionary
<details>
<summary>x21.png Details</summary>

### Visual Description
\n
## Text Block: Counter Data
### Overview
The image presents a textual output representing a counter, likely from a program or script, detailing the frequency of various actions or state transitions. The output is formatted as a Python dictionary-like string.
### Components/Axes
There are no axes or components in the traditional sense of a chart or diagram. The data is presented as key-value pairs, where the keys represent actions or transitions, and the values represent their counts.
### Detailed Analysis or Content Details
The counter data can be reconstructed as follows:
* 'walk': 570
* 'stay': 525
* 'jog': 495
* 'skip': 405
* 'stDown': 225
* 'stUp': 225
* 'walk->jog': 210
* 'stay->stDown': 180
* 'walk->stay': 180
* 'stay->skip': 180
* 'jog->walk': 165
* 'jog->stay': 150
* 'walk->stUp': 120
* 'skip->stay': 120
* 'stay->jog': 120
* 'stDown->stay': 105
* 'stay->stUp': 105
* 'stUp->walk': 105
* 'jog->skip': 105
* 'skip->walk': 105
* 'walk->skip': 75
* 'stUp->stay': 75
* 'stDown->walk': 75
* 'skip->jog': 75
* 'stUp->skip': 45
* 'stay->walk': 45
* 'walk->stDown': 45
* 'stDown->jog': 45
### Key Observations
The most frequent actions are 'walk' (570), 'stay' (525), and 'jog' (495). The least frequent actions/transitions are several with a count of 45. Transitions involving 'walk' and 'stay' appear relatively common.
### Interpretation
This data likely represents a log of actions or state changes within a system, potentially a simulation or a recording of user behavior. The high counts for 'walk', 'stay', and 'jog' suggest these are the most common activities. The transitions between states (e.g., 'walk->jog') provide insight into the flow or sequence of actions. The data could be used to analyze patterns of behavior, identify common pathways, or evaluate the performance of a system. The presence of transitions suggests a state machine or similar model is being used. The counts could be used to build a Markov chain model to predict future states.
</details>
Figure 11: Label Frequency
<details>
<summary>x22.png Details</summary>

### Visual Description
\n
## Line Chart: Training and Validation Accuracy vs. Epochs
### Overview
This image presents a line chart illustrating the training and validation accuracy of a model over 400 epochs. The chart visualizes how the model's performance on both the training dataset and a validation dataset changes as the training progresses.
### Components/Axes
* **X-axis:** "Epochs" - ranging from 0 to 400.
* **Y-axis:** "Accuracy" - ranging from 0.3 to 1.0.
* **Data Series 1:** "Kernel Size=25 Train" - represented by a solid blue line.
* **Data Series 2:** "Kernel Size=25 Validation" - represented by a dashed blue line.
* **Legend:** Located in the bottom-left corner of the chart, associating colors with the data series.
* **Grid:** A light gray grid is present in the background to aid in reading values.
### Detailed Analysis
**Training Accuracy (Solid Blue Line):**
The training accuracy line starts at approximately 0.32 at epoch 0. It exhibits a steep upward slope initially, rapidly increasing to around 0.95 by epoch 25. The line then plateaus, fluctuating slightly around 0.98-0.99 from epoch 50 to 400.
* Epoch 0: ~0.32
* Epoch 25: ~0.95
* Epoch 50: ~0.98
* Epoch 100: ~0.99
* Epoch 150: ~0.99
* Epoch 200: ~0.99
* Epoch 250: ~0.99
* Epoch 300: ~0.99
* Epoch 350: ~0.99
* Epoch 400: ~0.99
**Validation Accuracy (Dashed Blue Line):**
The validation accuracy line begins at approximately 0.32 at epoch 0, similar to the training accuracy. It also shows an initial steep increase, but at a slightly slower rate than the training accuracy. It reaches around 0.93 by epoch 25. The line then plateaus, fluctuating around 0.97-0.99 from epoch 50 to 400.
* Epoch 0: ~0.32
* Epoch 25: ~0.93
* Epoch 50: ~0.97
* Epoch 100: ~0.98
* Epoch 150: ~0.99
* Epoch 200: ~0.99
* Epoch 250: ~0.99
* Epoch 300: ~0.99
* Epoch 350: ~0.99
* Epoch 400: ~0.99
### Key Observations
* Both training and validation accuracy increase rapidly in the initial epochs.
* The training accuracy consistently remains slightly higher than the validation accuracy throughout the training process.
* Both lines converge and plateau after approximately 50 epochs, indicating that the model has likely reached a point of diminishing returns in terms of further training.
* There is minimal gap between the training and validation accuracy, suggesting that the model is not significantly overfitting to the training data.
### Interpretation
The chart demonstrates that the model, with a kernel size of 25, learns effectively and achieves high accuracy on both the training and validation datasets. The rapid initial increase in accuracy suggests that the model quickly identifies the underlying patterns in the data. The subsequent plateau indicates that further training does not significantly improve the model's generalization performance. The small gap between training and validation accuracy suggests a good balance between model complexity and generalization ability. The model appears to be well-trained and is unlikely to be significantly overfitting, as evidenced by the similar performance on both datasets. The choice of kernel size (25) seems appropriate for this dataset and model architecture, as it allows for effective learning without leading to excessive overfitting.
</details>
Figure 12: The Accuracy Curves
<details>
<summary>x23.png Details</summary>

### Visual Description
## Heatmap: Prediction vs. Label
### Overview
The image presents a heatmap visualizing the relationship between "Prediction" and "Label". The heatmap displays numerical values representing the frequency or count of occurrences for each combination of prediction and label. The color intensity corresponds to the magnitude of the value, with darker blue indicating higher values and lighter blue indicating lower values.
### Components/Axes
* **X-axis:** "Prediction" ranging from 0 to 27, with integer values.
* **Y-axis:** "Label" ranging from 2 to 27, with integer values.
* **Color Scale:** A gradient from light blue to dark blue, representing values from approximately 0 to 135. The scale is positioned on the right side of the heatmap.
* 0: Light Blue
* ~20: Medium Blue
* ~60: Darker Blue
* ~100: Very Dark Blue
* ~120: Deepest Blue
* **Data Points:** Each cell in the heatmap represents a specific (Prediction, Label) pair and is filled with a color corresponding to its value.
### Detailed Analysis
The heatmap shows a sparse distribution of values, with most cells having a value of 0. Several cells have non-zero values, indicating some correlation between specific predictions and labels.
Here's a breakdown of notable values, cross-referenced with the color scale:
* **Label 2, Prediction 0:** Approximately 90. (Deep Blue)
* **Label 3, Prediction 0:** Approximately 45. (Darker Blue)
* **Label 4, Prediction 0:** Approximately 45. (Darker Blue)
* **Label 5, Prediction 0:** Approximately 44. (Darker Blue)
* **Label 6, Prediction 0:** Approximately 45. (Darker Blue)
* **Label 7, Prediction 0:** Approximately 90. (Deep Blue)
* **Label 8, Prediction 0:** Approximately 40. (Darker Blue)
* **Label 9, Prediction 0:** Approximately 40. (Darker Blue)
* **Label 10, Prediction 0:** Approximately 43. (Darker Blue)
* **Label 11, Prediction 0:** Approximately 35. (Darker Blue)
* **Label 12, Prediction 0:** Approximately 4. (Light Blue)
* **Label 13, Prediction 17:** Approximately 135. (Deepest Blue)
* **Label 14, Prediction 18:** Approximately 45. (Darker Blue)
* **Label 15, Prediction 18:** Approximately 45. (Darker Blue)
* **Label 16, Prediction 19:** Approximately 39. (Darker Blue)
* **Label 17, Prediction 20:** Approximately 2. (Light Blue)
* **Label 18, Prediction 21:** Approximately 135. (Deepest Blue)
* **Label 19, Prediction 22:** Approximately 45. (Darker Blue)
* **Label 20, Prediction 23:** Approximately 45. (Darker Blue)
* **Label 21, Prediction 24:** Approximately 3. (Light Blue)
* **Label 22, Prediction 24:** Approximately 12. (Light Blue)
* **Label 23, Prediction 25:** Approximately 33. (Darker Blue)
* **Label 24, Prediction 26:** Approximately 44. (Darker Blue)
* **Label 25, Prediction 27:** Approximately 44. (Darker Blue)
* **Label 26, Prediction 27:** Approximately 0. (Light Blue)
* **Label 27, Prediction 27:** Approximately 0. (Light Blue)
The majority of the heatmap is filled with 0 values, indicating a lack of prediction-label correspondence for most combinations.
### Key Observations
* The highest values (approximately 135) occur at (Label 13, Prediction 17) and (Label 18, Prediction 21).
* Label 2 shows a relatively high frequency of prediction 0 (approximately 90).
* The heatmap is sparse, suggesting that the prediction and label are often mismatched.
* There is no clear diagonal pattern, indicating that the model doesn't consistently predict the correct label.
### Interpretation
This heatmap likely represents a confusion matrix, commonly used in machine learning to evaluate the performance of a classification model. The "Prediction" axis represents the model's output, and the "Label" axis represents the true class. The values in the heatmap indicate how often the model predicted a particular label given the true label.
The high values at (13, 17) and (18, 21) suggest that the model frequently predicts labels 17 and 21 when the true labels are 13 and 18, respectively. The sparsity of the matrix indicates that the model is often incorrect. The high value for Label 2 and Prediction 0 suggests that the model often predicts 0 for the true label 2.
The lack of a strong diagonal pattern indicates that the model is not performing well and has difficulty accurately classifying the labels. The heatmap provides a visual representation of the model's errors, allowing for identification of specific labels that are frequently misclassified. Further analysis would be needed to understand *why* these misclassifications occur.
</details>
Figure 13: Confusion Matrix of Real Test Dataset
<details>
<summary>x24.png Details</summary>

### Visual Description
\n
## Line Chart: Sensor Signal Over Time with Activity Labels
### Overview
The image presents a line chart displaying sensor signals (x, y, and z axes) over time. The chart is segmented into sections, each labeled with a different activity: "walk", "skip", "jog", "stay", "stUp" (step up), "stDown" (step down). The chart appears to represent data collected from an accelerometer or similar sensor, showing signal variations corresponding to different human movements.
### Components/Axes
* **X-axis:** "Time" - ranging from approximately 0 to 11000 units.
* **Y-axis:** "Signal" - ranging from approximately -4 to 4 units.
* **Legend:** Located in the top-left corner, defining the color-coding for each axis:
* Orange: "x"
* Blue: "y"
* Green: "z"
* **Activity Labels:** Placed horizontally above the chart, dividing it into segments corresponding to different activities. The labels are: "walk", "skip", "jog", "walk", "stUp", "stay", "stDown", "walk", "stay", "skip", "jog".
* **Vertical Lines:** Purple vertical lines are used to demarcate the boundaries between the activity segments.
### Detailed Analysis
The chart displays three distinct lines representing the x, y, and z signals. Each activity segment exhibits a unique signal pattern.
* **Walk:** The orange "x" signal shows a consistent oscillating pattern with a relatively high amplitude (between -3 and 3). The blue "y" signal is relatively flat, fluctuating around 0. The green "z" signal shows small oscillations around 0.
* **Skip:** The orange "x" signal exhibits a higher frequency oscillation with a larger amplitude (between -3 and 3). The blue "y" signal shows a more pronounced oscillation, and the green "z" signal also shows increased activity.
* **Jog:** Similar to "walk", but with a slightly higher frequency and potentially larger amplitude in the orange "x" signal.
* **Stay:** All three signals (x, y, and z) are relatively flat and close to 0, indicating minimal movement.
* **stUp (Step Up):** The orange "x" signal shows a sharp positive peak, followed by oscillations. The blue "y" signal shows a smaller peak, and the green "z" signal shows a slight increase.
* **stDown (Step Down):** The orange "x" signal shows a sharp negative peak, followed by oscillations. The blue "y" signal shows a smaller peak, and the green "z" signal shows a slight increase.
**Approximate Data Points (sampled at key points within each activity):**
| Activity | Time (approx.) | x Signal | y Signal | z Signal |
|---|---|---|---|---|
| walk | 500 | 2.5 | 0.2 | 0.1 |
| walk | 1500 | -2.0 | -0.1 | 0.0 |
| skip | 2500 | 3.0 | 1.0 | 0.5 |
| skip | 3000 | -2.5 | -0.8 | -0.3 |
| jog | 3500 | 2.0 | 0.3 | 0.2 |
| jog | 4000 | -1.5 | -0.2 | 0.1 |
| walk | 4500 | 2.2 | 0.1 | 0.0 |
| stUp | 5500 | 3.5 | 0.5 | 0.3 |
| stay | 6500 | 0.1 | 0.0 | 0.0 |
| stDown | 7500 | -3.0 | -0.4 | -0.2 |
| walk | 8500 | 2.3 | 0.2 | 0.1 |
| stay | 9500 | 0.0 | 0.0 | 0.0 |
| skip | 10500 | 2.8 | 0.9 | 0.4 |
| jog | 11000 | 1.8 | 0.3 | 0.2 |
### Key Observations
* The "stay" activity consistently shows minimal signal variation across all axes.
* "Skip" exhibits the most dynamic signal patterns, with the highest amplitude and frequency oscillations.
* "stUp" and "stDown" show distinct, sharp peaks in the x-signal, indicating a sudden change in acceleration.
* The "walk" and "jog" activities show similar patterns, but "jog" appears to have slightly higher signal amplitudes.
### Interpretation
The data suggests that the sensor is effectively capturing the different acceleration patterns associated with various human activities. The distinct signal characteristics for each activity could be used to train a machine learning model for activity recognition. The consistent patterns within each activity indicate the reliability of the sensor data. The sharp peaks observed during "stUp" and "stDown" likely correspond to the initial acceleration and deceleration phases of stepping. The relatively flat signals during "stay" confirm the sensor's ability to detect periods of inactivity. The differences in signal amplitude and frequency between "walk" and "jog" likely reflect the increased intensity and speed of jogging. The data provides a clear demonstration of how accelerometer data can be used to characterize and differentiate human movements.
</details>
Figure 14: Change-point Detection of Real Dataset for Person 7 (2nd sequence). The red line at 4476 is the true change-point, the blue line on its right is the estimator. The difference between them is caused by the similarity of “Walk” and “StairUp”.
<details>
<summary>x25.png Details</summary>

### Visual Description
\n
## Time Series Chart: Activity Signal Over Time
### Overview
The image presents a time series chart displaying signal strength across three axes (x, y, and z) over time, with annotations indicating different activities performed during the recording. The chart appears to represent sensor data, likely from an accelerometer or similar device, capturing movement patterns. The x-axis represents time, and the y-axis represents signal strength.
### Components/Axes
* **X-axis:** Time (units not specified, ranging from approximately 0 to 11000)
* **Y-axis:** Signal (units not specified, ranging from approximately -2 to 2)
* **Legend (top-right):**
* x (Blue line)
* y (Orange line)
* z (Green line)
* **Activity Annotations (top-center):**
* walk
* skip
* stay
* jog
* stUp (likely "step up")
* stay
* stDown (likely "step down")
* walk
* stay
* skip
* jog
### Detailed Analysis
The chart shows three distinct lines representing the signal from the x, y, and z axes. Each line fluctuates over time, and the amplitude of these fluctuations varies depending on the activity being performed. Vertical lines mark the beginning and end of each activity.
* **Walk (0-1500, 3500-5000, 7500-8500):** During walking, all three axes show relatively consistent, moderate fluctuations. The x and z axes exhibit larger amplitude oscillations than the y axis.
* **Skip (1500-2500, 9000-10000):** Skipping is characterized by high-amplitude oscillations across all three axes, particularly the z-axis. The signal is more erratic and less consistent than during walking.
* **Stay (2500-3500, 5000-6000, 8500-9000):** When staying still, the signal across all axes is relatively low and stable, with minimal fluctuations. The y-axis shows the least amount of movement.
* **Jog (3500-4500, 10000-11000):** Jogging exhibits moderate to high amplitude oscillations, similar to walking but with a slightly higher frequency and more pronounced peaks.
* **stUp (5000-6000):** The signal shows a quick increase in amplitude across all axes, followed by a return to a more stable state.
* **stDown (6000-7500):** The signal shows a quick decrease in amplitude across all axes, followed by a return to a more stable state.
**Approximate Signal Values (based on visual estimation):**
| Activity | Time Range | x-axis (approx. range) | y-axis (approx. range) | z-axis (approx. range) |
|---|---|---|---|---|
| Walk | 0-1500 | -1.5 to 1.5 | -0.5 to 0.5 | -1.8 to 1.8 |
| Skip | 1500-2500 | -2 to 2 | -1.5 to 1.5 | -2 to 2 |
| Stay | 2500-3500 | -0.2 to 0.2 | -0.1 to 0.1 | -0.3 to 0.3 |
| Jog | 3500-4500 | -1.2 to 1.2 | -0.4 to 0.4 | -1.5 to 1.5 |
| stUp | 5000-6000 | -1.5 to 1.5 | -0.5 to 0.5 | -1.8 to 1.8 |
| stDown | 6000-7500 | -1.5 to 1.5 | -0.5 to 0.5 | -1.8 to 1.8 |
### Key Observations
* The z-axis consistently shows the largest amplitude fluctuations across most activities, suggesting it is the most sensitive to movement.
* The y-axis generally exhibits the smallest amplitude fluctuations, indicating it is the least sensitive to movement.
* The "stay" activity shows minimal signal variation, providing a baseline for comparison.
* The "skip" activity generates the most dynamic signal, with the highest amplitude and frequency of oscillations.
* The "stUp" and "stDown" activities show a transient increase and decrease in signal, respectively.
### Interpretation
This chart demonstrates how sensor data can be used to identify and differentiate between various human activities. The distinct signal patterns observed for each activity suggest that machine learning algorithms could be trained to automatically recognize these activities based on the sensor data. The differences in signal amplitude and frequency across the x, y, and z axes provide valuable information about the type and intensity of movement. The "stay" activity serves as a control, allowing for the identification of baseline noise levels. The transient signals during "stUp" and "stDown" likely represent the initial acceleration and deceleration phases of stepping. The data suggests a clear correlation between activity type and signal characteristics, which could be leveraged for applications such as activity recognition, fall detection, and fitness tracking.
</details>
Figure 15: Change-point Detection of Real Dataset for Person 7 (3rd sequence). The red vertical lines represent the underlying change-points, the blue vertical lines represent the estimated change-points.
There are 7 persons observations in this dataset. The first 6 persons sequential data are treated as the training dataset, we use the last person’s data to validate the trained classifier. Each person performs each of 6 activities: “stay”, “walk”, “jog”, “skip”, “stair up” and “stair down” at least 10 seconds. The transition point between two consecutive activities can be treated as the change-point. Therefore, there are 30 possible types of change-point. The total number of labels is 36 (6 activities and 30 possible transitions). However, we only found 28 different types of label in this real dataset, see Figure 10. The initial learning rate is 0.001, the epoch size is 400. Batch size is 16, the dropout rate is 0.3, the filter size is 16 and the kernel size is $(3,25)$ . Furthermore, we also use 20% of the training dataset to validate the classifier during training step. Figure 12 shows the accuracy curves of training and validation. After 150 epochs, both solid and dash curves approximate to 1. The test accuracy is 0.9623, see the confusion matrix in Figure 13. These results show that our neural network classifier performs well both in the training and test datasets. Next, we apply the trained classifier to 3 repeated sequential datasets of Person 7 to detect the change-points. The first sequential dataset has shape $(3,10743)$ . First, we extract the $n$ -length sliding windows with stride 1 as the input dataset. The input size becomes $(9883,6,700)$ . Second, we use Algorithm 1 to detect the change-points where we relabel the activity label as “no-change” label and transition label as “one-change” label respectively. Figures 14 and 15 show the results of multiple change-point detection for other 2 sequential data sets from the 7-th person.