# Mutual Information Surprise: Rethinking Unexpectedness in Autonomous Systems
**Authors**: Yinsong Wang, Quan Zeng, Xiao Liu, Yu Ding, H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta & 30332, USA
## Abstract
Recent breakthroughs in autonomous experimentation have demonstrated remarkable physical capabilities, yet their cognitive control remains limitedâoften relying on static heuristics or classical optimization. A core limitation is the absence of a principled mechanism to detect and adapt to the unexpectedness. While traditional surprise measuresâsuch as Shannon or Bayesian Surpriseâoffer momentary detection of deviation, they fail to capture whether a system is truly learning and adapting. In this work, we introduce Mutual Information Surprise (MIS), a new framework that redefines surprise not as anomaly detection, but as a signal of epistemic growth. MIS quantifies the impact of new observations on mutual information, enabling autonomous systems to reflect on their learning progression. We develop a statistical test sequence to detect meaningful shifts in estimated mutual information and propose a mutual information surprise reaction policy (MISRP) that dynamically governs system behavior through sampling adjustment and process forking. Empirical evaluationsâon both synthetic domains and a dynamic pollution map estimation taskâshow that MISRP-governed strategies significantly outperform classical surprise-based approaches in stability, responsiveness, and predictive accuracy. By shifting surprise from reactive to reflective, MIS offers a path toward more self-aware and adaptive autonomous systems.
## 1 Introduction
In July 2020, Nature published a cover story (?) about an autonomous robotic chemistâlocked in a lab for a week with no external communicationâindependently conducting experiments to search for improved photocatalysts for hydrogen production from water. In the years that followed, Nature featured three more articles (?, ?, ?) highlighting the transformative role of autonomous systems in materials discovery, experimentation, and even manufacturing, each reporting orders-of-magnitude improvements in efficiency. These reports spotlighted the intensifying global race to advance autonomous technologies beyond the already well-established domain of self-driving cars (?, ?, ?, ?). Nature was not alone; numerous other outlets have documented the surge in autonomous research and innovation (?, ?, ?). This rapid expansion is a natural consequence of recent advances in robotics and artificial intelligence, which continue to push the boundaries of what autonomous systems can accomplish.
The systems featured in the Nature publications demonstrate highly capable bodies that can perform complex tasks. Recall that an autonomous system comprises two fundamental components: a brain and a bodyâcolloquial terms for its control mechanism and its sensing-action capabilities, respectively. Unlike traditional automation systems, which follow predefined instructions to execute simple, repetitive tasks, true autonomy requires a higher level of cognitive capacityâan autonomous system is supposedly capable of making decisions with minimal human intervention. However, their brain function, while more sophisticated than rigid pre-programmed instructions, remains relatively limited.
Surveying the literature over the past decade, we found that (?), (?), and (?) rely on classical Bayesian optimization to guide system decisionsâa technique that, although effective, does not constitute full autonomy, i.e., completely eliminating human involvement. More recent works in Nature (?, ?) continue in a similar vein, adopting active learning frameworks akin to Bayesian optimization, without fundamentally enhancing the cognitive capabilities of these systems. The conceptual limitations of their decision-making mechanisms continue to impede progress toward genuine autonomy. (?) argue that a core deficiency of current autonomous systems is the absence of a âsurpriseâ mechanismâthe capacity to detect and adapt to unforeseen situations. Without this capability, true autonomy remains out of reach.
What is a âsurprise,â and how does it differ from existing measures governing automation? Surprise is a fundamental psychological trigger that enables humans to react to unexpected events. Intuitively, it arises when observations deviate from expectations. Traditionally, unexpectedness has been loosely equated with anomaliesâquantifying inconsistencies between new observations and historical data. Common approaches to anomaly detection include statistical methods such as z-scores (?) and hypothesis testing (?, ?); distance-based techniques (?), including Euclidean (?) and Mahalanobis distances (?, ?); and machine learning-based models (?, ?), which learn patterns to identify and filter out anomalous data. However, researchers increasingly recognize that simply detecting and discarding unexpected events is insufficient for achieving higher levels of autonomy. In human cognition, unexpectedness is not inherently undesirable; in fact, surprise often signals opportunities for discovery rather than error. Although mathematically similar to anomaly measures, surprise is conceptually distinct: it is not merely a deviation to be rejected, but a valuable learning signal that can enhance adaptation and decision-making.
This shift in perspective aligns with formal definitions of surprise in information theory and computational psychology, such as Shannon surprise (?), Bayesian surprise (?), Bayes Factor surprise (?), and Confidence-Corrected surprise (?). These surprise definitions quantify unexpectedness by modeling deviations from prior beliefs or probability distributions. In the following section, we will delve deeper into these existing measures and evaluate whether they truly serve the intended role of identifying opportunities, as human surprise does, more than merely flagging anomalies. Using current surprise definitions, (?) demonstrated that treating surprising events not as noise to be removed but as catalysts for learning can significantly enhance a systemâs learning speed. Additional empirical evidence shows that incorporating surprise as a learning mechanism can improve autonomy in domains such as autonomous driving (?, ?, ?) and manufacturing (?, ?).
In our research, we find that existing definitions of surprise require significant improvement. Their close resemblance to anomaly detection measures suggests that they may not effectively support higher levels of autonomy. Specifically, a robust surprise measure should emphasize knowledge acquisition and adaptability, rather than treating unexpectedness merely as a deviation from the normâan approach that current surprise definitions tend to adopt. We therefore argue that it is essential to develop a novel surprise metric that inherently fosters learning and deepens an autonomous systemâs understanding of the underlying processes it encounters. To capture this dynamic capability, we introduce the Mutual Information Surprise (MIS)âa new framework that redefines how autonomous systems interpret and respond to unexpected events. MIS quantifies the degree of both frustration and enlightenment associated with new observations, measuring their impact on refining the systemâs internal understanding of its environment. We also demonstrate the differences that arise when applying mutual information surprise, as opposed to relying solely on classical surprise definitions, highlighting MISâs potential to meaningfully enhance autonomous learning and decision-making.
The paper is organized as follows. In Section 2, we revisit the concept of surprise by presenting a taxonomy of existing surprise measures and introducing the intuition, mathematical formulation, and limitations of classical definitions. In Section 3, we formally define the Mutual Information Surprise (MIS) and derive a testing sequence for detecting multiple types of system changes in autonomous systems. We also design an MIS reaction policy (MISRP) that provides high-level guidance to complement existing exploration-exploitation active learning strategies. In Section 4, we compare MIS with classical surprise measures to illustrate its numerical stability and enhanced cognitive capability. We further demonstrate the effectiveness of the MIS reaction policy through a pollution map estimation simulation. In Section 5, we conclude the paper.
## 2 Current Surprise Definitions and Their Limitations
Classical definitions of surprise, such as Shannon and Bayesian Surprise, provide elegant mathematical frameworks for quantifying unexpectedness. However, these approaches often fall short in capturing the core mechanisms driving adaptive behavior: continuous learning and flexible model updating. This section revisits and analyzes existing formulations, elaborating on their conceptual foundations and outlining both their strengths and limitations.
Before proceeding with our discussion, we introduce the notation used throughout this paper. Scalars are denoted by lowercase letters (e.g., $x$ ), vectors by bold lowercase letters (e.g., $\mathbf{x}$ ), and matrices by bold uppercase letters (e.g., $\mathbf{X}$ ). Distributions in the data space are represented by uppercase letters (e.g., $P$ ), probabilities by lowercase letters (e.g., $p$ ), and distributions in the parameter space by the symbol $\pi$ . The $L_{2}$ norm is denoted by $\|\cdot\|_{2}$ , and the absolute value or $L_{1}$ norm is denoted by $|\cdot|$ . We use $\mathbb{E}[\cdot]$ to denote the expectation operator and $\text{sgn}(\cdot)$ for the sign operator. Estimators are denoted with a hat, as in $\hat{\cdot}$ .
### The Family of Shannon Surprises
The family of Shannon Surprise metrics emphasizes the improbability of observed data, typically independent of explicit model parameters. This class broadly aligns with âobservationâ and âprobabilistic-mismatchâ surprises as categorized in (?). The central question which the Shannon family of surprises tries to answer is straightforward: How unlikely is the observation?
The most widely recognized measure is Shannon Surprise (?), formally defined as:
$$
S_{\text{Shannon}}(\mathbf{x})=-\log p(\mathbf{x}), \tag{1}
$$
interpreting surprise directly through event rarity. Although conceptually clear and mathematically elegant, this definition has a significant limitation: encountering a Shannon Surprise does not inherently imply knowledge acquisition. Consider, for instance, a uniform dartboardâa stochastic yet entirely understood system. Each outcome has an equally low probability, thus appearing âsurprisingâ under Shannonâs definition, despite humans neither genuinely finding these outcomes surprising nor gaining any additional knowledge by observing them. In other words, the focus of Shannon Surprise is statistical rarity rather than genuine knowledge gain.
To address this limitation, particularly in highly stochastic scenarios, Residual Information Surprise (?) has been introduced, which measures surprise by quantifying the gap between the minimally achievable and observed Shannon Surprises:
$$
S_{\text{Residual}}(\mathbf{x})=|\underset{\mathbf{x}^{\prime}}{\min}\{-\log p(\mathbf{x}^{\prime})\}-(-\log p(\mathbf{x}))|=\underset{\mathbf{x}^{\prime}}{\max}\log p(\mathbf{x}^{\prime})-\log p(\mathbf{x}).
$$
In the dartboard example, Residual Information Surprise becomes zero for all outcomes, as $p(\mathbf{x}^{\prime})$ remains constant for every $\mathbf{x}^{\prime}$ , accurately reflecting an absence of genuine surprise. However, this formulation introduces a conceptual challenge, as determining $\underset{\mathbf{x}^{\prime}}{\max}\log p(\mathbf{x}^{\prime})$ implicitly presumes an omniscient oracle, an assumption typically infeasible in practice.
Interestingly, Shannon Surprise serves as a foundation for various anomaly measures. For example, under Gaussian assumptions, Shannon Surprise becomes proportional to squared error:
$$
S_{\text{Shannon}}(\mathbf{x})\propto\|\mathbf{x}-\mu_{\mathbf{x}}\|_{2}^{2},
$$
thus linking surprise with deviation from the mean. Similarly, assuming a Laplace distribution recovers an absolute error interpretation, termed Absolute Error Surprise in (?):
$$
S_{\text{Shannon}}(\mathbf{x})\propto|\mathbf{x}-\mu_{\mathbf{x}}|.
$$
We note that both Squared Error Surprise and Absolute Error Surprise are commonly utilized metrics in anomaly detection literature (?, ?, ?).
### The Family of Bayesian Surprises
Bayesian Surprises, by contrast, explicitly model belief updates. These measures quantify the degree to which a new observation alters the internal model, shifting the focus from event rarity to epistemic impact. This concept parallels the âbelief-mismatchâ surprise in the taxonomy by (?).
The canonical formulation, introduced in (?), defines Bayesian Surprise as the KullbackâLeibler divergence between the prior and posterior distributions over parameters:
$$
S_{\text{Bayes}}(\mathbf{x})=D_{\text{KL}}\left(\pi(\boldsymbol{\theta}\mid\mathbf{x})\,\|\,\pi(\boldsymbol{\theta})\right).
$$
This measure offers a principled approach to belief revision and naturally aligns with learning mechanisms. In theory, it encourages agents to reduce surprise through model updates, providing a pathway toward adaptive autonomy.
However, Bayesian Surprise is not without limitations. As data accumulates, new observations exert diminishing influence on the posterior, rendering the agent increasingly âstubborn.â This behavior can result in Bayesian Surprise overlooking rare but meaningful anomalies. For example, consider the discovery by S. S. Ting of the $J$ particle, characterized by an unusually long lifespan compared to other particles in its class. Under standard Bayesian updating, scientistsâ beliefs about particle lifespans would barely shift due to this single observation. Consequently, Bayesian Surprise would classify such an event as merely an anomaly, potentially disregarding it.
To mitigate this posterior overconfidence, Confidence-Corrected (CC) Surprise (?) compares the current informed belief against that of a naĂŻve learner with a flat prior:
$$
S_{\text{CC}}(\mathbf{x})=D_{\text{KL}}\left(\pi(\boldsymbol{\theta})\,\|\,\pi^{\prime}(\boldsymbol{\theta}\mid\mathbf{x})\right),
$$
where $\pi^{\prime}(\boldsymbol{\theta}\mid\mathbf{x})$ represents the updated belief assuming a uniform prior. This confidence-corrected formulation remains sensitive to new data irrespective of prior history. In the $J$ particle example, employing Confidence-Corrected Surprise would trigger a genuine surprise, as the posterior remains responsive to the novel observation without the inertia introduced by extensive historical data.
A related idea emerges with Bayes Factor (BF) Surprise (?), which compares likelihoods under naĂŻve and informed beliefs:
$$
S_{\text{BF}}(\mathbf{x})=\frac{p(\mathbf{x}\mid\pi^{0}(\boldsymbol{\theta}))}{p(\mathbf{x}\mid\pi^{t}(\boldsymbol{\theta}))},
$$
where $\pi^{0}(\boldsymbol{\theta})$ represents the naĂŻve (untrained) prior and $\pi^{t}(\boldsymbol{\theta})$ the informed belief based on all prior observations up to time $t$ (before observing $\mathbf{x}$ ). This ratio quantifies how strongly the current observation supports the naĂŻve prior over the informed prior. In practice, the effectiveness of both Confidence-Corrected and Bayes Factor Surprises heavily depends on constructing appropriate priorsâa task often challenging and subjective.
Another variant within the Bayesian Surprise family is Postdictive Surprise (?), which operates in the output space rather than parameter space as in the original Bayesian Surprise:
$$
S_{\text{Postdictive}}(\mathbf{x})=D_{\text{KL}}\left(P(\mathbf{y}\mid\boldsymbol{\theta}^{\prime},\mathbf{x})\,\|\,P(\mathbf{y}\mid\boldsymbol{\theta},\mathbf{x})\right), \tag{2}
$$
where $\boldsymbol{\theta}$ and $\boldsymbol{\theta}^{\prime}$ denote parameters before and after the update, respectively. (?) argue that computing KL divergence in the output space is more computationally tractable for variational models but potentially less expressive when output variance depends on the input (e.g., under heteroskedastic conditions).
### Reflection
We acknowledge the presence of alternative categorizations of surprise definitions, notably the taxonomy in (?), which classifies surprise measures into three groups: observation surprises, probabilistic-mismatch surprises, and belief-mismatch surprises. As discussed previously, the Shannon Surprise family aligns closely with the first two categories, whereas the Bayesian Surprise family corresponds to the last.
These categorizations are not strictly delineated. For instance, Residual Information Surprise incorporates a conceptual element common to the Bayesian Surprise familyâproviding a baseline against which the observed data is contrasted with. On the other hand, Bayes Factor Surprise, despite being explicitly Bayesian in its formulation, closely resembles a Shannon Surprise conditioned on alternative priors. Furthermore, notwithstanding their philosophical distinctions, Bayesian and Shannon Surprises often behave similarly in practice; we provide further details on this observation in Section 4.
It is understandable that researchers initially explored these two foundational surprise definitions, each possessing inherent limitations: Shannon Surprise conflates probability with knowledge gain, while Bayesian Surprise suffers from increasing posterior stubbornness. Subsequent refinements emerged to address these shortcomings, primarily through adjusting the choice of prior to create more meaningful contrasts. The Residual Information Surprise assumes an oracle-like prior, whereas Confidence-Corrected and Bayes Factor Surprises rely on a non-informative prior. Regardless of the priors chosen, defining a suitable prior remains a challenging and unresolved issue in the research community.
Both surprise families share other critical limitations: they are single-instance measures and inherently one-sided measures. Being single instance means that they assess surprise based solely on the marginal impact of individual observations, without explicitly modeling cumulative learning dynamics over time, whereas being one-sided means that they have a decision threshold on one single side, offering limited expressiveness since human perceptions of surprise can range from positive to negative.
## 3 Mutual Information Surprise
In this section, we introduce the concept of Mutual Information Surprise (MIS). We first explore the intuition and motivation underlying this concept, followed by the development of a novel, theoretically grounded testing sequence. We then discuss the implications when this test sequence is violated and propose a reaction policy contingent on different types of violations. Table 1 summarizes the perspective differences between Mutual Information Surprise vs Shannon and Bayesian family of surprises.
Table 1: The perspective differences among Shannon family surprises, Bayesian family surprises, and Mutual Information Surprise.
| Surprise | Single Instance Focused | Capture Transient Changes | Aware of Learning Progression | Parametric Predictive Modeling |
| --- | --- | --- | --- | --- |
| Shannon Family | â | â | â | â |
| Bayesian Family | â | â | â | â |
| MIS | â | â | â | â |
### 3.1 What Do We Expect from a Surprise?
In human cognition, surprise often triggers reflection and adaptation. A computational analog should similarly prompt deeper examination and enhanced understanding, transcending mere statistical rarity and indicating an opportunity for learning.
To formalize this perspective, consider a system governed by a functional mapping $f:\mathbf{x}\rightarrow\mathbf{y}$ , with observations drawn from a joint distribution $P(\mathbf{x},\mathbf{y})$ . This system is well-regulated, meaning the input distribution $P(\mathbf{x})$ , output distribution $P(\mathbf{y})$ , and joint distribution $P(\mathbf{x},\mathbf{y})$ are time-invariant. This definition expands the traditional notion of time-invariance by explicitly including consistent exposure $P(\mathbf{x})$ , aligning closely with human trust in persistent patterns across rules and experiences.
To quantify system understanding, we use mutual information (MI) (?), defined as
$$
I(\mathbf{x},\mathbf{y})=\mathbb{E}_{\mathbf{x},\mathbf{y}}\left[\log\frac{p(\mathbf{y}\mid\mathbf{x})}{p(\mathbf{y})}\right]=H(\mathbf{x})+H(\mathbf{y})-H(\mathbf{x},\mathbf{y})=H(\mathbf{y})-H(\mathbf{y}\mid\mathbf{x}), \tag{3}
$$
where $H(\cdot)$ denotes entropy, measuring uncertainty or chaos of a random variable. Mutual information quantifies the reduction in uncertainty about $\mathbf{y}$ given knowledge of $\mathbf{x}$ . A high $I(\mathbf{x},\mathbf{y})$ indicates strong comprehension of $f$ , whereas stagnation or a decrease in $I(\mathbf{x},\mathbf{y})$ signals stalled learning. For the aforementioned well-regulated system, $I(\mathbf{x},\mathbf{y})$ remains constant.
Typically, mutual information $I(\mathbf{x},\mathbf{y})$ is estimated via maximum likelihood estimation (MLE) (?); details of the MLE estimator are provided in the Appendix. Empirical estimation of $I(\mathbf{x},\mathbf{y})$ is, however, downward biased for clean data with a low noise level (?):
$$
\mathbb{E}[\hat{I}(\mathbf{x},\mathbf{y})]\leq I(\mathbf{x},\mathbf{y}).
$$
Interestingly, this bias can serve as an informative feature: As experience accumulates, $\mathbb{E}[\hat{I}(\mathbf{x},\mathbf{y})]$ should increase and approach the true value $I(\mathbf{x},\mathbf{y})$ , determined by $p(\mathbf{x})$ and function $f$ . Thus, a monotonic growth in mutual information estimate signals learning.
Returning to our core questionâwhat do we expect from a surprise? Unlike classical surprise measures (Shannon or Bayesian), which focus narrowly on conditional distributions and rarity, we posit that a surprise measure should reflect whether learning occurred. Noticing the connection between mutual information growth and learning, we define surprise as a deviation from expected mutual information growth. Specifically, we define Mutual Information Surprise (MIS) as the difference in mutual information estimates after incorporating new observations:
$$
\text{MIS}\triangleq\hat{I}_{n+m}-\hat{I}_{n}, \tag{4}
$$
where $\hat{I}_{n}$ is the estimation of mutual information $I_{n}$ at the time of the first $n$ observations, and $\hat{I}_{n+m}$ for $I_{m+n}$ after observing $m$ additional points. Starting from here, we omit the variable terms, $\mathbf{x}$ and $\mathbf{y}$ , in the notations of mutual information and its estimation for the sake of simplicity. A large (relative to sample size $m$ and $n$ ) positive MIS signals enlightenment, indicating significant learning, whereas a near-zero or negative MIS indicates frustration, suggesting stalled progress. Hence, MIS provides operational insight into whether a system evolves as expected, transforming it into a practical autonomy test. Significant deviations from the expected MIS trajectory indicate meaningful changes or system stagnation.
### 3.2 Bounding MIS
Mutual information estimation is inherently challenging: it is high-dimensional, nonlinear, and exhibits complex variance. The standard method, though principled, is a computationally expensive permutation test (?, ?), involving repeatedly shuffling $m+n$ observations into two groups, calculating MI differences, and evaluating rejection probabilities:
$$
p=\frac{1}{B}\sum_{i=1}^{B}\mathbf{1}(|\Delta\hat{I}|>|\Delta\hat{I}|_{i}),
$$
where $\Delta\hat{I}=\hat{I}_{n}-\hat{I}_{m}$ represents the actual differences between mutual information estimations, and $\Delta\hat{I}_{i}$ represents the $i$ th permuted difference. $\mathbf{1}(\cdot)$ is the indicator function. In real-time streaming scenarios, however, permutation tests become impractical due to their computational load. Moreover, when $m\ll n$ , permutation tests lose effectiveness, yielding noisy outcomes.
An alternative is standard deviation-based testing. For MLE mutual information estimator $\hat{I}_{n}$ , its estimation standard deviation satisfies (?):
$$
\sigma\lesssim\frac{\log n}{\sqrt{n}}, \tag{5}
$$
where $\lesssim$ stands for less or equal to (in terms of order), which yields an analytical test on the mutual information change when omitting the bias term (brief derivation provided in the Appendix),
$$
\hat{I}_{m+n}-\hat{I}_{n}\in\pm\sqrt{\frac{\log^{2}(m+n)}{m+n}+\frac{\log^{2}n}{n}}\cdot z_{\alpha}\asymp\mathcal{O}\left(\frac{\log n}{\sqrt{n}}\right), \tag{6}
$$
where $z_{\alpha}$ represents the standard normal random variable at confidence level $\alpha$ and $\asymp$ represents equal in order. But this test too is unsatisfying, because the above bound is so loose that it rarely gets violated. The root cause is the loose upper bound shown in Eq. (5), where empirical evidence suggests the true estimation standard deviation is usually much smaller than the theoretical bound. We provide the empirical evidence in the Appendix.
So, we turn to a new path for bounding MIS as follows. First, we impose several mild assumptions on the observations and the physical process.
**Assumption 1**
*We impose the following assumptions on the sampling process and physical system. 1. We assume that the existing observations are typical in the sense of the Asymptotic Equipartition Property (?), meaning that empirical statistics computed from the data are representative of their corresponding expected values under the experimental designâs intended distribution, i.e., $\hat{I}_{n}\approx\mathbb{E}[\hat{I}_{n}]$ . This is true when we regard the initial observations as true system information.
1. The number of existing observations $n$ is much smaller than cardinality of space $\mathcal{X},\mathcal{Y}$ . $n\ll|\mathcal{X}|,|\mathcal{Y}|$
1. The number of new observations $m$ is much smaller than the number of existing observations. $m\ll n$ .*
**Theorem 1**
*Consider a well-regulated autonomous system defined in Section 3.1, which satisfies the conditions in Assumption 1. With probability at least $1-\rho$ , the change in MLE-based mutual information estimates satisfies:
$$
\hat{I}_{n+m}-\hat{I}_{n}\in\left(\log(m+n)-\log n\right)\pm\frac{\sqrt{2m\log\frac{2}{\rho}}\log(m+n)}{m+n}\triangleq MIS_{\pm}.
$$
$MIS_{\pm}$ denotes the upper and lower bound for the test sequence.*
The proof of Theorem 1 is shown in the Appendix. These bounds are both tighter ( $\mathcal{O}(\frac{\log n}{n})$ instead of $\mathcal{O}(\frac{\log n}{\sqrt{n}})$ ) and more efficient (analytical test sequence) than previous methods. The bounds offer theoretically grounded thresholds within which we expect MI to evolve. When these bounds $MIS_{\pm}$ are breachedâeither from below or from aboveâwe know the system has encountered something.
Some may argue that for an oversampled system, Assumption 2 does not hold. That is true, and as a result, the expectation term in Theorem 1, $\log(m+n)-\log n$ , needs to be adjusted. For a noise-free system with limited outcome and large number of existing observations, one needs to replace the expectation term with $(|\mathcal{Y}|-1)(\frac{1}{n}-\frac{1}{m+n})$ and the bounds in Theorem 1 still works.
### 3.3 What Does MIS Actually Tell Us?
When the quantity $\text{MIS}=\hat{I}_{n+m}-\hat{I}_{n}$ falls outside the established bounds $MIS_{\pm}$ âeither exceeding the upper bound or falling below the lower boundâthe system is considered to be surprised, thereby triggering a Mutual Information Surprise (MIS). Essentially, Theorem 1 functions as a statistical hypothesis test: the null hypothesis posits that the underlying system remains well-regulated, implying $\Delta I=I_{n+m}-I_{n}=0$ , where $I_{n}$ denotes the true mutual information at the time of $n$ observations. Any violation indicates a significant shift, with negative deviations ( $\Delta I<0$ ) and positive deviations ( $\Delta I>0$ ) each carrying distinct implications.
Recall that mutual information can be expressed in terms of entropy, as shown in Eq. (3), so changes in $\Delta I$ may result from variations in $H(\mathbf{x})$ , $H(\mathbf{y})$ , and $H(\mathbf{y}\mid\mathbf{x})$ . In this subsection, we examine the implications of MIS under different driving forces.
#### Violation from Below: Learning Has Stalled or Regressed
If
$$
\text{MIS}<\text{MIS}_{-},
$$
this implies $\Delta I(\mathbf{x},\mathbf{y})<0$ , signifying a downward shift in mutual information. A negative surprise indicates diminished or stalled learning, potentially due to:
1. Stagnation in Exploration: A downward shift driven by a decrease in input entropy $\Delta H(\mathbf{x})<0$ suggests the system repeatedly samples in a limited region, thus gathering redundant data with minimal new information.
1. Increased Noise or Process Drift: A downward shift could also result from increased conditional entropy $\Delta H(\mathbf{y}\mid\mathbf{x})>0$ , indicating greater uncertainty in predicting $\mathbf{y}$ given $\mathbf{x}$ . Practically, this often signifies increased external noise or a fundamental change in the underlying process.
#### Violation from Above: Sudden Growth in Understanding
If
$$
\text{MIS}>\text{MIS}_{+},
$$
this implies $\Delta I(\mathbf{x},\mathbf{y})>0$ , indicating an upward shift in mutual information. This positive surprise can result from:
1. Aggressive Exploration: If the increase is driven by higher input entropy $\Delta H(\mathbf{x})>0$ , the system is likely exploring previously unvisited regions aggressively, potentially inflating knowledge gains without sufficient validation.
1. Reduction in Noise: An increase due to reduced conditional entropy $\Delta H(\mathbf{y}\mid\mathbf{x})<0$ signals a desirable decrease in uncertainty, thus generally representing a beneficial development.
1. Novel Discovery: An increase in output entropy $\Delta H(\mathbf{y})>0$ suggests discovery of novel and previously rare outputsâparticularly valuable in exploratory or scientific contexts.
#### Summary Table
| Violation from Below | Stagnation in exploration | $\downarrow H(\mathbf{x})\Rightarrow\downarrow I(\mathbf{x},\mathbf{y})$ |
| --- | --- | --- |
| Increased noise / process drift | $\uparrow H(\mathbf{y}\mid\mathbf{x})\Rightarrow\downarrow I(\mathbf{x},\mathbf{y})$ | |
| Violation from Above | Aggressive exploration | $\uparrow H(\mathbf{x})\Rightarrow\uparrow I(\mathbf{x},\mathbf{y})$ |
| Noise reduction | $\downarrow H(\mathbf{y}\mid\mathbf{x})\Rightarrow\uparrow I(\mathbf{x},\mathbf{y})$ | |
| Novel discovery | $\uparrow H(\mathbf{y})\Rightarrow\uparrow I(\mathbf{x},\mathbf{y})$ | |
The table above summarizes potential causes for MIS violations and their implications. These patterns help the system differentiate between meaningful learning and misleading deviations, expanding beyond the capacity of classical surprise measures and providing a road map to corrective or adaptive responses for higher level autonomy. We purposely omit the case where a decrease in $H(\mathbf{y})$ causes violation from below, as this scenario typically lacks independent significance. Instead, its happening is generally caused by changes in sampling strategy or underlying processes, which we have already discussed.
### 3.4 Reaction Policy: A Three-Pronged Approach
Following the identification of potential causes behind MIS triggers (Section 3.3), the next question is how the system should respond. Naturally, the systemâs reaction should align with the dominant entropy component contributing to the change. In practice, we identify the dominant entropy change by computing and ranking the ratios
$$
\frac{\text{sgn}(\text{MIS})\Delta\hat{H}(\mathbf{x})}{|\text{MIS}|},\quad\frac{\text{sgn}(\text{MIS})\Delta\hat{H}(\mathbf{y})}{|\text{MIS}|},\quad\text{and}\quad\frac{\text{sgn}(\text{MIS})\Delta\hat{H}(\mathbf{y}\mid\mathbf{x})}{|\text{MIS}|},
$$
where $\Delta\hat{H}(\cdot)=\hat{H}_{m+n}(\cdot)-\hat{H}_{n}(\cdot)$ denotes the estimated entropy change.
We do not prescribe a specific reaction when $\Delta\hat{H}(\mathbf{y})$ dominates the MIS, as an increase in $H(\mathbf{y})$ is typically a passive consequence of changes in $H(\mathbf{x})$ and $H(\mathbf{y}\mid\mathbf{x})$ . When both $H(\mathbf{x})$ and $H(\mathbf{y}\mid\mathbf{x})$ remain relatively stable, a rise in $H(\mathbf{y})$ indicates that the current sampling strategy is effectively uncovering novel information; thus, no change in action is required.
For $\Delta\hat{H}(\mathbf{x})$ and $\Delta\hat{H}(\mathbf{y}\mid\mathbf{x})$ , situations may arise where their contributions are similar, i.e., no clear dominant entropy component exists and we need a resolution mechanism to break the tie. To address all these scenarios, we propose a three-pronged reaction policy that serves as a supervisory layer, compatible with existing explorationâexploitation sampling strategies:
1. Sampling Adjustment. The first policy addresses variations in input entropy $H(\mathbf{x})$ . If $\Delta\hat{H}(\mathbf{x})>0$ dominates MIS, indicating overly aggressive exploration, the system should moderate exploration and emphasize exploitation to prevent fitting to noise. Conversely, if $\Delta\hat{H}(\mathbf{x})<0$ , suggesting redundant sampling, the system should enhance exploration to restore sample diversity.
2. Process Forking. The second policy responds to variations in conditional entropy $H(\mathbf{y}\mid\mathbf{x})$ , i.e., changes in function mapping. Upon surprise triggered by $\Delta\hat{H}(\mathbf{y}\mid\mathbf{x})$ , the system forks into two subprocesses, each consisting of $n$ existing observations and $m$ new observations divided at the surprise moment (Theorem 1). The two subprocesses represent the prior process (existing observations) and the likely altered process (new observations), and will continue their sampling separately. The subprocess first encountering a $\Delta\hat{H}(\mathbf{y}\mid\mathbf{x})$ -triggered surprise is discarded, and the remaining subprocess continues as the main process. In the extremely rare case when both subprocesses trigger a $\Delta\hat{H}(\mathbf{y}\mid\mathbf{x})$ dominated MIS surprise at the same time, we discard the process with fewer observations, and continues with the subprocess with more observations.
3. Coin Toss Resolution. There are occasions where changes in $\Delta\hat{H}(\mathbf{x})$ and $\Delta\hat{H}(\mathbf{y}\mid\mathbf{x})$ are comparable, making selecting a reaction policy challenging. Instead of arbitrarily favoring the slightly larger change, we always use a biased coin toss approach, stochastically selecting which entropy to address based on the magnitude of changes:
$$
p_{\text{adjust}}=\frac{|\Delta\hat{H}(\mathbf{x})|}{|\Delta\hat{H}(\mathbf{x})|+|\Delta\hat{H}(\mathbf{y}\mid\mathbf{x})|},\quad p_{\text{fork}}=1-p_{\text{adjust}}.
$$
The decision variable $z$ is sampled as $z\sim\text{Bernoulli}(P_{\text{adjust}})$ , with $z=1$ indicating sampling adjustment and $z=0$ indicating process forking. This mechanism ensures balanced reactions, robustness, and prevents overreactions to marginal signals.
The description above provides a brief summary of the MIS reaction policy. In the remaining portion of this subsection, we will present the specific MIS reaction policy in an algorithm. To do that, we first need to define a sampling process formally and then present the detailed algorithmic implementation of this reaction policy in Algorithm 1.
**Definition 1**
*A sampling process $\mathcal{P}(\mathbf{X},g(\cdot))$ consists of two components: existing observations $\mathbf{X}$ and a sampling function $g(\cdot)$ , where the next sample location is determined by
$$
\mathbf{x}_{\text{next}}\sim g(\mathbf{X}),
$$
with $\mathbf{x}_{\text{next}}$ drawn from the stochastic oracle $g(\mathbf{X})$ . If $g(\cdot)$ is deterministic, $\sim$ is replaced by equality ( $=$ ). For clarity, a sampling process with $n$ existing observations is denoted $\mathcal{P}_{n}$ .*
Algorithm 1 Mutual Information Surprise Reaction Policy (MISRP)
1: A sampling process $\mathcal{P}(\mathbf{Z},g(\cdot))$ , where $\mathbf{Z}$ consists of $k$ pairs of input $\mathbf{X}$ and output $\mathbf{Y}$ ; A maximum reflection threshold $T$ ; Reflection period $m=2$
2: while $m\leq\min(T,\frac{k}{2})$ do
3: Set $n=k-m$ ; Compute $MIS=\hat{I}_{m+n}-\hat{I}_{n}$ ; Record $\Delta\hat{H}(\mathbf{x})$ , $\Delta\hat{H}(\mathbf{y})$ , and $\Delta\hat{H}(\mathbf{y}\mid\mathbf{x})$
4: if $MIS\not\in MIS_{\pm}$ and $\frac{\text{sgn}(\text{MIS})\Delta\hat{H}(\mathbf{y})}{|\text{MIS}|}\neq\max\big{\{}\frac{\text{sgn}(\text{MIS})\Delta\hat{H}(\mathbf{x})}{|\text{MIS}|},\frac{\text{sgn}(\text{MIS})\Delta\hat{H}(\mathbf{y})}{|\text{MIS}|},\frac{\text{sgn}(\text{MIS})\Delta\hat{H}(\mathbf{y}\mid\mathbf{x})}{|\text{MIS}|}\big{\}}$ then
5: Compute bias: $p\leftarrow\frac{|\Delta\hat{H}(\mathbf{x})|}{|\Delta\hat{H}(\mathbf{x})|+|\Delta\hat{H}(\mathbf{y}\mid\mathbf{x})|}$
6: Sample $z\sim\text{Bernoulli}(p)$
7: if $z=1$ then $\triangleright$ Sampling Adjustment
8: if $MIS>MIS_{+}$ then
9: Modify $g$ to reduce exploration and increase exploitation
10: else
11: Modify $g$ to increase exploration and reduce redundancy
12: end if
13: break while
14: else $\triangleright$ Process Forking
15: if $\mathcal{P}$ is forked and the other process is not requesting Process Forking then
16: Delete $\mathcal{P}$ ; Merge the other process as the main process
17: break while
18: end if
19: if $\mathcal{P}$ is forked and the other process is requesting Process Forking then
20: Delete the $\mathcal{P}$ with fewer data; Merge the other one as the main process
21: break while
22: end if
23: Fork process into two branches: $\mathcal{P}_{n}$ and $\mathcal{P}_{m}$
24: Call $\text{MISRP}(\mathcal{P}_{n},t)$ and $\text{MISRP}(\mathcal{P}_{m},t)$
25: break while
26: end if
27: else
28: No action required (surprise within expected bounds)
29: end if
30: $m=m+1$
31: end while
We offer several remarks on the MIS reaction policy $\text{MISRP}(\mathcal{P},t)$ :
- In the pseudocode, we introduce two additional notations: the maximum reflection threshold $T$ and the total number of observations $k$ . In practice, MIS is computed retroactively, that is, given a sequence of $k$ observations, we partition them into $m$ recent observations and $n=k-m$ older observations to compute the MIS. We term the $m$ recent observation as the reflection period and we increment $m$ to iterate over different partition points. The reflection period $m$ is constrained to be no greater than $\min(T,\frac{k}{2})$ . This constraint is motivated by the comparative behavior of test statistics derived from Theorem 1 and the variance-based test in Eq. (6). Specifically, when $m=n$ , both our proposed test and the variance-based test yield statistics of order $\mathcal{O}\left(\frac{\log n}{\sqrt{n}}\right)$ . As discussed in Section 3.2, such statistics are typically too loose to be violated in practice, thereby diminishing the sensitivity advantage of our method. Consequently, evaluating MIS beyond $m=\frac{k}{2}$ is unnecessary and computationally inefficient. The reflection threshold $T$ is introduced to ensure computational feasibility, and we recommend selecting $T$ as large as computational resources permit.
- Note that the reflection period $m$ starts at $2$ . This implies that the reaction policy does not respond to a single-instance surprise. Mathematically, this is because the derivation of the bound in Theorem 1 is ill-defined for $m=1$ . Intuitively, MIS measures the progression of learning in a sampling process, and it is impossible to determine whether a single observation is informative or erroneous without additional verification. Therefore, the MIS policy always take at least two additional samples to start to react. One may argue that this requirement for an extra sample imposes additional cost in conducting experiments. That is true. But recall one insight from the study in (?) is the benefit of â the extra resources spent on deciding the nature of an observation â in the long run.
- It is important to emphasize that bot the sampling adjustment and process forking approaches are rooted in the active learning literature and practice. Balancing exploration and exploitation, i.e., sampling adjustment, has long been a key topic in Bayesian optimization and active learning (?), whereas discarding irrelevant observations, as we do in process forking, is a common practice in the dataset drift literature (?, ?, ?, ?, ?). Our Mutual Information Surprise reaction framework provides a principled mechanism for autonomous systems to determine how to balance exploration versus exploitation and when or what to discard (i.e., forget).
## 4 Numerical Analysis
In this section, we illustrate the merits of Mutual Information Surprise (MIS). Section 4.1 demonstrates the strength of MIS compared to classical surprise measures. Section 4.2 showcases the advantages of the MIS reaction policy in the context of dynamically estimating a pollution map using data generated from a physics-based simulator.
### 4.1 Putting Surprise to the Test
To compare MIS with classical surprise measuresâprincipally Shannon and Bayesian Surprisesâwe conduct a series of controlled simulations using a simple yet interpretable system, designed to reveal how each measure behaves under varying conditions. The system is governed by the mapping
$$
y=x\mod 10, \tag{7}
$$
chosen for its simplicity, modifiability, and clarity of interpretation. The first four scenarios are fully deterministic, while the final two introduce noise and perturbations, enabling an assessment of whether each surprise measure responds meaningfully to new observations, structural changes, or stochastic disturbances. Each simulation begins with $100$ samples drawn uniformly from $x\in[0,30]$ to establish the systemâs initial knowledge. We then progressively introduce new data under different conditions, recording the response of each surprise measure. As the magnitudes of MIS, Shannon Surprise, and Bayesian Surprise differ in scale, our analysis focuses on behavioral trends âhow each measure changes, spikes, or saturatesârather than on their absolute values.
The surprise measures are computed as follows. Shannon Surprise is calculated using its classical definition in Eq. (1), as the negative log-likelihood of the true label under a Gaussian Process predictive model. Bayesian Surprise is computed as Postdictive Surprise, defined in Eq. (2), using the KL divergence between the prior and posterior predictive distributions of $y$ at each input $x$ . The same Gaussian Process predictive model is used for both, using a Matérn $\nu=2.5$ kernel with a constant noise level set to $0.1$ . After each surprise computation, the model is re-trained with all currently available data.
For MIS, we treat the initial $100$ observations as the initial sample size $n=100$ , as defined in Section 3.1. As sampling continues, the number of new observations $m$ increases (represented in the ticks of the X-axis in the figures). The output space has cardinality $|\mathcal{Y}|=10$ , corresponding to the ten possible outcomes of the modulus function, except in Scenario 6 where $|\mathcal{Y}|=20$ . MIS is calculated as defined in Eq. (4). When the theoretical bound in Theorem 1 is used, the probability level is set to $\rho=0.1$ . The bias term is adjusted as discussed in Section 3.2, since $n\gg|\mathcal{Y}|$ in this setting.
Scenario 1: Standard Exploration.
New data is randomly sampled from $x\in[30,100]$ , expanding the domain without altering the underlying function or aggressively exploring unfamiliar regions. This represents a system exploring new yet consistent areas of its environment.
Expected behavior: A well-calibrated surprise measure should indicate ongoing learning without abrupt fluctuations. We do not expect MIS to be violated.
As shown in Figure 1, MIS progresses steadily within its expected bounds, reflecting a stable and well-regulated learning process. In contrast, Shannon and Bayesian Surprises fluctuate erratically, often spiking without clear justification.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Line Chart: Shannon and Bayesian Surprises
### Overview
The image presents a line chart comparing Shannon Surprise and Bayesian Surprise over a number of explorations (m). The chart displays two distinct lines representing each surprise metric, plotted against the number of explorations, ranging from 0 to 100. The y-axis for Shannon Surprise is on the left, ranging from 0 to 8, while the y-axis for Bayesian Surprise is on the right, ranging from 0 to 20.
### Components/Axes
* **Title:** "Shannon and Bayesian Surprises" (centered at the top)
* **X-axis:** "Number of Explorations (m)" (bottom, ranging from 0 to 100)
* **Left Y-axis:** "Shannon Surprise" (ranging from 0 to 8)
* **Right Y-axis:** "Bayesian Surprise" (ranging from 0 to 20)
* **Legend:** Located in the top-right corner.
* "Shannon Surprise" - represented by a dashed blue line.
* "Bayesian Surprise" - represented by a solid red line.
### Detailed Analysis
**Bayesian Surprise (Red Line):**
The red line representing Bayesian Surprise starts at approximately 1.8 at m=0, rises to a peak of around 2.2 at m=2, then rapidly declines to near 0 by m=10. It remains very close to 0 for the majority of the exploration range (m=10 to m=90), with minor fluctuations. At m=90, it rises to approximately 1.5 and then falls back to near 0 at m=100. The trend is generally decreasing, with a sharp initial drop and then a sustained low level.
**Shannon Surprise (Blue Dashed Line):**
The blue dashed line representing Shannon Surprise exhibits a much more volatile pattern. It starts at approximately 0.5 at m=0, rises to a peak of around 3.5 at m=4, then falls to approximately 1.5 at m=8. It continues to fluctuate significantly throughout the exploration range, with multiple peaks and valleys. Notable peaks occur around m=12 (approximately 3.0), m=20 (approximately 3.5), m=30 (approximately 4.0), m=40 (approximately 5.0), m=50 (approximately 3.0), m=60 (approximately 3.5), m=70 (approximately 2.5), and m=80 (approximately 3.0). The line ends at approximately 1.0 at m=100. The trend is highly variable, with no clear overall direction.
### Key Observations
* Bayesian Surprise is consistently much lower than Shannon Surprise for most of the exploration range.
* Shannon Surprise exhibits significant fluctuations, indicating a high degree of uncertainty or information gain during exploration.
* Bayesian Surprise drops rapidly and remains low, suggesting that the initial surprise quickly diminishes as the exploration progresses.
* There is a noticeable peak in Shannon Surprise around m=40, which is significantly higher than other peaks.
* Both lines show some activity around m=90, but the Bayesian Surprise remains very low.
### Interpretation
The chart suggests that the initial explorations provide a significant amount of information (high Shannon Surprise), but this information quickly becomes predictable (low Bayesian Surprise). The sustained low Bayesian Surprise indicates that the exploration process is converging on a relatively stable understanding of the environment. The fluctuations in Shannon Surprise suggest that there are still unexpected events or discoveries occurring throughout the exploration process, even after the initial period of high surprise. The peak at m=40 could represent a particularly informative or surprising event during the exploration. The difference between the two surprise measures highlights the distinction between the raw amount of information gained (Shannon Surprise) and the information that is relevant given prior beliefs (Bayesian Surprise). The chart demonstrates how Bayesian Surprise can filter out irrelevant information, focusing on the unexpected events that truly challenge existing beliefs.
</details>
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Line Chart: Mutual Information Surprise
### Overview
The image presents a line chart illustrating the relationship between the number of explorations (m) and the mutual information surprise. A shaded region represents the MIS (Mutual Information Surprise) bound. The chart aims to visualize how the mutual information surprise changes as the number of explorations increases.
### Components/Axes
* **Title:** "Mutual Information Surprise" (centered at the top)
* **X-axis:** "Number of Explorations (m)" - ranging from 0 to 100, with tick marks at intervals of 10.
* **Y-axis:** "Mutual Information Surprise" - ranging from -0.1 to 0.3, with tick marks at intervals of 0.1.
* **Data Series:**
* "Mutual Information Surprise" - represented by a green line.
* **Legend:** Located in the top-right corner.
* "Mutual Information Surprise" - associated with the green line.
* "MIS Bound" - associated with the gray shaded area.
### Detailed Analysis
The green line representing "Mutual Information Surprise" starts at approximately 0.01 at x=0. It initially decreases slightly to a minimum of approximately -0.02 at x=5. From x=5 to x=60, the line generally increases, reaching a peak of approximately 0.07 at x=60. After x=60, the line fluctuates around 0.06, ending at approximately 0.065 at x=100.
The "MIS Bound" is represented by a gray shaded area. At x=0, the upper bound is approximately 0.3 and the lower bound is approximately -0.1. As the number of explorations increases, the upper bound remains relatively constant at around 0.3, while the lower bound decreases, approaching -0.15 at x=100. The shaded area is widest between x=0 and x=20, indicating a larger uncertainty in the MIS bound during the initial explorations.
### Key Observations
* The Mutual Information Surprise initially decreases with a small number of explorations, then increases and plateaus.
* The MIS Bound widens initially and then narrows as the number of explorations increases.
* The Mutual Information Surprise line remains within the MIS Bound throughout the entire range of explorations.
* The Mutual Information Surprise appears to converge towards a stable value as the number of explorations increases.
### Interpretation
The chart suggests that the mutual information surprise is initially low and somewhat unstable with a small number of explorations. As the number of explorations increases, the mutual information surprise grows, indicating that more information is being gained. However, the rate of information gain diminishes as the number of explorations continues to increase, eventually reaching a plateau. The MIS bound provides a confidence interval for the mutual information surprise, and the narrowing of this bound with increasing explorations suggests that the estimate of the mutual information surprise becomes more precise as more data is collected. The fact that the Mutual Information Surprise line stays within the MIS bound indicates that the observed values are consistent with the expected range of values. This could be indicative of a learning process where initial explorations are noisy, but subsequent explorations yield more reliable and informative results.
</details>
Figure 1: Surprise measures during standard exploration.
Scenario 2: Over-Exploitation.
In this scenario, the system repeatedly samples a previously seen point from $x\in[0,30]$ , specifically observing the pair $(x,y)=(7,7)$ one hundred times. This simulates stagnation.
Expected behavior: Surprise should diminish as no new information is gained. This mirrors the stagnation case in Section 3.3, and we expect MIS to violate its lower bound.
Figure 2 shows that MIS falls below its lower bound, signaling a lack of knowledge gain. While Shannon and Bayesian Surprises also trend downward, they lack a defined lower threshold, limiting their reliability for flagging such behavior. Recall that both Shannon and Bayesian Surprises are inherently one-sided, as noted in (?) and (?).
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Line Chart: Shannon and Bayesian Surprise
### Overview
The image presents a line chart comparing Shannon Surprise and Bayesian Surprise as a function of the Number of Exploitations (m). The chart displays two curves, one representing Shannon Surprise and the other Bayesian Surprise, both decreasing as the number of exploitations increases. A secondary y-axis on the right side of the chart displays the scale for Bayesian Surprise.
### Components/Axes
* **Title:** "Shannon and Bayesian Surprise" - positioned at the top-center of the chart.
* **X-axis:** "Number of Exploitations (m)" - ranging from approximately 0 to 100, with tick marks at intervals of 20.
* **Y-axis (left):** "Shannon Surprise" - ranging from approximately -2.0 to -3.5, with tick marks at intervals of 0.2.
* **Y-axis (right):** "Bayesian Surprise" - ranging from approximately 0.000 to 0.012, with tick marks at intervals of 0.002.
* **Legend:** Located in the top-right corner of the chart.
* "Shannon Surprise" - represented by a dashed blue line.
* "Bayesian Surprise" - represented by a solid red line.
* **Gridlines:** Present throughout the chart, aiding in value estimation.
### Detailed Analysis
**Shannon Surprise (Blue Dashed Line):**
The Shannon Surprise line starts at approximately -2.1 at m=0 and decreases gradually, approaching approximately -3.5 at m=100. The line exhibits a steep initial decline, then flattens out as the number of exploitations increases.
* m = 0: Shannon Surprise â -2.1
* m = 20: Shannon Surprise â -2.8
* m = 40: Shannon Surprise â -3.0
* m = 60: Shannon Surprise â -3.2
* m = 80: Shannon Surprise â -3.3
* m = 100: Shannon Surprise â -3.5
**Bayesian Surprise (Red Solid Line):**
The Bayesian Surprise line starts at approximately 0.011 at m=0 and decreases rapidly, approaching 0.000 at m=100. The decline is much steeper than that of the Shannon Surprise line, especially in the initial range.
* m = 0: Bayesian Surprise â 0.011
* m = 20: Bayesian Surprise â 0.002
* m = 40: Bayesian Surprise â 0.001
* m = 60: Bayesian Surprise â 0.0005
* m = 80: Bayesian Surprise â 0.0002
* m = 100: Bayesian Surprise â 0.000
### Key Observations
* Both Shannon and Bayesian Surprise decrease with an increasing number of exploitations.
* Bayesian Surprise decreases much more rapidly than Shannon Surprise.
* Bayesian Surprise approaches zero much faster than Shannon Surprise.
* The initial drop in Bayesian Surprise is very significant, indicating a large change in surprise with the first few exploitations.
### Interpretation
The chart illustrates the concept of diminishing returns in terms of information gain from observing exploitations. Initially, each exploitation provides a significant amount of new information (high surprise), as reflected by the steep decline in Bayesian Surprise. However, as the number of exploitations increases, the information gained from each additional exploitation diminishes (lower surprise).
Shannon Surprise, while also decreasing, does so at a slower rate, suggesting that it captures a different aspect of information content. The difference between the two curves highlights the impact of prior beliefs (Bayesian Surprise) on how information is perceived. Bayesian Surprise incorporates prior knowledge, leading to a more rapid reduction in surprise as evidence accumulates.
The fact that Bayesian Surprise approaches zero suggests that, with enough exploitations, the event becomes highly predictable, and the surprise associated with it vanishes. This could be interpreted as a system reaching a state of equilibrium or complete understanding. The chart demonstrates the utility of Bayesian methods in modeling information gain and predicting event likelihoods.
</details>
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Line Chart: Mutual Information Surprise
### Overview
The image presents a line chart illustrating the relationship between the number of exploitations (m) and the mutual information surprise. A shaded region represents the MIS bound. The chart aims to visualize how the mutual information surprise changes as the number of exploitations increases.
### Components/Axes
* **Title:** Mutual Information Surprise (centered at the top)
* **X-axis:** Number of Exploitations (m) - ranging from 0 to 100, with gridlines at intervals of 10.
* **Y-axis:** Mutual Information Surprise - ranging from -0.5 to 0.2, with gridlines at intervals of 0.1.
* **Data Series:**
* "Mutual Information Surprise" - represented by a green line.
* **Legend:** Located in the top-right corner.
* "Mutual Information Surprise" - Green line
* "MIS Bound" - Gray shaded area
### Detailed Analysis
The green line representing "Mutual Information Surprise" starts at approximately 0.0 when the number of exploitations is 0. The line exhibits a consistent downward trend, becoming increasingly negative as the number of exploitations increases.
Here's an approximate reconstruction of data points:
* (0, 0.0)
* (10, -0.15)
* (20, -0.25)
* (30, -0.35)
* (40, -0.4)
* (50, -0.45)
* (60, -0.5)
* (70, -0.55)
* (80, -0.6)
* (90, -0.65)
* (100, -0.7)
The "MIS Bound" is represented by a gray shaded region. It starts above the x-axis at x=0, reaching a maximum height of approximately 0.2. As the number of exploitations increases, the upper bound of the shaded region decreases, eventually settling around -0.2 at x=100. The lower bound of the shaded region remains relatively constant around -0.3.
### Key Observations
* The Mutual Information Surprise decreases linearly with the number of exploitations.
* The MIS Bound provides a range within which the Mutual Information Surprise is expected to fall.
* The line representing Mutual Information Surprise consistently remains within the MIS Bound.
* The rate of decrease in Mutual Information Surprise appears constant throughout the range of exploitations.
### Interpretation
The chart suggests that as the number of exploitations increases, the "surprise" or unexpectedness associated with each exploitation decreases. This is likely due to the fact that with more exploitations, the system becomes more predictable, and the information gained from each new exploitation is less novel.
The MIS Bound likely represents a confidence interval or a range of plausible values for the Mutual Information Surprise. The fact that the line consistently falls within this bound suggests that the observed relationship between exploitations and surprise is statistically significant and reliable.
The linear trend indicates a consistent and predictable relationship. There are no apparent outliers or anomalies in the data. The chart provides a clear visualization of how information gain diminishes with repeated exploitation, which could be relevant in fields like security, machine learning, or game theory.
</details>
Figure 2: Surprise measures under over-exploitation.
Scenario 3: Noisy Exploration.
We perform standard exploration over $x\in[30,100]$ but apply random corruption to the outputs $\mathbf{y}$ , replacing each with a uniformly random digit between $0$ and $9$ . This simulates exploration without informative feedback.
Expected behavior: Despite novel inputs, the system should register confusion if understanding fails to improve. This mirrors the noise-increase case in Section 3.3, and we expect MIS to violate its lower bound.
Figure 3 confirms the following: MIS drops below its expected range, accurately signaling lack of learning. In contrast, Shannon and Bayesian Surprises again display erratic behavior without consistent trends.
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Line Chart: Shannon and Bayesian Surprises
### Overview
The image presents a line chart comparing Shannon Surprise and Bayesian Surprise over a range of explorations (m). The chart displays two distinct lines representing each surprise metric, plotted against the number of explorations, ranging from 0 to 100. A secondary y-axis on the right side of the chart displays the scale for Bayesian Surprise.
### Components/Axes
* **Title:** "Shannon and Bayesian Surprises" - positioned at the top-center of the chart.
* **X-axis:** "Number of Explorations (m)" - ranging from 0 to 100, with tick marks at intervals of 10.
* **Left Y-axis:** "Shannon Surprise" - ranging from 0 to 8, with tick marks at intervals of 1.
* **Right Y-axis:** "Bayesian Surprise" - ranging from 0 to 8, with tick marks at intervals of 1.
* **Legend:** Located in the bottom-right corner.
* "Shannon Surprise" - represented by a dashed blue line.
* "Bayesian Surprise" - represented by a solid red line.
* **Gridlines:** Horizontal and vertical gridlines are present to aid in reading values.
### Detailed Analysis
**Shannon Surprise (Dashed Blue Line):**
The Shannon Surprise line exhibits a highly oscillatory pattern. It starts at approximately 2 at m=0, rapidly fluctuates between approximately 2 and 7 for the first 40 explorations. After m=40, the line generally trends downwards, reaching a minimum of approximately 1.5 around m=90, with continued oscillations.
* m=0: ~2
* m=10: ~6
* m=20: ~3
* m=30: ~5
* m=40: ~2.5
* m=50: ~3
* m=60: ~2
* m=70: ~4
* m=80: ~2
* m=90: ~1.5
* m=100: ~2.5
**Bayesian Surprise (Solid Red Line):**
The Bayesian Surprise line also shows significant oscillations, but generally maintains higher values than the Shannon Surprise. It begins at approximately 7.5 at m=0, fluctuates wildly between approximately 1 and 8 for the first 40 explorations, and then generally decreases, reaching a minimum of approximately 1 around m=90, with continued oscillations.
* m=0: ~7.5
* m=10: ~8
* m=20: ~2
* m=30: ~6
* m=40: ~3
* m=50: ~7
* m=60: ~2
* m=70: ~6
* m=80: ~2
* m=90: ~1
* m=100: ~3
### Key Observations
* Both Shannon and Bayesian Surprises exhibit high variability throughout the exploration process.
* Bayesian Surprise generally has higher values than Shannon Surprise, although there are periods where this is not the case.
* Both lines show a general decreasing trend in surprise as the number of explorations increases, particularly after m=40.
* The oscillations suggest that each exploration provides varying degrees of new information or unexpectedness.
### Interpretation
The chart illustrates the concept of surprise in the context of information gathering or exploration. Shannon Surprise and Bayesian Surprise are both measures of how unexpected an event is, but they differ in their underlying assumptions and calculations. The high degree of oscillation in both lines suggests that the exploration process is highly dynamic and that each exploration can reveal significantly different levels of novelty. The decreasing trend in surprise over time indicates that as more explorations are conducted, the system becomes more predictable, and the likelihood of encountering truly surprising events diminishes. The fact that Bayesian Surprise is generally higher than Shannon Surprise could indicate that the Bayesian approach is more sensitive to prior beliefs or expectations. The chart could be used to evaluate the effectiveness of an exploration strategy or to understand how information is gained over time. The large fluctuations suggest that the environment being explored is complex and non-deterministic.
</details>
<details>
<summary>x6.png Details</summary>

### Visual Description
\n
## Line Chart: Mutual Information Surprise
### Overview
The image presents a line chart illustrating the relationship between the number of explorations (m) and the mutual information surprise. A shaded region represents the MIS Bound. The chart aims to visualize how the mutual information surprise changes as the number of explorations increases.
### Components/Axes
* **X-axis:** Number of Explorations (m), ranging from 0 to 100.
* **Y-axis:** Mutual Information Surprise, ranging from approximately -0.5 to 0.2.
* **Data Series:**
* "Mutual Information Surprise" - represented by a green line.
* **Legend:** Located in the top-right corner.
* Green line: "Mutual Information Surprise"
* Gray shaded area: "MIS Bound"
### Detailed Analysis
The green line representing "Mutual Information Surprise" starts at approximately 0.05 at m=0. The line generally slopes downward as the number of explorations increases.
Here's a breakdown of approximate data points:
* m = 0: Mutual Information Surprise â 0.05
* m = 10: Mutual Information Surprise â -0.05
* m = 20: Mutual Information Surprise â -0.15
* m = 30: Mutual Information Surprise â -0.20
* m = 40: Mutual Information Surprise â -0.25
* m = 50: Mutual Information Surprise â -0.30
* m = 60: Mutual Information Surprise â -0.35
* m = 70: Mutual Information Surprise â -0.40
* m = 80: Mutual Information Surprise â -0.45
* m = 90: Mutual Information Surprise â -0.50
* m = 100: Mutual Information Surprise â -0.52
The "MIS Bound" is represented by a gray shaded region. It starts with a relatively large positive value at m=0 (approximately 0.2) and gradually decreases, becoming more negative as m increases. The lower bound of the shaded region appears to converge towards the "Mutual Information Surprise" line as m approaches 100.
### Key Observations
* The "Mutual Information Surprise" consistently decreases as the number of explorations increases.
* The "MIS Bound" provides an upper and lower limit for the "Mutual Information Surprise".
* The gap between the "Mutual Information Surprise" line and the lower bound of the "MIS Bound" narrows as the number of explorations increases.
* The "Mutual Information Surprise" appears to approach a stable negative value as the number of explorations reaches 100.
### Interpretation
The chart suggests that as the number of explorations increases, the mutual information surprise decreases. This could indicate that with more exploration, the system gains more certainty and reduces its surprise about the environment or data. The "MIS Bound" likely represents a theoretical limit or confidence interval for the mutual information surprise. The convergence of the "Mutual Information Surprise" line towards the lower bound of the "MIS Bound" suggests that the system is approaching a state where the observed surprise is consistent with the theoretical limits. The initial positive surprise at m=0 could represent the initial uncertainty or novelty of the system before any exploration has taken place. The negative values indicate a reduction in surprise, potentially due to learning or adaptation. The chart demonstrates a clear trend of diminishing returns in terms of information gain with increasing exploration.
</details>
Figure 3: Surprise measures under noisy exploration.
Scenario 4: Aggressive Exploration.
This scenario enforces strict exploration over $x\in[30,500]$ , where each new sample is far from all observed points (i.e., outside the $\pm 1$ neighborhood range).
Expected behavior: Aggressive exploration without verification can lead to overconfidence. This mirrors the aggressive exploration case in Section 3.3, and we expect MIS to exceed its upper bound.
Figure 4 shows MIS surpassing its upper bound, consistent with this expectation. Shannon and Bayesian Surprises again fluctuate unpredictably.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Chart: Shannon and Bayesian Surprises
### Overview
The image presents a line chart comparing Shannon Surprise and Bayesian Surprise over a number of explorations. The chart displays two distinct lines representing each surprise metric, plotted against the number of explorations (m). A dual y-axis is used, with Shannon Surprise on the left and Bayesian Surprise on the right.
### Components/Axes
* **Title:** "Shannon and Bayesian Surprises" - positioned at the top-center of the chart.
* **X-axis:** "Number of Explorations (m)" - ranging from approximately 0 to 100, with tick marks at intervals of 10.
* **Left Y-axis:** "Shannon Surprise" - ranging from 0 to 8, with tick marks at intervals of 1.
* **Right Y-axis:** "Bayesian Surprise" - ranging from 0 to 20, with tick marks at intervals of 2.5.
* **Legend:** Located at the top-center of the chart.
* "Shannon Surprise" - represented by a dashed blue line.
* "Bayesian Surprise" - represented by a solid red line.
* **Gridlines:** Present to aid in reading values.
### Detailed Analysis
**Shannon Surprise (Blue Dashed Line):**
The Shannon Surprise line exhibits a highly fluctuating pattern. It generally stays between 1 and 3 for the first 20 explorations. Around exploration 20, it spikes to approximately 7, then drops back down. There are several other spikes throughout the chart, reaching peaks around 6-7 at explorations 30, 50, 60, 70, 80, and 90. The line ends around a value of 3.5 at exploration 100.
**Bayesian Surprise (Red Solid Line):**
The Bayesian Surprise line also fluctuates, but generally remains lower than the Shannon Surprise. It starts around 1.5 and increases to approximately 5 by exploration 20. It then experiences a significant drop to around 1 at exploration 40. A large peak occurs around exploration 50, reaching approximately 18. After this peak, the line generally decreases, fluctuating between 2 and 8 until exploration 100, where it ends around 5.
**Specific Data Points (Approximate):**
| Exploration (m) | Shannon Surprise | Bayesian Surprise |
|---|---|---|
| 0 | 1.5 | 1.5 |
| 10 | 2.0 | 2.5 |
| 20 | 7.0 | 5.0 |
| 30 | 6.0 | 4.0 |
| 40 | 1.0 | 1.0 |
| 50 | 5.5 | 18.0 |
| 60 | 6.5 | 7.0 |
| 70 | 6.0 | 5.0 |
| 80 | 6.5 | 4.0 |
| 90 | 6.0 | 6.0 |
| 100 | 3.5 | 5.0 |
### Key Observations
* The Bayesian Surprise generally exhibits larger peaks than the Shannon Surprise.
* Both surprise metrics show significant fluctuations, indicating a dynamic exploration process.
* There is no clear correlation between the peaks of the two surprise metrics. Sometimes they peak together (e.g., around exploration 60), while other times they peak independently (e.g., exploration 20 for Shannon, exploration 50 for Bayesian).
* The Bayesian Surprise has a much larger range of values (0-18) compared to the Shannon Surprise (0-7).
### Interpretation
The chart suggests that the exploration process reveals information that is sometimes surprising from both a Shannon and Bayesian perspective. The fluctuations in both surprise metrics indicate that the information gained during exploration is not consistent; some explorations yield more surprising results than others. The lack of correlation between the peaks suggests that the information surprising from a Shannon perspective (information content) is not necessarily the same as the information surprising from a Bayesian perspective (deviation from prior beliefs). The larger range of the Bayesian Surprise suggests that the prior beliefs are being significantly updated during the exploration process. The large peak in Bayesian Surprise at exploration 50 could indicate a particularly significant discovery that drastically changed the understanding of the system being explored. The chart provides insight into the information gain during an exploration process, highlighting the dynamic nature of learning and discovery.
</details>
<details>
<summary>x8.png Details</summary>

### Visual Description
\n
## Line Chart: Mutual Information Surprise
### Overview
The image presents a line chart illustrating the relationship between the number of explorations (m) and the mutual information surprise. A shaded region represents the MIS (Mutual Information Surprise) bound. The chart aims to demonstrate how the mutual information surprise changes as the number of explorations increases, and how it relates to a theoretical bound.
### Components/Axes
* **X-axis:** Number of Explorations (m), ranging from 0 to 100.
* **Y-axis:** Mutual Information Surprise, ranging from 0.0 to 0.6.
* **Line 1:** "Mutual Information Surprise" - a green line representing the observed mutual information surprise.
* **Area:** "MIS Bound" - a gray shaded area representing the theoretical bound for mutual information surprise.
* **Legend:** Located in the top-right corner, identifying the line and shaded area.
### Detailed Analysis
The green line representing "Mutual Information Surprise" starts at approximately 0.05 at m=0. The line generally slopes upward, indicating an increase in mutual information surprise as the number of explorations increases.
Here's a breakdown of approximate data points:
* m = 0: Mutual Information Surprise â 0.05
* m = 10: Mutual Information Surprise â 0.18
* m = 20: Mutual Information Surprise â 0.25
* m = 30: Mutual Information Surprise â 0.33
* m = 40: Mutual Information Surprise â 0.40
* m = 50: Mutual Information Surprise â 0.45
* m = 60: Mutual Information Surprise â 0.50
* m = 70: Mutual Information Surprise â 0.54
* m = 80: Mutual Information Surprise â 0.57
* m = 90: Mutual Information Surprise â 0.59
* m = 100: Mutual Information Surprise â 0.60
The gray shaded area ("MIS Bound") starts at approximately -0.1 at m=0 and expands upwards as m increases, reaching a maximum height of approximately 0.35 at m=100. The line representing "Mutual Information Surprise" remains consistently *above* the "MIS Bound" throughout the entire range of explorations.
### Key Observations
* The mutual information surprise increases with the number of explorations, but the rate of increase diminishes as the number of explorations grows larger.
* The observed mutual information surprise consistently exceeds the theoretical MIS bound.
* The line exhibits some fluctuations, suggesting that the increase in mutual information surprise is not perfectly smooth.
### Interpretation
The chart suggests that increasing the number of explorations leads to a greater understanding or revelation of information (as measured by mutual information surprise). The fact that the observed surprise consistently exceeds the theoretical bound indicates that the exploration process is yielding more information than expected based on the MIS model. This could imply that the model is conservative or that there are factors not accounted for in the MIS calculation that contribute to the observed surprise. The diminishing rate of increase suggests that there may be a point of diminishing returns, where further explorations yield progressively smaller increases in mutual information surprise. The fluctuations in the line could be due to the inherent randomness or complexity of the exploration process. This chart is likely used to evaluate the effectiveness of an exploration strategy or algorithm, and to understand the trade-off between exploration effort and information gain.
</details>
Figure 4: Surprise measures during aggressive exploration.
Scenario 5: Noise Decrease.
To simulate noise reduction, we begin with $100$ initial observations from $x\in[0,30]$ , paired with a randomly assigned output $y\in[0,9]$ . New samples are drawn from the same $x$ range but the new $y$ is produced using the deterministic modulus function in Eq. (7).
Expected behavior: Reduced noise implies stronger input-output dependency and we thus expect MIS to exceed its upper bound.
Figure 5 confirms this: MIS grows beyond its bound. Shannon and Bayesian Surprises continue to spike erratically.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Line Chart: Shannon and Bayesian Surprises
### Overview
The image presents a line chart comparing Shannon Surprise and Bayesian Surprise over a range of sampling points. The chart displays two fluctuating lines, one representing Shannon Surprise (dashed blue line) and the other representing Bayesian Surprise (solid red line). The x-axis represents the number of samplings, and the y-axes represent the respective surprise values.
### Components/Axes
* **Title:** "Shannon and Bayesian Surprises" (centered at the top)
* **X-axis:** "Number of Samplings (m)" - Scale ranges from approximately 0 to 100, with tick marks at intervals of 10.
* **Left Y-axis:** "Shannon Surprise" - Scale ranges from 0 to 8, with tick marks at intervals of 1.
* **Right Y-axis:** "Bayesian Surprise" - Scale ranges from 0 to 20, with tick marks at intervals of 2.5.
* **Legend:** Located in the bottom-left corner.
* "Shannon Surprise" - Represented by a dashed blue line.
* "Bayesian Surprise" - Represented by a solid red line.
* **Gridlines:** Horizontal and vertical gridlines are present to aid in reading values.
### Detailed Analysis
**Shannon Surprise (Dashed Blue Line):**
The Shannon Surprise line exhibits a highly oscillatory pattern. It fluctuates rapidly between approximately 0 and 8.
* At x = 0, y â 0.
* At x = 10, y â 7.
* At x = 20, y â 0.
* At x = 30, y â 6.
* At x = 40, y â 0.
* At x = 50, y â 5.
* At x = 60, y â 0.
* At x = 70, y â 4.
* At x = 80, y â 0.
* At x = 90, y â 2.
* At x = 100, y â 0.
**Bayesian Surprise (Solid Red Line):**
The Bayesian Surprise line also shows a highly oscillatory pattern, but with a larger amplitude and generally higher values. It fluctuates between approximately 0 and 18.
* At x = 0, y â 1.
* At x = 10, y â 17.
* At x = 20, y â 1.
* At x = 30, y â 15.
* At x = 40, y â 1.
* At x = 50, y â 13.
* At x = 60, y â 2.
* At x = 70, y â 10.
* At x = 80, y â 1.
* At x = 90, y â 8.
* At x = 100, y â 2.
### Key Observations
* Both surprise measures exhibit a strong oscillatory behavior, suggesting a periodic or fluctuating underlying process.
* Bayesian Surprise generally has a much larger range of values than Shannon Surprise.
* The peaks and troughs of the two lines do not perfectly align, indicating that the two measures capture different aspects of the "surprise" associated with the sampled data.
* The lines appear to be somewhat correlated, as periods of high surprise in one measure often correspond to periods of relatively high surprise in the other.
### Interpretation
The chart demonstrates the difference between Shannon Surprise and Bayesian Surprise as they vary across a series of samplings. The high degree of fluctuation in both lines suggests that the underlying data source is highly variable or contains significant noise. The larger magnitude of Bayesian Surprise indicates that the Bayesian measure is more sensitive to changes in the data or incorporates stronger prior beliefs. The lack of perfect correlation between the two measures suggests that they are capturing different types of uncertainty. The "m" unit on the x-axis likely represents a distance or time interval, implying that the sampling is occurring over a spatial or temporal domain. The chart could be used to analyze the information content or predictability of a signal or process, and to compare the effectiveness of different surprise measures in capturing its characteristics. The oscillatory nature of the data suggests a cyclical or wave-like pattern, which could be further investigated to understand the underlying dynamics of the system.
</details>
<details>
<summary>x10.png Details</summary>

### Visual Description
\n
## Line Chart: Mutual Information Surprise
### Overview
The image presents a line chart illustrating the relationship between the number of samplings (m) and the Mutual Information Surprise. A shaded region represents the MIS Bound. The chart aims to visualize how the Mutual Information Surprise changes as the number of samples increases, and how it relates to a defined bound.
### Components/Axes
* **X-axis:** Number of Samplings (m), ranging from approximately 0 to 100.
* **Y-axis:** Mutual Information Surprise, ranging from approximately -0.15 to 0.45.
* **Data Series:**
* "Mutual Information Surprise" - Represented by a green line.
* **Legend:** Located in the top-right corner.
* "Mutual Information Surprise" - Green color.
* "MIS Bound" - Gray color.
### Detailed Analysis
The green line representing "Mutual Information Surprise" starts at approximately -0.05 at m=0. The line initially slopes upward, reaching a value of approximately 0.1 at m=10. It then plateaus around 0.25 between m=20 and m=40, with some fluctuations. After m=40, the line resumes an upward trend, reaching approximately 0.42 at m=100.
The "MIS Bound" is represented by a gray shaded region. It starts at approximately -0.1 at m=0, widens as m increases, and reaches a maximum height of approximately 0.3 at m=100. The lower bound of the shaded region remains consistently negative, around -0.15, throughout the entire range of m.
Here's a breakdown of approximate data points:
| Number of Samplings (m) | Mutual Information Surprise |
|---|---|
| 0 | -0.05 |
| 10 | 0.10 |
| 20 | 0.20 |
| 30 | 0.24 |
| 40 | 0.26 |
| 50 | 0.28 |
| 60 | 0.30 |
| 70 | 0.34 |
| 80 | 0.38 |
| 90 | 0.40 |
| 100 | 0.42 |
### Key Observations
* The Mutual Information Surprise generally increases with the number of samplings.
* There's a period of relative stability in the Mutual Information Surprise between m=20 and m=40.
* The Mutual Information Surprise remains within the MIS Bound throughout the observed range of samplings.
* The MIS Bound widens as the number of samplings increases, indicating a greater potential range for the Mutual Information Surprise.
### Interpretation
The chart suggests that as more samples are taken, the Mutual Information Surprise increases, indicating a greater amount of information gained. The MIS Bound provides a theoretical limit or range for the expected Mutual Information Surprise. The fact that the Mutual Information Surprise stays within the bound suggests that the observed behavior is consistent with the theoretical model. The plateau between m=20 and m=40 could indicate a point of diminishing returns, where additional samples provide less incremental information. The widening of the MIS Bound with increasing samples suggests that the uncertainty around the Mutual Information Surprise grows as more data is collected, potentially due to the complexity of the underlying system. This could be a visualization of a convergence property, where the information gain slows down as the system approaches a stable state.
</details>
Figure 5: Surprise measures during noise decrease.
Scenario 6: Discovery of New Output Values.
We modify the function in the unexplored region ( $x>30$ ) to $y=x\mod 10-10$ , introducing a different behavior while keeping the original function unchanged in $[0,30]$ .
Expected behavior: A competent surprise measure should register this new structure as a meaningful discovery. This mirrors the novel discovery case in Section 3.3, and we expect MIS to exceed its upper bound.
Figure 6 shows MIS sharply exceeding its expected trajectory, signaling successful identification of a structural shift. Shannon and Bayesian Surprises again fail to provide consistent or interpretable responses.
<details>
<summary>x11.png Details</summary>

### Visual Description
## Line Chart: Shannon and Bayesian Surprises
### Overview
The image presents a line chart comparing Shannon Surprise and Bayesian Surprise over a number of explorations (m). The chart displays the fluctuations of both surprise metrics as the number of explorations increases from 0 to 100. The chart has a grid background for easier readability.
### Components/Axes
* **Title:** "Shannon and Bayesian Surprises" - positioned at the top-center of the chart.
* **X-axis:** "Number of Explorations (m)" - ranging from 0 to 100, with tick marks at intervals of 10.
* **Left Y-axis:** "Shannon Surprise" - ranging from 0 to 8, with tick marks at intervals of 2.
* **Right Y-axis:** "Bayesian Surprise" - ranging from 0 to 14, with tick marks at intervals of 2.
* **Legend:** Located in the top-right corner.
* "Shannon Surprise" - represented by a dashed blue line.
* "Bayesian Surprise" - represented by a solid red line.
### Detailed Analysis
**Shannon Surprise (Blue Dashed Line):**
The Shannon Surprise line exhibits a highly fluctuating pattern. It starts at approximately 0 at m=0, rises sharply to a peak of around 7.5 at m=18, then drops back down to near 0. It continues to oscillate between approximately 0 and 6.5 for the remainder of the exploration range. The line shows multiple peaks and valleys, indicating significant changes in Shannon Surprise with each exploration.
* m=0: ~0
* m=5: ~2.5
* m=10: ~3.5
* m=15: ~6
* m=18: ~7.5 (Peak)
* m=20: ~1.5
* m=25: ~0.5
* m=30: ~1
* m=35: ~2
* m=40: ~3
* m=45: ~4.5
* m=50: ~1.5
* m=55: ~5
* m=60: ~6.5
* m=65: ~2
* m=70: ~3.5
* m=75: ~4
* m=80: ~1
* m=85: ~2.5
* m=90: ~0.5
* m=95: ~1
* m=100: ~0
**Bayesian Surprise (Red Solid Line):**
The Bayesian Surprise line also fluctuates, but generally remains lower than the Shannon Surprise. It starts at approximately 0 at m=0, rises to a peak of around 13 at m=18, then declines. It oscillates between approximately 0 and 5 for the majority of the exploration range, with a few smaller peaks.
* m=0: ~0
* m=5: ~1.5
* m=10: ~2.5
* m=15: ~4.5
* m=18: ~13 (Peak)
* m=20: ~3
* m=25: ~1
* m=30: ~1.5
* m=35: ~2.5
* m=40: ~3.5
* m=45: ~3
* m=50: ~1.5
* m=55: ~3
* m=60: ~3.5
* m=65: ~1.5
* m=70: ~2
* m=75: ~2.5
* m=80: ~1
* m=85: ~1.5
* m=90: ~0.5
* m=95: ~1
* m=100: ~0.5
### Key Observations
* Both Shannon and Bayesian Surprises exhibit significant fluctuations as the number of explorations increases.
* The Shannon Surprise generally has higher values and more pronounced peaks than the Bayesian Surprise.
* Both metrics reach their highest values around m=18, suggesting a significant change or discovery at that point in the exploration process.
* After m=20, both metrics tend to stabilize, oscillating around lower average values.
### Interpretation
The chart suggests that the initial stages of exploration (up to approximately m=20) are characterized by high uncertainty and significant information gain, as reflected in the high surprise values for both metrics. The peak at m=18 indicates a particularly impactful discovery or change in the system being explored. As the exploration progresses, the surprise values decrease, indicating that the system becomes more predictable and less novel information is revealed.
The difference between Shannon and Bayesian Surprise could be interpreted as a difference in how they quantify uncertainty. Shannon Surprise focuses on the probability of an event, while Bayesian Surprise incorporates prior beliefs. The fact that Shannon Surprise is consistently higher suggests that the system is more unpredictable than initially assumed, or that the prior beliefs used in the Bayesian calculation were relatively accurate.
The fluctuations after m=20 suggest that the exploration process continues to reveal new information, but at a diminishing rate. The overall trend indicates a convergence towards a more stable understanding of the system being explored. The chart provides valuable insights into the dynamics of information gain and uncertainty reduction during an exploration process.
</details>
<details>
<summary>x12.png Details</summary>

### Visual Description
\n
## Line Chart: Mutual Information Surprise
### Overview
The image presents a line chart illustrating the relationship between the number of explorations (m) and the mutual information surprise. A shaded region represents the MIS bound. The chart aims to visualize how the mutual information surprise changes as the number of explorations increases, along with a confidence or uncertainty bound.
### Components/Axes
* **Title:** "Mutual Information Surprise" (centered at the top)
* **X-axis:** "Number of Explorations (m)" - ranging from approximately 0 to 100.
* **Y-axis:** "Mutual Information Surprise" - ranging from approximately -0.02 to 0.65.
* **Legend:** Located in the top-right corner.
* "Mutual Information Surprise" - represented by a green line.
* "MIS Bound" - represented by a gray shaded area.
* **Grid:** A light gray grid is present throughout the chart area, aiding in value estimation.
### Detailed Analysis
The chart displays a single green line representing "Mutual Information Surprise" and a gray shaded area representing the "MIS Bound".
* **Mutual Information Surprise (Green Line):**
* The line starts at approximately 0.00 at x=0.
* It increases rapidly between x=0 and x=20, reaching approximately 0.40.
* The rate of increase slows down between x=20 and x=60, reaching approximately 0.55.
* The line plateaus between x=60 and x=100, reaching a maximum value of approximately 0.63.
* **MIS Bound (Gray Shaded Area):**
* The shaded area starts at approximately -0.02 at x=0.
* It widens as the number of explorations increases, reaching a maximum width between x=40 and x=80.
* The upper bound of the shaded area remains relatively constant at approximately 0.30 after x=40.
* The lower bound of the shaded area remains relatively constant at approximately -0.02 after x=20.
### Key Observations
* The Mutual Information Surprise increases with the number of explorations, but the rate of increase diminishes over time.
* The MIS Bound provides a range of possible values for the Mutual Information Surprise, indicating uncertainty.
* The MIS Bound is most wide at the beginning of the exploration process, indicating higher uncertainty.
* The Mutual Information Surprise appears to converge towards a stable value as the number of explorations increases.
### Interpretation
The chart suggests that increasing the number of explorations initially leads to a significant increase in mutual information surprise. However, as the number of explorations grows, the marginal benefit of additional explorations decreases. The MIS bound indicates the range of plausible values for the mutual information surprise, reflecting the inherent uncertainty in the process. The convergence of the Mutual Information Surprise line suggests that there is a limit to the amount of information that can be gained through further exploration. This could be due to the inherent limitations of the system being explored or the diminishing returns of continued exploration. The initial wide MIS bound suggests a high degree of uncertainty at the start, which decreases as more data is gathered through exploration. This is a common pattern in information-gathering processes, where initial explorations provide the most significant gains in understanding.
</details>
Figure 6: Surprise measures when exploring a new region with novel outputs.
Summary
Across all scenarios, MIS reliably indicates whether the system is genuinely learning, stagnating, or encountering degradation. It responds to the structure and value of observations, more than just novelty. In contrast, Shannon and Bayesian Surprises often react to superficial fluctuations and display numerical instability. Furthermore, the MIS progression bound remains consistent and interpretable across all scenarios, while Shannon and Bayesian Surprises lack a universal scale or threshold, as reflected by their inconsistent magnitudes across Figures 1 through 6. This inconsistency limits their effectiveness as a reliable trigger. Overall, this simulation study demonstrates MIS not only as a novel metric for quantifying surprise, but also as a more trustworthy indicator of learning dynamicsâmaking it a promising tool for autonomous system monitoring.
### 4.2 Pollution Estimation: A Case Study
To demonstrate the practical utility of our proposed MIS reaction policy, we apply it to a real-time pollution map estimation scenario. We evaluate the impact of integrating the MIS reaction policy on system performance in a dynamic, non-stationary environment. Specifically, we compare two approaches: a selection of baseline sampling strategies and the same strategies governed by our MIS reaction policy.
### Dataset: Dynamic Pollution Maps
We utilize a synthetic pollution simulation dataset comprising $450$ time frames, each representing a $50\times 50$ pollution grid. Initially, the environment contains $3$ pollution sources, each emitting high pollution at a fixed level. The rest of the field exhibits moderate and random pollution values. Over time, the pollution levels across the entire field evolve due to natural diffusion, decay, and wind effects. Moreover, every $50$ frames, a new pollution source is added to the field at a random location. These new sources elevate the overall pollution levels and alter the input-output relationship between the spatial coordinates and the pollution intensity. Figure 7 displays a snippet of the pollution map at two intermediate time points. The simulation details for the dynamic pollution map generation are provided in the Appendix.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Heatmaps: Pollution Maps at Time 150 and Time 350
### Overview
The image presents two heatmaps, side-by-side, visualizing pollution levels. The left heatmap represents pollution at Time 150, while the right heatmap shows pollution at Time 350. Both maps use a color gradient to represent pollution levels, with warmer colors (reds and oranges) indicating higher pollution and cooler colors (blues and purples) indicating lower pollution. The maps appear to represent a 2D space, likely a geographical area, discretized into a grid.
### Components/Axes
Both heatmaps share the following characteristics:
* **Title:** "Pollution Map - Time [Value]" (Time 150 for the left, Time 350 for the right). Positioned at the top-center of each map.
* **Axes:** Both axes range from 0 to 49. The x-axis is horizontal, and the y-axis is vertical. They are not explicitly labeled, but represent coordinates within the mapped area.
* **Colorbar:** A vertical colorbar is present on the right side of each heatmap.
* **Left Heatmap (Time 150):** Colorbar ranges from approximately 5.2 (purple) to 6.4 (red). Labeled "Pollution Level".
* **Right Heatmap (Time 350):** Colorbar ranges from approximately 6.2 (purple) to 7.4 (red). Labeled "Pollution Level".
### Detailed Analysis or Content Details
**Left Heatmap (Time 150):**
* **Dominant Feature:** A significant concentration of high pollution (red/orange) is located around coordinates (25, 15).
* **Secondary Feature:** A smaller area of moderate pollution (yellow/orange) is present near coordinates (5, 40).
* **Low Pollution Areas:** The bottom-left corner (around 0,0) and the top-right corner (around 45, 45) exhibit lower pollution levels (blue/purple).
* **Approximate Pollution Levels (based on colorbar):**
* (25, 15): ~6.2 - 6.4
* (5, 40): ~5.8 - 6.0
* (0, 0): ~5.2 - 5.4
* (45, 45): ~5.4 - 5.6
**Right Heatmap (Time 350):**
* **Dominant Features:** Two major concentrations of high pollution (red/orange) are visible. One is around (25, 15), similar to the left heatmap, but slightly more dispersed. The other is around (40, 5).
* **Secondary Feature:** A moderate pollution area (yellow/orange) is present near coordinates (5, 40).
* **Low Pollution Areas:** The bottom-right corner (around 45, 40) exhibits lower pollution levels (blue/purple).
* **Approximate Pollution Levels (based on colorbar):**
* (25, 15): ~6.8 - 7.0
* (40, 5): ~7.0 - 7.2
* (5, 40): ~6.4 - 6.6
* (45, 40): ~6.2 - 6.4
### Key Observations
* **Pollution Increase:** Overall pollution levels appear to have increased from Time 150 to Time 350. The colorbar range shifted upwards (5.2-6.4 to 6.2-7.4).
* **Persistent Hotspot:** The area around (25, 15) remains a hotspot for pollution in both timeframes, although the intensity has increased.
* **New Hotspot:** A new significant pollution concentration emerged around (40, 5) between Time 150 and Time 350.
* **Spatial Distribution:** The pollution distribution is not uniform; it's clustered in specific areas.
### Interpretation
The data suggests a dynamic pollution pattern. The increase in overall pollution levels between Time 150 and Time 350 indicates a worsening environmental situation. The persistence of the hotspot at (25, 15) suggests a consistent pollution source in that area. The emergence of a new hotspot at (40, 5) implies a new pollution source or a shift in pollution patterns.
The heatmaps likely represent a simplified model of pollution distribution. The grid-like structure suggests that the area is being monitored at discrete points. The lack of axis labels makes it difficult to determine the geographical context, but the data provides valuable insights into the spatial and temporal dynamics of pollution. Further investigation would be needed to identify the sources of pollution and understand the factors driving the observed changes. The difference in colorbar ranges between the two maps is significant, indicating a substantial increase in pollution levels over the 200-unit time difference.
</details>
Figure 7: Pollution maps at time $150$ and time $350$ .
### Sampling Strategies
As discussed in Section 3.4, the MIS reaction policy is designed to complement existing explorationâexploitation strategies. To demonstrate the effectiveness of the Mutual Information Surprise Reaction Policy (MISRP), we integrate it with three well-established sampling strategies. These are: the surprise-reactive (SR) sampling method proposed by (?) using either Shannon or Bayesian surprises, the subtractive clustering/entropy (SC/E) active learning strategy proposed by (?), and the greedy search/query by committee (GS/QBC) active learning strategy used in (?).
1. SR: The surprise-reactive sampling method (?) switches between exploration and exploitation modes based on observed Shannon or Bayesian Surprise. By default, SR operates in an exploration mode guided by the widely used space-filling principle (?), selecting new sampling locations via the min-max objective:
$$
\mathbf{x}^{*}=\underset{\mathbf{x}}{\operatorname{argmax}}\>\underset{\mathbf{x}_{i}\in\mathbf{X}}{\min}\>\|\mathbf{x}-\mathbf{x}_{i}\|_{2},
$$
where $\mathbf{X}$ denotes the set of existing observations. Upon encountering a surprising event (in terms of either Shannon or Bayesian Surprise), SR switches to exploitation mode, performing localized verification sampling within the neighborhood of the surprise-triggering location. This continues either for a fixed number of steps defined by an exploitation limit $t$ , or until an unsurprising event occurs. If exploitation confirms that the surprise is consistent (i.e., persistent surprise until reaching the exploitation threshold), all corresponding observations are accepted and incorporated into the pollution map estimation. Conversely, if an unsurprising event arises before the threshold is reached, the surprising observations are deemed anomalous and discarded. For Shannon Surprise, we set the triggering threshold at $1.3$ , corresponding to a likelihood of $5\$ . For Bayesian Surprise, we use the Postdictive Surprise and adopt the threshold of $0.5$ , following (?).
MISRP: The MISRP modifies SR by dynamically adjusting the exploitation limit $t$ . When increased exploitation is needed, $t$ is incremented by $1$ . For increased exploration, $t$ is decremented by $1$ , with a lower bound of $t=1$ .
1. SC/E: The subtractive clustering/entropy active learning strategy (?) selects the next sampling location by maximizing a custom acquisition function. For an unseen region $\mathcal{X}$ and a probabilistic predictive function $\hat{f}(\mathbf{x})$ trained on the observed data, the acquisition function is defined as:
$$
a(\mathbf{x})=(1-\eta)\mathbb{E}_{\mathbf{x}^{\prime}\in\mathcal{X}}[e^{-\|\mathbf{x}-\mathbf{x}^{\prime}\|_{2}}]+\eta H(\hat{f}(\mathbf{x})),
$$
where $\eta$ is the exploitation parameter, with a default value of $0.5$ , and $H(\hat{f}(\mathbf{x}))$ denotes the entropy of the predictive distribution at $\mathbf{x}$ . A larger value of $\eta$ emphasizes sampling at locations with high predictive uncertainty near previously seen points, promoting exploitation. A smaller value favors sampling at representative locations in the unseen region, promoting exploration (?).
MISRP: The MISRP modifies SC/E by adjusting the exploitation parameter $\eta$ . For increased exploitation, $\eta$ is increased by $0.1$ , up to a maximum of $1$ . For increased exploration, $\eta$ is decreased by $0.1$ , with a minimum of $0$ .
1. GS/QBC: The greedy search/query by committee active learning strategy (?) uses a different acquisition function. Given the set of seen observations $\{\mathbf{X},\mathbf{Y}\}$ and a model committee $\mathcal{F}$ composed of multiple predictive models trained on this data, the acquisition function is defined as:
$$
a(\mathbf{x})=(1-\eta)\underset{\mathbf{x}^{\prime},\mathbf{y}^{\prime}\in\mathbf{X},\mathbf{y}}{\min}\|\mathbf{x}-\mathbf{x}^{\prime}\|_{2}\|\hat{f}(\mathbf{x})-\mathbf{y}^{\prime}\|_{2}+\eta\underset{\hat{f}(\cdot),\hat{f}^{\prime}(\cdot)\in\mathcal{F}}{\max}\|\hat{f}(\mathbf{x})-\hat{f}^{\prime}(\mathbf{x})\|_{2}, \tag{8}
$$
where the first term encourages exploration by selecting points that are distant from existing observations in both input and output space. The second term promotes exploitation by targeting locations with high disagreement among models in the committee.
MISRP: The MISRP regulates the balance between exploration and exploitation in GS/QBC in the same manner as in SC/E, by adjusting the parameter $\eta$ .
### Experimental Setup
The estimation process is initialized with $10$ observed locations uniformly sampled across the pollution field. Each time frame collects $10$ new samples according to the chosen sampling strategy, representing the operation of $10$ mobile pollution sensors. The pollution field is estimated using a Gaussian Process Regressor with a Matérn kernel ( $\nu=2.5$ ) and a noise prior of $10^{-2}$ , consistently applied across all strategies. The model predicts pollution levels at specified spatial locations and is updated using both current and historical data, with a maximum of $200$ observations retained to reduce computational cost.
For the GS/QBC strategy, the model committee additionally includes regressors with a Matérn $\nu=1.5$ kernel and a Gaussian kernel with bandwidth $0.1$ , both using a noise prior of $10^{-2}$ . These two additional models are used solely for calculating disagreement in Eq. (8) and are not employed in pollution map estimation.
Shannon and Bayesian Surprise are computed following the procedure described in Section 4.1. For MIS calculations, we discretize the range of pollution values observed in the data into $100$ bins to estimate entropy.
In process forking scenarios, two separate pollution map estimates, $\hat{f}_{m}$ and $\hat{f}_{n}$ , are produced for subprocesses $\mathcal{P}_{m}$ and $\mathcal{P}_{n}$ , respectively. The final pollution map estimate is formed as a weighted combination:
$$
\hat{f}=\frac{\sqrt{m}}{\sqrt{m}+\sqrt{n}}\hat{f}_{m}+\frac{\sqrt{n}}{\sqrt{m}+\sqrt{n}}\hat{f}_{n},
$$
accounting for generalization errors that scale as $\mathcal{O}(\frac{1}{\sqrt{m}})$ and $\mathcal{O}(\frac{1}{\sqrt{n}})$ , respectively (?).
### Simulation Results
We assess performance using the mean squared error (MSE) between predicted and true pollution maps at each time step. Due to the dynamic nature of the pollution field, estimation errors exhibit substantial fluctuation. To smooth these variations, we compute a 20-frame moving average of the MSE for both vanilla and MISRP-governed strategies. The results are shown in Figure 8.
<details>
<summary>x14.png Details</summary>

### Visual Description
\n
## Line Chart: Mean Error over Time
### Overview
The image presents a line chart illustrating the mean error over time for two different methods: one using Shannon entropy (SR with Shannon) and another using MISRP. The chart displays the error fluctuations across 450 time steps.
### Components/Axes
* **Title:** "Mean Error over Time" (centered at the top)
* **X-axis:** "Time Step" (ranging from approximately 0 to 450, with gridlines at intervals of 50)
* **Y-axis:** "Mean Error" (ranging from 0 to 10, with gridlines at intervals of 2)
* **Legend:** Located in the top-right corner.
* "Mean Error over Time (SR with Shannon)" - represented by a dashed blue line.
* "Mean Error over Time (with MISRP)" - represented by a solid red line.
### Detailed Analysis
**Mean Error over Time (SR with Shannon) - Dashed Blue Line:**
The blue line exhibits significant fluctuations throughout the time series. It generally oscillates between approximately 2 and 9.
* At Time Step 0: Approximately 2.2
* At Time Step 50: Peaks at approximately 9.5
* At Time Step 100: Peaks at approximately 9.8
* At Time Step 150: Drops to approximately 2.0
* At Time Step 200: Rises to approximately 3.0
* At Time Step 250: Drops to approximately 1.5
* At Time Step 300: Rises to approximately 4.5
* At Time Step 350: Drops to approximately 2.0
* At Time Step 400: Rises to approximately 6.5
* At Time Step 450: Approximately 4.0
**Mean Error over Time (with MISRP) - Solid Red Line:**
The red line shows a more stable trend with smaller fluctuations compared to the blue line. It generally oscillates between approximately 1.5 and 5.
* At Time Step 0: Approximately 2.0
* At Time Step 50: Approximately 1.8
* At Time Step 100: Approximately 2.5
* At Time Step 150: Drops to approximately 1.5
* At Time Step 200: Rises to approximately 2.5
* At Time Step 250: Approximately 1.8
* At Time Step 300: Approximately 2.0
* At Time Step 350: Rises to approximately 3.0
* At Time Step 400: Rises to approximately 4.5
* At Time Step 450: Approximately 5.0
### Key Observations
* The "SR with Shannon" method (blue line) exhibits much higher and more frequent error spikes than the "with MISRP" method (red line).
* The MISRP method maintains a relatively low and stable error rate throughout the observed time steps.
* The Shannon method shows a clear pattern of periodic spikes, suggesting potential instability or sensitivity to certain conditions at regular intervals.
* Both methods show an increasing trend in error towards the end of the time series (around Time Step 400-450), but the increase is far more pronounced for the Shannon method.
### Interpretation
The data suggests that the MISRP method is more robust and provides more consistent performance (lower and more stable error) compared to the SR with Shannon method. The Shannon method, while potentially offering benefits in certain scenarios, is prone to significant error spikes, indicating a higher degree of instability. The increasing error trend observed in both methods towards the end of the time series could indicate a degradation in performance over time, potentially due to factors like data drift or model limitations. The periodic spikes in the Shannon method suggest a cyclical pattern in its errors, which could be related to the characteristics of the input data or the algorithm's internal workings. Further investigation would be needed to determine the root cause of these spikes and to explore potential mitigation strategies. The chart demonstrates a clear trade-off between potential performance gains (if the Shannon method works well) and stability (which the MISRP method provides).
</details>
<details>
<summary>x15.png Details</summary>

### Visual Description
## Line Chart: Mean Error over Time
### Overview
The image presents a line chart illustrating the "Mean Error over Time" for two different methods: "SR with Bayesian" and "with MISRP". The chart displays how the mean error fluctuates across time steps, allowing for a comparison of the performance of the two methods.
### Components/Axes
* **Title:** "Mean Error over Time" (centered at the top)
* **X-axis:** "Time Step" (ranging from approximately 0 to 450, with tick marks at intervals of 50)
* **Y-axis:** "Mean Error" (ranging from 0 to 10, with tick marks at intervals of 2)
* **Legend:** Located in the top-left corner.
* "Mean Error over Time (SR with Bayesian)" - represented by a dashed blue line.
* "Mean Error over Time (with MISRP)" - represented by a solid red line.
### Detailed Analysis
**SR with Bayesian (Blue Dashed Line):**
The blue line exhibits significant fluctuations throughout the time steps. It generally starts around a mean error of 1.5 at Time Step 0. The line increases to a peak of approximately 7 at Time Step 80, then decreases to around 2 at Time Step 150. It rises again to a peak of approximately 6 at Time Step 200, followed by a decline to around 2 at Time Step 300. The line then fluctuates between 2 and 4 until Time Step 400, where it spikes to approximately 10, and then decreases to around 5 at Time Step 450.
* Time Step 0: Mean Error â 1.5
* Time Step 50: Mean Error â 2.5
* Time Step 80: Mean Error â 7.0
* Time Step 100: Mean Error â 3.0
* Time Step 150: Mean Error â 2.0
* Time Step 200: Mean Error â 6.0
* Time Step 250: Mean Error â 3.0
* Time Step 300: Mean Error â 2.0
* Time Step 350: Mean Error â 3.0
* Time Step 400: Mean Error â 10.0
* Time Step 450: Mean Error â 5.0
**with MISRP (Red Solid Line):**
The red line shows a much more stable pattern with lower mean error values compared to the blue line. It starts around a mean error of 1 at Time Step 0 and generally fluctuates between 1 and 2.5 throughout the majority of the time steps. There is a slight increase around Time Step 250, reaching approximately 2.5, and another around Time Step 400, reaching approximately 2.7.
* Time Step 0: Mean Error â 1.0
* Time Step 50: Mean Error â 1.2
* Time Step 100: Mean Error â 1.5
* Time Step 150: Mean Error â 1.8
* Time Step 200: Mean Error â 2.0
* Time Step 250: Mean Error â 2.5
* Time Step 300: Mean Error â 1.5
* Time Step 350: Mean Error â 1.7
* Time Step 400: Mean Error â 2.7
* Time Step 450: Mean Error â 2.3
### Key Observations
* The "SR with Bayesian" method exhibits significantly higher and more volatile mean error values compared to the "with MISRP" method.
* The "with MISRP" method maintains a relatively stable and low mean error throughout the observed time steps.
* There is a large spike in the mean error for the "SR with Bayesian" method at Time Step 400, indicating a potential issue or outlier event.
### Interpretation
The chart demonstrates that the "with MISRP" method consistently outperforms the "SR with Bayesian" method in terms of minimizing mean error over time. The "SR with Bayesian" method is prone to larger errors and greater fluctuations, suggesting it may be less robust or sensitive to certain conditions. The spike at Time Step 400 for the "SR with Bayesian" method warrants further investigation to understand the cause of this significant error increase. The consistent low error of the MISRP method suggests it is a more reliable and stable approach for this particular task. The data suggests that the MISRP method is a better choice for minimizing error in this context.
</details>
<details>
<summary>x16.png Details</summary>

### Visual Description
\n
## Line Chart: Mean Error over Time
### Overview
This image presents a line chart illustrating the mean error over time for two different conditions: one without MISRP (denoted as (SC/E)) and one with MISRP. The chart displays how the mean error fluctuates across time steps, allowing for a comparison of the performance of the two conditions.
### Components/Axes
* **Title:** "Mean Error over Time" (centered at the top)
* **X-axis:** "Time Step" (ranging from approximately 0 to 450, with gridlines at intervals of 50)
* **Y-axis:** "Mean Error" (ranging from 0 to 10, with gridlines at intervals of 2)
* **Legend:** Located in the top-left corner.
* "Mean Error over Time (SC/E)" - represented by a dashed blue line.
* "Mean Error over Time (with MISRP)" - represented by a solid red line.
### Detailed Analysis
The chart shows two fluctuating lines representing the mean error over time.
**Line 1: Mean Error over Time (SC/E) - Dashed Blue Line**
This line generally fluctuates between approximately 1 and 5, with several peaks and valleys.
* At Time Step 0, the mean error is approximately 1.2.
* Around Time Step 50, the mean error rises to a peak of approximately 4.5.
* Around Time Step 150, the mean error is approximately 2.5.
* Around Time Step 250, the mean error is approximately 2.0.
* Around Time Step 350, the mean error is approximately 2.2.
* Around Time Step 400, the mean error spikes dramatically to approximately 8.0.
* At Time Step 450, the mean error decreases to approximately 3.5.
**Line 2: Mean Error over Time (with MISRP) - Solid Red Line**
This line also fluctuates, but generally remains lower than the blue line, especially after Time Step 200.
* At Time Step 0, the mean error is approximately 2.0.
* Around Time Step 50, the mean error is approximately 2.2.
* Around Time Step 150, the mean error is approximately 1.8.
* Around Time Step 250, the mean error is approximately 1.5.
* Around Time Step 350, the mean error is approximately 1.7.
* Around Time Step 400, the mean error begins to increase, reaching approximately 4.5 at Time Step 450.
### Key Observations
* The "with MISRP" condition (red line) generally exhibits lower mean error values than the "SC/E" condition (blue line) for most of the time steps.
* Both conditions show significant fluctuations in mean error over time.
* A major outlier occurs around Time Step 400, where the "SC/E" condition experiences a substantial spike in mean error.
* The red line shows a clear upward trend in the final portion of the chart, while the blue line fluctuates more erratically.
### Interpretation
The data suggests that incorporating MISRP generally leads to a reduction in mean error compared to the "SC/E" condition. The significant spike in mean error for the "SC/E" condition around Time Step 400 indicates a potential instability or issue specific to that condition at that point in time. The increasing trend of the red line towards the end of the chart suggests that the benefits of MISRP might diminish or become less effective as time progresses, or that a new factor is influencing the error. The fluctuations in both lines indicate that the system is sensitive to changes over time, and that the mean error is not consistently stable. Further investigation is needed to understand the cause of the spike at Time Step 400 and the increasing trend of the red line. The chart demonstrates the effectiveness of MISRP in reducing error, but also highlights the need for ongoing monitoring and potential adjustments to maintain optimal performance.
</details>
<details>
<summary>x17.png Details</summary>

### Visual Description
\n
## Line Chart: Mean Error over Time
### Overview
This image presents a line chart comparing the mean error over time for two different methods: GS/QBC and a method utilizing MISRP. The chart displays the mean error on the y-axis against the time step on the x-axis.
### Components/Axes
* **Title:** "Mean Error over Time" (centered at the top)
* **X-axis Label:** "Time Step" (bottom-center)
* Scale: 0 to 450, with tick marks at intervals of approximately 50.
* **Y-axis Label:** "Mean Error" (left-center)
* Scale: 0 to 10, with tick marks at intervals of 2.
* **Legend:** Located in the top-right corner.
* "Mean Error over Time (GS/QBC)" - represented by a dashed blue line.
* "Mean Error over Time (with MISRP)" - represented by a solid red line.
* **Gridlines:** Horizontal and vertical gridlines are present to aid in reading values.
### Detailed Analysis
**Data Series 1: Mean Error over Time (GS/QBC) - Dashed Blue Line**
The blue line exhibits significant fluctuations over time.
* At Time Step 0, the mean error is approximately 1.2.
* The line increases to a peak of approximately 6.0 at Time Step 80.
* It then decreases to a low of approximately 1.0 at Time Step 150.
* Another peak is observed around Time Step 220, reaching approximately 4.5.
* The line remains relatively stable between Time Steps 250 and 350, fluctuating between 2.0 and 3.0.
* A final, large peak occurs around Time Step 400, reaching approximately 7.5, before decreasing to approximately 3.5 at Time Step 450.
**Data Series 2: Mean Error over Time (with MISRP) - Solid Red Line**
The red line also fluctuates, but generally remains lower than the blue line.
* At Time Step 0, the mean error is approximately 2.0.
* The line dips to a low of approximately 1.2 at Time Step 50.
* It increases to a peak of approximately 3.0 at Time Step 100.
* The line remains relatively stable between Time Steps 150 and 300, fluctuating between 1.5 and 2.5.
* Around Time Step 350, the line begins to increase, reaching a peak of approximately 4.2 at Time Step 400.
* It then decreases to approximately 3.0 at Time Step 450.
### Key Observations
* The GS/QBC method (blue line) consistently exhibits higher mean errors than the method with MISRP (red line) across most of the time steps.
* Both methods experience peaks in mean error at approximately Time Steps 80, 220, and 400, suggesting potential common issues or events causing increased error.
* The MISRP method appears to dampen the fluctuations in mean error, resulting in a more stable performance.
* The largest difference between the two methods occurs around Time Step 400, where the GS/QBC method's error is significantly higher.
### Interpretation
The data suggests that incorporating MISRP into the method leads to a reduction in mean error and improved stability compared to the GS/QBC method. The consistent lower error values and dampened fluctuations indicate that MISRP effectively mitigates some of the factors contributing to errors in the process. The recurring peaks in both lines suggest that there are specific time steps or events that consistently introduce errors, regardless of the method used. Further investigation into the nature of these events could lead to further improvements in both methods. The large error spike at Time Step 400 for the GS/QBC method warrants particular attention, as it represents a significant performance degradation. This could be due to a specific condition or input encountered at that time step.
</details>
Figure 8: Moving average estimation error over time. Top-Left: SR with Shannon Surprise. Top-Right: SR with Bayesian Surprise. Bottom-Left: SC/E. Bottom-Right: GS/QBC.
Across all comparisons, the baseline strategies display considerable volatility. In contrast, MISRP-governed counterparts produce smoother and consistently lower error curves, highlighting the stabilizing effect of MIS through its ability to facilitate adaptive responses in dynamic environments.
Table 2 presents the average estimation errors and their corresponding standard errors. The standard error is measured across $10$ Monte Carlo simulations and $450$ frames. Across all sampling strategies, incorporating the MIS reaction policy yields a substantial reduction in both mean estimation error and variability. Improvements in estimation error range from $24\$ to $76\$ , while reductions in standard error range from $36\$ to $90\$ .
To further illustrate the advantage of MISRP, we increase the per-frame sampling budget and the initial number of observed locations of the baseline strategies from $10$ to $25$ , and expand the total memory buffer from $200$ to $500$ , in order to assess whether baseline strategies can match the performance of MISRP-governed approaches. Table 3 compares the estimation error of MISRP-governed strategies (maintaining the original sampling budget of $10$ ) against the enhanced baseline strategies. Even with a $2.5\times$ increase in sampling budget, the baseline strategies remain significantly outperformed by their MISRP-governed counterparts.
Table 2: Comparison of pollution map estimation errors: baseline sampling strategies versus MISRP-governed strategies.
| SR with Shannon SR with Bayesian SC/E | $6.64\pm 0.436$ $2.79\pm 0.096$ $2.02\pm 0.071$ | $\mathbf{1.60\pm 0.043}$ $\mathbf{0.87\pm 0.016}$ $\mathbf{1.53\pm 0.045}$ | $76\$ $69\$ $24\$ | $90\$ $83\$ $36\$ |
| --- | --- | --- | --- | --- |
| GS/QBC | $2.07\pm 0.071$ | $\mathbf{1.49\pm 0.039}$ | $28\$ | $45\$ |
Table 3: Error Comparison under Extended Sampling for Baseline Strategies.
| Sampling Strategy SR with Shannon SR with Bayesian | Estimation Error (MISRP-Governed, Budget 10) $\mathbf{1.60}$ $\mathbf{0.87}$ | Estimation Error (Baseline, Budget $25$ ) $6.23$ $2.72$ |
| --- | --- | --- |
| SC/E | $\mathbf{1.53}$ | $1.89$ |
| GS/QBC | $\mathbf{1.49}$ | $2.00$ |
So far we demonstrated that governing basic sampling strategies with MISRP can substantially enhance learning performance in dynamic environments. To provide a clearer view of how MISRP operates over time, we conduct an additional simulation examining its actions throughout the process.
In this experiment, we simulate a two-phase pollution map evolution governed by the same PDE used in earlier simulation. During the first phase (time $0$ â $250$ ), three pollution sources emit high levels of pollutants, and the map evolves under diffusion, decay, and wind effects. At time step $250$ , the emission sources are removed, and the decay factor is reduced to one-twentieth of its original value. The system then continues evolving for an additional $50$ steps.
When the pollution sources exists and are emitting (the dynamic phase, time $0$ â $250$ ), the underlying process is a non-stationary process in which we expect frequent MIS triggering. When the pollution sources are gone (the stationary phase, time $251$ â $300$ ), the pollutants in the area will eventually diffuse to a stationary existence, during which time MIS is expected to stop being triggered.
<details>
<summary>x18.png Details</summary>

### Visual Description
## Line Chart: Error Progression with Action Overlays
### Overview
This chart depicts the progression of mean error over time, comparing two methods (MISRP and SR with Shannon) and overlaying action events (Fork->Merge span and Adjustment event). The x-axis represents time steps, and the y-axis represents the mean error. The chart visualizes how error changes over time and how specific actions correlate with these changes.
### Components/Axes
* **Title:** Error Progression with Action Overlays
* **X-axis:** Time Step (Scale: 0 to 300, increments of approximately 25)
* **Y-axis:** Mean Error (Scale: 0 to 50, increments of approximately 10)
* **Legend:**
* Blue Solid Line: Mean Error Over Time (with MISRP)
* Red Dashed Line: Mean Error Over Time (SR with Shannon)
* Green Vertical Bars: Fork->Merge span
* Red Vertical Bars: Adjustment event
### Detailed Analysis
The chart displays three distinct data series and two types of action overlays.
**1. Mean Error Over Time (with MISRP) - Blue Solid Line:**
This line generally remains low, fluctuating around the 0-5 error range. It exhibits a slight upward trend from time step 0 to approximately 20, then stabilizes. There's a small dip around time step 40 (approximately error value of 2), and another around time step 190 (approximately error value of 1). The line decreases sharply after time step 250, approaching 0.
**2. Mean Error Over Time (SR with Shannon) - Red Dashed Line:**
This line shows significantly higher error values and more pronounced fluctuations. It starts at approximately 15 error at time step 0, rises to a peak of around 45-50 error at time step 80, then drops sharply. It then oscillates between approximately 10 and 40 error for the remainder of the chart, with a final sharp decrease after time step 250, approaching 0.
**3. Fork->Merge span - Green Vertical Bars:**
These bars are consistently present throughout the chart, appearing at regular intervals, approximately every 10-15 time steps. They indicate the frequent occurrence of Fork->Merge spans.
**4. Adjustment event - Red Vertical Bars:**
These bars appear less frequently than the green bars, occurring at time steps approximately 80, 160, and 250. They signify the occurrence of Adjustment events.
**Specific Data Points (Approximate):**
* Time Step 0: MISRP ~ 2, SR ~ 15
* Time Step 40: MISRP ~ 2, SR ~ 10
* Time Step 80: MISRP ~ 3, SR ~ 50
* Time Step 120: MISRP ~ 4, SR ~ 30
* Time Step 160: MISRP ~ 2, SR ~ 40
* Time Step 200: MISRP ~ 3, SR ~ 20
* Time Step 240: MISRP ~ 1, SR ~ 10
* Time Step 280: MISRP ~ 0, SR ~ 0
### Key Observations
* The MISRP method consistently exhibits lower mean error compared to the SR with Shannon method.
* Adjustment events (red bars) often coincide with peaks in the SR with Shannon error curve.
* Fork->Merge spans (green bars) do not appear to have a strong correlation with either error curve.
* Both methods show a significant reduction in error towards the end of the time period (after time step 250).
### Interpretation
The data suggests that the MISRP method is more effective at maintaining low error rates than the SR with Shannon method. The frequent occurrence of Fork->Merge spans doesn't seem to significantly impact error levels, while Adjustment events appear to be associated with increased error in the SR with Shannon method. The sharp decrease in error for both methods towards the end of the time period could indicate a convergence or stabilization of the system.
The correlation between Adjustment events and error spikes in the SR with Shannon method suggests that these events may introduce instability or require further refinement in the SR with Shannon approach. The consistent low error of MISRP suggests it is more robust to these events, or handles them more effectively.
The chart provides valuable insights into the performance of these two methods and highlights potential areas for improvement in the SR with Shannon approach. Further investigation into the nature of Adjustment events and their impact on error could lead to more effective error mitigation strategies.
</details>
Figure 9: A visualization of estimation error progression with MISRP action overlays.
Figure 9 shows the estimation error progression with action overlays under surprise-reactive sampling based on Shannon surprise. Recall from Section 3.4 that there are two actions employed in MISRP governance: sampling adjustments and process forking. These two actions are marked as red vertical lines and green shaded regions in the plot, respectively. For clarity, we present the $20$ -frame moving average of estimation error, whereas the unsmoothed version is provided in the Appendix. Actions are displayed $20$ steps in advance, corresponding to their first observable effect on the smoothed error trajectory.
Several key observations emerge from the figure. First, both sampling adjustments and process forking occur frequently during the dynamic phase as expected, highlighting the effectiveness of MISRPâs action design in maintaining low estimation error. Second, sudden spikes in estimation error (circled) under MISRP governance are almost always followed by corrective actions that prevent further error growth, resulting in non-smooth error progressions after intervention. By contrast, the baseline sampling strategy allows estimation error to rise unchecked. Then, once the system enters the stationary phase, MISRP ceases intervention, aligning with the intuition that a balanced sampling strategy in a well-regulated system should not trigger Mutual Information Surprise.
## 5 Conclusion
In this work, we reimagined the concept of surprise as a mechanism for fostering understanding, rather than merely detecting anomalies. Traditional definitionsâsuch as Shannon and Bayesian Surprisesâfocus on single-instance deviations and belief updates, yet fail to capture whether a system is truly growing in its understanding over time. By introducing Mutual Information Surprise (MIS), we proposed a new framework that reframes surprise as a reflection of learning progression, grounded in mutual information growth.
We developed a formal test sequence to monitor deviations in estimated mutual information, and introduced a reaction policy, MISRP, that transforms surprise into actionable system behavior. Through a synthetic case study and a real-time pollution map estimation task, we demonstrated that MIS governance offers clear advantages over conventional sampling strategies. Our results show improved stability, better responsiveness to environmental drift, and significant reductions in estimation error. These findings affirm MIS as a robust and adaptive supervisory signal for autonomous systems.
Looking forward, this work opens several promising directions for future research. A natural next step is the development of a continuous space formulation of mutual information surprise, enabling its application in large complex systems. Another direction involves designing a specialized reaction policy âone that incorporates a sampling strategy tailored directly to the structure and signals of MIS, rather than relying on existing sampling strategies. This could enhance efficiency and responsiveness in highly dynamic or resource-constrained systems. Moreover, pairing MIS with physical probing capability for specific physical systems could unlock the true potential of MIS, as MIS provides new perspectives in system characterization compared to traditional measures.
## Appendix
The appendix is organized as follows. In the first section, we present empirical evidence supporting our claim in Section 3.2 that standard deviation-based tests are overly permissive. In the second section, we provide the derivation of the standard deviation-based test for mutual information. In the third section, we provide the proof of Theorem 1. The fourth section details the simulation setup for dynamic pollution map generation. In the fifth section, we provide the pseudocode for the surprise-reactive (SR) sampling strategy (?) to facilitate reproducibility.
## MLE Mutual Information Estimator Standard Deviation
In Section 3.2, we discussed the limitations of standard deviation-based tests. Specifically, the current distribution agnostic tightest bound for the standard deviation of a maximum likelihood estimator (MLE) for mutual information with $n$ observations is given by (?)
$$
\sigma\lesssim\frac{\log n}{\sqrt{n}}.
$$
Despite the best result, this bound is still too loose.
To empirically verify this statement, we perform a simple simulation as follows. We construct variable pairs $(x,y)$ where $y=x\;\text{mod}\;10$ , in the same manner as the simulation in Section 4.1. The variable $x$ is generated as random integers sampled from randomly generated probability mass functions over the domain $[0,100]$ . We generate $100$ such probability mass functions. For each probability mass function, we generate $3,000$ pairs of $(x,y)$ , repeat the process using $10$ Monte Carlo simulations, and compute the standard deviation of the MLE mutual information estimates over the $10$ simulations for varying numbers of $(x,y)$ pairs $n$ . We then plot the average standard deviation across the $100$ different probability mass functions as a function of $n$ versus the estimation bound shown in Eq. (5). The results are shown in Figure 10.
<details>
<summary>x19.png Details</summary>

### Visual Description
\n
## Line Chart: MI Estimation Std
### Overview
The image presents a line chart illustrating the standard deviation (Std) of Mutual Information (MI) estimation and its upper bound, plotted against the sample size 'n'. The chart aims to demonstrate how the standard deviation decreases as the sample size increases, and how the actual MI estimation standard deviation compares to its theoretical upper bound.
### Components/Axes
* **Title:** "MI Estimation Std" (positioned at the top-center)
* **X-axis:** "n" (sample size), ranging from approximately 0 to 3000.
* **Y-axis:** "Std" (standard deviation), ranging from approximately 0 to 0.55.
* **Legend:** Located in the top-left corner, containing two entries:
* "MI Estimation Std" (represented by a blue line)
* "Upper Bound for Estimation Std" (represented by an orange line)
### Detailed Analysis
* **MI Estimation Std (Blue Line):** The blue line starts at approximately 0.05 at n=0, rapidly decreases, and then plateaus around 0.01 for n > 1000. The trend is a steep initial decline followed by a flattening.
* n = 0: Std â 0.05
* n = 100: Std â 0.02
* n = 200: Std â 0.015
* n = 500: Std â 0.012
* n = 1000: Std â 0.01
* n = 2000: Std â 0.009
* n = 3000: Std â 0.008
* **Upper Bound for Estimation Std (Orange Line):** The orange line starts at approximately 0.5 at n=0, exhibits a rapid decrease, and then gradually levels off around 0.08 for n > 2000. The trend is similar to the blue line, but the values are consistently higher.
* n = 0: Std â 0.5
* n = 100: Std â 0.2
* n = 200: Std â 0.15
* n = 500: Std â 0.1
* n = 1000: Std â 0.08
* n = 2000: Std â 0.07
* n = 3000: Std â 0.06
### Key Observations
* The standard deviation of MI estimation decreases rapidly with increasing sample size 'n' for both the actual estimation and its upper bound.
* The actual MI estimation standard deviation (blue line) is consistently lower than its upper bound (orange line) across all sample sizes.
* The rate of decrease in standard deviation slows down as 'n' increases, suggesting diminishing returns in precision with larger sample sizes.
* The upper bound provides a conservative estimate of the standard deviation, and the actual estimation is always better than this bound.
### Interpretation
The chart demonstrates the relationship between sample size and the precision of Mutual Information estimation. As the sample size increases, the uncertainty in the MI estimation (represented by the standard deviation) decreases. This is expected, as larger samples provide more information and lead to more reliable estimates. The upper bound serves as a theoretical limit on the estimation error. The fact that the actual MI estimation standard deviation remains below the upper bound indicates that the estimation method is performing well. The diminishing returns observed at larger sample sizes suggest that there is a point beyond which increasing the sample size yields only marginal improvements in precision. This information is valuable for determining the appropriate sample size needed to achieve a desired level of accuracy in MI estimation. The chart suggests that for n > 1000, the MI estimation is relatively stable.
</details>
Figure 10: Empirical standard deviation of MLE mutual information estimates vs. the current tightest bound.
We observe that the current bound for the standard deviation of the mutual information estimate, computed using Eq. (5), is significantly larger than the empirical average standard deviation. This empirical observation supports our claim in Section 3.2 that the test in Eq. (6) is rarely violated in practice.
## Standard Deviation Test Derivation
First, recall that the estimation standard deviation satisfies
$$
\sigma\lesssim\frac{\log n}{\sqrt{n}}.
$$
Therefore, we treat this worst case scenario as the baseline when deriving the test of difference between the two maximum likelihood estimators (MLE) of mutual information.
Let:
- $\hat{I}_{n}$ be the MLE estimate from a sample of size $n$ ,
- $\hat{I}_{m+n}$ be the MLE estimate from a larger sample of size $m+n$ ,
Assume the standard deviation of the MLE estimator is approximately:
$$
\sigma_{n}=\frac{\log n}{\sqrt{n}},\quad\sigma_{m+n}=\frac{\log(m+n)}{\sqrt{m+n}}
$$
We want to test the hypothesis:
$$
H_{0}:\mathbb{E}[\hat{I}_{n}]=\mathbb{E}[\hat{I}_{m+n}]\quad\text{vs.}\quad H_{1}:\mathbb{E}[\hat{I}_{n}]\neq\mathbb{E}[\hat{I}_{m+n}]
$$
Note that we are omitting the estimation bias of MLE mutual information estimators for simplicity.
Under the null hypothesis and assuming the two estimates are independent, the test statistic is:
$$
z_{\alpha}=\frac{\hat{I}_{n}-\hat{I}_{m+n}}{\sqrt{\sigma_{n}^{2}+\sigma_{m+n}^{2}}}=\frac{\hat{I}_{n}-\hat{I}_{m+n}}{\sqrt{\left(\frac{\log n}{\sqrt{n}}\right)^{2}+\left(\frac{\log(m+n)}{\sqrt{m+n}}\right)^{2}}}
$$
Moving the denominator to the left hand side will yield the form presented in Eq. (6).
## Proof of Theorem 1
First, we formally introduce the maximum likelihood entropy estimator $\hat{H}$ (?) for random variable $\mathbf{x}\in\mathcal{X}$ as follows
$$
\hat{H}(\mathbf{x})=\sum_{i=1}^{|\mathcal{X}|}\hat{p}_{i}\log\hat{p}_{i},
$$
where $\hat{p}_{i}$ is the empirical probability mass of random variable $\mathbf{x}$ at category $i$ . The MLE mutual information estimator is then defined based on the MLE entropy estimator
$$
\hat{I}(\mathbf{x},\mathbf{y})=\hat{H}(\mathbf{x})+\hat{H}(\mathbf{y})-\hat{H}(\mathbf{x},\mathbf{y}).
$$
MIS test bound (Expectation):
Here, we derive the first part of the MIS test bound, representing the expectation of the MIS statistics, i.e., $\mathbb{E}[\text{MIS}]$ . The derivation involves two cases, $n\ll|\mathcal{X}|,|\mathcal{Y}|$ and $n\gg|\mathcal{X}|,|\mathcal{Y}|$ .
When $n\ll|\mathcal{X}|,|\mathcal{Y}|$ , an MLE entropy estimator $\hat{H}$ with $n$ observations will behave simply as $\log n$ (?), conditioning on the $n$ observations are selected using some kind of space filling designs, which is common for design the initial set of experimentation locations in design of experiments literature (?). We have $\mathbb{E}[\hat{H}_{n}(\mathbf{x})]=\log n$ . Hence, the mutual information estimator with $n$ observations admits
$$
\mathbb{E}[\hat{I}_{n}(\mathbf{x},\mathbf{y})]=\mathbb{E}[\hat{H}_{n}(\mathbf{x})+\hat{H}_{n}(\mathbf{y})-\hat{H}_{n}(\mathbf{x},\mathbf{y})]=\log n.
$$
Then for MIS, we have
$$
\mathbb{E}[\text{MIS}]=\mathbb{E}[\hat{I}_{m+n}]-\mathbb{E}[\hat{I}_{n}]=\log(m+n)-\log n.
$$
When $n\gg|\mathcal{X}|,|\mathcal{Y}|$ , we are facing an oversampled scenario where the samples have most likely exhausted the input and output space. In this case, we first introduce the following lemma.
**Lemma 1**
*(?) For a random variable $\mathbf{x}\in\mathcal{X}$ , the bias of an oversampled ( $n\gg|\mathcal{X}|$ ) MLE entropy estimator $\hat{H}_{n}(\mathbf{x})$ is
$$
\mathbb{E}[\hat{H}_{n}(\mathbf{x})]-H(\mathbf{x})=-\frac{|\mathcal{X}|-1}{n}+o(\frac{1}{n}). \tag{9}
$$*
With the above lemma, we can derive the following Corollary.
**Corollary 1**
*For random variable $\mathbf{x}\in\mathcal{X}$ and $\mathbf{y}\in\mathcal{Y}$ , when the $\mathbf{y}=f(\mathbf{x})$ mapping is noise free, the MLE mutual information estimator $\hat{I}_{n}$ asymptotically satisfies
$$
\mathbb{E}[\hat{I}_{n}]=I-\frac{|\mathcal{Y}|-1}{n}.
$$*
The proof of the above Corollary immediately follows observing the fact of $|\mathcal{X}|=|\mathcal{X},\mathcal{Y}|$ for noise free mapping and invoking Lemma 1.
Therefore, for MIS under the case of oversampling, we have
| | $\displaystyle\mathbb{E}[\text{MIS}]$ | $\displaystyle=\mathbb{E}[\hat{I}_{m+n}]-\mathbb{E}[\hat{I}_{n}]$ | |
| --- | --- | --- | --- |
MIS test bound (Variation):
In this part, we derive the second term of the MIS test bound, accounting for the variation of the MIS statistics. We first investigate the maximum change in mutual information estimation $\hat{I}$ when changing one observation. Here, we derive the following Lemma.
**Lemma 2**
*Let $\mathcal{S}=\{(x_{i},y_{i})\}_{i=1}^{n}$ be an i.i.d. sample from an unknown joint distribution on finite alphabets and denote by
$$
\hat{I}_{n}(\mathbf{x},\mathbf{y})\;=\;\hat{H}_{n}(\mathbf{x})+\hat{H}_{n}(\mathbf{y})-\hat{H}_{n}(\mathbf{x},\mathbf{y})
$$
the MLE estimator, where $\hat{H}_{n}$ is the empirical Shannon entropy (in nats). If $\mathcal{S}^{\prime}$ differs from $\mathcal{S}$ in exactly one observation, then with a mild abuse of notation (denoting mutual information estimator on sample set $\mathcal{S}$ with $\hat{I}_{n}(\mathcal{S})$ ),
$$
\bigl{|}\hat{I}_{n}(\mathcal{S})-\hat{I}_{n}(\mathcal{S}^{\prime})\bigr{|}\;\leq\;\frac{2\,\log n}{n}.
$$*
* Proof*
Proof For Lemma 2 We omit $\hat{\cdot}$ for estimators during this proof for simplicity. Write $H=-\sum_{i}p_{i}\log p_{i}$ for Shannon entropy estimator with natural logarithms. Replacing a single observation does two things: 1. in one $X$ -category and one $Y$ -category the counts change by $\pm 1$ (all other marginal counts are unchanged);
1. in one joint cell the count changes by $-1$ and in another joint cell the count changes by $+1$ .
Step 1. How much can one empirical Shannon entropy change?
Assume a single observation is moved from category $A$ to category $B$ . Let the counts before the move be $A=a$ (with $a\geq 1$ ) and $B=b$ (with $b\geq 0$ ). After the move the counts become $a-1$ and $b+1$ . Only these two probabilities change; every other probability is fixed.
The change in entropy is therefore
| | $\displaystyle\Delta H$ | $\displaystyle=\big{(}\frac{a}{n}\log\frac{a}{n}-\frac{a-1}{n}\log\frac{a-1}{n}\big{)}-\big{(}\frac{b+1}{n}\log\frac{b+1}{n}-\frac{b}{n}\log\frac{b}{n}\big{)}.$ | |
| --- | --- | --- | --- |
We can see that the maximum difference is largest when $a=n$ and $b=0$ , i.e. when all $n$ observations initially occupy a single category and we create a brand-new one. In that worst case
$$
\displaystyle\Delta H \displaystyle=\frac{n-1}{n}\log\frac{n-1}{n}+\frac{1}{n}\log n \displaystyle\leq\frac{n-1}{n}\log\frac{n}{n}+\frac{1}{n}\log n=\frac{\log n}{n}. \tag{10}
$$
The forth equality follows the Taylor expansion of $\log(1-x)$ . Conversly, one could see that $-\frac{\log n}{n}\leq\Delta H$ also holds. Therefore, the maximum absolute differences of entropy estimation under the shift of one observations is upper bounded by $\frac{\log n}{n}$ .
Step 2. Sign coupling between the three entropies.
Assume the moved observation leaves joint cell $(i,j)$ and enters cell $(k,\ell)$ . Because $(i,j)$ lies in row $i$ and column $j$ only, we have the key fact (denoting sign operator with $\text{sgn}(\cdot)$ ):
$$
\text{sgn}\bigl{(}\Delta H(\mathbf{x},\mathbf{y})\bigr{)}\in\bigl{\{}\text{sgn}\bigl{(}\Delta H(\mathbf{x})\bigr{)},\text{sgn}\bigl{(}\Delta H(\mathbf{y})\bigr{)}\bigr{\}}.
$$
Hence $-\text{sgn}\bigl{(}\Delta H(\mathbf{x},\mathbf{y})\bigr{)}=\text{sgn}\bigl{(}\Delta H(\mathbf{x})\bigr{)}=\text{sgn}\bigl{(}\Delta H(\mathbf{y})\bigr{)}$ is impossible.
Then, with $\Delta I=\Delta H(\mathbf{x})+\Delta H(\mathbf{y})-\Delta H(\mathbf{x},\mathbf{y})$ , we can see the following fact
$$
|\Delta I|=\bigl{|}\Delta H(\mathbf{x})+\Delta H(\mathbf{y})-\Delta H(\mathbf{x},\mathbf{y})|\leq 2\max\{|\Delta H(\mathbf{x})|,|\Delta H(\mathbf{y})|,|\Delta H(\mathbf{x},\mathbf{y})|\}.
$$
Applying the one-entropy bound (10) to the two marginals,
$$
|\Delta I|\leq\frac{2\log n}{n},
$$
which is the desired inequality. â
Establishing Lemma 2 allows us to apply the McDiarmidâs Inequality (?), a concentration inequality for functions with bounded difference.
**Lemma 3 (McDiarmidâs Inequality)**
*If $\{\mathbf{x}_{i}\in\mathcal{X}_{i}\}_{i=1}^{n}$ are independent random variables (not necessarily identical), and a function $f:\mathcal{X}_{1}\times\mathcal{X}_{2}\ldots\mathcal{X}_{n}\rightarrow\mathbb{R}$ satisfies coordinate wise bounded condition
$$
\underset{\mathbf{x}^{\prime}_{j}\in\mathcal{X}_{j}}{sup}|f(\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{j},\ldots,\mathbf{x}_{n})-f(\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}^{\prime}_{j},\ldots,\mathbf{x}_{n})|<c_{j},
$$
for $1\leq j\leq n$ , then for any $\epsilon\geq 0$ ,
$$
P(|f(\mathbf{x}_{1},\ldots,\mathbf{x}_{n})-\mathbb{E}[f]|>\epsilon)\leq 2e^{-2\epsilon^{2}/\sum c_{j}^{2}}. \tag{11}
$$*
To apply the McDiarmidâs Inequality, we can view the mutual information estimator with $n$ old observations and $m$ new observations, denoted with $\hat{I}_{m+n}$ , as a function of the new $m$ observations $\{\mathbf{x}_{i}\in\mathcal{X}\}_{i=1}^{m}$ . Moreover, we have already bounded the maximum differences of the mutual information estimator through Lemma 2, meaning
$$
\underset{\mathbf{x}^{\prime}_{j}\in\mathcal{X}}{sup}|\hat{I}_{m+n}(\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{j},\ldots,\mathbf{x}_{m})-\hat{I}_{m+n}(\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}^{\prime}_{j},\ldots,\mathbf{x}_{m})|<\frac{2\log(m+n)}{m+n}.
$$
Then, plug the upper bound into Eq. (11), we have
$$
P(|\hat{I}_{m+n}-\mathbb{E}[\hat{I}_{m+n}]|>\epsilon)\leq 2e^{-2\epsilon^{2}/\sum(\frac{2\log(m+n)}{m+n})^{2}}=2e^{-(m+n)^{2}\epsilon^{2}/2m\log^{2}(m+n)}.
$$
By setting the RHS of the above equation to $\rho$ , we can get the following statement with probability at least $1-\rho$ ,
$$
|\hat{I}_{m+n}-\mathbb{E}[\hat{I}_{m+n}]|\leq\frac{\sqrt{2m\log 2/\rho}\log(m+n)}{m+n}. \tag{12}
$$
Finally, combining the derivation in the two parts, when $n\ll|\mathcal{X}|,|\mathcal{Y}|$ , we have the following with probability at least $1-\rho$
| | $\displaystyle MIS$ | $\displaystyle=\hat{I}_{m+n}-\hat{I}_{n}$ | |
| --- | --- | --- | --- |
The second equation follows the typical sample assumption in Assumption 1. The proof of Theorem 1 is now complete.
## Pollution Map Dataset
The dynamic pollution map is modeled as $u(\mathbf{x},t)$ , a function of spatial location $\mathbf{x}=(x_{1},x_{2})\in[0,1]^{2}$ . The governing partial differential equation (PDE) for the pollution map is
$$
\frac{\partial u}{\partial t}=-\mathbf{v}\cdot\nabla u+\nabla(\mathbf{D}\nabla u)-\zeta u+S(\mathbf{x}), \tag{13}
$$
where $\mathbf{v}=[1,0]$ is the advection velocity, representing wind that transports pollution horizontally to the right. The matrix $\mathbf{D}=\text{diag}(0.01,2)$ is the diagonal diffusion matrix, indicating that pollution diffuses much more rapidly in the $x_{2}$ direction than in the $x_{1}$ direction. The parameter $\zeta=2$ represents the exponential decay factor, modeling the natural decay of pollution levels over time. The term $S(\mathbf{x})$ models the spatially dependent but temporally constant pollution source at location $\mathbf{x}$ . Additionally, a base level of random pollution with mean $2$ and standard deviation $0.25$ is added to the pollution field. The evolution of the pollution map is computed in the Fourier domain by applying a discretized Fourier transformation to the PDE in Eq. (13).
In the last simulation experiment with the pollution map, we use the same PDE with modified parameters. Specifically, the pollution sources $S(\mathbf{x})$ is removed, and the decay parameter $\zeta$ is reduced to $0.1$ in the second phase.
## Surprise Reactive Sampling Strategy Pseudo Code
In this section, we present the pseudocode for the SR sampling strategy in (?) for reproducibility purpose in Algorithm 2.
Algorithm 2 Surprise Reactive (SR) Sampling Strategy
1: Observation set $\mathbf{X}:\{\mathbf{x}_{i}\in\mathcal{X}\}_{i=1}^{n}$ ; Total sampling budget $k$ ; Exploitation limit $t$ ; A surprise measure $S(\cdot)$ ; A surprise triggering threshold $s$ ; Exploration mode indicator $\xi=\text{True}$ ; Surprising location $\mathbf{x}_{s}=\text{None}$ ; Surprising location set $\mathbf{X}_{s}=\text{None}$ ; Neighborhood radius $\epsilon$ .
2: while $i<k$ ( $i$ starts from $0$ ) do
3: if $\xi$ then
4: Sample $\mathbf{x}^{*}$ as
$$
\mathbf{x}^{*}=\underset{\mathbf{x}}{\operatorname{argmax}}\>\underset{\mathbf{x}_{i}\in\mathbf{X}}{\min}\>\|\mathbf{x}-\mathbf{x}_{i}\|_{2}.
$$
5: $i=i+1$
6: Compute $S(\mathbf{x}^{*})$
7: if $S(\mathbf{x}^{*})\leq s$ then
8: $\mathbf{X}=[\mathbf{X},\mathbf{x}^{*}]$
9: else
10: $\xi=False$ , $\mathbf{x}_{s}=\mathbf{x}^{*}$ , $\mathbf{X}_{s}=[\mathbf{x}^{*}]$
11: end if
12: else
13: while $j\leq t$ ( $j$ starts from $0$ ) do
14: Sample $\mathbf{x}^{*}$ randomly in the $\epsilon$ ball centered at $\mathbf{x}_{s}$ .
15: $j=j+1$ , $i=i+1$
16: Compute $S(\mathbf{x}^{*})$
17: if $S(\mathbf{x}^{*})\leq s$ then
18: $\mathbf{X}=[\mathbf{X},\mathbf{x}^{*}]$ , $\xi=\text{True}$ , $\mathbf{X}_{s}=\text{None}$
19: Break While
20: else
21: $\mathbf{X}_{s}=[\mathbf{X}_{s},\mathbf{x}^{*}]$
22: end if
23: if $i\geq k$ then
24: Break While
25: end if
26: end while
27: if $\mathbf{X}_{s}$ is not None then
28: $\mathbf{X}=[\mathbf{X},\mathbf{X}_{s}]$ , $\xi=\text{True}$
29: end if
30: end if
31: end while
## Non-smoothed Error Progression with Action Overlays
Here we present the non-smoothed estimation error progression figure with action overlays.
<details>
<summary>x20.png Details</summary>

### Visual Description
## Line Chart: Error Progression with Action Overlays--Shannon
### Overview
This chart displays the progression of mean error over time, with overlays indicating the occurrence of "Fork->Merge span" and "Adjustment event" actions. Two error curves are presented: one representing the mean error over time with MISRP, and the other representing the mean error over time with Shannon. The chart spans from Time Step 0 to approximately 300.
### Components/Axes
* **Title:** Error Progression with Action Overlays--Shannon
* **X-axis:** Time Step (ranging from 0 to approximately 300)
* **Y-axis:** Mean Error (ranging from 0 to 140)
* **Legend:**
* Blue Solid Line: Mean Error Over Time (with MISRP)
* Red Dashed Line: Mean Error Over Time (with SR with Shannon)
* Green Vertical Bands: Fork->Merge span
* Teal Vertical Bands: Adjustment event
### Detailed Analysis
The chart presents three distinct visual elements: two error curves and two types of vertical bands representing actions.
**Mean Error Over Time (with MISRP) - Blue Solid Line:**
The blue line generally fluctuates close to the x-axis (Mean Error = 0) for the first 150 time steps, with occasional spikes. Around Time Step 150, the line exhibits a more pronounced upward trend, peaking at approximately 10 at Time Step 160, then returning to low values. After Time Step 200, the line remains relatively stable, fluctuating between 0 and 5.
* Time Step 0: Mean Error â 2
* Time Step 25: Mean Error â 4
* Time Step 50: Mean Error â 3
* Time Step 75: Mean Error â 5
* Time Step 100: Mean Error â 6
* Time Step 125: Mean Error â 8
* Time Step 150: Mean Error â 10
* Time Step 175: Mean Error â 4
* Time Step 200: Mean Error â 2
* Time Step 225: Mean Error â 3
* Time Step 250: Mean Error â 1
* Time Step 275: Mean Error â 2
**Mean Error Over Time (with SR with Shannon) - Red Dashed Line:**
The red dashed line shows significantly higher error values compared to the blue line. It fluctuates considerably, with frequent spikes reaching up to approximately 130. The line exhibits a generally upward trend, with periods of relative stability interspersed with rapid increases in error.
* Time Step 0: Mean Error â 10
* Time Step 25: Mean Error â 15
* Time Step 50: Mean Error â 20
* Time Step 75: Mean Error â 25
* Time Step 100: Mean Error â 30
* Time Step 125: Mean Error â 40
* Time Step 150: Mean Error â 50
* Time Step 175: Mean Error â 60
* Time Step 200: Mean Error â 70
* Time Step 225: Mean Error â 80
* Time Step 250: Mean Error â 90
* Time Step 275: Mean Error â 100
**Fork->Merge span - Green Vertical Bands:**
These bands appear frequently throughout the chart, particularly between Time Step 0 and 250. They are relatively wide and cover a significant portion of the chart.
**Adjustment event - Teal Vertical Bands:**
These bands are less frequent than the green bands and appear more sporadically, primarily after Time Step 200. They are narrower than the green bands.
### Key Observations
* The "SR with Shannon" method consistently exhibits higher mean error values than the "MISRP" method.
* The "Fork->Merge span" action appears to correlate with increased error, particularly for the "SR with Shannon" method.
* The "Adjustment event" action seems to occur less frequently and its impact on error is less pronounced than the "Fork->Merge span" action.
* The error with MISRP is generally low and stable, while the error with SR with Shannon is high and volatile.
### Interpretation
The chart suggests that the "MISRP" method is more effective at maintaining low error rates compared to the "SR with Shannon" method. The frequent occurrence of "Fork->Merge span" actions, coupled with the higher error values associated with "SR with Shannon", indicates that this action may be a source of instability or error in the system. The "Adjustment event" action, while present, appears to have a less significant impact on error.
The data implies a trade-off between the frequency of "Fork->Merge span" actions and the resulting error. The system may benefit from reducing the frequency of these actions or implementing strategies to mitigate their impact on error. The consistent low error of MISRP suggests it is a more robust approach, but further investigation is needed to understand why "SR with Shannon" is more susceptible to error during "Fork->Merge span" events. The chart provides a visual representation of the performance of two different methods and highlights the potential impact of specific actions on system error.
</details>
Figure 11: A non-smoothed visualization of estimation error progression with MISRP action overlays.
## References and Notes
- 1. B. Burger, P. M. Maffettone, V. V. Gusev, C. M. Aitchison, Y. Bai, X. Wang, X. Li, B. M. Alston, B. Li, R. Clowes, N. Rankin, B. Harris, R. S. Sprick, and A. I. Cooper, âA mobile robotic chemist,â Nature, vol. 583, pp. 237â241, 2020.
- 2. A. Merchant, S. Batzner, S. S. Schoenholz, M. Aykol, G. Cheon, and E. D. Cubuk, âScaling deep learning for materials discovery,â Nature, vol. 624, pp. 80â85, 2023.
- 3. N. J. Szymanski, B. Rendy, Y. Fei, R. E. Kumar, T. He, D. Milsted, M. J. McDermott, M. Gallant, E. D. Cubuk, A. Merchant, H. Kim, A. Jain, C. J. Bartel, K. Persson, Y. Zeng, and G. Ceder, âAn autonomous laboratory for the accelerated synthesis of novel materials,â Nature, vol. 624, pp. 86â91, 2023.
- 4. T. Dai, S. Vijayakrishnan, F. T. SzczypiĆski, J.-F. Ayme, E. Simaei, T. Fellowes, R. Clowes, L. Kotopanov, C. E. Shields, Z. Zhou, J. W. Ward, and A. I. Cooper, âAutonomous mobile robots for exploratory synthetic chemistry,â Nature, vol. 635, pp. 890â897, 2024.
- 5. J. Levinson, J. Askeland, J. Becker, J. Dolson, D. Held, S. Kammel, J. Z. Kolter, D. Langer, O. Pink, V. Pratt, M. Sokolsky, G. Stanek, D. Stavens, A. Teichman, M. Werling, and S. Thrun, âTowards fully autonomous driving: Systems and algorithms,â in Proceedings of the 2011 IEEE Intelligent Vehicles Symposium, (Baden-Baden, Germany), June 2011.
- 6. B. P. MacLeod, F. G. Parlane, T. D. Morrissey, F. HĂ€se, L. M. Roch, K. E. Dettelbach, R. Moreira, L. P. Yunker, M. B. Rooney, and J. R. Deeth, âSelf-driving laboratory for accelerated discovery of thin-film materials,â Science Advances, vol. 6, no. 20, p. eaaz8867, 2020.
- 7. E. Yurtsever, J. Lambert, A. Carballo, and K. Takeda, âA survey of autonomous driving: Common practices and emerging technologies,â IEEE Access, vol. 8, pp. 58443â58469, 2020.
- 8. D. Bogdoll, M. Nitsche, and J. M. Zöllner, âAnomaly detection in autonomous driving: A survey,â in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (New Orleans, USA), June 2022.
- 9. H.-S. Park and N.-H. Tran, âAn autonomous manufacturing system based on swarm of cognitive agents,â Journal of Manufacturing Systems, vol. 31, no. 3, pp. 337â348, 2012.
- 10. J. Leng, Y. Zhong, Z. Lin, K. Xu, D. Mourtzis, X. Zhou, P. Zheng, Q. Liu, J. L. Zhao, and W. Shen, âTowards resilience in industry 5.0: A decentralized autonomous manufacturing paradigm,â Journal of Manufacturing Systems, vol. 71, pp. 95â114, 2023.
- 11. J. Reis, Y. Cohen, N. MelĂŁo, J. Costa, and D. Jorge, âHigh-tech defense industries: Developing autonomous intelligent systems,â Applied Sciences, vol. 11, no. 11, p. 4920, 2021.
- 12. P. Nikolaev, D. Hooper, F. Webber, R. Rao, K. Decker, M. Krein, J. Poleski, R. Barto, and B. Maruyama, âAutonomy in materials research: A case study in carbon nanotube growth,â NPJ Computational Materials, vol. 2, p. 16031, 2016.
- 13. J. Chang, P. Nikolaev, J. Carpena-NĂșñez, R. Rao, K. Decker, A. E. Islam, J. Kim, M. A. Pitt, J. I. Myung, and B. Maruyama, âEfficient closed-loop maximization of carbon nanotube growth rate using Bayesian optimization,â Scientific Reports, vol. 10, p. 9040, 2020.
- 14. I. Ahmed, S. T. Bukkapatnam, B. Botcha, and Y. Ding, âToward futuristic autonomous experimentationâa surprise-reacting sequential experiment policy,â IEEE Transactions on Automation Science and Engineering, vol. 22, pp. 7912â7926, 2025.
- 15. Z.-G. Zhou and P. Tang, âContinuous anomaly detection in satellite image time series based on z-scores of season-trend model residuals,â in Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium, (Beijing, China), July 2016.
- 16. K. Cohen and Q. Zhao, âActive hypothesis testing for anomaly detection,â IEEE Transactions on Information Theory, vol. 61, no. 3, pp. 1432â1450, 2015.
- 17. J. F. Kamenik and M. Szewc, âNull hypothesis test for anomaly detection,â Physics Letters B, vol. 840, p. 137836, 2023.
- 18. D. J. Weller-Fahy, B. J. Borghetti, and A. A. Sodemann, âA survey of distance and similarity measures used within network intrusion anomaly detection,â IEEE Communications Surveys & Tutorials, vol. 17, no. 1, pp. 70â91, 2014.
- 19. L. Montechiesi, M. Cocconcelli, and R. Rubini, âArtificial immune system via Euclidean distance minimization for anomaly detection in bearings,â Mechanical Systems and Signal Processing, vol. 76, pp. 380â393, 2016.
- 20. Y. Wang, Q. Miao, E. W. Ma, K.-L. Tsui, and M. G. Pecht, âOnline anomaly detection for hard disk drives based on Mahalanobis distance,â IEEE Transactions on Reliability, vol. 62, no. 1, pp. 136â145, 2013.
- 21. Y. Hou, Z. Chen, M. Wu, C.-S. Foo, X. Li, and R. M. Shubair, âMahalanobis distance based adversarial network for anomaly detection,â in Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, (Virtual), May 2020.
- 22. T. Schlegl, P. Seeböck, S. M. Waldstein, G. Langs, and U. Schmidt-Erfurth, âF-anoGAN: Fast unsupervised anomaly detection with generative adversarial networks,â Medical Image Analysis, vol. 54, pp. 30â44, 2019.
- 23. B. Lian, Y. Kartal, F. L. Lewis, D. G. Mikulski, G. R. Hudas, Y. Wan, and A. Davoudi, âAnomaly detection and correction of optimizing autonomous systems with inverse reinforcement learning,â IEEE Transactions on Cybernetics, vol. 53, no. 7, pp. 4555â4566, 2022.
- 24. A. Barto, M. Mirolli, and G. Baldassarre, âNovelty or surprise?,â Frontiers in Psychology, vol. 4, p. 907, 2013.
- 25. L. Itti and P. Baldi, âBayesian surprise attracts human attention,â Vision Research, vol. 49, no. 10, pp. 1295â1306, 2009.
- 26. V. Liakoni, A. Modirshanechi, W. Gerstner, and J. Brea, âLearning in volatile environments with the Bayes factor surprise,â Neural Computation, vol. 33, no. 2, pp. 269â340, 2021.
- 27. M. Faraji, K. Preuschoff, and W. Gerstner, âBalancing new against old information: The role of puzzlement surprise in learning,â Neural Computation, vol. 30, no. 1, pp. 34â83, 2018.
- 28. O. Ăatal, S. Leroux, C. De Boom, T. Verbelen, and B. Dhoedt, âAnomaly detection for autonomous guided vehicles using Bayesian surprise,â in Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, (Las Vegas, USA), October 2020.
- 29. Y. Zamiri-Jafarian and K. N. Plataniotis, âA Bayesian surprise approach in designing cognitive radar for autonomous driving,â Entropy, vol. 24, no. 5, p. 672, 2022.
- 30. A. Dinparastdjadid, I. Supeene, and J. Engstrom, âMeasuring surprise in the wild,â arXiv preprint arXiv:2305.07733, 2023.
- 31. A. S. Raihan, H. Khosravi, T. H. Bhuiyan, and I. Ahmed, âAn augmented surprise-guided sequential learning framework for predicting the melt pool geometry,â Journal of Manufacturing Systems, vol. 75, pp. 56â77, 2024.
- 32. S. Jin, J. R. Deneault, B. Maruyama, and Y. Ding, âAutonomous experimentation systems and benefit of surprise-based Bayesian optimization,â in Proceedings of the 2022 International Symposium on Flexible Automation, (Yokohama, Japan), July 2022.
- 33. A. Modirshanechi, J. Brea, and W. Gerstner, âA taxonomy of surprise definitions,â Journal of Mathematical Psychology, vol. 110, p. 102712, 2022.
- 34. P. Baldi, âA computational theory of surprise,â in Information, Coding and Mathematics: Proceedings of Workshop Honoring Prof. Bob Mceliece on his 60th Birthday, pp. 1â25, 2002.
- 35. A. Prat-Carrabin, R. C. Wilson, J. D. Cohen, and R. Azeredo da Silveira, âHuman inference in changing environments with temporal structure,â Psychological Review, vol. 128, no. 5, p. 879â912, 2021.
- 36. P. J. Rousseeuw and C. Croux, âAlternatives to the median absolute deviation,â Journal of the American Statistical Association, vol. 88, no. 424, pp. 1273â1283, 1993.
- 37. C. Aytekin, X. Ni, F. Cricri, and E. Aksu, âClustering and unsupervised anomaly detection with l-2 normalized deep auto-encoder representations,â in Proceedings of the 2018 International Joint Conference on Neural Networks, (Rio de Janeiro, Brazil), October 2018.
- 38. D. T. Nguyen, Z. Lou, M. Klar, and T. Brox, âAnomaly detection with multiple-hypotheses predictions,â in Proceedings of the 36th International Conference on Machine Learning, (Long Beach, USA), June 2019.
- 39. A. Kolossa, B. Kopp, and T. Fingscheidt, âA computational analysis of the neural bases of Bayesian inference,â Neuroimage, vol. 106, pp. 222â237, 2015.
- 40. C. E. Shannon, âA mathematical theory of communication,â The Bell System Technical Journal, vol. 27, no. 3, pp. 379â423, 1948.
- 41. L. Paninski, âEstimation of entropy and mutual information,â Neural Computation, vol. 15, no. 6, pp. 1191â1253, 2003.
- 42. D. François, V. Wertz, and M. Verleysen, âThe permutation test for feature selection by mutual information,â in Proceedings of the 14th European Symposium on Artificial Neural Networks, (Bruges, Belgium), April 2006.
- 43. G. Doquire and M. Verleysen, âMutual information-based feature selection for multilabel classification,â Neurocomputing, vol. 122, pp. 148â155, 2013.
- 44. T. M. Cover, Elements of Information Theory. John Wiley & Sons, 1999.
- 45. A. Bondu, V. Lemaire, and M. BoullĂ©, âExploration vs. exploitation in active learning: A Bayesian approach,â in Proceedings of the 2010 International Joint Conference on Neural Networks, (Barcelona, Spain), July 2010.
- 46. J. G. Moreno-Torres, T. Raeder, R. Alaiz-RodrĂguez, N. V. Chawla, and F. Herrera, âA unifying view on dataset shift in classification,â Pattern Recognition, vol. 45, no. 1, pp. 521â530, 2012.
- 47. M. Sugiyama, M. Krauledat, and K.-R. MĂŒller, âCovariate shift adaptation by importance weighted cross validation,â Journal of Machine Learning Research, vol. 8, no. 5, pp. 985â1005, 2007.
- 48. S. Bickel, M. BrĂŒckner, and T. Scheffer, âDiscriminative learning under covariate shift,â Journal of Machine Learning Research, vol. 10, no. 9, pp. 2137â2155, 2009.
- 49. I. ĆœliobaitÄ, M. Pechenizkiy, and J. Gama, âAn overview of concept drift applications,â Big Data Analysis: New Algorithms for a New Society, vol. 16, pp. 91â114, 2016.
- 50. K. Zhang, A. T. Bui, and D. W. Apley, âConcept drift monitoring and diagnostics of supervised learning models via score vectors,â Technometrics, vol. 65, no. 2, pp. 137â149, 2023.
- 51. N. Cebron and M. R. Berthold, âActive learning for object classification: From exploration to exploitation,â Data Mining and Knowledge Discovery, vol. 18, pp. 283â299, 2009.
- 52. U. J. Islam, K. Paynabar, G. Runger, and A. S. Iquebal, âDynamic explorationâexploitation trade-off in active learning regression with Bayesian hierarchical modeling,â IISE Transactions, vol. 57, no. 4, pp. 393â407, 2025.
- 53. V. R. Joseph, âSpace-filling designs for computer experiments: A review,â Quality Engineering, vol. 28, no. 1, pp. 28â35, 2016.
- 54. K. Chai, âGeneralization errors and learning curves for regression with multi-task Gaussian processes,â in Proceedings of the 23rd Advances in Neural Information Processing Systems, (Vancouver, Canada), December 2009.
- 55. S. P. Strong, R. Koberle, R. R. D. R. Van Steveninck, and W. Bialek, âEntropy and information in neural spike trains,â Physical Review Letters, vol. 80, p. 197, 1998.
- 56. C. McDiarmid, âOn the method of bounded differences,â Surveys in Combinatorics, vol. 141, no. 1, pp. 148â188, 1989.