# The Sample Complexity of Simple Binary Hypothesis Testing††This paper was presented in part at COLT 2024 as an extended abstract [PenJL24-colt].
**Authors**:
- Ankit Pensia (Simons Institute for the Theory of Computing)
- Varun Jog (University of Cambridge)
- Po-Ling Loh (University of Cambridge)
\addbibresource
ref.bib
Abstract
The sample complexity of simple binary hypothesis testing is the smallest number of i.i.d. samples required to distinguish between two distributions $p$ and $q$ in either: (i) the prior-free setting, with type-I error at most $\alpha$ and type-II error at most $\beta$ ; or (ii) the Bayesian setting, with Bayes error at most $\delta$ and prior distribution $(\pi,1-\pi)$ . This problem has only been studied when $\alpha=\beta$ (prior-free) or $\pi=1/2$ (Bayesian), and the sample complexity is known to be characterized by the Hellinger divergence between $p$ and $q$ , up to multiplicative constants. In this paper, we derive a formula that characterizes the sample complexity (up to multiplicative constants that are independent of $p$ , $q$ , and all error parameters) for: (i) all $0≤\alpha,\beta≤ 1/8$ in the prior-free setting; and (ii) all $\delta≤\pi/4$ in the Bayesian setting. In particular, the formula admits equivalent expressions in terms of certain divergences from the Jensen–Shannon and Hellinger families. The main technical result concerns an $f$ -divergence inequality between members of the Jensen–Shannon and Hellinger families, which is proved by a combination of information-theoretic tools and case-by-case analyses. We explore applications of our results to (i) robust hypothesis testing, (ii) distributed (locally-private and communication-constrained) hypothesis testing, (iii) sequential hypothesis testing, and (iv) hypothesis testing with erasures. Contents
1. 1 Introduction
1. 2 Our results
1. 2.1 Sample complexity of Bayesian simple hypothesis testing
1. 2.2 Sample complexity of prior-free simple hypothesis testing
1. 2.3 Distributed simple binary hypothesis testing
1. 2.4 Robust simple binary hypothesis testing
1. 2.5 Large error probability regime: Weak detection
1. 3 Related work
1. 4 Preliminaries
1. 4.1 Problem definitions
1. 4.1.1 Bayesian hypothesis testing
1. 4.1.2 Prior-free hypothesis testing
1. 4.1.3 Relation between these two problems
1. 4.2 Probability divergences
1. 5 Proof Sketch of Theorem 2.1
1. 5.1 Linear error probability regime
1. 5.2 Sublinear error probability regime
1. 6 Reduction: Error amplification and success amplification
1. 6.1 Proof of Proposition 6.1
1. 7 Bayesian testing problem: Linear regime
1. 7.1 Proof of Lemma 7.4
1. 7.2 Proofs of Claims 7.5 and 7.7
1. 7.3 Proofs of Theorems 2.1 and 2.2
1. 8 Weak detection
1. 9 Distributed hypothesis testing
1. 9.1 Hypothesis testing under communication constraints
1. 9.2 Hypothesis testing under local differential privacy
1. 10 Sequential hypothesis testing
1. 10.1 Constant error regime
1. 10.2 Bounded likelihood ratio
1. 11 Hypothesis testing with erasures
1. 11.1 Background on simple hypothesis testing with erasures
1. 11.2 Prior-free hypothesis testing with erasures
1. 11.3 Bayesian hypothesis testing with erasures
1. 12 Discussion
1. 13 Auxiliary results
1. 13.1 Proof of Fact 4.9
1. 13.2 Proof of Claim 7.8
1. 14 Asymptotic analyses of the errors in hypothesis testing with erasures
1. 14.1 Stein regime in the erasure setting
1. 14.2 Chernoff and Bayes regimes in the erasure setting
1 Introduction
Hypothesis testing is a fundamental problem in statistical inference that seeks the most likely hypothesis corresponding to a given set of observations. The simplest formulation of hypothesis testing, which is also the focus of this paper, is simple binary hypothesis testing. Here, the two hypotheses correspond to distributions $p$ and $q$ over a domain $\mathcal{X}$ , and the set of observations comprises $n$ i.i.d. samples $X_{1},...,X_{n}$ from $\theta∈\{p,q\}$ . The goal is to identify which distribution generated the samples; i.e., to produce $\widehat{\theta}:=\widehat{\theta}(X_{1},...,X_{n})$ so that $\widehat{\theta}=\theta$ with high probability.
Simple binary hypothesis testing is a crucial building block in statistical inference procedures. Consequently, much research has analyzed the best algorithms and their performance [LehRC86]. The famous Neyman–Pearson lemma [NeyPea33] provides the optimal procedure for $\widehat{\theta}$ , and subsequent works have completely characterized its error probability in two regimes: the single-sample setting ( $n=1$ ) and the infinite-sample setting ( $n→∞$ ). While the single-sample regime is relatively straightforward, much historical work in statistics and information theory has focused on the infinite-sample (or asymptotic) regime, where the asymptotic error admits particularly neat expressions in terms of information-theoretic divergences between $p$ and $q$ .
Although asymptotic results provide crucial insight into the problem structure, they offer no concrete guarantees for finite samples. In particular, asymptotic bounds cannot satisfactorily answer the question of how many samples are needed (i.e., what is the sample complexity) to solve hypothesis testing with a desired level of accuracy. This limits their applicability in practice, as well as in learning theory research, where sample complexity bounds are paramount.
While some prior work has established non-asymptotic bounds on the error probability in simple binary hypothesis testing, we observe that the sample complexity perspective has remained largely unaddressed. (See Section 3 for details.) Our main contribution is to fill this gap by developing tight results for the sample complexity of simple binary hypothesis testing. In what follows, we precisely define the problem and introduce terminology necessary to describe our results.
In the context of simple binary hypothesis testing, an algorithm can make two types of errors: (i) $\mathbb{P}(\widehat{\theta}=q|\theta=p)$ , termed type-I error, and (ii) $\mathbb{P}(\widehat{\theta}=p|\theta=q)$ , termed type-II error. These two types of errors may have different operational consequences in different situations, e.g., false positives vs. false negatives of a lethal illness, where the cost incurred by the former is significantly smaller than that of the latter. A natural way to combine these metrics is by considering a weighted sum of the errors, which is equivalent to the well-studied Bayesian formulation. We define the Bayesian version of simple binary hypothesis testing and its corresponding sample complexity as follows:
**Definition 1.1 (Bayesian simple binary hypothesis testing)**
*Let $p$ and $q$ be two distributions over a finite domain $\mathcal{X}$ , and let $\pi,\delta∈(0,1)$ . We say that a test $\phi:\cup_{n=0}^{∞}\mathcal{X}^{n}→\{p,q\}$ solves the Bayesian simple binary hypothesis testing problem of distinguishing $p$ versus $q$ , under the prior $(\pi,1-\pi)$ , with sample complexity $n$ and probability of error $\delta$ , if for $X:=(X_{1},...,X_{n})$ , we have
$$
\displaystyle\pi\cdot\mathbb{P}_{X\sim p^{\otimes n}}\left\{\phi(X)\neq p%
\right\}+(1-\pi)\cdot\mathbb{P}_{X\sim q^{\otimes n}}\left\{\phi(X)\neq q%
\right\}\leq\delta\,. \tag{1}
$$
We use $\mathcal{B}_{\mathrm{B}}(p,q,\pi,\delta)$ to denote this problem, and define its sample complexity ${n}^{*}_{\mathrm{B}}(p,q,\pi,\delta)$ to be the smallest $n$ such that there exists a test $\phi$ which solves $\mathcal{B}_{\mathrm{B}}(p,q,\pi,\delta)$ .*
Note that the left-hand side of Equation 1 may also be written as $\mathbb{P}(\widehat{\Theta}≠\Theta)$ , where $\Theta$ is supported on $\{p,q\}$ with $\mathbb{P}(\Theta=p)=\pi$ , $\widehat{\Theta}=\phi(X)$ , and for any $n∈\mathbb{N}$ , the conditional distribution is $X|{\Theta=\theta}\sim\theta^{\otimes n}$ .
We also consider a prior-free version, where these two errors are analyzed separately.
**Definition 1.2 (Prior-free simple binary hypothesis testing)**
*Let $p$ , $q$ , and $\phi$ be as in Definition 1.1. We say that $\phi$ solves the simple binary hypothesis testing problem of $p$ versus $q$ with type-I error $\alpha$ , type-II error $\beta$ , and sample complexity $n$ , if
$$
\displaystyle\mathbb{P}_{X\sim p^{\otimes n}}\left\{\phi(X)\neq p\right\}\leq%
\alpha,\qquad\text{and}\qquad\mathbb{P}_{X\sim q^{\otimes n}}\left\{\phi(X)%
\neq q\right\}\leq\beta. \tag{2}
$$
We use $\mathcal{B}_{\mathrm{PF}}(p,q,\alpha,\beta)$ to denote this problem, and define its sample complexity ${n}^{*}_{\mathrm{PF}}(p,q,\alpha,\beta)$ to be the smallest $n$ such that there exists a test $\phi$ which solves $\mathcal{B}_{\mathrm{PF}}(p,q,\alpha,\beta)$ .*
In the remainder of this section, we restrict our attention to the Bayesian setting (Definition 1.1) for ease of presentation, but analogous claims hold for the prior-free setting, as well.
It is well-known that the likelihood ratio test (with a suitable threshold depending on $\alpha$ ) minimizes the average error probability [NeyPea33]. A natural question is to understand the relationship between the Bayes error (the left-hand side of Equation 1 when $\phi$ is the optimal estimator) and the sample size $n$ for the optimal estimator. The Chernoff–Stein lemma gives the precise asymptotic dependence of the Bayes error on $n$ , stating that the Bayes error equals $e^{-n\left(\mathrm{CI}(p,q)+o(1)\right)}$ , where $\mathrm{CI}(p,q)$ is the Chernoff information between $p$ and $q$ [CovTho06]. Observe that the asymptotic error rate does not depend on the prior $\alpha$ at all, summarized as follows in [CovTho06, Chapter 11]: “ Essentially, the effect of the prior is washed out for large sample sizes. ” Consequently, the error rate is symmetric in $p$ and $q$ . Furthermore, the asymptotic rate suggests a guess for the sample complexity to obtain error $\delta$ :
$$
\displaystyle{n}^{*}_{\mathrm{B}}(p,q,\pi,\delta)\stackrel{{\scriptstyle?}}{{%
\approx}}\frac{\log(1/\delta)}{\mathrm{CI}(p,q)}\asymp\frac{\log(1/\delta)}{%
\mathrm{h}^{2}(p,q)}, \tag{3}
$$
where $\mathrm{h}^{2}(p,q)$ is the Hellinger divergence between $p$ and $q$ , which satisfies $\mathrm{h}^{2}(p,q)\asymp\mathrm{CI}(p,q)$ [PolWu23, Chapter 16].
Interestingly, the sample complexity suggested by Equation 3 turns out to be accurate for the special case of the uniform prior ( $\pi=1/2$ ). More precisely, [Ziv02] states the sample complexity satisfies ${n}^{*}_{\mathrm{B}}(p,q,1/2,\delta)\asymp\frac{\log(1/\delta)}{\mathrm{h}^{2}%
(p,q)}$ under mild technical assumptions such as $\mathrm{h}^{2}(p,q)≤ 0.5$ and $\delta≤ 1/4$ . We consider this to be a success story for asymptotic results: precise asymptotics yielding accurate predictions for the finite-sample regime.
However, this success is rather brittle, and gaps appear between the asymptotic and non-asymptotic regimes as soon as $\pi→ 0$ . Consider the simple example where both $p$ and $q$ are Bernoulli distributions, with biases $0 0$ and $\epsilon∈(0,1/4)$ , respectively. Suppose that for a given prior $\pi∈(0,1/2)$ , we ask for the error probability $\delta:=\pi/10$ , i.e., a non-trivial prediction. Direct calculations show that ${n}^{*}_{\mathrm{B}}(p,q,\pi,\pi/10)\asymp\frac{\log(1/\pi)}{\epsilon}$ , whereas ${n}^{*}_{\mathrm{B}}(q,p,\pi,\pi/10)\asymp\frac{1}{\epsilon}$ . In contrast, the sample complexity suggested by Equation 3 for both problems is $\asymp\frac{\log(1/\pi)}{\epsilon}$ , which does not correctly capture the sample complexity for the second problem. This example highlights that the sample complexity may be (i) asymmetric in $p$ and $q$ , (ii) prior-dependent, and (iii) much smaller than predicted, particularly, without the $\log(1/\delta)$ factor in Equation 3; all of these contradict the predictions from the asymptotic regime (cf. Equation 3).
For general $\pi∈(0,1/2]$ , [Ziv02] implies the following non-asymptotic bounds:
$$
\displaystyle\frac{\log(\pi/\delta)}{\mathrm{h}^{2}(p,q)}\lesssim{n}^{*}_{%
\mathrm{B}}(p,q,\pi,\delta)\lesssim\frac{\log(1/\delta)}{\mathrm{h}^{2}(p,q)}\,. \tag{4}
$$
The lower and upper bounds are not tight up to universal constants, and are clearly seen to diverge when $\pi→ 0$ and $\delta/\pi$ is relatively large. For example, when $\delta=\pi/10$ and $\pi→ 0$ : As a sanity check, for any fixed $\pi$ , as $\delta→ 0$ , the non-asymptotic bounds in Equation 4 converge to the asymptotic prediction of Equation 3. In fact, the asymptotic regime kicks in as soon as $\delta≤\pi^{2}$ . The previous example involving the Bernoulli distributions demonstrates that Equation 4 cannot be improved in general; i.e., it is the best possible bound in terms of the Hellinger divergence. The small- $\pi$ regime corresponds to the high-stakes situations mentioned earlier. Thus, a testing procedure informed by Equation 4 might need to collect more samples than needed, leading to the following question:
**Question 1.3**
*What is ${n}^{*}(p,q,\pi,\delta)$ as a function of $p$ , $q$ , $\pi$ , and $\delta$ ?*
We remark that we are looking for a “simple” expression for ${n}^{*}_{\mathrm{B}}(p,q,\pi,\delta)$ in terms of $p$ , $q$ , $\pi$ , and $\delta$ (tight up to multiplicative constants). It is not hard to see that the sample complexity is dictated by how fast a certain $f$ -divergence between the product distributions, $p^{\otimes n}$ and $q^{\otimes n}$ , grows with $n$ . For the uniform prior, the “divergence” is measured in terms of the total variation distance, which satisfies a two-sided polynomial relationship with the Hellinger divergence (and the Hellinger divergence tensorizes). For a non-uniform prior, the “divergence” is measured by the $E_{\gamma}$ -divergence (Definition 4.11), which does not satisfy any two-sided inequalities with any standard tensorizing $f$ -divergence. This is because $E_{\gamma}$ can be zero for distinct distributions. The divergence between $p^{\otimes n}$ and $q^{\otimes n}$ need not admit a simple expression in terms of $p$ and $q$ . Somewhat surprisingly, the Hellinger divergence neatly captures the sample complexity under a uniform prior. Unfortunately, the proof technique for the uniform prior breaks down for a general prior. We observe that the kind of simple expressions sought in Question 1.3 can be extremely beneficial in algorithm design, allowing the statistician to quickly identify the most sample-efficient test among $\mathcal{B}(p,q,\pi,\delta)$ and $\mathcal{B}(p^{\prime},q^{\prime},\pi^{\prime},\delta^{\prime})$ by comparing ${n}^{*}_{\mathrm{B}}(p,q,\pi,\delta)$ with ${n}^{*}_{\mathrm{B}}(p^{\prime},q^{\prime},\pi^{\prime},\delta^{\prime})$ . Existing bounds Equation 4 do not even allow for comparisons between ${n}^{*}_{\mathrm{B}}(p,q,\pi,\delta)$ and ${n}^{*}_{\mathrm{B}}(q,p,\pi,\delta)$ , which can differ by as much as $\log(1/\delta)$ . Moreover, these expressions may highlight additional properties of the sample complexity, which otherwise may be difficult to establish by its definition. For example, reexamining the uniform prior setting, the Hellinger divergence satisfies additional properties such as joint convexity in $p$ and $q$ (being an $f$ -divergence). These properties have been crucial in developing efficient algorithms for hypothesis testing under the additional constraints of communication and privacy (with a uniform prior) [PenJL24, PenAJL23]. This leads to our second question:
**Question 1.4**
*Do the techniques developed for the uniform prior setting transfer to the general prior setting?*
We provide affirmative answers to both questions in Section 2.
2 Our results
In this section, we present our main results. Sections 2.1 and 2.2 provide tight sample complexity results for the Bayesian and prior-free simple binary hypothesis testing problems (Definitions 1.1 and 1.2). Sections 2.3 and 2.4 focus on applications to distributed and robust hypothesis testing. Section 2.5 considers the weak detection regime.
2.1 Sample complexity of Bayesian simple hypothesis testing
We begin with some notation. For a parameter $\lambda∈[0,1]$ , define the $f$ -divergence $\mathcal{H}_{\lambda}(p,q):=1-\sum_{i}p_{i}^{\lambda}q_{i}^{1-\lambda}$ , which is equal to the Hellinger divergence when $\lambda=1/2$ . We use $I(X;Y)$ to denote the mutual information between random variables $X$ and $Y$ . A particularly important quantity will be $I(\Theta;X_{1})$ , with $\Theta$ and $X_{1}$ as in Definition 1.1. Observe that $I(\Theta;X_{1})$ depends on $\pi$ , $p$ , and $q$ .
Our main result is Theorem 2.1, which characterizes ${n}^{*}_{\mathrm{B}}(p,q,\pi,\delta)$ in terms of $I(\Theta;X_{1})$ and $\mathcal{H}_{\lambda}(p,q)$ . As mentioned earlier (in Footnote 1), the sample complexity crucially depends on how small $\delta$ is compared to $\pi$ . The following result handles these regimes separately:
**Theorem 2.1 (Bayesian simple hypothesis testing)**
*Let $p$ and $q$ be two arbitrary discrete distributions satisfying $\mathrm{h}^{2}(p,q)≤ 0.125$ . Let $\pi∈(0,1/2]$ be the prior parameter and $\delta∈(0,\pi/4)$ be the desired average error probability.
1. (Linear error probability) If $\delta∈[\pi/100,\pi/4]$ , then for $\lambda=\frac{(0.5\log 2)}{\log(1/\pi)}$ , we have
$$
\displaystyle{n}^{*}_{\mathrm{B}}(p,q,\pi,\delta)\asymp\frac{\pi\log(1/\pi)}{I%
(\Theta;X_{1})}\asymp\frac{1}{\mathcal{H}_{1-\lambda}(p,q)}\,. \tag{5}
$$
1. (Sublinear error probability) If $\delta∈\left[\pi^{2},\frac{\pi}{100}\right]$ , then for $T=\big{\lfloor}\frac{\log\left(\pi/\delta\right)}{\log 8}\big{\rfloor}$ and $\pi^{\prime}:=\pi^{\frac{1}{T}}$ , we have
$$
\displaystyle{n}^{*}_{\mathrm{B}}(p,q,\pi,\delta) \displaystyle\asymp\log\left(\pi/\delta\right)\cdot{n}^{*}_{\mathrm{B}}(p,q,%
\pi^{\prime},\pi^{\prime}/4)\,, \tag{6}
$$
where the last expression can be evaluated using Equation 5.
1. (Polynomial error probability) If $\delta≤\pi^{2}$ , then ${n}^{*}_{\mathrm{B}}(p,q,\pi,\delta)\asymp\frac{\log(1/\delta)}{h^{2}(p,q)}$ .*
Theorem 2.1 gives the sample complexity for all $p$ , $q$ , $\pi$ , and $\delta$ up to constant multiplicative factors, answering Question 1.3. We explain the intuition for each regime below:
1. First suppose $\delta∈[\pi/100,\pi/4]$ . If $\delta≥\pi$ , the problem is trivial (we can simply output “ $q$ ” without observing any samples); thus, this regime requires a probability of error that is a constant fraction better than the trivial test. We consider this to be the most insightful regime. The result Equation 5 states that the sample complexity is $\asymp\frac{h_{\mathrm{bin}}(\pi)}{I(\Theta;X_{1})}$ , where $h_{\mathrm{bin}}(·)$ is the binary entropy function. A loose interpretation is as follows: initially, the uncertainty of $\Theta$ is $h_{\mathrm{bin}}(\pi)$ and each sample provides roughly $I(\Theta;X_{1})$ amount of information. This argument, when made rigorous, is simply Fano’s inequality, leading to a lower bound on the sample complexity in terms of this expression. Proving a matching upper bound is rather sophisticated and constitutes a key technical result of our work.
1. In the sublinear regime of $\delta$ , we follow a reduction-based approach to show that solving $\mathcal{B}_{B}(p,q,\pi,\delta)$ is roughly equivalent to solving $T$ independent instances of $\mathcal{B}_{B}(p,q,\pi^{\prime},\delta^{\prime})$ for $\pi^{\prime}=\pi^{1/T}$ and $\delta^{\prime}=\delta^{1/T}$ . The sample complexity upper bound follows from the popular success-boosting procedure, i.e., going from a larger error probability of $\delta^{\prime}$ to a smaller error probability $\delta$ , using the median output of the $T$ tests. Somewhat surprisingly, the lower bound by a straightforward application of Fano’s inequality turns out to be loose. The tight lower bound is established using a new failure-boosting argument, which shows that the median is near-optimal. We can choose $T$ so the modified instance $\mathcal{B}_{B}(p,q,\pi^{\prime},\delta^{\prime})$ lies in the regime of Equation 5.
1. The final regime of $\delta$ corresponds to the asymptotic regime, and the result follows by Equation 4.
2.2 Sample complexity of prior-free simple hypothesis testing
Theorem 2.1 can be used to find the sample complexity of the prior-free formulation of hypothesis testing from Definition 1.2, as shown in the following theorem:
**Theorem 2.2 (Prior-free simple hypothesis testing)**
*Let $p$ and $q$ be two arbitrary distributions satisfying $\mathrm{h}^{2}(p,q)≤ 0.125$ . Let $\alpha∈(0,1/8]$ be the type-I error and $\beta∈(0,1/8]$ be the type-II error. Without loss of generality, let $\alpha≤\beta$ . Then ${n}^{*}_{\mathrm{PF}}\left(p,q,\alpha,\beta\right)\asymp{n}^{*}_{\mathrm{B}}%
\Big{(}p,q,\frac{\alpha}{\alpha+\beta},\frac{\alpha\beta}{\alpha+\beta}\Big{)}$ , where the last expression can be calculated using Theorem 2.1.*
Thus, Theorem 2.1 also implies tight sample complexity results for the prior-free setting.
2.3 Distributed simple binary hypothesis testing
The sample complexity formula established in Theorem 2.1 (cf. Question 1.4) has numerous consequences. Recall that Theorem 2.1 characterises the sample complexity in terms of either the mutual information or the Hellinger divergence. Both structures have favorable properties and lead to algorithmic and statistical results in the distributed setting, as we show below.
Distributed statistical inference is typically studied in the following framework: Samples are collected by $n$ observers, who then communicate information about their samples (such as summary statistics) to a central server, which tries to solve the inference problem. The transmission from an observer to the central server may be subject to different constraints, such as a bandwidth constraint that limits the number of bits the observer might transmit, or a privacy constraint that requires all transmissions to protect the privacy of the observer’s data. We define the generic problem below For simplicity, we consider the special case where each user employs the same modifier $f$ . :
**Definition 2.3 (Distributed simple binary hypothesis setting)**
*Let $p$ and $q$ be two distributions over $\mathcal{X}$ and define the prior parameter $\pi∈(0,1/2)$ and the error probability $\delta∈(0,1)$ . Let $\mathcal{Y}$ be a fixed domain and let $\mathcal{C}$ be a fixed set of stochastic maps from $\mathcal{X}→\mathcal{Y}$ , representing the constraints. We say that a test-modifier pair $(\phi,f)$ with test $\phi:\cup_{n=0}^{∞}\mathcal{X}^{n}→\{p,q\}$ and $f∈\mathcal{C}$ solves the Bayesian simple binary hypothesis testing problem of distinguishing $p$ versus $q$ , under the prior $(\pi,1-\pi)$ , with sample complexity $n$ , probability of error $\delta$ , under constraints $\mathcal{C}$ , if for $Y_{1},...,Y_{n}$ , generated independently (conditioned on the $X_{i}$ ’s) as $Y_{i}=f(X_{i})$ , the following holds:
| | $\displaystyle\pi·\mathbb{P}_{X\sim p^{\otimes n},Y_{1},...,Y_{n}}\left\{%
\phi\left(Y_{1},...,Y_{n}\right)≠ p\right\}+(1-\pi)·\mathbb{P}_{X\sim
q%
^{\otimes n},Y_{1},...,Y_{n}}\left\{\phi\left(Y_{1},...,Y_{n})\right)≠ q%
\right\}≤\delta\,.$ | |
| --- | --- | --- |
We use $\mathcal{B}_{\mathrm{B},\mathrm{constrained}}(p,q,\pi,\delta,\mathcal{C})$ to denote this problem, and define its sample complexity to be the smallest $n$ such that there exists a test-modifier pair $(\phi,f)$ which solves $\mathcal{B}_{\mathrm{B},\mathrm{constrained}}(p,q,\pi,\delta,\mathcal{C})$ ; we denote the sample complexity by ${n}^{*}_{\mathrm{B},\mathrm{constrained}}(p,q,\pi,\delta,\mathcal{C})$ .*
We are interested in understanding the “costs” of constraints $\mathcal{C}$ , measured in terms of statistical complexity and runtime. Starting with the statistical cost, we wish to compare ${n}^{*}_{\mathrm{B},\mathrm{constrained}}(p,q,\pi,\delta,\mathcal{C})$ and ${n}^{*}_{\mathrm{B}}(p,q,\pi,\delta)$ . The computational cost refers to the runtime of finding a modifier $f∈\mathcal{C}$ such that the resulting sample complexity with the choice of modifier $f$ is comparable to the optimal choice, ${n}^{*}_{\mathrm{B},\mathrm{constrained}}(p,q,\pi,\delta,\mathcal{C})$ . Once the modifier $f$ is fixed, the optimal test $\phi$ is the likelihood ratio test between the push-forward distribution of $f$ under $p$ and $q$ . The uniform prior setting was studied extensively in [PenJL24] for communication constraints (see also [BhaNOP21]) and [PenAJL23] for local privacy constraints. At a high level, [BhaNOP21, PenJL24, PenAJL23] characterized the contraction of the Hellinger divergence under these constraints and the time complexity of finding a good $f∈\mathcal{C}$ . Due to the reliance on the Hellinger divergence, a naive application of these results to an arbitrary $\pi$ and $\delta$ would incur a superfluous multiplicative $\log(1/\delta)$ factor in the resulting sample complexity, akin to Equation 4. In this section, we show much improved bounds by using the structural properties of the sample complexity established in Theorem 2.1 above.
We begin with communication-efficient simple binary hypothesis testing, also known as “decentralized detection” [Tsitsiklis93]. The following result is obtained by using the characterization of the sample complexity in terms of mutual information in Theorem 2.1, with the result of [BhaNOP21] on the contraction of mutual information under quantization:
**Theorem 2.4 (Statistical and computational costs of communication for hypothesis testing)**
*Let $p$ and $q$ be two arbitrary distributions over $\mathcal{X}$ , and let $\pi∈(0,1/2]$ and $\delta∈(0,\pi/4)$ . Let $D≥ 2$ be an integer denoting the communication budget. Consider the setting in Definition 2.3 with $\mathcal{Y}=[D]$ and $\mathcal{C}$ being the set of all channels from $\mathcal{X}→\mathcal{Y}$ , and denote the resulting sample complexity by ${n}^{*}_{\mathrm{B},\mathrm{comm}}(p,q,\pi,\delta,D)$ . Let ${n}^{*}:={n}^{*}_{\mathrm{B}}(p,q,\pi,\delta)$ . For any integer $D≥ 2$ , we have the following:
1. (Statistical cost) ${n}^{*}\lesssim{n}^{*}_{\mathrm{B},\mathrm{comm}}(p,q,\pi,\delta,D)\lesssim{n}%
^{*}\max\left(1,\frac{\log({n}^{*}/\pi)}{D}\right)$ .
1. (Computational cost) There exists an algorithm that takes as input $p,q,\pi,\delta$ , and $D$ and runs in time $\mathrm{poly}(|\mathcal{X}|,D,\log(1/\pi),\log(1/\delta))$ , and outputs a quantizer $\widehat{f}∈\mathcal{C}$ such that the resulting sample complexity is at most ${n}^{*}\max\left(1,\frac{\log({n}^{*}/\pi)}{D}\right)$ .*
Observe that for large enough $D$ , the sample complexity is comparable to the unconstrained sample complexity. A naive application of [BhaNOP21, PenJL24] with Equation 4 would instead yield a sample complexity of $n^{\prime}·\max\left(1,\frac{\log(n^{\prime})}{D}\right)·\log(1/\delta)$ , for $n^{\prime}:={n}^{*}_{B}(p,q,1/2,1/8)\asymp 1/\mathrm{h}^{2}(p,q)$ , which can be larger than ${n}^{*}$ by a factor of $\log(1/\delta)$ for all values of $D$ .
It is likely that a more whitebox analysis of [BhaNOP21] can replace $\log({n}^{*}/\pi)$ from Theorem 2.4 with $\log({n}^{*})$ . The proof of Theorem 2.4 is given in Section 9.1 and crucially relies on the mutual information characterization of sample complexity in Theorem 2.1.
We now turn our attention to privacy constraints, starting with a definition:
**Definition 2.5 (Local differential privacy)**
*For two finite domains $\mathcal{X}$ and $\mathcal{Y}$ , we say that a randomized mechanism $\mathcal{M}:\mathcal{X}→\mathcal{P}(\mathcal{Y})$ is $\epsilon$ -LDP for an $\epsilon>0$ if for all $x,x^{\prime}∈\mathcal{X}$ and $y∈\mathcal{Y}$ , we have $\mathbb{P}(\mathcal{M}(x)=y)≤ e^{\epsilon}·\mathbb{P}(\mathcal{M}(x^{%
\prime})=y)$ .*
Similar to communication constraints, the two main goals are understanding the statistical and computational costs of privacy. Even under a uniform prior, the statistical cost of privacy (i.e., contraction of the Hellinger divergence under LDP constraints) can be rather delicate, as shown in [PenAJL23]. Though one can similarly study the contraction of $\mathcal{H}_{\lambda}(·,·)$ under local privacy, we instead focus on understanding the computational costs of privacy, which effectively asks us to find the (approximate) minimizer of ${n}^{*}(\mathcal{M}(p),\mathcal{M}(q),\pi,\delta)$ over private channels $\mathcal{M}$ . Here, $\mathcal{M}(p)$ denotes the push-forward measure of $p$ under the mechanism $\mathcal{M}$ . While it is unclear how to solve such a generic optimization problem in a computationally-efficient manner, [PenAJL23] gave an efficient algorithm for minimizing $g(\mathcal{M}(p),\mathcal{M}(q))$ over quantized private channels when $g(·,·)$ is a jointly quasi-concave function. Fortunately, Theorem 2.1 states that ${n}^{*}_{\mathrm{B}}$ is also characterized by an $f$ -divergence, so [PenAJL23] is applicable.
**Theorem 2.6 (Computational costs of privacy for hypothesis testing)**
*Let $p$ and $q$ be two arbitrary distributions over $\mathcal{X}$ . Let $\pi∈(0,1/2]$ and $\delta∈(0,\pi/4)$ . Let $\epsilon>0$ be the privacy budget. Consider the setting in Definition 2.3, with $\mathcal{Y}=\mathcal{X}$ and $\mathcal{C}$ equal to the set of all $\epsilon$ -LDP channels from $\mathcal{X}→\mathcal{Y}$ , and denote the resulting sample complexity by ${n}^{*}_{\mathrm{B},\mathrm{priv}}(p,q,\pi,\delta,\epsilon)$ . Let ${n}^{*}_{\mathrm{priv}}:={n}^{*}_{\mathrm{B},\mathrm{priv}}(p,q,\pi,\delta,\epsilon)$ . There exists an algorithm $\mathcal{A}$ that takes as inputs $p,q,\pi,\delta$ , $\epsilon$ , and communication budget parameter $D$ , runs in time $\mathrm{poly}(|\mathcal{X}|^{D^{2}},2^{\mathrm{poly}(D)},\log(1/\pi),\log(1/%
\delta))$ , and outputs an $\epsilon$ -LDP mechanism $\mathcal{M}:\mathcal{X}→[D]$ , such that the resulting sample complexity with $\mathcal{M}$ is at most ${n}^{*}_{\mathrm{priv}}·\max\left(1,\frac{\log({n}^{*}_{\mathrm{priv}}/\pi%
)}{D}\right)$ .*
We remind the reader that the resulting sample complexity has no explicit multiplicative factor of $\log(1/\delta)$ for $D$ moderately large. A naive application of [PenAJL23] would yield an $\epsilon$ -LDP mechanism with sample complexity $\log(1/\delta)· n^{\prime}·\max\left(1,\frac{\log(n^{\prime})}{D}\right)$ , for $n^{\prime}:={n}^{*}_{\mathrm{B},\mathrm{priv}}(p,q,1/2,1/8)$ . We defer the proof of Theorem 2.6 to Section 9.2.
2.4 Robust simple binary hypothesis testing
In some scenarios, it may not be possible to know the distributions $p$ and $q$ exactly, but only up to a small error. In this setting, we want to guess “ $p$ ” (similarly, “ $q$ ”) even if the samples are generated by $p^{\prime}∈ D_{1}$ (respectively, $q^{\prime}∈ D_{2}$ ), where $D_{1}$ and $D_{2}$ are the sets of all distributions within an $\epsilon$ total variation distance Other distance metrics over distributions could be considered, e.g., using the Hellinger divergence. Our results should continue to hold as long as they fit into the least favorable distributions framework of [Huber65, HubStr73]. , respectively. This setting is termed robust hypothesis testing, and has a rich history in statistics, with seminal contributions by [Huber65].
**Definition 2.7 (Robust Bayesian simple binary hypothesis testing)**
*Consider the setting in Definition 1.1. Let $\epsilon<\mathrm{TV}(p,q)/2$ , and define $D_{1}:=\{p^{\prime}:\mathrm{TV}(p,p^{\prime})<\epsilon\}$ and $D_{2}:=\{q^{\prime}:\mathrm{TV}(q,q^{\prime})<\epsilon\}$ . We say that a (randomized) test $\phi:\cup_{n=0}^{∞}\mathcal{X}^{n}→\{p,q\}$ solves the Bayesian simple binary hypothesis testing problem of distinguishing $p$ versus $q$ , under the prior $(\pi,1-\pi)$ , with sample complexity $n$ , probability of error $\delta$ , and contamination level $\epsilon$ , if
$$
\pi\cdot\sup_{p^{\prime}\in D_{1}}\mathbb{P}_{X\sim p^{\prime\otimes n}}\left%
\{\phi(X)\neq p\right\}+(1-\pi)\cdot\sup_{q^{\prime}\in D_{2}}\mathbb{P}_{X%
\sim q^{\prime\otimes n}}\left\{\phi(X)\neq q\right\}\leq\delta.
$$
We denote the problem by $\mathcal{B}_{\mathrm{B},\mathrm{rob}}(p,q,\pi,\delta,\epsilon)$ and its sample complexity by ${n}^{*}_{\mathrm{B},\mathrm{rob}}(p,q,\pi,\delta,\epsilon)$ .*
[Huber65] established that Definition 2.7 reduces to solving $\mathcal{B}_{\mathrm{B}}(p^{\prime},q^{\prime},\pi,\delta)$ for the least favorable distribution (LFD) pair $p^{\prime}∈ D_{1}$ and $q^{\prime}∈ D_{2}$ . Moreover, the LFD pair can be calculated in a computationally-efficient manner by an algorithm given $p$ , $q$ , and $\epsilon$ . Applying our result to the LFD pair yields the following result:
**Corollary 2.8 (Corollary ofTheorem2.1and[Huber65])**
*Let $p$ and $q$ be two arbitrary distributions, and suppose $\pi∈(0,1/2]$ , $\delta∈(0,\pi/4)$ , and $\epsilon∈(0,\mathrm{TV}(p,q)/2)$ . There is a computationally efficient algorithm that takes $p$ , $q$ , and $\epsilon$ as inputs and computes distributions $p^{\prime}$ and $q^{\prime}$ such that ${n}^{*}_{\mathrm{B},\mathrm{rob}}\left(p,q,\pi,\delta,\epsilon\right)\asymp{n}%
^{*}_{\mathrm{B}}\left(p^{\prime},q^{\prime},\pi,\delta\right)$ , where the last expression can be evaluated using Theorem 2.1.*
2.5 Large error probability regime: Weak detection
Finally, we turn our attention to the large error probability regime, often called the “weak detection regime.” In this regime, we allow the estimators to make an error with probability $\delta=\pi(1-\gamma)$ , for $\gamma$ close to 0 in $(0,1)$ . Recall that $\delta=\pi$ is trivial, so the regime $\delta=\pi(1-\gamma)$ asks the estimator to be “very slightly” better than trivial. Even for a uniform prior, it is rather surprising that the exact sample complexity is not known, with the tightest known results being
$$
\displaystyle\frac{\gamma^{2}}{\mathrm{h}^{2}(p,q)}\lesssim{n}^{*}_{\mathrm{B}%
}\left(p,q,1/2,\frac{1-\gamma}{2}\right)\lesssim\frac{\gamma}{\mathrm{h}^{2}(p%
,q)} \tag{7}
$$
(see, e.g., [PolWu23, Eq. (14.19)]). In fact, as we show, these bounds happen to be tight in the worst case, i.e., there exist $(p^{\prime},q^{\prime})$ and $(p^{\prime\prime},q^{\prime\prime})$ with $\mathrm{h}^{2}(p^{\prime},q^{\prime})=\mathrm{h}^{2}(p^{\prime\prime},q^{%
\prime\prime})$ such that the sample complexity of $(p^{\prime},q^{\prime})$ is given by the lower bound, while that of $(p^{\prime\prime},q^{\prime\prime})$ is given by the upper bound (cf. Example 8.1). Thus, surprisingly, even for a uniform prior, the Hellinger divergence does not exactly characterize the sample complexity in the weak detection regime. Even more surprisingly, for $\pi≠ 1/2$ , the sample complexity in the weak detection regime may not even depend on $\gamma$ for all small values of $\gamma$ . We establish the following result:
**Theorem 2.9 (Surprises in weak detection; Informal version)**
*Consider the setting in Definition 1.1. Let $\delta=\pi(1-\gamma)$ , for some $\gamma∈(0,1/2)$ . Then
| | $\displaystyle\gamma\max\left\{\gamma,|0.5-\pi|\right\}\lesssim\frac{{n}^{*}_{%
\mathrm{B}}\left(p,q,\pi,\pi(1-\gamma)\right)}{{n}^{*}_{\mathrm{B}}\left(p,q,%
\pi,\pi/4\right)}\lesssim\max\left\{\gamma,|0.5-\pi|\right\}.$ | |
| --- | --- | --- |
In particular, for $\pi≤ 1/4$ , we have $\gamma\lesssim\frac{{n}^{*}_{\mathrm{B}}\left(p,q,\pi,\pi(1-\gamma)\right)}{{n%
}^{*}_{\mathrm{B}}\left(p,q,\pi,\pi/4\right)}\lesssim 1$ , and the upper bounds are tight in the worst case (cf. Example 8.3).*
The formal version of this result is presented in Theorem 8.2. That is, the upper bounds and the lower bounds differ by a factor of $\gamma$ , with the upper bound not decreasing with $\gamma$ for any non-uniform prior. Perhaps surprisingly, there exist distribution pairs whose sample complexity with error probability $\pi/4$ is almost as large as the sample complexity with error probability $\pi(1-\gamma)$ , for any tiny $\gamma$ . To be precise, for any $n_{0}∈\mathbb{N}$ and non-uniform prior, there exist distributions, given in Example 8.3, such that the error probability with $n$ samples for all $n∈\{0,...,n_{0}\}$ is $\pi$ —the trivial error, whereas ${n}^{*}_{\mathrm{B}}(p,q,\pi,\pi/4)\lesssim n_{0}$ . Hence, weak detection might be as hard as strong detection (up to constant factors).
3 Related work
Simple binary hypothesis testing is a classical problem that has been studied for over a century. The famous Neyman–Pearson lemma [NeyPea33] states that the optimal test is obtained by thresholding the likelihood ratio. The lemma allows calculation of the decision regions, but does not provide the resulting errors of the optimal test. It is of great practical and theoretical interest to determine how many i.i.d. samples are necessary to obtain errors that are below some desired threshold, and this problem has accordingly been studied extensively within the statistics and information theory communities. Almost all of this work focuses on asymptotic properties of the errors for large sample sizes; i.e., identifying exponents $E$ such that the error(s) decay as $e^{-n(E+o(1))}$ . These results are included in textbooks such as [CovTho06, PolWu23], and we briefly survey them: the Stein regime considers the exponential rate of decay of the type-II error when the type-I error is fixed; the Chernoff regime considers all possible pairs $(E_{0},E_{1})$ such that the type-I and type-II errors decay exponentially fast with exponents $E_{0}$ and $E_{1}$ ; and the Bayesian setting considers the exponential rate of decay of the Bayes error. In all of these problems, the exponents are neatly characterized in terms information-theoretic quantities such as the KL divergence and the Chernoff information between $p$ and $q$ . Non-asymptotic bounds on the errors have appeared in prior works such as [Stra62, PolPooVer10], using Berry–Esseen type bounds to analyse the concentration of the likelihood ratio. These works derive upper and lower bounds on the type-II error in terms of the type-I error and the number of samples $n$ . Deriving analytical sample complexity from these results appears difficult. We may derive sample complexity bounds computationally, but these are unlikely to be tight within constants since they rely on moments of the likelihood ratio and not on the divergences identified in this paper. The problem of deriving sample complexity bounds that are tight up to multiplicative constants gained traction in the early 2000s with the emergence of property testing [Gold17], and in particular, distribution testing [Can22], within the theoretical computer science community. To our knowledge, the sample complexity of simple binary hypothesis testing was first explicitly stated in [Ziv02], although it was likely well-known before that. Here, it was shown that the sample complexity is $\asymp\log(1/\delta)/\mathrm{h}^{2}(p,q)$ , where $\delta$ is the desired type-I and type-II error or the Bayes error. Distributed hypothesis testing under information constraints dates back to the work of [Tsitsiklis93] which studied an asymptotic version of hypothesis testing under communication constraints. A relatively recent resurgence in interest has led to numerous contributions towards obtaining sample complexity bounds for distribution estimation under communication and privacy constraints [AchCT20-I, AchCT20-II, CheKO23]. The problem of simple hypothesis testing (binary and $M$ -ary) under constraints has been studied in [CanKMSU19, BunKSW19, GopKKNWZ20, PenJL24, PenAJL23, AliBS23].
4 Preliminaries
Notation:
Throughout this paper, we assume that the domain $\mathcal{X}$ is discrete, taken to be $\mathbb{N}$ without loss of generality. For distributions $p_{1},...,p_{n}$ over $\mathcal{X}$ , we use $\prod_{i=1}^{n}p_{i}$ to denote the product distribution. When $p_{1},...,p_{n}$ are identically equal to a distribution $p$ , we denote the joint distribution by $p^{\otimes n}$ . When we have a setting with just two distributions $p$ and $q$ , we use $p_{i}$ and $q_{i}$ to denote $p(i)$ and $q(i)$ , respectively. We denote the measure of a set $A⊂eq\mathcal{X}$ under $p$ by $\mathbb{P}_{p}(A)$ . For a bias $a∈[0,1]$ , we denote the Bernoulli distribution with bias $a$ by $\mathrm{Ber}(a)$ .
For two distributions $p$ and $q$ , we use $\mathrm{TV}(p,q)=0.5\mathbb{E}_{q}\left[\left|\frac{dp}{dq}-1\right|\right]$ to denote the total variation distance; $\mathrm{h}^{2}(p,q)=0.5\mathbb{E}_{q}\left[\left|\sqrt{\frac{dp}{dq}}-1\right|%
^{2}\right]$ to denote the Hellinger divergence; and $\mathrm{KL}(p,q)=\mathbb{E}_{q}\left[\frac{dp}{dq}\log\left(\frac{dp}{dq}%
\right)\right]$ , whenever finite, to denote the Kullback-Leibler divergence, where $\frac{dp}{dq}$ denotes the Radon–Nikodym derivative. We define additional $f$ -divergences in Section 4.2. For two random variables $X$ and $Y$ , we use $I(X;Y)$ to denote the mutual information between $X$ and $Y$ .
For a scalar $x∈[0,1]$ , shall use $\overline{x}$ as shorthand to denote $1-x$ . All logarithms are natural logarithms. We use $\mathrm{poly}(·)$ to denote an expression that is polynomial in this arguments. We also use the standard inequalities $\lesssim,\gtrsim,\asymp$ to hide absolute constant factors in the relationships.
Let $P_{e}^{(n)}$ denote the Bayesian error probability with $n$ i.i.d. samples, i.e., the minimum error probability over all tests $\phi$ in Equation 1. When $n=1$ , we also use the notation $P_{e}$ .
Finally, we will focus on pairs of distributions $p$ and $q$ that are sufficiently far apart, in the sense of ${n}^{*}_{\mathrm{PF}}\left(p,q,\frac{1}{4},\frac{1}{4}\right)≥ 2$ . This condition implies that one needs to observe at least two samples to ensure that both type-I and type-II errors are at most $1/4$ . We consider this to be a fairly mild regularity condition. If such a condition does not hold, then the sample complexity might display some degenerate behavior, contradicting even Equation 4. For example, if $p$ and $q$ have non-overlapping supports, then ${n}^{*}_{\mathrm{B}}(p,q,\pi,\delta)≤ 1$ and ${n}^{*}_{\mathrm{PF}}(p,q,\alpha,\beta)≤ 1$ for all parameter choices $\alpha,\delta,\beta,\pi$ . We note that a non-trivial upper bound on $\mathrm{h}^{2}(p,q)$ is also assumed in [Ziv02, Theorem 4.7]. The condition ${n}^{*}_{\mathrm{PF}}(p,q,1/4,1/4)≥ 2$ is guaranteed if $\mathrm{h}^{2}(p,q)≤ 0.125$ because ${n}^{*}_{\mathrm{PF}}(p,q,1/4,1/4)≥{n}^{*}_{\mathrm{B}}(p,q,0.5,1/4)$ by Claim 4.6, and ${n}^{*}_{\mathrm{B}}(p,q,0.5,1/4)≥ 2$ if $0.5(1-\mathrm{TV}(p,q))<1/4$ by Proposition 4.13, which is equivalent to $\mathrm{TV}(p,q)>1/2$ . Finally, $\mathrm{TV}(p,q)<\sqrt{2}\mathrm{h}(p,q)$ for distinct $p$ and $q$ (see, for example, [Ziv02, Proposition 2.38]) implies that $\mathrm{h}(p,q)≤ 1/2\sqrt{2}$ suffices for $\mathrm{TV}(p,q)<1/2$ .
4.1 Problem definitions
4.1.1 Bayesian hypothesis testing
Recall the Bayesian hypothesis testing problem:
See 1.1
Observe that the problem is trivial when $\delta≥\min(\pi,1-\pi)$ . The case $\pi=1/2$ corresponds to the typical setting of a uniform prior, which is well-understood; in particular, the sample complexity with a uniform prior is characterized by the Hellinger divergence. We record both of these results below:
**Proposition 4.1 (Standard bounds onnB∗subscriptsuperscript𝑛B{n}^{*}_{\mathrm{B}}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_B end_POSTSUBSCRIPT(Folklore))**
*We have the following bounds:
1. (Vacuous Regime) ${n}^{*}_{\mathrm{B}}(p,q,\pi,\delta)=0$ if $\delta≥\min(\pi,1-\pi)$ .
1. Let $p$ and $q$ satisfy $\mathrm{h}^{2}(p,q)≤ 0.125$ . For all $\delta≤\pi/4$ , we have $\frac{\log(\pi/\delta)}{\mathrm{h}^{2}(p,q)}\lesssim{n}^{*}_{\mathrm{B}}(p,q,%
\pi,\delta)\lesssim\frac{\log(1/\delta)}{\mathrm{h}^{2}(p,q)}$ . Moreover, these bounds are tight in the worst case, as shown in Example 4.2.
1. In particular, for $\pi=1/2$ , any $\delta≤ 1/8$ , and $\mathrm{h}^{2}(p,q)≤ 0.125$ , we have ${n}^{*}_{\mathrm{B}}(p,q,0.5,\delta)\asymp\frac{\log(\frac{1}{\delta})}{%
\mathrm{h}^{2}(p,q)}$ [Ziv02, PolWu23].*
The first claim in the proposition follows by always picking the output hypothesis (without even looking at the samples) with prior $1-\pi$ (respectively, $\pi$ ), leading to a test with average error $\pi$ (respectively, $1-\pi$ ). The second claim follows by Proposition 4.5 and Corollary 4.8, proved later.
Proposition 4.1 implies that the “interesting” regime of the parameter $\delta$ is when $\delta$ is not too small: In particular, if $\delta≤\pi^{1+c}$ for a constant $c>0$ , the sample complexity is $\log(1/\delta)/h^{2}(p,q)$ (up to a multiplicative factor of $1/c$ ), i.e., it matches the uniform prior rates. This is in line with the asymptotic result that the prior does not play a role in the asymptotic regime [CovTho06]. Our main contribution will be to provide tight bounds when $\delta≥\pi^{1+c}$ . A particularly important case will be the linear regime, i.e., the case when $\delta∈(\pi/100,\pi/4)$ . As we shall see, a generic instance $\mathcal{B}_{\mathrm{B}}(p,q,\pi,\delta)$ with error probability $\delta∈(\pi^{2},\pi/4)$ can be related to that of $\mathcal{B}_{\mathrm{B}}(p,q,\pi^{\prime},\delta^{\prime})$ , where $\delta^{\prime}∈(\pi^{\prime}/16,\pi^{\prime}/4)$ (cf. Proposition 6.1).
Moreover, the sample complexity is not symmetric in the first two arguments when $\pi≠ 0.5$ , i.e., in general, ${n}^{*}_{\mathrm{B}}(p,q,\pi,\delta)≠{n}^{*}_{\mathrm{B}}(q,p,\pi,\delta)$ . For example, for every $\epsilon∈(0,1)$ , there exist $p$ and $q$ with $\mathrm{h}^{2}(p,q)=\epsilon$ such that for all $\pi≤ 1/2$ , we have ${n}^{*}_{\mathrm{B}}\left(p,q,\pi,\frac{\pi}{10}\right)\asymp\frac{\log(1/\pi)%
}{\epsilon}$ and ${n}^{*}_{\mathrm{B}}\left(q,p,\pi,\frac{\pi}{10}\right)\asymp\frac{1}{\epsilon}$ . Formally, we have the following:
**Example 4.2**
*Let $\epsilon∈(0,\epsilon_{0})$ for a small absolute constant $\epsilon_{0}$ . and $\pi∈(0,1/2)$ . Let $\delta∈(0,\pi/4)$ . Let $p=\mathrm{Ber}(0)$ and $q=\mathrm{Ber}(\epsilon)$ be the Bernoulli distributions with bias $0 0$ and $\epsilon$ , respectively. Then $\mathrm{h}^{2}(p,q)\asymp\epsilon$ and
| | $\displaystyle{n}^{*}_{\mathrm{B}}(p,q,\pi,\delta)\asymp\frac{\log(1/\pi)}{%
\mathrm{h}^{2}(p,q)}\,\,\,\text{and}\,\,{n}^{*}_{\mathrm{B}}(q,p,\pi,\delta)%
\asymp\frac{\log(\pi/\delta)}{\mathrm{h}^{2}(p,q)}.$ | |
| --- | --- | --- |*
* Proof*
Let the support of $p$ and $q$ be $\{A,B\}$ , such that $\mathbb{P}(A)=0$ and $\mathbb{P}(B)=1$ under $p$ . We first show that ${n}^{*}_{\mathrm{B}}(q,p,\pi)\asymp\frac{\log(\pi/\delta)}{\epsilon}$ . The lower bound follows from Proposition 4.1, so we focus on establishing the upper bound. Consider the test $\phi$ that picks $q$ if the data contains a nonzero number of $A$ ’s, otherwise picks $B$ . Then the probability of error under $p$ is $0 0$ , whereas the probability of error under $q$ is $(1-\epsilon)^{n}$ . Thus, the expected error is $\pi(1-\epsilon)^{n}$ , which is less than $\delta$ if $n\lesssim\log(\pi/\delta)/\epsilon$ . We now consider ${n}^{*}_{\mathrm{B}}(p,q,\pi,\delta)$ , where we establish the lower bound; the upper bound follows from Proposition 4.1. Since the probability of observing “A” is zero, the minimum probability of error has a simple expression. Proposition 4.13, proved later, implies the following expression for the minimum expected error:
| | $\displaystyle P_{e}^{(n)}$ | $\displaystyle=\sum_{x_{1},...,x_{n}∈\{A,B\}}\min\left(\pi p^{\otimes n}(x_%
{1},...,x_{n}),(1-\pi)q^{\otimes n}(x_{1},...,x_{n})\right)$ | |
| --- | --- | --- | --- |
which is less than $\delta$ if and only if $(1-\pi)(1-\epsilon)^{n}≤\delta$ , or equivalently, $n≥\frac{\log((1-\pi)/\delta)}{-\log(1-\epsilon)}$ . Since $\pi≤ 1/2$ and $\epsilon≤ 1/2$ , the aforementioned condition implies that $n\gtrsim\frac{\log(1/\delta)}{\epsilon}$ . ∎
In terms of the optimal test for $\mathcal{B}_{\mathrm{B}}(p,q,\pi,\delta)$ , the classical result of Neyman and Pearson implies that the likelihood ratio test (with a particular choice of threshold) is optimal.
**Definition 4.3 (Likelihood ratio test)**
*Given a threshold $T∈[0,∞]$ and two distributions $p$ and $q$ , the likelihood ratio test takes in $n$ samples $x_{1},...,x_{n}$ and outputs $p$ if the likelihood ratio, $\prod_{i=1}^{n}\frac{p(x_{i})}{q(x_{i})}$ , is at least $T$ and $q$ otherwise.*
**Fact 4.4 (Optimal test[NeyPea33])**
*For any $n$ , the likelihood ratio test with the threshold $\frac{1-\pi}{\pi}$ minimizes the probability of error in Equation 1.*
4.1.2 Prior-free hypothesis testing
We now turn our attention to the prior-free setting, recalled below.
See 1.2 If both $\alpha$ and $\beta$ are larger than $1/2$ , the problem is trivial. The case $\alpha=\beta$ is roughly equivalent to the uniform prior setting with error probability $\alpha$ (which is characterized by the Hellinger divergence).
**Proposition 4.5 (Standard Bounds onnPF∗subscriptsuperscript𝑛PF{n}^{*}_{\mathrm{PF}}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_PF end_POSTSUBSCRIPT)**
*We have the following bounds:
1. (Vacuous Regime) ${n}^{*}_{\mathrm{PF}}(p,q,\alpha,\beta)=0$ if $\alpha+\beta≥ 1$ .
1. For all $\alpha,\beta,p$ , and $q$ such that $\max(\alpha,\beta)<1/4$ and $\mathrm{h}^{2}(p,q)≤ 0.125$ , we have
$$
\displaystyle\frac{\log(1/\max(\alpha,\beta))}{h^{2}(p,q)}\lesssim{n}^{*}_{%
\mathrm{PF}}(p,q,\alpha,\beta)\lesssim\frac{\log(1/\min(\alpha,\beta))}{h^{2}(%
p,q)}\,. \tag{8}
$$*
* Proof*
The test that chooses the distribution $p$ with probability $\overline{\alpha}$ and $q$ with probability $\alpha$ has type-1 error $\alpha$ and type-2 error $1-\alpha$ , which is at most $\beta$ if $\alpha+\beta≥ 1$ . This test does not use any samples. The second claim in Equation 8 follows by the folklore result [Ziv02, PolWu23] that ${n}^{*}_{\mathrm{PF}}(p,q,\delta,\delta)\asymp\log(1/\delta)/h^{2}(p,q)$ for any $\delta≤ 1/4$ and $\mathrm{h}^{2}(p,q)≤ 0.125$ , combined with the following monotonocity property
| | $\displaystyle{n}^{*}_{\mathrm{PF}}(p,q,\max\left(\alpha,\beta\right),\max\left%
(\alpha,\beta\right))≤{n}^{*}_{\mathrm{PF}}(p,q,\alpha,\beta)≤{n}^{*}_{%
\mathrm{PF}}(p,q,\min\left(\alpha,\beta\right),\min\left(\alpha,\beta\right))\,.$ | |
| --- | --- | --- |
∎
Proposition 4.5 implies that when $\max(\alpha,\beta)$ is polynomially related to $\min(\alpha,\beta)$ , the sample complexity is characterized by $\log(1/\alpha)/\mathrm{h}^{2}(p,q)$ . Thus, the “interesting” regime of the parameters is when $\alpha$ and $\beta$ diverge. Again, the sample complexity ${n}^{*}_{\mathrm{PF}}(p,q,\alpha,\beta)$ is not symmetric in the first two arguments.
4.1.3 Relation between these two problems
It is easy to see that the sample complexities of the Bayesian and prior-free hypothesis testing problems are closely related to each other.
**Claim 4.6 (Relation between Bayesian and prior-free sample complexities)**
*For any $\alpha,\beta∈(0,1)$ , we have
$$
\displaystyle{n}^{*}_{\mathrm{B}}\left(p,q,\frac{\beta}{\alpha+\beta},\frac{2%
\alpha\beta}{\alpha+\beta}\right)\leq{n}^{*}_{\mathrm{PF}}(p,q,\alpha,\beta)%
\leq{n}^{*}_{\mathrm{B}}\left(p,q,\frac{\beta}{\alpha+\beta},\frac{\alpha\beta%
}{\alpha+\beta}\right), \tag{9}
$$
and for all $\pi,\delta∈(0,1)$ , we have
$$
\displaystyle{n}^{*}_{\mathrm{PF}}\left(p,q,\frac{\delta}{\pi},\frac{\delta}{1%
-\pi}\right)\leq{n}^{*}_{\mathrm{B}}\left(p,q,\pi,\delta\right)\leq{n}^{*}_{%
\mathrm{PF}}\left(p,q,\frac{\delta}{2\pi},\frac{\delta}{2(1-\pi)}\right). \tag{10}
$$*
* Proof*
We begin with Equation 9.
Proof of Equation 9:
The first inequality in Equation 9 can be checked by noting that the average error (with $p$ chosen with probability $\beta/(\alpha+\beta)$ ) of the test achieving the guarantee for ${n}^{*}_{\mathrm{PF}}(p,q,\alpha,\beta)$ is at most $2\alpha\beta/(\alpha+\beta)$ , thus solving $\mathcal{B}_{\mathrm{B}}\left(p,q,\frac{\beta}{\alpha+\beta},\frac{2\alpha%
\beta}{\alpha+\beta}\right)$ .
For the second inequality, consider a test that solves ${n}^{*}_{\mathrm{B}}\left(p,q,\frac{\beta}{\alpha+\beta},\frac{\alpha\beta}{%
\alpha+\beta}\right)$ . Observe that the type-I error cannot be larger than $\alpha$ and the type-II error cannot be larger than $\beta$ , or else the weighted average error would exceed $\alpha\beta/(\alpha+\beta)$ .
Proof of Equation 10:
For the first inequality in Equation 10, consider a test $\phi$ that solves $\mathcal{B}_{\mathrm{B}}\left(p,q,\pi,\delta\right)$ . Then the type-I error of the test $\phi$ cannot be more than $\delta/\pi$ ; otherwise, the average error would be larger than $\delta$ . Similarly, the type-II error cannot be larger than $\delta/(1-\pi)$ .
For the second inequality, consider a test that solves ${n}^{*}_{\mathrm{PF}}\left(p,q,\frac{\delta}{2\pi},\frac{\delta}{2(1-\pi)}\right)$ . Then the average error, where $p$ is chosen with probability $\pi$ , is at most $\pi(\delta/(2\pi))+(1-\pi)\left(\delta/(2(1-\pi))\right)=\delta$ .
∎
Finally, the following proposition shows that the sample complexity is stable under mild changes in the problem parameters, implying that both series of inequalities in Claim 4.6 can be reversed, up to constants:
**Proposition 4.7 (Mild changes in prior and error probability)**
*Let $\pi_{1}∈(0,1/2]$ , $\pi_{2}∈(0,1/2]$ , $\delta_{1}∈(0,\pi_{1}/4)$ , and $\delta_{2}∈(0,\pi_{2}/4)$ . Then
$$
\displaystyle\frac{{n}^{*}_{\mathrm{B}}(p,q,\pi_{1},\delta_{1})}{{n}^{*}_{%
\mathrm{B}}(p,q,\pi_{2},\delta_{2})}\lesssim\max\left(1,\frac{\log\left(\frac{%
\pi_{1}}{\delta_{1}}\right)}{\log\left(\frac{\pi_{2}}{\delta_{2}}\right)},%
\frac{\log\left(\frac{1}{\delta_{1}}\right)}{\log\left(\frac{1}{\delta_{2}}%
\right)}\right). \tag{11}
$$
In particular, if $\delta_{1}=\mathrm{poly}(\delta_{2})$ and $\delta_{1}/\pi_{1}=\mathrm{poly}(\delta_{2}/\pi_{2})$ , the sample complexities are related by multiplicative constants. Similarly, let $\alpha_{1},\beta_{1},\alpha_{2},\beta_{2}∈(0,1/4]$ . Then
$$
\displaystyle\frac{{n}^{*}_{\mathrm{PF}}(p,q,\alpha_{1},\beta_{1})}{{n}^{*}_{%
\mathrm{PF}}(p,q,\alpha_{2},\beta_{2})}\lesssim\max\left(1,\frac{\log\left(%
\frac{1}{\alpha_{1}}\right)}{\log\left(\frac{1}{\alpha_{2}}\right)},\frac{\log%
\left(\frac{1}{\beta_{1}}\right)}{\log\left(\frac{1}{\beta_{2}}\right)}\right)\,. \tag{12}
$$
In particular, if $\alpha_{1}=\mathrm{poly}(\alpha_{2})$ and $\beta_{1}=\mathrm{poly}(\beta_{2})$ , the sample complexities are related by multiplicative constants.*
We state the following corollary:
**Corollary 4.8**
*Let $\pi∈(0,1/2]$ and $\delta≤\pi/4$ . Then ${n}^{*}_{\mathrm{B}}(p,q,\pi,\delta)\asymp{n}^{*}_{\mathrm{PF}}\left(p,q,\frac%
{\delta}{\pi},\delta\right)$ . Similarly, let $\alpha,\beta∈(0,1/8)$ be such that $\beta≤\alpha$ . Then ${n}^{*}_{\mathrm{PF}}\left(p,q,\alpha,\beta\right)\asymp{n}^{*}_{\mathrm{B}}%
\left(p,q,\frac{\beta}{\alpha+\beta},\frac{\alpha\beta}{\alpha+\beta}\right)$ .*
* Proof ofCorollary4.8*
By Claim 4.6, the Bayesian sample complexity is sandwiched between the two following prior-free sample complexities, so we obtain a bound on the ratio using Proposition 4.7:
| | $\displaystyle\frac{{n}^{*}_{\mathrm{PF}}\left(p,q,\frac{\delta}{2\pi},\frac{%
\delta}{2(1-\pi)}\right)}{{n}^{*}_{\mathrm{PF}}\left(p,q,\frac{\delta}{\pi},%
\frac{\delta}{1-\pi}\right)}$ | $\displaystyle\lesssim\max\left(1,\frac{\log(2\pi/\delta)}{\log(\pi/\delta)},%
\frac{\log(2(1-\pi)/\delta)}{\log((1-\pi)/\delta)}\right)\,.$ | |
| --- | --- | --- | --- |
Both ratios above are at most a constant, since $\log(1/x)/\log(1/2x)\lesssim 1$ when $x∈(0,1/4)$ , and $\delta/\pi∈(0,1/4)$ and $\delta/(1-\pi)≤\delta/\pi≤ 1/4$ , since $\pi≤ 1/2$ . We now consider the second claim: By Claim 4.6, it suffices to upper-bound the ratio of the following Bayesian sample complexities, which can be controlled using Proposition 4.7 (since the priors satisfy $\frac{\beta}{\alpha+\beta}≤ 1/2$ and the ratio of the error probability and prior is at most $1/4$ ):
| | $\displaystyle\frac{{n}^{*}_{\mathrm{B}}\left(p,q,\frac{\beta}{\alpha+\beta},%
\frac{\alpha\beta}{\alpha+\beta}\right)}{{n}^{*}_{\mathrm{B}}\left(p,q,\frac{%
\beta}{\alpha+\beta},\frac{2\alpha\beta}{\alpha+\beta}\right)}$ | $\displaystyle\lesssim\max\left(1,\frac{\log(1/\alpha)}{\log(1/2\alpha)},\frac{%
\log((\alpha+\beta)/\alpha\beta)}{\log((\alpha+\beta)/2\alpha\beta)}\right)\,.$ | |
| --- | --- | --- | --- |
Both ratios above are at most a constant, since $\alpha∈(0,1/4)$ and $\alpha\beta/(\alpha+\beta)≤\min(\alpha,\beta)≤ 1/4$ . ∎
The following folklore result about boosting the probability of success using repetition, proved in Section 13.1 for completeness, will be crucial:
**Fact 4.9 (Repetition to boost success probability)**
*Consider a function $\phi:\mathcal{X}^{n}→\{0,1\}$ such that $\mathbb{P}(\phi(X_{1},...,X_{n})≠ 0)≤\tau$ for $\tau≤ 1/4$ when $X_{1},...,X_{n}$ are sampled i.i.d. under $P$ . The constant $1/4$ can be improved to any fixed constant less than $1/2$ , at the cost of a bigger constant $2^{5}$ below. Consider the modified function $\phi^{\prime}:\mathcal{X}^{nT}→\{0,1\}$ that is defined to be the majority of $\phi(X_{1},...,X_{n})$ , $...$ , $\phi(X_{(T-1)n+1}...,X_{nT})$ . Then $\mathbb{P}(\phi^{\prime}(X_{1},...,X_{nT})≠ 0)$ when $X_{1},...,X_{nT}$ are sampled i.i.d. from $P$ is at most $\exp\left(-2^{-5}T\log(1/\tau)\right)=\tau^{T/32}$ . Alternatively, for any $\tau^{\prime}≤\tau$ and any $T≥\frac{2^{5}\log(1/\tau^{\prime})}{\log(1/\tau)}$ , the probability of error (when the true distribution is $P$ ) of the boosted procedure is at most $\tau^{\prime}$ .*
* Proof ofProposition4.7*
We begin with the proof of Equation 12. Consider a test $\phi$ that solves $\mathcal{B}_{\mathrm{PF}}(p,q,\alpha_{2},\beta_{2})$ with $n:={n}^{*}_{\mathrm{PF}}(p,q,\alpha_{2},\beta_{2})$ samples. We want to boost the test $k$ times in the sense of Fact 4.9, using $nk$ samples in total, so that the resulting type-I error is at most $\alpha_{1}$ and the type-II error is at most $\beta_{1}$ . Fact 4.9 then implies the result. Similar arguments work for Equation 11. Using Claim 4.6, it suffices to solve $\mathcal{B}_{\mathrm{PF}}(p,q,(\delta_{1}/2\pi_{1}),\delta_{1}/2)$ to solve $\mathcal{B}_{\mathrm{B}}(p,q,\pi_{1},\delta_{1})$ . Consider a test $\phi$ that solves $\mathcal{B}_{\mathrm{B}}(p,q,\pi_{2},\delta_{2})$ with $n:={n}^{*}_{\mathrm{B}}(p,q,\pi_{2},\delta_{2})$ samples. By Claim 4.6 again, the type-I error of $\phi$ on $n$ samples is at most $\delta_{2}/\pi_{2}$ and the type-II error is at most $\delta_{2}/(1-\pi_{2})≤ 2\delta_{2}$ , both of which are less than $1/4$ . We boost the test $k$ times so that the resulting type-I error is at most $(\delta_{1}/2\pi_{1})$ and the type-II error is at most $\delta_{1}/2$ . Fact 4.9 then implies the result. ∎
4.2 Probability divergences
Our results crucially rely on $f$ -divergences, which we formally define below:
**Definition 4.10 (f𝑓fitalic_f-divergence)**
*For a convex function $f:\mathbb{R}_{+}→\mathbb{R}$ with $f(1)=0$ , we use $I_{f}(p,q)$ to denote the $f$ -divergence between $p$ and $q$ , defined as $I_{f}(p,q):=\sum_{i}q_{i}f\left(p_{i}/q_{i}\right)$ . We use the following conventions [Sason18]: $f(0)=\lim_{t→ 0^{+}}f(t)$ , $0f(0/0)=0$ , and for $a>0$ , $0f(a/0)=a\lim_{u→∞}f(u)/u$ .*
We will repeatedly use the following $f$ -divergences and their properties:
**Definition 4.11 (Eγsubscript𝐸𝛾E_{\gamma}italic_E start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT-divergence)**
*For $\gamma≥ 1$ , the $E_{\gamma}$ -divergence, also known as the hockey-stick divergence, is defined by $E_{\gamma}(p,q):=\sum_{i}(p_{i}-\gamma q_{i})_{+}$ .*
Observe that the $E_{\gamma}$ -divergence is an $f$ -divergence with $f(x)=\max(0,x-\gamma)$ .
**Proposition 4.12 (Variational characterization ofEγsubscript𝐸𝛾E_{\gamma}italic_E start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT, see, e.g.,[SasVer16, Section VII])**
*Let $p$ and $q$ be two distributions over $\mathcal{X}$ with $\sigma$ -algebra $\mathcal{F}$ . Then $E_{\gamma}(p,q):=\sup_{A∈\mathcal{F}}\left(\mathbb{P}_{p}(A)-\gamma\mathbb{P%
}_{q}(A)\right)$ .*
As an application, we obtain the following exact characterization of the Bayesian error probability:
**Proposition 4.13 (Exact characterization of error probability)**
*Let $p$ and $q$ be two distributions over $\mathcal{X}$ , and let $\alpha∈(0,0.5]$ . Then
$$
\displaystyle P_{e}^{(1)}=\min_{\phi:\mathcal{X}\to\{p,q\}}\Big{(}\pi\mathbb{P%
}_{x\sim p}\left\{\phi(x)\neq p\right\}+(1-\pi)\mathbb{P}_{x\sim q}\left\{\phi%
(x)\neq q\right\}\Big{)} \displaystyle=\pi-\pi E_{\frac{1-\pi}{\pi}}\left(p,q\right)\,. \displaystyle=\sum_{i}\min\left(\pi p_{i},(1-\pi)q_{i}\right). \tag{1}
$$
In particular, for any $n$ , the minimum possible error probability for the Bayesian simple binary hypothesis testing problem with prior $\pi$ , $P_{e}^{(n)}$ , is $\pi\left(1-E_{\frac{1-\pi}{\pi}}\left(p^{\otimes n},q^{\otimes n}\right)\right)$ .*
Thus, the sample complexity for $\mathcal{B}_{\mathrm{B}}(p,q,\pi,\delta)$ is the minimum $n$ such that $\pi\left(1-E_{\frac{1-\pi}{\pi}}\left(p^{\otimes n},q^{\otimes n}\right)\right)$ is at most $\delta$ . In particular, for $\pi=1/2$ , the $E_{1}$ -divergence becomes the total variation distance, and the tight characterization of the sample complexity for ${n}^{*}_{\mathrm{B}}(p,q,1/2,\delta)$ (for small $\delta≤ 1/4$ ) in terms of the Hellinger divergence is obtained by upper- and lower-bounding the total variation distance in terms of the Hellinger divergence, coupled with the tensorization property of the Hellinger divergence [Ziv02]. Such an approach cannot be replicated for the $E_{\gamma}$ -divergence. Unlike most divergences such as the total variation distance, Hellinger divergence, $\chi^{2}$ -divergence, or Kullback–Leibler divergence, the $E_{\gamma}$ -divergence between $p$ and $q$ may be 0 even when $p≠ q$ . This makes it impossible to upper-bound any such divergence by the $E_{\gamma}$ -divergence, necessitating a new approach.
Our results will use the following generalization of the Hellinger divergence (which reduces to the usual definition when $\lambda=1/2$ ):
**Definition 4.14 (ℋλsubscriptℋ𝜆\mathcal{H}_{\lambda}caligraphic_H start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT-divergence)**
*For $\lambda∈[0,1]$ , define the $\mathcal{H}_{\lambda}(p,q)$ -divergence to be
| | $\displaystyle\mathcal{H}_{\lambda}(p,q):=1-\sum_{i}p_{i}^{\lambda}q_{i}^{1-%
\lambda}=\sum_{i}q_{i}\left(1-\left(\frac{p_{i}}{q_{i}}\right)^{\lambda}\right%
)\,.$ | |
| --- | --- | --- |*
The $\mathcal{H}_{\lambda}$ -divergence is an $f$ -divergence with $f(x)=1-x^{\lambda}$ , and is jointly convex in $(p,q)$ for $\lambda∈[0,1]$ . The $\mathcal{H}_{\lambda}$ divergence generalizes the Hellinger divergence, which is obtained (up to normalization) by setting $\lambda=1/2$ . Moreover, the $\mathcal{H}_{\lambda}$ -divergence is closely related to the Rényi divergence of order $\lambda$ , via $D_{\lambda}(p,q)=\frac{1}{\lambda}\log\left(1-\mathcal{H}_{\lambda}(p,q)\right)$ .
**Proposition 4.15 (Tensorization ofℋλsubscriptℋ𝜆\mathcal{H}_{\lambda}caligraphic_H start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT)**
*We have $1-\mathcal{H}_{\lambda}(p^{\otimes n},q^{\otimes n})=\left(1-\mathcal{H}_{%
\lambda}(p,q)\right)^{n}$ .*
* Proof*
By expanding the definition of $\mathcal{H}_{\lambda}(p^{\otimes n},q^{\otimes n})$ , we obtain
| | $\displaystyle 1-\mathcal{H}_{\lambda}(p^{\otimes n},q^{\otimes n})$ | $\displaystyle=\sum_{x_{1},...,x_{n}}\left(p_{x_{1}}... p_{x_{n}}\right)^{%
\lambda}\left(q_{x_{1}}... q_{x_{n}}\right)^{1-\lambda}$ | |
| --- | --- | --- | --- |
∎
We also define the following generalization of the Jensen–Shannon divergence:
**Definition 4.16 (SkewedJSαsubscriptJS𝛼\mathrm{JS}_{\alpha}roman_JS start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT-divergence)**
*For $\pi∈[0,1]$ , we define the skewed Jensen–Shannon divergence by
$$
\displaystyle\mathrm{JS}_{\pi}(p,q):=\pi\,\mathrm{KL}(p,\pi p+(1-\pi)q)+(1-\pi%
)\,\mathrm{KL}\,(q,\pi p+(1-\pi)q)\,. \tag{13}
$$*
This divergence was called the asymmetric $\pi$ -skew Jensen–Shannon divergence in [Nie20]. Similar quantities have appeared previously in the literature, as well [Lin91, Ziv02, Nie11] termed the $\pi$ -Jensen–Shannon divergence, which was defined to be $K_{\pi}(p,q)+K_{\pi}(q,p)$ , where $K_{\pi}(p,q):=\mathrm{KL}(p,\pi p+(1-\pi)q)$ . In comparison, $\mathrm{JS}_{\pi}(p,q)=\pi K_{\pi}(p,q)+(1-\pi)K_{(1-\pi)}(q,p)$ , thus is asymmetric. Combining the facts that $K_{\pi}(·,·)$ is an $f$ -divergence [Lin91, Nie11]; the sum of two $f$ -divergences is an $f$ -divergence; and the reversal of an $f$ -divergence is another $f$ -divergence; we conclude that $\mathrm{JS}_{\pi}(p,q)$ is also an $f$ -divergence.
**Proposition 4.17 (Information-theoretic interpretation of skewedJSπsubscriptJS𝜋\mathrm{JS}_{\pi}roman_JS start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT-divergence)**
*Let $\Theta$ be a $\mathrm{Ber}(\overline{\pi})$ random variable. Let $X\sim p$ if $\Theta=0$ and $X\sim q$ if $\Theta=1$ . Then $I(\Theta;X)=\mathrm{JS}_{\pi}(p,q)$ .*
* Proof*
Let $P$ denote the joint distribution over $\theta$ and $X$ . Let $P_{X}$ denote the distribution of $X$ and let $P_{\Theta}$ denote the distribution of $\Theta$ . Then
| | $\displaystyle I(X;\Theta)$ | $\displaystyle=\mathrm{KL}(P,P_{X}× P_{\Theta})=\mathbb{E}_{\Theta}\left[%
\mathrm{KL}(P_{X|\Theta},P_{X})\right]$ | |
| --- | --- | --- | --- |
where the final result follows by noting that $P_{X}=\pi p+(1-\pi)q$ . ∎
For a random variable $Y\sim p_{Y}$ , let $H_{\text{ent}}(Y)$ (also written as $H(p_{Y})$ ) be the Shannon entropy of $Y$ defined as $H_{\text{ent}}(Y)=-\sum_{y}p_{Y}(y)\log p_{Y}(y)).$ For $x∈[0,1]$ , define $h_{\mathrm{bin}}(x)$ to be the Shannon entropy of a $\mathrm{Ber}(x)$ random variable.
**Fact 4.18 (Fano’s inequality[CovTho06,PolWu23])**
*Let $V→ X→\widehat{V}$ be any Markov chain, where $V$ is a binary random variable.. Define $p_{e}:=\mathbb{P}(V≠\widehat{V})$ . Then
| | $\displaystyle h_{\mathrm{bin}}(p_{e})≥ H_{\mathrm{ent}}(X|V).$ | |
| --- | --- | --- |*
**Lemma 4.19 (Method of joint range[HarVaj11])**
*Let $D_{f_{1}}(·,·)$ and $D_{f_{2}}(·,·)$ be two $f$ -divergences. Suppose that for a constant $c>0$ , we have
$$
D_{f_{1}}(\mathrm{Ber}(x),\mathrm{Ber}(y))\leq cD_{f_{2}}(\mathrm{Ber}(x),%
\mathrm{Ber}(y))
$$
for all $x,y∈[0,1]$ . Then $D_{f_{1}}(p,q)≤ cD_{f_{2}}(p,q)$ for all probability distributions $p$ and $q$ .*
* Proof*
The result from [HarVaj11] states that $\mathcal{R}$ is the convex hull of $\mathcal{R}_{2}$ , where
| | $\displaystyle\mathcal{R}$ | $\displaystyle:=\{(u,v):u=D_{f_{1}}(p,q),v=D_{f_{2}}(p,q)\text{ for any $p$ and%
$q$ over any $\mathcal{X}$}\}\text{ and }$ | |
| --- | --- | --- | --- |
Since the $f$ -divergence inequality is given to hold for Bernoulli distributions, we know that $\mathcal{R}_{2}⊂eq\{(u,v):u≤ cv\}$ . Since the latter is a convex subset of $\mathbb{R}_{+}^{2}$ , we conclude that the convex hull of $\mathcal{R}_{2}$ , which is the same as $\mathcal{R}$ , also lies within the same set. This concludes the proof. ∎
5 Proof Sketch of Theorem 2.1
In this section, we present a high-level sketch of the proof of Theorem 2.1.
5.1 Linear error probability regime
The formal statements of all results in this section and their complete proofs are in Section 7.
We begin by focusing on the regime where $\delta$ is a small fraction of $\pi$ . Suppose $\delta∈(\pi/100,\pi/4)$ . We begin by presenting sample complexity lower bounds and upper bounds, respectively, using classical techniques. Finally, we present our main technical result that establishes their equivalence.
A sample complexity lower bound is obtained via Fano’s inequality:
**Proposition 5.1 (Lower bound, formally stated inProposition7.2)**
*${n}^{*}_{\mathrm{B}}(p,q,\pi,\pi/4)\gtrsim\frac{\pi\log(1/\pi)}{I(\Theta;X_{1})}$ .*
* Proof Sketch*
For $x∈(0,1)$ , we use $h_{\mathrm{bin}}(x)$ to denote the binary entropy of $x$ and $H_{\mathrm{ent}}(·)$ to denote the entropy (and conditional entropy) of a random variable or a distribution. Observe that $\Theta→(X_{1},...,X_{n})→\widehat{\Theta}$ is a Markov chain. Let $P_{e}^{(n)}$ be the probability of error, i.e., $\mathbb{P}(\Theta≠\widehat{\Theta})$ . By Fano’s inequality (Fact 4.18), we have
$$
\displaystyle h_{\mathrm{bin}}(P_{e}^{(n)}) \displaystyle\geq H_{\mathrm{ent}}(\Theta|X_{1},\dots,X_{n})=H_{\mathrm{ent}}(%
\Theta)-I(\Theta;X_{1}{,}\dots,X_{n}) \displaystyle=h_{\mathrm{bin}}(\pi)-I(\Theta;X_{1}{,}\dots,X_{n}). \tag{14}
$$
The mutual information term is upper-bounded using the conditionally i.i.d. distribution of the $X_{i}$ ’s given $\Theta$ , as $I(\Theta;X_{1},...,X_{n})≤\sum_{i=1}^{n}I(\theta;X_{i})=nI(\Theta;X_{1})$ . Combining this inequality with Section 5.1, we obtain $n≥\frac{h_{\mathrm{bin}}(\pi)-h_{\mathrm{bin}}(P_{e}^{(n)})}{I(\Theta;X_{1})}$ . As $P_{e}^{n}≤\pi/4$ , the desired inequality follows by noting that $h_{\mathrm{bin}}(\pi)-h_{\mathrm{bin}}(\pi/4)\asymp h_{\mathrm{bin}}(\pi)$ . ∎
We note that a similar strategy for establishing a lower bound for the uniform prior was used in [Ziv02, Theorem 4.27]. For a uniform prior and small constant error probability, this gives an alternate proof of the $\frac{1}{\mathrm{h}^{2}(p,q)}$ sample complexity lower bound, because $I(\Theta;X_{1})\asymp\mathrm{h}^{2}(p,q)$ when $\Theta$ is uniform over $p$ and $q$ [Ziv02]. Turning our attention to an upper bound, we use the classical idea of tensorization for the Hellinger family of $f$ -divergences. This tensorization leads directly to the following classical upper bound:
**Proposition 5.2 (Upper bound, formally stated inProposition7.3)**
*Set $\lambda=\frac{0.5\log 2}{\log(1/\pi)}$ , which lies in $(0,0.5]$ , and define $\overline{\lambda}:=1-\lambda$ . Then ${n}^{*}_{\mathrm{B}}(p,q,\pi,\pi/4)\lesssim\left\lceil\frac{2}{\mathcal{H}_{%
\overline{\lambda}}(p,q)}\right\rceil$ .*
* Proof Sketch*
Let $P_{e}^{(n)}$ be the minimum Bayesian error between $p$ and $q$ , given $n$ samples. By Proposition 4.13, we have $P_{e}^{(n)}=\sum_{x_{1},...,x_{n}}\min\left(\pi p^{\otimes n}(x_{1},...,x_%
{n}),(1-\pi)q^{\otimes n}(x_{1},...,x_{n})\right)$ . Since $\min(x,y)≤ x^{\lambda}y^{1-\lambda}$ for any nonnegative $x$ , $y$ and $\lambda∈(0,1)$ , we obtain
$$
\displaystyle P_{e}^{(n)} \displaystyle\leq\sum_{x_{1},\dots,x_{n}}\left(\pi p^{\otimes n}(x_{1},\dots,x%
_{n})\right)^{\overline{\lambda}}\left((1-\pi)q(x_{1},\dots,x_{n})\right)^{\lambda} \displaystyle=\pi^{\overline{\lambda}}(1-\pi)^{\lambda}\left(1-\mathcal{H}_{%
\overline{\lambda}}(p^{\otimes n},q^{\otimes n})\right)\leq\pi^{\overline{%
\lambda}}\left(1-\mathcal{H}_{\overline{\lambda}}(p,q)\right)^{n} \displaystyle\leq\pi^{\overline{\lambda}}\left(e^{-\mathcal{H}_{\overline{%
\lambda}}(p,q)}\right)^{n}\,.
$$
By definition of $\lambda$ , we have $\pi^{\overline{\lambda}}\lesssim\pi$ . Thus, if $e^{-\mathcal{H}_{\overline{\lambda}}(p,q)n}$ is sufficiently small, the whole error probability will be less than $\pi/4$ , yielding the desired upper bound on the sample complexity. ∎
We remark that similar arguments have appeared classically in [HelRav70].
Main technical result:
We now arrive at the main technical contribution of our work. Following the standard tools in our toolkit, we have obtained lower and upper bounds that depend on two different quantities: $I(\Theta;X_{1})$ and $\mathcal{H}_{\overline{\lambda}}(p,q)$ , for $\lambda:=\frac{0.5\log 2}{\log(1/\pi)}$ . Our main contribution is to show that both these quantities are equivalent up to constant factors for all $p$ , $q$ , and $\pi$ . The inequality $\frac{\pi\log(1/\pi)}{I(\Theta;X_{1})}\lesssim\frac{1}{\mathcal{H}_{\overline{%
\lambda}}(p,q)}$ follows directly from these results. Our proof technique relies on noting that $I(\Theta;X_{1})$ is equal to the $f$ -divergence $\mathrm{JS}_{\pi}(p,q)$ defined in Definition 4.16; see Proposition 4.17.
Thus, our task reduces to showing a tight relationship between the $f$ -divergences $\mathrm{JS}_{\pi}(p,q)$ and $\mathcal{H}_{\overline{\lambda}}(p,q)$ . Framing this inequality in terms of $f$ -divergences permits us to use the powerful joint range method of [HarVaj11] from Lemma 4.19. Thus, it remains to show that $\frac{\pi\log(1/\pi)}{\mathrm{JS}_{\pi}(p,q)}\gtrsim\frac{1}{\mathcal{H}_{%
\overline{\lambda}}(p,q)}$ for Bernoulli distributions $p$ and $q$ .
**Lemma 5.3 (Relationship betweenJSJS\mathrm{JS}roman_JSandℋℋ\mathcal{H}caligraphic_H, formally stated inLemma7.4)**
*Let $p$ and $q$ be two Bernoulli distributions. Let $\pi∈(0,1/2]$ and $\lambda=\frac{0.5\log 2}{\log(1/\pi)}$ . Then $\frac{1}{\mathcal{H}_{\overline{\lambda}}(p,q)}\lesssim\frac{\pi\log(1/\pi)}{%
\mathrm{JS}_{\pi}(p,q)}$ .*
The proof of the above lemma is rather technical and follows by careful case-by-case analyses.
5.2 Sublinear error probability regime
The formal statements of all results in this section and their complete proofs are in Section 6.
In this section, we sketch the argument for the proof of Theorem 2.1 in the regime $\delta∈(\pi^{2},\pi/100)$ . Instead of characterizing the sample complexity in terms of an $f$ -divergence, we adopt a reduction approach and establish a tight relationship between ${n}^{*}(p,q,\pi,\delta)$ and ${n}^{*}(p,q,\pi^{\prime},\delta^{\prime})$ .
Our starting point is the standard boosting lemma, which states that if a decision procedure fails with probability $\tau≤ 1/4$ with $n$ samples, then taking the majority vote of $T$ (independent) runs of the same decision procedure fails with probability $\tau^{\prime}\ll\tau$ if $T\asymp\log(1/\tau^{\prime})/\log(1/\tau)$ ; see Fact 4.9. As a consequence, we immediately obtain the following result:
**Lemma 5.4 (Success amplification, formally stated inLemma6.4)**
*For all positive integers $T$ such that $\max\left((\delta/\pi)^{1/T},\pi^{1/T}\right)≤ 1/8$ , we have ${n}^{*}_{\mathrm{B}}\left(p,q,\pi,\delta\right)\lesssim T·{n}^{*}_{\mathrm%
{B}}\left(p,q,\pi^{1/T},\delta^{1/T}\right)$ .*
However, the above boosting procedure might be lossy; recall that the optimal procedure remains the likelihood ratio test. The following result establishes that the likelihood ratio test also suffers the same sample complexity (up to constant factors):
**Lemma 5.5 (Error amplification, formally stated inLemma6.3)**
*Under the same setting as in Lemma 5.4, we have ${n}^{*}_{\mathrm{B}}\left(p,q,\pi,\delta\right)\gtrsim T{n}^{*}_{\mathrm{B}}%
\left(p,q,\pi^{1/T},\delta^{1/T}\right)$ .*
* Sketch*
Consider the likelihood ratio test with $n:={n}^{*}(p,q,\pi,\delta)$ samples. Divide the $n$ samples into $T$ buckets, each of size $n/T$ ; for simplicity of presentation, we assume that $n$ is divisible by $T$ for this proof sketch. Let the $i^{\text{th}}$ bucket be $B_{i}$ and denote the likelihood ratio on $B_{i}$ by $L_{i}:=\log\left(\prod_{x∈ B_{i}}^{n}\frac{p_{x}}{q_{x}}\right)$ . The optimal likelihood ratio test $\phi^{*}$ takes $n$ samples $x_{1},...,x_{n}$ and returns $p$ if the joint likelihood $L:=\sum_{i∈[T]}L_{i}$ is at least $\log((1-\pi)/\pi)$ , and $q$ otherwise. Consider the event $\mathcal{E}_{i}$ over the points in the bucket $B_{i}$ (which has $m$ points), defined to be the domain where $L_{i}≤\log((1-\pi)/\pi)/T$ . Define $\mathcal{E}:=\bigcap_{i=1}^{T}\mathcal{E}_{i}$ and $\mathcal{E}^{\prime}=\bigcap_{i=1}^{T}\mathcal{E}_{i}^{\complement}$ . On the event $\mathcal{E}$ , we have $L<\log((1-\pi)/\pi)$ , and on event $\mathcal{E}^{\prime}$ , we have $L≥\log((1-\pi)/\pi)$ . By the definition of the test, the (average) probability of error of $\phi^{*}$ is
$$
\displaystyle\mathbb{P}(\text{error}) \displaystyle=\pi\mathbb{P}(L<\log(\pi/\beta)|p)+(1-\pi)\mathbb{P}(L>\log(\pi/%
\beta)|q) \displaystyle\geq\pi\mathbb{P}(\mathcal{E}|\Theta=p)+(1-\pi)\mathbb{P}(%
\mathcal{E}^{\prime}|\Theta=q) \displaystyle=\pi\left(\mathbb{P}(\mathcal{E}_{1}|\Theta=p)\right)^{T}+(1-\pi)%
\left(\mathbb{P}(\mathcal{E}_{1}^{\complement}|\Theta=q)\right)^{T}\,, \tag{15}
$$
where the last step uses the fact that the events are conditionally i.i.d. Consider the test $\phi$ on $m$ samples that outputs $p$ on $\mathcal{E}_{1}^{\complement}$ and $q$ on $\mathcal{E}_{1}$ . Let $e_{1}$ and $e_{2}$ be its marginal failure probabilities, defined as $\mathbb{P}(\mathcal{E}_{1}|\Theta=p)$ and $\mathbb{P}(\mathcal{E}_{1}^{\complement}|\Theta=q)$ , respectively. By Equation 15, we have $\pi e_{1}^{T}+(1-\pi)e_{2}^{T}≤\delta$ . Thus, $e_{1}≤(\delta/\pi)^{1/T}$ and $e_{2}≤(2\delta)^{1/T}$ . Suppose we use $\phi$ to solve $\mathcal{B}(p,q,\pi^{\prime},\delta^{\prime})$ for $\pi^{\prime}=\pi^{1/T}$ and $\delta^{\prime}=3\delta^{1/T}$ . Its average error probability will be at most $\pi^{\prime}e_{1}+(1-\pi^{\prime})e_{2}≤ 3\delta^{1/T}$ . By the definition of sample complexity, we have $m≥{n}^{*}(p,q,\pi^{\prime},\delta^{\prime})$ , so ${n}^{*}≥ T{n}^{*}(p,q,\pi^{\prime},\delta^{\prime})$ . For simplicity of presentation, we ignore the distinction between the failure probabilities $\delta^{1/T}$ and $3\delta^{1/T}$ . ∎
6 Reduction: Error amplification and success amplification
In this section, we first show by a series of simple reductions that we can reduce the problem of computing ${n}^{*}_{\mathrm{B}}(p,q,\pi,\delta)$ for any $\delta≤\pi/4$ to computing ${n}^{*}_{\mathrm{B}}(p,q,\pi^{\prime},\pi^{\prime}/10)$ for some $\pi^{\prime}$ appropriately defined, formalizing the discussion from Section 5.2. Later, our main result will be to give an expression for ${n}^{*}_{\mathrm{B}}(p,q,\pi,\pi/10)$ .
**Proposition 6.1 (Self-reduction)**
*Let $p$ and $q$ be two arbitrary distributions such that ${n}^{*}_{\mathrm{PF}}(p,q,1/4,1/4)≥ 2$ . The condition ${n}^{*}_{\mathrm{PF}}(p,q,1/4,1/4)≥ 2$ is guaranteed by $\mathrm{h}^{2}(p,q,)≤ 0.125$ .
1. (Prior-free testing problem) Let $\alpha∈(0,1/8)$ and $\beta∈(0,1/4)$ . Then for all integers $T$ such that $\max\left((2\alpha)^{1/T},(2\beta)^{1/T}\right)≤ 1/4$ , we have ${n}^{*}_{\mathrm{PF}}\left(p,q,\alpha,\beta\right)\asymp T·{n}^{*}_{%
\mathrm{PF}}\left(p,q,\alpha^{\frac{1}{T}},\beta^{\frac{1}{T}}\right)\,$ .
In particular, if $\beta≤\alpha<1/8$ , let $T=\left\lfloor\frac{\log(1/2\alpha)}{\log 4}\right\rfloor$ , so that $\alpha^{\frac{1}{T}}∈\left[\frac{1}{16},\frac{1}{4}\right]$ . Defining $\beta^{\prime}:=\beta^{1/T}$ , we obtain
| | $\displaystyle{n}^{*}_{\mathrm{PF}}\left(p,q,\alpha,\beta\right)\asymp\log\left%
(\frac{1}{\alpha}\right){n}^{*}_{\mathrm{PF}}\left(p,q,1/4,\beta^{\prime}%
\right)\,.$ | |
| --- | --- | --- |
1. (Bayesian testing problem) Let $\pi∈(0,1/2]$ and $\delta∈(0,\pi/8)$ . For any integer $T$ such that $(\delta/\pi)^{\frac{1}{T}}≤ 1/8$ and $(\pi)^{\frac{1}{T}}≤ 1/2$ , we have ${n}^{*}_{\mathrm{B}}\left(p,q,\pi,\delta\right)\asymp T·{n}^{*}_{\mathrm{B%
}}\left(p,q,\pi^{\frac{1}{T}},\delta^{\frac{1}{T}}\right)$ .
In particular, if $\delta∈(\pi^{2},\pi/8)$ , we can set $T=\left\lfloor\frac{\log(\pi/\delta)}{\log 8}\right\rfloor$ and $\pi^{\prime}:=\pi^{\frac{1}{T}}$ (so that $\delta^{\frac{1}{T}}∈\left[\frac{\pi^{\prime}}{64},\frac{\pi^{\prime}}{8}\right]$ and $\pi^{\prime}∈(0,1/2]$ ) to obtain
| | $\displaystyle{n}^{*}_{\mathrm{B}}\left(p,q,\pi,\delta\right)\asymp\log\left(%
\frac{\pi}{\delta}\right){n}^{*}_{\mathrm{B}}\left(p,q,\pi^{\prime},\pi^{%
\prime}/4\right)\,.$ | |
| --- | --- | --- |*
The proof of the proposition is contained in Section 6.1.
Error amplification:
We begin by deriving a lower bound on the prior-free sample complexity in terms of an easier problem: a prior-free testing problem on the same distributions, but with larger error probabilities.
**Lemma 6.2 (Error amplification)**
*For any positive integer $T$ , we have
$$
\displaystyle{n}^{*}_{\mathrm{PF}}(p,q,\alpha,\beta)\geq T\cdot\left({n}^{*}_{%
\mathrm{PF}}\left(p,q,(2\alpha)^{1/T},(2\beta)^{1/T}\right)-1\right). \tag{16}
$$*
* Proof ofLemma6.2*
Exactly characterizing the sample complexity in the prior-free setting is difficult, because it is not clear what the optimal threshold should be for the likelihood ratio test. Thus, we first reduce to Bayesian setting, using Equation 9 in Claim 4.6:
$$
\displaystyle{n}^{*}_{\mathrm{PF}}(p,q,\alpha,\beta)\geq{n}^{*}_{\mathrm{B}}%
\left(p,q,\frac{\beta}{\alpha+\beta},\frac{2\alpha\beta}{\alpha+\beta}\right)\,. \tag{17}
$$
Using the exact characterization of the optimal test for the Bayesian setting (Fact 4.4), we establish the following error amplification result:
**Lemma 6.3 (Error amplification)**
*For any positive integer $T$ , we have
| | $\displaystyle{n}^{*}_{\mathrm{B}}\left(p,q,\frac{\beta}{\alpha+\beta},\frac{2%
\alpha\beta}{\alpha+\beta}\right)≥ T·\left({n}^{*}_{\mathrm{PF}}(p,q,(2%
\alpha)^{1/T},(2\beta)^{1/T})-1\right)\,.$ | |
| --- | --- | --- |*
The above lemma, combined with Equation 17, completes the proof of Lemma 6.2. ∎
We now provide the proof of Lemma 6.3.
* Proof ofLemma6.3*
Consider the problem $\mathcal{B}_{\mathrm{B}}\left(p,q,\frac{\beta}{\alpha+\beta},\frac{2\alpha%
\beta}{\alpha+\beta}\right)$ with sample complexity ${n}^{*}:={n}^{*}_{\mathrm{B}}\left(p,q,\frac{\beta}{\alpha+\beta},\frac{2%
\alpha\beta}{\alpha+\beta}\right)$ . Given any positive integer $T$ , let $n$ be the smallest integer, at least ${n}^{*}$ , which is divisible by $T$ . Observe that $T\lceil{n}^{*}/T\rceil≥ n≥{n}^{*}$ . By Fact 4.4, an optimal test $\phi^{*}$ takes $n$ samples $x_{1},...,x_{n}$ and returns $p$ if $L:=\log(\prod_{i=1}^{n}p_{x_{i}}/q_{x_{i}})$ is at least $\log(\alpha/\beta)$ , and $q$ otherwise. Since $n≥{n}^{*}$ , the average error probability must be at most $2\alpha\beta/(\alpha+\beta)$ . Let $\alpha^{\prime}=(2\alpha)^{1/T}$ and $\beta^{\prime}=(2\alpha)^{1/T}$ . We will now show that unless $n≥ T·{n}^{*}_{\mathrm{PF}}(p,q,\alpha^{\prime},\beta^{\prime})$ , the (average) probability of error of the test $\phi^{*}$ is strictly larger than $2\alpha\beta/(\alpha+\beta)$ , which gives the desired claim. Divide the entire dataset into $T$ buckets $B_{1},...,B_{T}$ , each of size $n/T$ (such that the partition is independent of the data). Define $L_{i}$ to be the likelihood ratio on the bucket $B_{i}$ , i.e., $L_{i}=\log(\prod_{x∈ B_{i}}\left(p_{x}/q_{x}\right))$ . We have $L=\sum_{i=1}^{T}L_{i}$ . Consider the event $\mathcal{E}_{i}$ over the points in the bucket $B_{i}$ (which has at most $m:=n/T$ points) defined to be the domain where $L_{i}≤\log(\alpha/\beta)/T$ . Define $\mathcal{E}:=\bigcap_{i=1}^{T}\mathcal{E}_{i}$ and $\mathcal{E}^{\prime}=\bigcap_{i=1}^{T}\mathcal{E}_{i}^{\complement}$ . On the event $\mathcal{E}$ , we have $L<\log(\alpha/\beta)$ , and on event $\mathcal{E}^{\prime}$ , we have $L≥\log(\alpha/\beta)$ . By the definition of the test $\phi^{*}$ , the (average) probability of error of $\phi^{*}$ is as follows:
| | $\displaystyle\mathbb{P}(\text{error})$ | $\displaystyle=\frac{\beta}{\alpha+\beta}\mathbb{P}(L<\log(\alpha/\beta)|p)+%
\frac{\alpha}{\alpha+\beta}\mathbb{P}(L>\log(\alpha/\beta)|q)$ | |
| --- | --- | --- | --- |
where in the last step, we use the fact that events are i.i.d., conditioned on the underlying measure. To show that the probability of error is strictly larger than $2\alpha\beta/(\alpha+\beta)$ , it suffices to show that either $\left(\mathcal{P}(\mathcal{E}_{1}|p)\right)^{T}>2\alpha$ or $\left(\mathcal{P}(\mathcal{E}_{1}^{\complement}|q)\right)^{T}>2\beta$ unless $n$ is as large, as claimed. Consider the test $\phi_{\mathcal{E}_{1}}$ on $m=n/T$ points that outputs $p$ on the event $\mathcal{E}_{1}$ and outputs $q$ on the event $\mathcal{E}_{1}^{\complement}$ . Thus, if $\mathcal{P}(\mathcal{E}_{1}|p)≤(2\alpha)^{1/T}≤\alpha^{\prime}$ and $\mathcal{P}(\mathcal{E}_{1}^{\complement}|q)≤(2\beta)^{1/T}=\beta^{\prime}$ , the test $\phi_{\mathcal{E}_{1}}$ solves the hypothesis testing problem $\mathcal{B}_{\mathrm{PF}}(p,q,\alpha^{\prime},\beta^{\prime})$ with $m$ samples. By definition of the sample complexity, we have $m≥{n}^{*}_{\mathrm{PF}}(p,q,\alpha^{\prime},\beta^{\prime})$ . Combined with the definition of $n$ and $m$ , we obtain $T\lceil{n}^{*}/T\rceil≥ n≥ T{n}^{*}_{\mathrm{PF}}(p,q,\alpha^{\prime},%
\beta^{\prime})$ , so $({n}^{*}/T)+1≥\lceil{n}^{*}/T\rceil≥{n}^{*}_{\mathrm{PF}}(p,q,\alpha^{%
\prime},\beta^{\prime})$ . ∎
Accuracy amplification:
We now turn our attention to the upper bound. To mirror the style of our lower bound, which was error amplification, we will establish the upper bound using success amplification.
**Lemma 6.4 (Success amplification)**
*For all positive integers $T$ such that $\max\left(\alpha^{1/T},\beta^{1/T}\right)≤ 1/4$ , we have
$$
\displaystyle{n}^{*}_{\mathrm{PF}}\left(p,q,\alpha,\beta\right)\lesssim T\cdot%
{n}^{*}_{\mathrm{PF}}\left(p,q,\alpha^{1/T},\beta^{1/T}\right)\,. \tag{18}
$$*
* Proof ofLemma6.4*
Let $\alpha^{\prime}=\alpha^{1/cT}$ and $\beta^{\prime}=\beta^{1/cT}$ , where $c=1/32$ is from Fact 4.9, both of which are less than $1/4$ by assumption. Consider the test $\phi^{*}$ achieving the sample complexity $n:={n}^{*}_{\mathrm{PF}}(p,q,\alpha^{\prime},\beta^{\prime})$ . Design the test $\phi^{\prime}$ that randomly divides the data of $nT$ samples into $T$ buckets and runs $\phi^{*}$ on each bucket, then outputs the majority of the $T$ individual outputs. By Fact 4.9, the probability of error of the boosted procedure when the distribution is $p$ (respectively, $q$ ) is at most $(\alpha^{\prime})^{T/32}=(\alpha^{32/T})^{T/32}=\alpha$ (respectively, $\beta$ ). Thus, the sample complexity is at most $T·{n}^{*}_{\mathrm{PF}}(p,q,\alpha^{1/cT},\beta^{1/cT})$ . The desired result will follow if we show that ${n}^{*}_{\mathrm{PF}}(p,q,\alpha^{1/cT},\beta^{1/cT})$ is at most a constant times ${n}^{*}_{\mathrm{PF}}(p,q,\alpha^{1/T},\beta^{1/T})$ , which follows from Proposition 4.7, since $\alpha^{1/T}≤ 1/4$ and $\beta^{1/T}≤ 1/4$ . This completes the proof. ∎
6.1 Proof of Proposition 6.1
* Proof ofProposition6.1*
We first start with the prior-free setting. Combining Lemmata 6.4 and 6.2, we obtain the following (for all $T∈\mathbb{N}$ such that $\max\left(\alpha^{1/T},\beta^{1/T}\right)≤ 1/4$ , which holds under the assumptions on $T$ from Proposition 6.1):
$$
\displaystyle T\cdot\left({n}^{*}_{\mathrm{PF}}\left(p,q,(2\alpha)^{1/T},(2%
\beta)^{1/T}\right)-1\right)\leq{n}^{*}_{\mathrm{PF}}(p,q,\alpha,\beta)%
\lesssim T\cdot{n}^{*}_{\mathrm{PF}}\left(p,q,\alpha^{1/T},\beta^{1/T}\right)\,. \tag{19}
$$
Since $\max\left((2\alpha)^{1/T},(2\beta)^{1/T}\right)≤ 1/4$ , monotonicity of the sample complexity implies that
$$
{n}^{*}_{\mathrm{PF}}\left(p,q,(2\alpha)^{1/T},(2\beta)^{1/T}\right)\geq{n}^{*%
}_{\mathrm{PF}}\left(p,q,1/4,1/4\right)\geq 2,
$$
which further implies that the left-hand expression in Equation 19 is further lower-bounded by $0.5T{n}^{*}_{\mathrm{PF}}\left(p,q,(2\alpha)^{1/T},(2\beta)^{1/T}\right)$ . Thus, we obtain
| | $\displaystyle 0.5T·{n}^{*}_{\mathrm{PF}}\left(p,q,(2\alpha)^{1/T},(2\beta)%
^{1/T}\right)≤{n}^{*}_{\mathrm{PF}}(p,q,\alpha,\beta)\lesssim T·{n}^{*}%
_{\mathrm{PF}}\left(p,q,\alpha^{1/T},\beta^{1/T}\right)\,.$ | |
| --- | --- | --- |
Note that $\max\left((2\alpha)^{1/T},(2\beta)^{1/T},\alpha^{1/T},\beta^{1/T}\right)≤ 1/4$ and $\log(1/x)/\log(1/2x)\lesssim 1$ for $x≤ 1/4$ . Combining these facts with the application of Proposition 4.7 to the display equation above, we conclude that the upper and lower bounds are within constants of each other, so
| | $\displaystyle{n}^{*}_{\mathrm{PF}}(p,q,\alpha,\beta)\asymp T·{n}^{*}_{%
\mathrm{PF}}\left(p,q,\alpha^{1/T},\beta^{1/T}\right)\,.$ | |
| --- | --- | --- | We now explain the special choice of $T$ : we set $T=\left\lfloor\frac{\log(1/2\alpha)}{\log 4}\right\rfloor$ , which is at least $1$ , since $\alpha≤ 1/8$ . We will now show that $\max((2\alpha)^{1/T},(2\beta)^{1/T})≤ 1/4$ , and additionally that $(2\alpha)^{1/T}∈[1/16,1/4]$ . Thus, we obtain
$$
(2\beta)^{1/T}\leq(2\alpha)^{1/T}=e^{\frac{\log(2\alpha)}{T}}\leq e^{\frac{%
\log(2\alpha)}{\log(1/2\alpha)/\log 4}}=e^{-\log 4}=1/4.
$$
Similarly, it can be shown that
$$
(2\alpha)^{1/T}=e^{\frac{\log(2\alpha)}{T}}\geq e^{\frac{\log(2\alpha)}{\log(1%
/2\alpha)/(2\log 4)}}=e^{-2\log 4}=1/16,
$$
where we use the fact that $T≥ 1$ , so $T≥\frac{\log(1/2\alpha)}{2\log 4}$ . Thus, the result established above specializes to ${n}^{*}_{\mathrm{PF}}(p,q,\alpha,\beta)\asymp T{n}^{*}(p,q,(\alpha)^{\frac{1}{%
T}},(\beta)^{1/T})$ , and the latter is equivalent to (up to constants) ${n}^{*}_{\mathrm{PF}}(p,q,1/4,\beta^{1/T})$ by Proposition 4.7. We conclude that
| | $\displaystyle{n}^{*}_{\mathrm{PF}}\left(p,q,\alpha,\beta\right)\asymp\log\left%
(\frac{1}{\alpha}\right){n}^{*}_{\mathrm{PF}}\left(p,q,1/4,\beta^{\prime}%
\right)\,,$ | |
| --- | --- | --- |
proving the first part of Proposition 6.1. We now prove the second part of the proposition by translating the results from the prior-free setting to the Bayesian setting using Corollary 4.8. For any positive integer $T$ such that $\left(\delta/\pi\right)^{\frac{1}{T}}≤ 1/8$ , we have
$$
\displaystyle{n}^{*}_{\mathrm{B}}\left(p,q,\pi,\delta\right) \displaystyle\asymp{n}^{*}_{\mathrm{PF}}\left(p,q,\frac{\delta}{\pi},\delta\right) \displaystyle\asymp T\cdot{n}^{*}_{\mathrm{PF}}\left(p,q,\left(\frac{\delta}{%
\pi}\right)^{\frac{1}{T}},\delta^{1/T}\right) \displaystyle\asymp T\cdot{n}^{*}_{\mathrm{B}}\left(p,q,\frac{\delta^{\frac{1}%
{T}}}{\delta^{\frac{1}{T}}+\left(\delta/\pi\right)^{\frac{1}{T}}},\frac{\delta%
^{\frac{1}{T}}}{\delta^{\frac{1}{T}}+\left(\delta/\pi\right)^{\frac{1}{T}}}%
\cdot\left(\frac{\delta}{\pi}\right)^{\frac{1}{T}}\right) \displaystyle=T\cdot{n}^{*}_{\mathrm{B}}\left(p,q,\frac{\pi^{\frac{1}{T}}}{\pi%
^{\frac{1}{T}}+1},\frac{\delta^{\frac{1}{T}}}{\pi^{\frac{1}{T}}+1}\right) \displaystyle\asymp T\cdot{n}^{*}_{\mathrm{B}}\left(p,q,\pi^{\frac{1}{T}},%
\delta^{\frac{1}{T}}\right),
$$
where the last equivalence follows from Proposition 4.7, as follows: (i) both priors are less than $1/2$ , (ii) the ratio of the error probability to the prior is $(\delta/\pi)^{\frac{1}{T}}$ is at most $1/4$ , and (iii) $\log(1/\delta_{1})/\log(1/\delta_{2})\asymp 1$ for $\delta_{1}=\delta^{\frac{1}{T}}$ and $\delta_{2}=\delta_{1}/(1+\pi^{\frac{1}{T}})$ , since $\delta_{1}/2≤\delta_{2}≤\delta_{1}≤ 1/4$ . We now explain the special choice of $T$ : we set $T=\left\lfloor\frac{\log(\pi/\delta)}{\log 8}\right\rfloor$ , which is at least $1$ , and we also took $\pi^{\prime}=\pi^{\frac{1}{T}}$ and $\delta^{\prime}=\delta^{1/T}$ . In particular, $\frac{\delta^{\prime}}{\pi^{\prime}}=\left(\frac{\delta}{\pi}\right)^{\frac{1}%
{T}}≤ e^{\frac{-\log(\pi/\delta)\log 8}{\log(\pi/\delta)}}≤ 1/8$ . Similarly, we see that $\frac{\delta^{\prime}}{\pi^{\prime}}≥\frac{1}{64}$ . Moreover, we have $\pi^{\prime}∈(0,1/2)$ , since $T≤\frac{\log(1/\pi)}{\log 8}$ because $\delta≥\pi^{2}$ , implying that $\pi^{\frac{1}{T}}≤ e^{\frac{\log(\pi)\log 8}{\log(1/\pi)}}<1/2$ . Specializing the result above to this choice of $T$ , we obtain
| | $\displaystyle{n}^{*}_{\mathrm{B}}(p,q,\pi,\delta)\asymp T{n}^{*}_{\mathrm{B}}%
\left(p,q,\pi^{\prime},\delta^{\prime}\right)\asymp T{n}^{*}_{\mathrm{B}}\left%
(p,q,\pi^{\prime},\pi^{\prime}/4\right),$ | |
| --- | --- | --- |
where the last equivalence uses Proposition 4.7, since $\delta^{\prime}∈(\pi^{\prime}/64,\pi^{\prime}/8)$ . ∎
7 Bayesian testing problem: Linear regime
In this section, we establish upper and lower bounds for the Bayesian simple binary hypothesis testing problem $\mathcal{B}_{\mathrm{B}}(p,q,\pi,\delta)$ when $\delta$ is a small constant fraction of $\pi$ . Observe that the following result recovers the case $\pi=1/2$ by noting that $\mathrm{JS}_{1/2}(p,q)\asymp\mathrm{h}^{2}(p,q)$ :
**Theorem 7.1 (Characterization of sample complexity ofnB∗subscriptsuperscript𝑛B{n}^{*}_{\mathrm{B}}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_B end_POSTSUBSCRIPTin linear error probability regime)**
*For any $\pi∈(0,0.5]$ , we have
| | $\displaystyle\left\lceil\frac{3}{16}\frac{\pi\log(1/\pi)}{\mathrm{JS}_{\pi}(p,%
q)}\right\rceil≤{n}^{*}_{\mathrm{B}}(p,q,\pi,\pi/4)\lesssim\left\lceil\frac%
{380\pi\log(1/\pi)}{\mathrm{JS}_{\pi}(p,q)}\right\rceil\,.$ | |
| --- | --- | --- |
In particular, if the condition ${n}^{*}_{\mathrm{B}}(p,q,\pi,\pi/4)>1$ holds This mild regularity condition necessarily happens when either (i) ${n}^{*}_{\mathrm{PF}}(p,q,1/4,1/4)>1$ , (ii) $\mathrm{h}^{2}(p,q)≤ 0.125$ , or (iii) $\mathrm{JS}_{\pi}(p,q)<(3/16)\pi\log(1/\pi)$ , or (iv) $\mathcal{H}_{1-\frac{0.5\log 2}{\log(1/\pi)}}(p,q)≤\frac{1}{1024}$ . See also the discussion in Section 4., we have
| | $\displaystyle{n}^{*}(p,q,\pi,\pi/4)\asymp\frac{\pi\log(1/\pi)}{\mathrm{JS}_{%
\pi}(p,q)}\asymp\frac{1}{\mathcal{H}_{1-\frac{0.5\log 2}{\log(1/\pi)}}(p,q)}\,.$ | |
| --- | --- | --- |*
In fact, our techniques give upper and lower bounds on ${n}^{*}_{\mathrm{B}}(p,q,\pi,\pi(1-\gamma))$ for all sufficiently small $\gamma$ (not just a large enough constant), which we will state in Section 8.
* Proof*
We establish this result in three steps: First, we show a lower bound in terms of $\mathrm{JS}_{\pi}(p,q)$ (cf. Proposition 7.2); second, we show an upper bound in terms of $\mathcal{H}_{\lambda}(p,q)$ for a particular value of $\lambda$ (cf. Proposition 7.3); and finally, we show that the upper and lower bounds are within constants of each other (cf. Lemma 7.4). We start with the lower bound for general $\pi$ :
**Proposition 7.2 (Lower bound fornB∗subscriptsuperscript𝑛B{n}^{*}_{\mathrm{B}}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_B end_POSTSUBSCRIPT)**
*For any $\pi∈(0,0.5]$ and $\gamma∈(0,1)$ , we have
| | $\displaystyle{n}^{*}_{\mathrm{B}}(p,q,\pi,\pi(1-\gamma))≥\left\lceil\frac{%
\pi\gamma\log\left(\frac{1-\pi}{\pi}\right)+\pi^{2}\gamma^{2}}{\mathrm{JS}_{%
\pi}(p,q)}\right\rceil\,.$ | |
| --- | --- | --- |
In particular, for $\gamma=3/4$ , we have ${n}^{*}_{\mathrm{B}}(p,q,\pi,\pi/4)≥\left\lceil\frac{3}{16}\frac{\pi\log(1/%
\pi)}{\mathrm{JS}_{\pi}(p,q)}\right\rceil$ .*
* Proof ofProposition7.2*
Let $\Theta,\widehat{\Theta},X_{1},...,X_{n}$ be from Definition 1.1. Since $\widehat{\Theta}$ is a function of $X_{1},...,X_{n}$ , it follows that $\Theta→(X_{1},...,X_{n})→\widehat{\Theta}$ is a Markov chain. By Fano’s inequality (Fact 4.18), we have
$$
\displaystyle h_{\mathrm{bin}}(P_{e}^{(n)})\geq H_{\mathrm{ent}}(\Theta|X_{1},%
\dots,X_{n})\,. \tag{20}
$$ We now calculate upper and lower bounds on $I(\Theta;X_{1},...,X_{n})$ using two different expressions. First, using i.i.d. distribution of $X_{i}$ ’s conditioned on $\Theta$ , we obtain:
$$
\displaystyle I(\Theta;X_{1},\dots,X_{n}) \displaystyle=H_{\mathrm{ent}}(X_{1},\dots,X_{n})-H_{\mathrm{ent}}(X_{1},\dots%
,X_{n}|\Theta) \displaystyle\leq\left(\sum_{i}H_{\mathrm{ent}}(X_{i})\right)-H_{\mathrm{ent}}%
(X_{1},\dots,X_{n}|\Theta) \displaystyle=nH_{\mathrm{ent}}(X_{i})-H_{\mathrm{ent}}(X_{1},\dots,X_{n}|\Theta) \displaystyle=\left(nH_{\mathrm{ent}}(\pi p+(1-\pi)q)\right) \displaystyle\qquad-\left(\pi H_{\mathrm{ent}}(X_{1},\dots,X_{n}|\Theta=p)+(1-%
\pi)H_{\mathrm{ent}}(X_{1},\dots,X_{n}|\Theta=q)\right) \displaystyle=\left(nH_{\mathrm{ent}}(\pi p+(1-\pi)q)\right)-\left(\pi nH_{%
\mathrm{ent}}(p)+(1-\pi)nH_{\mathrm{ent}}(q)\right) \displaystyle=n\left(H_{\mathrm{ent}}(\pi p+(1-\pi)q)-\pi H_{\mathrm{ent}}(p)-%
(1-\pi)H_{\mathrm{ent}}(q)\right) \displaystyle=nI(X;\theta)=n\mathrm{JS}_{\pi}(p,q), \tag{21}
$$
where the last equality is by Proposition 4.17. Combining this inequality with the fact that
$$
I(\Theta;X_{1},\dots,X_{n})=H_{\mathrm{ent}}(\Theta)-H_{\mathrm{ent}}(\Theta|X%
_{1},\dots,X_{n}),
$$
we obtain
$$
\displaystyle H_{\mathrm{ent}}(\Theta|X_{1},\dots,X_{n})\geq H_{\mathrm{ent}}(%
\Theta)-n\mathrm{JS}_{\pi}(p,q)\,. \tag{22}
$$
Combining this with Equation 20, we obtain
| | $\displaystyle h_{\mathrm{bin}}(P_{e}^{(n)})≥ H_{\mathrm{ent}}(\Theta)-n%
\mathrm{JS}_{\pi}(p,q),$ | |
| --- | --- | --- |
which implies that $n≥\frac{h_{\mathrm{bin}}(\pi)-h_{\mathrm{bin}}(P_{e}^{(n)})}{\mathrm{JS}_{%
\pi}(p,q)}$ , using the fact that $H_{\mathrm{ent}}(\Theta)=h_{\mathrm{bin}}(\pi)$ . Finally, $P_{e}^{(n)}≤\pi(1-\gamma)$ implies $h_{\mathrm{bin}}(P_{e}^{(n)})≤ h_{\mathrm{bin}}\left(\pi\left(1-\gamma%
\right)\right)$ by monotonicity of the binary entropy on $[0,1/2]$ . The strong concavity of the binary entropy function $h_{\mathrm{bin}}(·)$ Indeed, the first and second derivates of $h_{\mathrm{bin}}(x)$ are $\log\left(\frac{1-x}{x}\right)$ and $\frac{1}{x(1-x)}$ , respectively. implies that
$$
h_{\mathrm{bin}}(\pi)-h_{\mathrm{bin}}(\pi(1-\gamma))\geq\pi\gamma\nabla h_{%
\mathrm{bin}}(\pi)+\pi^{2}\gamma^{2}=\pi\gamma\log\left(\frac{1-\pi}{\pi}%
\right)+\pi^{2}\gamma^{2}.
$$
Thus, we obtain the desired lower bound of $n≥\frac{\pi\gamma\log\left(\frac{1-\pi}{\pi}\right)+\pi^{2}\gamma^{2}}{%
\mathrm{JS}_{\pi}(p,q)}$ . Since $n∈\mathbb{N}$ , we also obtain the lower bound with the ceiling operation. We now simplify the expression for $\gamma=3/4$ . We first observe that $\pi\log(\overline{\alpha}/\pi)≥ 0.5\pi\log(1/\pi)$ for $\pi≤ 1/3$ . Thus, we have $\pi\gamma\log\left(\frac{1-\pi}{\pi}\right)+\pi^{2}\gamma^{2}≥(3/8)\pi\log(%
1/\pi)$ for $\pi≤ 1/3$ . For larger $\pi∈(1/3,1/2]$ , we note that $\pi^{2}\gamma^{2}≥\pi\log(1/\pi)\frac{\pi\gamma^{2}}{\log 2}≥\pi\log(1/%
\pi)(3/16)$ , since $\pi≥ 1/3$ , $\gamma=3/4$ , and $\log 2≤ 1$ . ∎
We now establish a family of upper bounds in terms of $\mathcal{H}_{\lambda}$ .
**Proposition 7.3 (Upper bounds)**
*For any $\lambda∈(0,1)$ and $\pi∈(0,1/2)$ , we have
| | $\displaystyle{n}^{*}_{\mathrm{B}}(p,q,\pi,\pi(1-\gamma))≤\left\lceil\frac{%
\lambda\log\left(\frac{\overline{\alpha}}{\pi}\right)+\log\left(\frac{1}{%
\overline{\gamma}}\right)}{\mathcal{H}_{\overline{\lambda}}(p,q)}\right\rceil.$ | |
| --- | --- | --- |
In particular, for $\gamma=3/4$ , setting $\lambda=\frac{0.5\log 2}{\log(1/\pi)}$ , which lies in $(0,0.5]$ , we have
$$
\displaystyle{n}^{*}_{\mathrm{B}}(p,q,\pi,\pi/4)\leq\left\lceil\frac{2}{%
\mathcal{H}_{\overline{\lambda}}(p,q)}\,\right\rceil, \tag{23}
$$
recovering the case of uniform prior ( $\pi=0.5$ ) with constant error probability (since $\lambda=0.5$ ).*
* Proof*
Recall that $P_{e}^{(n)}$ denotes the minimum Bayesian error between $p^{\otimes n}$ and $q^{\otimes n}$ , which can be evaluated using Proposition 4.13. Since $\min(x,y)≤ x^{\lambda}y^{1-\lambda}$ for any $x,y≥ 0$ and $\lambda∈(0,1)$ , we have
$$
\displaystyle P_{e}^{(n)} \displaystyle=\sum_{x_{1},\dots,x_{n}}\min\left(\pi p^{\otimes n}(x_{1},\dots,%
x_{n}),(1-\pi)q^{\otimes n}(x_{1},\dots,x_{n})\right) \displaystyle\leq\sum_{x_{1},\dots,x_{n}}\left(\pi p^{\otimes n}(x_{1},\dots,x%
_{n})\right)^{\overline{\lambda}}\left((1-\pi)q(x_{1},\dots,x_{n})\right)^{\lambda} \displaystyle=\pi^{\overline{\lambda}}(1-\pi)^{\lambda}\left(1-\mathcal{H}_{%
\overline{\lambda}}(p^{\otimes n},q^{\otimes n})\right) \displaystyle=\pi^{\overline{\lambda}}\overline{\alpha}^{\lambda}\left(1-%
\mathcal{H}_{\overline{\lambda}}(p,q)\right)^{n} \displaystyle\leq\pi^{\overline{\lambda}}\overline{\alpha}^{\lambda}\left(e^{-%
\mathcal{H}_{\overline{\lambda}}(p,q)}\right)^{n}\,.
$$
For $n≥\left\lceil\frac{\log(\pi^{\overline{\lambda}}\overline{\alpha}^{\lambda}%
/\delta)}{\mathcal{H}_{\lambda}(p,q)}\right\rceil$ , the right-hand side is at most $\delta$ . Finally, taking $\delta=\pi(1-\gamma)=\pi\overline{\gamma}$ , the previous upper bound specializes to
| | $\displaystyle\left\lceil\frac{\log(\pi^{\overline{\lambda}}\overline{\alpha}^{%
\lambda}/\pi\overline{\gamma})}{\mathcal{H}_{\overline{\lambda}}(p,q)}\right%
\rceil=\left\lceil\frac{\log\left(\left(\frac{\overline{\alpha}}{\pi}\right)^{%
\lambda}\frac{1}{\overline{\gamma}}\right)}{\mathcal{H}_{\overline{\lambda}}(p%
,q)}\right\rceil≤\left\lceil\frac{\lambda\log\left(\frac{1}{\pi}\right)+%
\log\left(\frac{1}{\overline{\gamma}}\right)}{\mathcal{H}_{\overline{\lambda}}%
(p,q)}\right\rceil\,.$ | |
| --- | --- | --- |
The specialization to $\gamma=3/4$ follows by noting that if we set $\lambda=\frac{0.5\log 2}{\log(1/\pi)}$ , the numerator is upper-bounded by $0.5\log(2)+\log 4≤ 2$ . ∎
Combining Propositions 7.2 and 7.3, we obtain, for all $\pi∈(0,0.5]$ , that
$$
\displaystyle\left\lceil\frac{3}{16}\frac{\pi\log\left(\frac{1}{\pi}\right)}{%
\mathrm{JS}_{\pi}(p,q)}\right\rceil\leq{n}^{*}_{\mathrm{B}}\left(p,q,\pi,\frac%
{\pi}{4}\right)\leq\left\lceil\frac{2}{\mathcal{H}_{\overline{\lambda}}(p,q)}%
\right\rceil\,, \tag{24}
$$
whenever $\lambda=\frac{0.5\log 2}{\log(1/\pi)}$ . To show matching bounds on the sample complexity, we need to establish the reverse inequality. The following result shows that this is in fact true, and is the main technical result of the paper:
**Lemma 7.4 (Relationship betweenJSJS\mathrm{JS}roman_JSandℋℋ\mathcal{H}caligraphic_H)**
*Let $p$ and $q$ be two distributions. Let $\pi∈(0,1/2]$ and $\lambda∈(0,1/2]$ . Then
$$
\displaystyle\mathrm{JS}_{\pi}(p,q)\leq\frac{32e^{2\lambda\log(1/\pi)}\pi}{%
\lambda}\mathcal{H}_{\overline{\lambda}}(p,q). \tag{25}
$$
In particular, for $\lambda=\frac{0.5\log 2}{\log(1/\pi)}$ , we have
| | $\displaystyle\left\lceil\frac{3}{16}\frac{\pi\log\left(\frac{1}{\pi}\right)}{%
\mathrm{JS}_{\pi}(p,q)}\right\rceil≤\left\lceil\frac{2}{\mathcal{H}_{%
\overline{\lambda}}(p,q)}\right\rceil≤\left\lceil\frac{256}{\log 2}\frac{%
\pi\log(1/\pi)}{\mathrm{JS}_{\pi}(p,q)}\right\rceil.$ | |
| --- | --- | --- |*
We defer the proof of Lemma 7.4 to the end of this section and first focus on showing that Propositions 7.2, 7.3 and 7.4 suffice to prove Theorem 7.1. To this end, we obtain the following lower bound from Proposition 7.2:
$$
\displaystyle{n}^{*}(p,q,\pi,\pi/4)\geq\left\lceil\frac{3}{16}\frac{\pi\log(1/%
\pi)}{\mathrm{JS}_{\pi}(p,q)}\right\rceil\,. \tag{26}
$$
We now establish the upper bound using Proposition 7.3: for $\lambda=0.5\frac{\log 2}{\log(1/\pi)}∈(0,0.5]$ , we have
$$
\displaystyle{n}^{*}_{\mathrm{B}}(p,q,\pi,\pi/4)\leq\left\lceil\frac{2}{%
\mathcal{H}_{\overline{\lambda}}(p,q)}\right\rceil\,\leq\left\lceil\frac{256%
\pi\log(1/\pi)}{\log 2\mathrm{JS}_{\pi}(p,q)}\right\rceil, \tag{27}
$$
which combined with Lemma 7.4, gives the desired result. The second statement in Theorem 7.1 follows by noting that if a positive integer $x≥ 2$ satisfies that $c_{1}y≤ x≤\lceil c_{2}y\rceil$ for some $y,c_{1},c_{2}>0$ , then $\lceil c_{1}y\rceil≤ x≤ 2c_{2}y$ . This is because if $x≥ 2$ , then $c_{2}y>1$ , so $\lceil c_{2}y\rceil≤ 2c_{2}y$ . ∎
7.1 Proof of Lemma 7.4
* Proof ofLemma7.4*
Our starting point is to note that both $\mathrm{JS}_{\pi}$ and $\mathcal{H}_{\lambda}$ are $f$ -divergences. By the method of joint range [HarVaj11], it suffices to establish Equation 25 when $p$ and $q$ are Bernoulli distributions (cf. Lemma 4.19). By abusing notation in this proof, we use $p$ and $q$ to represent the biases of these Bernoulli distributions. We set $\lambda=\frac{r}{\log(1/\pi)}∈(0,1/2]$ for some $r∈(0,0.5\log(1/\pi)]$ in this proof. Without loss of generality, we will assume that $p≤ 0.5$ . Going forward, we fix $p$ and $\pi$ to be arbitrary and take $q$ to be a variable. Define $j(q):=\mathrm{JS}_{\pi}(\mathrm{Ber}(p),\mathrm{Ber}(q))$ and $h(q):=(\pi e^{2r}/r)\log(1/\pi)\mathcal{H}_{\overline{\lambda}}(\mathrm{Ber}(p%
),\mathrm{Ber}(q))$ . Thus, our goal is to show that $j(q)\lesssim h(q)$ . In fact, we will establish a stronger result: there exists an absolute constant $C_{*}$ , independent of $p,q$ , and $\pi$ , such that $C_{*}h(q)-j(q)$ is a nonnegative convex function; $C_{*}=32$ suffices. We argue in three steps:
1. $ch(p)-j(p)$ is equal to $0 0$ , for all constants $c$ .
1. $p$ is a stationary point of $ch(q)-j(q)$ , for all constants $c$ .
1. There is an absolute constant $C_{*}$ such that $\frac{d^{2}}{dq^{2}}(C_{*}h(q)-j(q))≥ 0$ , for all $q$ . The first claim is easy to verify since both expressions are $f$ -divergences. The second claim is an immediate calculation, as shown below, and the details are deferred to Section 7.2.
**Claim 7.5 (First derivatives ofh(⋅)ℎ⋅h(\cdot)italic_h ( ⋅ )andj(⋅)𝑗⋅j(\cdot)italic_j ( ⋅ ))**
*| | $\displaystyle\frac{d}{dq}\mathrm{JS}_{\pi}(\mathrm{Ber}(p),\mathrm{Ber}(q))$ | $\displaystyle=\overline{\alpha}\log\left(\frac{q}{\overline{q}}\frac{\pi%
\overline{p}+\overline{\alpha}\,\overline{q}}{\pi p+\overline{\alpha}q}\right)$ | |
| --- | --- | --- | --- |*
Thus, the remaining proof is dedicated to showing the last claim about the second derivatives, stating that the second derivate of $j(q)$ is less than a multiple of the second derivative of $h(q)$ . We now state the main technical part of this proof, with the proof of the lemma deferred below.
**Lemma 7.6**
*For any $p∈[0,0.5]$ , $q∈(0,1)$ , $\pi∈[0,1/2]$ , and $\lambda=\frac{r}{\log(1/\pi)}$ , for some $r≥ 0$ such that $\lambda∈(0,0.5]$ , we have
$$
\displaystyle\frac{d^{2}}{dq^{2}}\mathrm{JS}_{\pi}(\mathrm{Ber}(p),\mathrm{Ber%
}(q))\leq\frac{32e^{2r}\pi}{r}\cdot\frac{d^{2}}{dq^{2}}\left(\log(1/\pi)%
\mathcal{H}_{\overline{\lambda}}(\mathrm{Ber}(p),\mathrm{Ber}(q))\right)\,. \tag{28}
$$*
As argued above, Lemma 7.6 completes the proof of Equation 25; the conclusion for $q$ being either exactly $0 0$ or $1$ follows from continuity of $j(·)$ and $h(·)$ . ∎
We now provide the proof of Lemma 7.6:
* Proof ofLemma7.6*
We begin by stating Claim 7.7, which calculates the second derivatives of these functions. Its proof is deferred to Section 7.2.
**Claim 7.7 (Second derivatives ofh(⋅)ℎ⋅h(\cdot)italic_h ( ⋅ )andj(⋅)𝑗⋅j(\cdot)italic_j ( ⋅ ))**
*Let $\delta=p-q$ . Then
$$
\displaystyle\frac{d^{2}}{dq^{2}}\mathrm{JS}_{\pi}(\mathrm{Ber}(p),\mathrm{Ber%
}(q)) \displaystyle=\frac{\pi\overline{\alpha}\left(\pi p\overline{p}+\overline{%
\alpha}p\overline{q}+\overline{\alpha}q\overline{p}-\overline{\alpha}q%
\overline{q}\right)}{q\overline{q}(\pi p+\overline{\alpha}q)\left(\pi\overline%
{p}+\overline{\alpha}\,\overline{q}\right)}=\frac{\pi\overline{\alpha}}{q%
\overline{q}}\left(1+\frac{\overline{\alpha}\delta\left(\overline{q}-q-\pi%
\delta\right)}{\left(q+\pi\delta\right)\left(\overline{q}-\pi\delta\right)}%
\right), \tag{29}
$$
and if $\lambda=r/\log(1/\pi)$ and $\lambda∈(0,1/2]$ , we have
$$
\displaystyle\frac{d^{2}}{dq^{2}}\left(\log(1/\pi)\mathcal{H}_{\overline{%
\lambda}}(\mathrm{Ber}(p),\mathrm{Ber}(q))\right) \displaystyle=\lambda\overline{\lambda}\log(1/\pi)\left(\frac{1}{q}\left(\frac%
{p}{q}\right)^{\overline{\lambda}}+\frac{1}{\overline{q}}\left(\frac{\overline%
{p}}{\overline{q}}\right)^{\overline{\lambda}}\right)\, \displaystyle\geq\frac{r}{2}\left(\frac{1}{q}\left(\frac{p}{q}\right)^{%
\overline{\lambda}}+\frac{1}{\overline{q}}\left(\frac{\overline{p}}{\overline{%
q}}\right)^{\overline{\lambda}}\right)\,. \tag{30}
$$*
In the sequel, the following approximation, proved in Section 13.2, will be especially useful:
**Claim 7.8**
*For all $x≥ 0$ , $\pi∈(0,1/2]$ , and $\lambda_{*}∈\left[0,\frac{r}{\log(1/\pi)}\right]$ , for some $r≥ 0$ such that $\lambda_{*}∈(0,0.5]$ , we have $1+\frac{x}{1+\pi x}≤ e^{2r}\left(1+x\right)^{\overline{\lambda}_{*}}$ .*
We will establish Equation 28 using a case analysis.
Case 1: $q≤ p≤ 1/2$ .
In this case, $\delta:=p-q$ is nonnegative, and we use the last expression in Equation 29. Additionally, we use the following bounds that can be easily verified: (i) $\overline{q}-\pi\delta≥\overline{q}-0.5(0.5-q)≥ 0.5$ , (ii) $|\overline{q}-q-\pi\delta|≤ 1$ , and (iii) $\overline{\alpha}\delta≤\delta$ . Starting with Equation 29 in Claim 7.7, we obtain
$$
\displaystyle\frac{d^{2}}{dq^{2}}\mathrm{JS}_{\pi}(\mathrm{Ber}(p),\mathrm{Ber%
}(q)) \displaystyle=\frac{\pi\overline{\alpha}}{q\overline{q}}\left(1+\frac{%
\overline{\alpha}\delta\left(\overline{q}-q-\pi\delta\right)}{\left(q+\pi%
\delta\right)\left(\overline{q}-\pi\delta\right)}\right) \displaystyle\leq\frac{\pi\overline{\alpha}}{q\overline{q}}\left(1+\frac{%
\overline{\alpha}\delta\left|\overline{q}-q-\pi\delta\right|}{\left(q+\pi%
\delta\right)\left(\overline{q}-\pi\delta\right)}\right) \displaystyle\leq 2\frac{\pi}{q}\left(1+\frac{2\delta}{(q+\pi\delta)}\right) \displaystyle\leq 4\frac{\pi}{q}\left(1+\frac{(\delta/q)}{(1+\pi(\delta/q))}\right) \displaystyle\leq 4e^{2r}\frac{\pi}{q}\left(1+\frac{\delta}{q}\right)^{%
\overline{\lambda}} \displaystyle=4e^{2r}\pi\frac{1}{q}\left(\frac{p}{q}\right)^{\overline{\lambda}} \displaystyle\leq\frac{8e^{2r}\pi}{r}\frac{d^{2}}{dq^{2}}\left(\log(1/\pi)%
\mathcal{H}_{\overline{\lambda}}(\mathrm{Ber}(p),\mathrm{Ber}(q))\right),
$$
where the last line uses Equation 30 in Claim 7.7.
Case 2: $q≥ 1/2$ .
Let $\tau=q-p=-\delta$ , which we know is positive (since $p≤ 1/2$ ). Using the final expression in Equation 29 in Claim 7.7, we start with the following expression for the second derivative:
$$
\displaystyle\frac{d^{2}}{dq^{2}}\mathrm{JS}_{\pi}(\mathrm{Ber}(p),\mathrm{Ber%
}(q)) \displaystyle=\frac{\pi\overline{\alpha}}{q\overline{q}}\left(1+\frac{%
\overline{\alpha}\tau\left(q-\pi\tau-\overline{q}\right)}{\left(q-\pi\tau%
\right)\left(\overline{q}+\pi\tau\right)}\right)\, \displaystyle\leq\frac{\pi\overline{\alpha}}{q\overline{q}}\left(1+\frac{%
\overline{\alpha}\tau\left|\overline{q}+\pi\tau-q\right|}{\left(q-\pi\tau%
\right)\left(\overline{q}+\pi\tau\right)}\right) \displaystyle\leq\frac{2\pi}{\overline{q}}\left(1+\frac{4\tau}{\overline{q}+%
\pi\tau}\right) \displaystyle\leq\frac{8\pi}{\overline{q}}\left(1+\frac{\tau/\overline{q}}{1+%
\pi\tau/\overline{q}}\right) \displaystyle\leq\frac{8e^{2r}\pi}{\overline{q}}\left(1+\tau/\overline{q}%
\right)^{\overline{\lambda}} \displaystyle=8e^{2r}\pi\frac{1}{\overline{q}}\left(\frac{\overline{p}}{%
\overline{q}}\right)^{\overline{\lambda}} \displaystyle\leq\frac{16e^{2r}\pi}{r}\frac{d^{2}}{dq^{2}}\left(\log(1/\pi)%
\mathcal{H}_{\overline{\lambda}}(\mathrm{Ber}(p),\mathrm{Ber}(q))\right),
$$
where the last line uses Equation 30 in Claim 7.7.
Case 3: $p≤ q≤ 1/2$ .
In this case, we start with the first equality in Equation 29 in Claim 7.7:
$$
\displaystyle\frac{d^{2}}{dq^{2}}\mathrm{JS}_{\pi}(\mathrm{Ber}(p),\mathrm{Ber%
}(q)) \displaystyle=\frac{\pi\overline{\alpha}\left(\pi p\overline{p}+\overline{%
\alpha}p\overline{q}+\overline{\alpha}q\overline{p}-\overline{\alpha}q%
\overline{q}\right)}{q\overline{q}(\pi p+\overline{\alpha}q)\left(\pi\overline%
{p}+\overline{\alpha}\,\overline{q}\right)} \displaystyle\leq\frac{\pi\overline{\alpha}\left(\pi p\overline{p}+\overline{%
\alpha}p\overline{q}+\overline{\alpha}qq\right)}{q\overline{q}(\pi p+\overline%
{\alpha}q)\left(\pi\overline{p}+\overline{\alpha}\,\overline{q}\right)} \displaystyle\leq\frac{\pi\left(\pi p\overline{p}+\overline{\alpha}p\overline{%
q}+\overline{\alpha}q^{2}\right)}{q(1/2)(q/2)\left(1/2\right)} \displaystyle\leq\frac{8\pi\left(\pi p\overline{p}+\overline{\alpha}p\overline%
{q}+q^{2}\right)}{q^{2}} \displaystyle\leq 8\pi+\frac{16\pi p}{q^{2}} \displaystyle\leq 8\pi\left(\frac{\overline{p}}{\overline{q}}\right)^{%
\overline{\lambda}}+\frac{16\pi p}{q^{2}} \displaystyle\left(\text{using $\overline{p}\geq\overline{q}$ and $\overline{%
\lambda}\geq 0$}\right) \displaystyle\leq 8\pi\frac{1}{\overline{q}}\left(\frac{\overline{p}}{%
\overline{q}}\right)^{\overline{\lambda}}+16\frac{\pi}{q}\left(\frac{p}{q}%
\right)^{\overline{\lambda}} \displaystyle\left(\text{using $p\leq q$ and $\overline{\lambda}\in(0,1]$}\right) \displaystyle\leq 16\pi\left(\frac{1}{\overline{q}}\left(\frac{\overline{p}}{%
\overline{q}}\right)^{\overline{\lambda}}+\frac{\pi}{q}\left(\frac{p}{q}\right%
)^{\overline{\lambda}}\right) \displaystyle\leq\frac{32\pi}{r}\frac{d^{2}}{dq^{2}}\left(\log(1/\pi)\mathcal{%
H}_{\overline{\lambda}}(\mathrm{Ber}(p),\mathrm{Ber}(q))\right)\,.
$$
Thus, for each value of $q$ , the desired inequality holds, completing the proof of Lemma 7.6.∎
7.2 Proofs of Claims 7.5 and 7.7
See 7.5
* Proof*
Since
$$
\mathrm{JS}_{\pi}\left(\mathrm{Ber}(p),\mathrm{Ber}(q)\right)=\pi\mathrm{KL}(%
\mathrm{Ber}(p),\mathrm{Ber}(\pi p+\overline{\alpha}a)))+\overline{\alpha}%
\mathrm{KL}\left(\mathrm{Ber}(q),\mathrm{Ber}(\pi p+\overline{\alpha}q)\right),
$$
we obtain
| | $\displaystyle\mathrm{JS}_{\pi}\left(\mathrm{Ber}(p),\mathrm{Ber}(q)\right)$ | $\displaystyle=\pi p\log\left(\frac{p}{\pi p+\overline{\alpha}q}\right)+\pi%
\overline{p}\log\left(\frac{\overline{p}}{\pi\overline{p}+\overline{\alpha}%
\overline{q}}\right)$ | |
| --- | --- | --- | --- |
implying that
| | $\displaystyle\frac{d}{dq}\mathrm{JS}_{\pi}\left(\mathrm{Ber}(p),\mathrm{Ber}(q%
)\right)$ | $\displaystyle=-\pi p\frac{1}{\pi p+\overline{\alpha}q}\overline{\alpha}-\pi%
\overline{p}\frac{1}{\pi p+\overline{\alpha}\overline{q}}(-\overline{\alpha})$ | |
| --- | --- | --- | --- |
Next, observe that $\mathcal{H}_{\overline{\lambda}}(\mathrm{Ber}(p),\mathrm{Ber}(q))=1-p^{%
\overline{\lambda}}q^{\lambda}-\overline{p}^{\overline{\lambda}}\overline{q}^{\lambda}$ . Thus, the first derivative is
| | $\displaystyle\frac{d}{dq}\mathcal{H}_{\overline{\lambda}}(\mathrm{Ber}(p),%
\mathrm{Ber}(q))=-\lambda p^{\overline{\lambda}}q^{-\overline{\lambda}}+%
\lambda\overline{p}^{\overline{\lambda}}\overline{q}^{-\overline{\lambda}}.$ | |
| --- | --- | --- |
∎
See 7.7
* Proof*
Starting with $\mathrm{JS}_{\pi}$ , we obtain the following, using the expression for the first derivative in Claim 7.5:
| | $\displaystyle\frac{d^{2}}{dq^{2}}\mathrm{JS}_{\pi}(\mathrm{Ber}(p),\mathrm{Ber%
}(q))$ | $\displaystyle=\overline{\alpha}\frac{d^{2}}{dq^{2}}\left(\log q-\log(\overline%
{q})+\log(\pi p+\overline{\alpha}\overline{q})-\log(\pi p+\overline{\alpha}q)\right)$ | |
| --- | --- | --- | --- |
We now show the derivation for the alternate expression given in the lemma. First, the numerator $\pi p\overline{p}+\overline{\alpha}p\overline{q}+\overline{\alpha}q\overline{p%
}-\overline{\alpha}q\overline{q}$ simplifies to $p-\pi p^{2}-2\overline{\alpha}pq+\overline{\alpha}q^{2}$ . Similarly, the expression $(q+\pi\delta)(\overline{q}-\pi\delta)+\overline{\alpha}\delta(\overline{q}-q-%
\pi\delta)$ simplifies to $p-\pi p^{2}-2\overline{\alpha}pq+\overline{\alpha}q^{2}$ . Calculating the second derivative using the expression in Claim 7.5, we obtain
| | $\displaystyle\frac{d^{2}}{dq^{2}}\mathcal{H}_{\overline{\lambda}}(\mathrm{Ber}%
(p),\mathrm{Ber}(q))=\lambda\overline{\lambda}p^{\overline{\lambda}}q^{-%
\overline{\lambda}-1}+\lambda\overline{\lambda}\overline{p}^{\overline{\lambda%
}}\overline{q}^{-\overline{\lambda}-1}=\lambda\overline{\lambda}\left(\frac{1}%
{q}\left(\frac{p}{q}\right)^{\overline{\lambda}}+\frac{1}{\overline{q}}\left(%
\frac{\overline{p}}{\overline{q}}\right)^{\overline{\lambda}}\right)\,,$ | |
| --- | --- | --- |
which is always positive. Since $\lambda=\frac{r}{\log(1/\pi)}∈(0,1/2)$ , we have $\overline{\lambda}≥ 1/2$ , so
| | $\displaystyle\frac{d^{2}}{dq^{2}}\left(\log(1/\pi)\mathcal{H}_{\overline{%
\lambda}}(\mathrm{Ber}(p),\mathrm{Ber}(q))\right)$ | $\displaystyle≥\log(1/\pi)\lambda\overline{\lambda}\left(\frac{1}{q}\left(%
\frac{p}{q}\right)^{\overline{\lambda}}+\frac{1}{\overline{q}}\left(\frac{%
\overline{p}}{\overline{q}}\right)^{\overline{\lambda}}\right)$ | |
| --- | --- | --- | --- |
∎
7.3 Proofs of Theorems 2.1 and 2.2
We restate Theorem 2.1 below: See 2.1
* Proof*
The first regime follows from Theorem 7.1, the second regime follows from Proposition 6.1, and the final regime follows from Equation 4. ∎
We now restate Theorem 2.2: See 2.2 The proof of Theorem 2.2 follows from Corollary 4.8.
8 Weak detection
In this section, we consider the weak detection regime of the Bayesian hypothesis testing problem $\mathcal{B}_{\mathrm{B}}(p,q,\pi,\delta)$ . Recall that $\delta≥\pi$ is the vacuous regime, and our results in Theorem 2.1 give tight sample complexity bounds for $\delta≤\pi/4$ . Here, we discuss the remaining regime of $\delta∈(\pi/4,\pi)$ with the special focus on $\delta→\pi$ from below. In particular, we parametrize $\delta$ to be $\delta=(1-\gamma)\pi$ , for $\gamma$ small enough.
As a starting point, we consider the uniform prior regime $\pi=1/2$ . Recall that in this vanilla setting, the sample complexity of strong detection $(\delta/\pi≤ 1/4)$ is characterized by the Hellinger divergence. In the weak detection regime, however, the following example suggests that the Hellinger divergence does not characterize the sample complexity: it could be either $\frac{\gamma^{2}}{\mathrm{h}^{2}(p,q)}$ or $\frac{\gamma}{\mathrm{h}^{2}(p,q)}$ .
**Example 8.1**
*Let $c$ be a small enough absolute constant. For any $\gamma∈(0,1/4)$ , and $\epsilon∈(0,c\gamma^{2})$ , there exist distributions $p$ , $q$ , $p^{\prime}$ , and $q^{\prime}$ such that $\mathrm{h}^{2}(p,q)\asymp\mathrm{h}^{2}(p^{\prime},q^{\prime})\asymp\epsilon$ , such that ${n}^{*}_{\mathrm{B}}\left(p,q,0.5,0.5(1-\gamma)\right)\asymp\frac{\gamma^{2}}{%
\mathrm{h}^{2}(p,q)}$ , but ${n}^{*}_{\mathrm{B}}\left(p^{\prime},q^{\prime},0.5,0.5(1-\gamma)\right)\asymp%
\frac{\gamma}{\mathrm{h}^{2}(p^{\prime},q^{\prime})}$ . That is, the dependence on $\gamma$ is different.*
* Proof*
We will use Proposition 4.13 with $\pi=1/2$ and thus $E_{\gamma}$ divergence simply becomes the total variation distance. For $\delta≤\gamma$ to be decided, let $p=\mathcal{N}(\epsilon,1)$ and $q=\mathcal{N}(-\delta,1)$ , where $\mathcal{N}(\mu,\sigma^{2})$ denotes the univariate Gaussian distribution with mean $\mu$ and variance $\sigma^{2}$ . Let $u∈\mathbb{R}^{n}$ denote the all-ones vector. Using spherical symmetry of isotropic Gaussians, we observe that
| | $\displaystyle\mathrm{TV}(p^{\otimes n},q^{\otimes n})$ | $\displaystyle=\mathrm{TV}(\mathcal{N}(\delta u,I),\mathcal{N}(-\delta u,I))$ | |
| --- | --- | --- | --- |
where the last claim follows from [DevMR18, Theorem 1.3]. In the context of Proposition 4.13, setting $(1-\mathrm{TV}(p^{\otimes n},q^{\otimes n}))/2=1/2-\gamma$ , we see that ${n}^{*}_{\mathrm{B}}(p,q,1/2,(1-\gamma)/2)=:{n}^{*}$ has to be the smallest $n$ such that $\mathrm{TV}(p^{\otimes n},q^{\otimes n})\asymp\gamma$ . Therefore, the sample complexity ${n}^{*}$ satisfies that $\min(\sqrt{{n}^{*}}\delta,1)\asymp\gamma$ and thus ${n}^{*}\asymp\gamma^{2}/\mathrm{h}^{2}(p,q)$ , where $\delta^{2}\asymp\mathrm{TV}^{2}(p,q)\asymp\mathrm{h}^{2}(p,q)\asymp\epsilon$ . Let $p^{\prime}=\mathrm{Ber}(1)$ and $q^{\prime}=\mathrm{Ber}(1-\epsilon)$ . Then
| | $\displaystyle\mathrm{TV}({p^{\prime}}^{\otimes n},{q^{\prime}}^{\otimes n})$ | $\displaystyle=1-(1-\epsilon)^{n}\,.$ | |
| --- | --- | --- | --- |
Standard calculations shows that for $n≤ 1/\epsilon$ , the above expression is equivalent to $n\epsilon$ (up to constants). Setting $(1-\mathrm{TV}({p^{\prime}}^{\otimes n},{q^{\prime}}^{\otimes n}))/2=1/2-\gamma$ , we see that the sample complexity is the smallest $n$ such that $\mathrm{TV}({p^{\prime}}^{\otimes n},{q^{\prime}}^{\otimes n})\asymp\gamma$ , which is then equivalent to $tn\asymp\gamma$ , and hence ${n}^{*}\left(p^{\prime},q^{\prime},1/2,(1-\gamma)/2\right)\asymp\gamma/%
\epsilon\asymp\gamma/\mathrm{h}^{2}(p^{\prime},q^{\prime})$ , which also satisfies the constraint that $n≤ 1/\epsilon$ used earlier in the approximation. ∎
Thus, the uniform prior regime already exhibits sufficiently different behavior in the weak detection regime. In the weak detection regime, our bounds on the sample complexity from the previous section for general prior are as follows:
**Theorem 8.2**
*For any $\pi∈(0,0.5]$ , $\gamma∈(0,1)$ , and $\lambda∈(0,0.5]$ such that $\lambda\log(\overline{\alpha}/\pi)≤\log(1/\overline{\gamma})$ , we have
| | $\displaystyle\left\lceil\frac{\pi\gamma\log\left(\frac{1-\pi}{\pi}\right)+\pi^%
{2}\gamma^{2}}{\mathrm{JS}_{\pi}(p,q)}\right\rceil\lesssim{n}^{*}_{\mathrm{B}}%
(p,q,\pi,\pi(1-\gamma))\lesssim\left\lceil\frac{64\pi\log(1/\overline{\gamma})%
e^{2\lambda\log(1/\pi)}}{\lambda\mathrm{JS}_{\pi}(p,q)}\right\rceil.$ | |
| --- | --- | --- |
Let $\gamma∈(0,c)$ for a small enough absolute constant $c$ . Then the result above implies the following simplified expressions:
1. (Almost uniform prior) For $\pi=1/2-\eta$ , for $\eta≤ 1/4$ , we can take $\lambda\asymp\min(1,\gamma/\eta)$ and use $\mathrm{JS}_{\pi}(p,q)\asymp\mathrm{h}^{2}(p,q)$ for $\pi∈[1/4,1/2]$ to obtain
$$
\displaystyle\left\lceil\frac{\gamma\max(\gamma,\eta)}{\mathrm{h}^{2}(p,q)}%
\right\rceil\lesssim{n}^{*}_{\mathrm{B}}(p,q,\pi,\pi(1-\gamma))\lesssim\left%
\lceil\frac{\max(\gamma,\eta)}{\mathrm{h}^{2}(p,q)}\right\rceil\,. \tag{31}
$$
1. (Prior going to 0) When $\pi≤ 1/4$ , by taking $\lambda\asymp\gamma/\log(1/\pi)$ , we obtain
$$
\displaystyle\left\lceil\frac{\pi\gamma\log(1/\pi)}{\mathrm{JS}_{\pi}(p,q)}%
\right\rceil\lesssim{n}^{*}_{\mathrm{B}}(p,q,\pi,\pi(1-\gamma))\lesssim\left%
\lceil\frac{\pi\log(1/\pi)}{\mathrm{JS}_{\pi}(p,q)}\right\rceil\,. \tag{32}
$$
Moreover, there exist distributions $p$ and $q$ such that the upper bounds in Equations 32 and 31 are achieved (cf. Example 8.3).*
In both cases, we see that the upper and lower bounds differ by a factor of $\gamma$ . Perhaps surprisingly, when $\pi$ is far from uniform (much further than $\gamma$ ), the upper bound on the sample complexity is independent of $\gamma$ . That is, extremely weak detection (when $\gamma→ 0$ ) seems to be as hard as strong detection (when $\gamma=9/10$ ). We give explicit examples where this indeed happens:
**Example 8.3 (Upper bounds for sample complexity are tight in the worst case)**
*For every $\epsilon∈(0,1/4)$ , there exist distributions $p$ and $q$ such that $\mathrm{JS}_{\pi}(p,q)\asymp\pi\epsilon$ . Moreover, for every $\pi∈(0,1/2)$ and $\gamma∈(0,1/4)$ such that $\frac{\log\left(\frac{\overline{\alpha}}{\pi\overline{\gamma}}\right)}{%
\epsilon}≥ 1$ , the sample complexity satisfies ${n}^{*}\left(p,q,\pi,\pi(1-\gamma)\right)\asymp\frac{\log(\frac{\overline{%
\alpha}}{\pi\overline{\gamma}})}{\epsilon}$ . In particular, the following holds:
1. If $\pi=1/2-\eta$ , then ${n}^{*}_{\mathrm{B}}(p,q,\pi,\pi(1-\gamma))\asymp\frac{\max(\gamma,\eta)}{%
\mathrm{JS}_{\pi}(p,q)}\asymp\frac{\max(\gamma,\eta)}{\mathrm{h}^{2}(p,q)}$ . Thus, the upper bound in Equation 31 is tight.
1. If $\pi≤ 1/4$ , then ${n}^{*}_{\mathrm{B}}(p,q,\pi,\pi(1-\gamma))\asymp\frac{\pi\log(1/\pi)}{\mathrm%
{JS}_{\pi}(p,q)}$ . Thus, the upper bound in Equation 32 is tight.*
* Proof*
Let $p=\mathrm{Ber}(0)$ and $q=\mathrm{Ber}(\epsilon)$ . Then a quick calculation shows that $\mathrm{JS}_{\pi}\asymp\pi\epsilon$ :
$$
\displaystyle\mathrm{JS}_{\pi}(p,q) \displaystyle=\pi\mathrm{KL}\left(\mathrm{Ber}(0),\mathrm{Ber}(\overline{%
\alpha}\epsilon)\right)+\overline{\alpha}\mathrm{KL}\left(\mathrm{Ber}(%
\epsilon),\mathrm{Ber}(\overline{\alpha}\epsilon))\right) \displaystyle=\pi\log\left(\frac{1}{1-\overline{\alpha}\epsilon}\right)+%
\overline{\alpha}\cdot\overline{\epsilon}\log\left(\frac{\overline{\epsilon}}{%
1-\overline{\alpha}\epsilon}\right)+\overline{\alpha}\epsilon\log\left(\frac{1%
}{\overline{\alpha}}\right)\,. \tag{0}
$$
Since $\pi≤ 1/2$ and $\epsilon≤ 1/2$ , the first term is of the order $\pi\epsilon$ , Note that $\log\left(\frac{1}{1-\overline{\alpha}\epsilon}\right)=\log\left(1+\frac{%
\overline{\alpha}\epsilon}{1-\overline{\alpha}\epsilon}\right)\asymp\frac{%
\overline{\alpha}\epsilon}{1-\overline{\alpha}\epsilon}\asymp\epsilon$ . so it implies the desired lower bound on $\mathrm{JS}_{\pi}(p,q)$ . To upper-bound the second term, we note that the second term is negative, while the third term is at most $\pi\epsilon$ , since $\overline{\alpha}\epsilon\log(1/\overline{\alpha})=\overline{\alpha}\epsilon%
\log\left(1+\pi/\overline{\alpha}\right)≤\epsilon\pi$ . Thus, we have $\mathrm{JS}_{\pi}(p,q)\asymp\pi\epsilon$ . Moreover, the average error of hypothesis testing with $n$ samples is equal to (cf. Proposition 4.13)
| | $\displaystyle P_{e}$ | $\displaystyle=\sum_{x_{1},...,x_{n}}\min\left(\pi p^{\otimes n}(x_{1},...,%
x_{n}),(1-\pi)q^{\otimes n}(x_{1},...,x_{n})\right)$ | |
| --- | --- | --- | --- |
which is less than $\pi(1-\gamma)$ if and only if $(1-\pi)(1-\epsilon)^{n}≤\pi(1-\gamma)$ , which happens only if $n\asymp\frac{\log(\frac{\overline{\alpha}}{\pi\overline{\gamma}})}{\log(1/%
\overline{\epsilon})}$ . Since $\epsilon≤ 1/2$ , the denominator $\log(1/\overline{\epsilon})$ is equivalent to $\epsilon$ , up to constants. We evaluate the resulting expression $n\asymp\frac{\log(\frac{\overline{\alpha}}{\pi\overline{\gamma}})}{\epsilon}$ in the different regimes of $\pi$ . For $\pi≤ 1/4$ , the numerator is equivalent to $\log(1/\pi)$ , so $n\asymp\frac{\log(1/\pi)}{\epsilon}\asymp\frac{\pi\log(\frac{1}{\pi})}{\mathrm%
{JS}_{\pi}(p,q)}$ . In the other regime of $\pi=1/2-\eta$ for $\eta∈(0,1/4)$ , the numerator $\log(\frac{\overline{\alpha}}{\pi\overline{\gamma}})$ satisfies
$$
\log(\frac{\overline{\alpha}}{\pi\overline{\gamma}})=\log\left(1+\frac{4\eta}{%
1-2\eta}\right)+\log\left(1+\frac{\gamma}{\overline{\gamma}}\right)\asymp\eta+%
\gamma\asymp\max(\eta,\gamma).
$$
Thus, the sample complexity is $\frac{\max(\eta,\gamma)}{\mathrm{JS}_{\pi}(p,q)}\asymp\frac{\max(\eta,\gamma)}%
{\mathrm{h}^{2}(p,q)}$ . ∎
9 Distributed hypothesis testing
In this section, we provide the proofs of our results in Section 2.3.
9.1 Hypothesis testing under communication constraints
We provide the proof of Theorem 2.4, recalled below:
See 2.4
We shall use the following result from [BhaNOP21]:
**Theorem 9.1 ([BhaNOP21])**
*Let $p$ and $q$ be two distributions over a domain $\mathcal{X}$ of size $k$ . Let $\Theta$ be a binary random variable over $\{p,q\}$ with $\mathbb{P}(\Theta=p)=\pi$ . Let $X|\Theta=\theta$ be distributed as $\theta$ . Let $D∈\mathbb{N}$ be a communication budget greater than $1$ . Then there exists an algorithm $\mathcal{A}$ that takes as input $p$ , $q$ , $\pi$ , and $D$ and outputs a function $f:\mathcal{X}→[D]$ , in time $\mathrm{poly}(k,|D|)$ , such that
| | $\displaystyle I\left(\Theta;f(X)\right)\gtrsim I(\Theta;X)·\min\left(1,%
\frac{D}{\log\left(1/I(\Theta;X)\right)}\right).$ | |
| --- | --- | --- |*
* Proof ofTheorem2.4*
Let $X$ be distributed as $X_{1}$ in Definition 1.1. We will choose $f:\mathcal{X}→[D]$ shortly. Let $p^{\prime}$ and $q^{\prime}$ be the distributions of $f(Y)$ for $Y$ distributed as $p$ and $q$ , respectively. The server solves $\mathcal{B}_{\mathrm{B}}(p^{\prime},q^{\prime},\pi,\delta)$ , and as a result, requires ${n}^{*}_{\mathrm{B}}(p^{\prime},q^{\prime},\pi,\delta)$ samples. Since ${n}^{*}_{\mathrm{B},\mathrm{comm}}(p,q,\pi,\delta,D)≤{n}^{*}_{\mathrm{B}}(p%
^{\prime},q^{\prime},\pi,\delta)$ , in the remainder of the proof, we will upper-bound ${n}^{*}_{\mathrm{B}}(p^{\prime},q^{\prime},\pi,\delta)$ . First consider the regime $\delta∈(\pi/100,\pi/4]$ . Let $f:\mathcal{X}→[D]$ be the quantizer from Theorem 9.1 with $\pi$ , which can be computed in polynomial-time using Theorem 9.1. Then Theorem 2.1 implies that
$$
\displaystyle{n}^{*}_{\mathrm{B}}(p^{\prime},q^{\prime},\pi,\delta) \displaystyle\asymp\frac{\pi\log(1/\pi)}{I(\Theta;f(X_{1}))} \displaystyle\lesssim\frac{\pi\log(1/\pi)}{I(\Theta;X)}\max\left(1,\frac{\log(%
1/I(\Theta;X))}{D}\right) \displaystyle\asymp{n}^{*}_{\mathrm{B}}\left(p,q,\pi,\delta\right)\max\left(1,%
\frac{\log\left(\frac{{n}^{*}_{\mathrm{B}}\left(p,q,\pi,\delta\right)}{\pi\log%
(1/\pi)}\right)}{D}\right)
$$ Now consider the sublinear regime $\delta∈(\pi^{2},\pi/100)$ . Following the notation of Theorem 2.1, let $f:\mathcal{X}→[D]$ be the quantizer from Theorem 9.1 with $\pi=\pi^{\prime}$ . Then Theorem 2.1 implies that
| | $\displaystyle{n}^{*}_{\mathrm{B}}(p^{\prime},q^{\prime},\pi,\delta)$ | $\displaystyle\asymp\log\left(\pi/\delta\right){n}^{*}_{\mathrm{B}}\left(p^{%
\prime},q^{\prime},\pi^{\prime},\pi^{\prime}/4\right)$ | |
| --- | --- | --- | --- |
The final regime $\delta≤\pi^{2}$ follows analogously by noting that $\mathrm{h}^{2}(p,q)\asymp I(\Theta^{\prime};X)$ when $\Theta^{\prime}$ is uniform over $\{p,q\}$ . Choosing the $f$ from Theorem 9.1 for the uniform prior, we obtain
| | $\displaystyle{n}^{*}(p^{\prime},q^{\prime},\pi,\delta)$ | $\displaystyle\lesssim\frac{\log(1/\delta)}{I(\Theta^{\prime};f(X))}\lesssim%
\frac{\log(1/\delta)}{I(\Theta;X)}\max\left(1,\frac{\log(1/I(\Theta^{\prime};X%
))}{D}\right)$ | |
| --- | --- | --- | --- | ∎
9.2 Hypothesis testing under local differential privacy
In this section, we prove Theorem 2.6, restated below:
See 2.6
For an $\epsilon$ -LDP mechanism $\mathcal{M}:\mathcal{X}→\mathcal{Y}$ and a distribution $p$ over $\mathcal{X}$ , we use $\mathcal{M}(p)$ to denote the distribution of $\mathcal{M}(X)$ for $X\sim p$ . We shall use the following result from [PenAJL23]:
**Theorem 9.2 ([PenAJL23, Corollary 1.18])**
*Let $p$ and $q$ be two distributions on $[k]$ . For any $\ell∈\mathbb{N}$ , let $\mathcal{C}_{\ell}$ be the set of all $\epsilon$ -LDP mechanisms that maps $\mathcal{X}$ to $[\ell]$ . Let $\mathcal{A}=\{\left(\mathcal{M}(p),\mathcal{M}(q)\right):\mathcal{M}∈%
\mathcal{C}_{\ell}\}$ . Let $g(·,·):\mathcal{A}→\mathbb{R}$ be a jointly quasi-convex function. Then, there is an algorithm that takes as input $p,q,\epsilon,\ell$ , and returns an $\epsilon$ -LDP mechanism $\mathcal{M}$ that maximizes $g(\mathcal{M}(p),\mathcal{M}(q))$ over $\mathcal{M}∈\mathcal{C}_{\ell}$ . Moreover, the algorithm returns in time polynomial in $k^{\ell^{2}}$ and $2^{\ell^{3}\log l}$ .*
* Proof ofTheorem2.6*
Observe that the optimal $\epsilon$ -LDP mechanism $\mathcal{M}^{\prime}$ is the one that minimizes ${n}^{*}_{\mathrm{B}}\left(\mathcal{M}^{\prime}(p),\mathcal{M}^{\prime}(q),\pi,%
\delta\right)$ . Let ${n}^{*}_{\mathrm{B},\mathrm{priv}}:={n}^{*}_{\mathrm{B},\mathrm{priv}}\left(p,%
q,\pi,\delta,\epsilon\right)={n}^{*}_{\mathrm{B}}\left(\mathcal{M}^{\prime}(p)%
,\mathcal{M}^{\prime}(q),\pi,\delta\right)$ be the private sample complexity. Letting $\mathcal{Y}^{\prime}$ be the range of $\mathcal{M}$ , we will further quantize it using a (deterministic) function $f:\mathcal{Y}^{\prime}→[\ell]$ . If we choose the best quantizer, Theorem 2.4 implies that the resulting sample complexity $n^{\prime}$ is at most
| | $\displaystyle n^{\prime}\lesssim{n}^{*}_{\mathrm{B},\mathrm{priv}}·\max%
\left(1,\frac{\log\left(\frac{{n}^{*}_{\mathrm{B},\mathrm{priv}}}{\pi}\right)}%
{l}\right)\,,$ | |
| --- | --- | --- |
yielding the desired bound on the sample complexity. We now show the runtime guarantees using Theorem 9.2. Let $\mathcal{C}_{\ell}$ be the set of all $\epsilon$ -LDP mechanisms that map $\mathcal{X}→[\ell]$ , which in particular includes $f(\mathcal{M}^{\prime})$ analyzed above. Observe that for all the regimes of $\delta$ , minimizing ${n}^{*}_{\mathrm{B}}(\mathcal{M}(p),\mathcal{M}(q),\pi,\delta)$ over $\mathcal{M}∈\mathcal{C}_{\ell}$ is equivalent (up to constant factors) to maximizing a jointly convex function $a\mathcal{H}_{\lambda}(\mathcal{M}(p),\mathcal{M}(q))$ for some $\pi>0$ and $\lambda∈(0,1)$ depending on the parameters $\pi$ and $\delta$ ; the joint convexity follows because $\mathcal{H}$ is a jointly convex function. Let $\mathcal{M}^{\prime\prime}$ be the $\epsilon$ -LDP mechanism returned by Theorem 9.2 for maximizing the jointly convex function above. Thus, the resulting sample complexity is less than a constant multiple of $n^{\prime}$ . ∎
10 Sequential hypothesis testing
The sequential hypothesis testing problem, investigated by Wald [Wald45], considered a hypothesis testing setting where the statistician observes samples sequentially and chooses to make a decision after they have gathered enough evidence. In contrast to the prior-free hypothesis testing from Definition 1.2, the number of samples is not fixed in advance—and indeed, depends stochastically on the observed samples—which leads to the expected sample complexity as a metric of interest. We formally define the problem below:
**Definition 10.1**
*Let $p$ and $q$ be two distributions over a discrete domain $\mathcal{X}$ . A sequential hypothesis test $\phi$ consists of a stopping rule $T$ that is a stopping time (i.e., the event $\{T=t\}$ depends only on the samples observed up to time $t$ ) and a decision $\phi(X_{1},...,X_{T})∈\{p,q\}$ . We say that $\phi$ solves the simple sequential hypothesis testing problem with type-I error $\alpha$ , type-II error $\beta$ , and expected sample complexity $n$ , if
| | $\displaystyle\mathbb{P}_{p}(\phi(X_{1},...,X_{T})≠ p)≤\alpha,\quad%
\mathbb{P}_{q}(\phi(X_{1},...,X_{T})≠ q)≤\beta,\quad{and}\quad\max\{%
\mathbb{E}_{p}T,\mathbb{E}_{q}T\}≤ n.$ | |
| --- | --- | --- |
We use $\mathcal{B}_{\mathrm{SEQ}}(p,q,\alpha,\beta)$ to denote this problem, and define its sample complexity ${n}^{*}_{\mathrm{SEQ}}(p,q,\alpha,\beta)$ to be the smallest $n$ such that there exists a test $\phi$ which solves $\mathcal{B}_{\mathrm{SEQ}}(p,q,\alpha,\beta)$ .*
The counterpart to the Neyman–Pearson test in the sequential setting is Wald’s sequential probability ratio test (SPRT). This test updates the log-likelihood ratio $\Lambda_{t}$ according to the rule $\Lambda_{t}=\sum_{i=1}^{t}\log\frac{p(x_{i})}{q(x_{i})}$ , where $x_{i}$ is the $i$ -th observation. The test stops and declares $p$ when $\Lambda_{t}$ hits an upper threshold $A$ , or stops and declares $q$ when it hits a lower threshold $B$ . The thresholds $A$ and $B$ are computed based on the desired type-I and type-II errors, for example, setting $A=\log\frac{1-\beta}{\alpha}$ and $B=\log\frac{\beta}{1-\alpha}$ suffices to get the desired errors. Wald and Wolfowitz [WaldWolf48] showed that the SPRT (with optimal thresholds) is optimal, in the following sense. Let $\phi$ be the SPRT with stopping rule $T$ and errors $\alpha$ and $\beta$ , and let $\phi^{\prime}$ be any with stopping rule $T^{\prime}$ and errors $\alpha^{\prime}≤\alpha$ and $\beta^{\prime}≤\beta$ . Then we must necessarily have $\mathbb{E}_{p}[T^{\prime}]≥\mathbb{E}_{p}[T]$ and $\mathbb{E}_{q}[T^{\prime}]≥\mathbb{E}_{q}[T]$ .
If the Type-I error is at most $\alpha$ and the Type-II error is at most $\beta$ , Wald [Wald45] showed the following lower bounds on the expected sample size:
| | $\displaystyle\mathbb{E}_{p}T≥\frac{\mathrm{KL}(\alpha,\overline{\beta})}{%
\mathrm{KL}(p,q)}\quad\text{and}\quad\mathbb{E}_{q}T≥\frac{\mathrm{KL}(%
\beta,\overline{\alpha})}{\mathrm{KL}(q,p)},$ | |
| --- | --- | --- |
where we used $\mathrm{KL}(x,y)$ to denote $\mathrm{KL}(\mathrm{Ber}(x),\mathrm{Ber}(y))$ . This gives the sample complexity lower bound
$$
\displaystyle{n}^{*}_{\mathrm{SEQ}}(p,q,\alpha,\beta)\geq\max\left\{\frac{%
\mathrm{KL}(\alpha,\overline{\beta})}{\mathrm{KL}(p,q)},\frac{\mathrm{KL}(%
\beta,\overline{\alpha})}{\mathrm{KL}(q,p)}\right\}. \tag{33}
$$
Since any fixed sample-size algorithm is also a valid sequential algorithm, we also have the simple upper bound
$$
\displaystyle{n}^{*}_{\mathrm{SEQ}}(p,q,\alpha,\beta)\leq{n}^{*}_{\mathrm{PF}}%
(p,q,\alpha,\beta). \tag{34}
$$
The upper and lower bounds need not match. A simple example is when the two distributions have mismatched supports, leading to unbounded KL divergences.
**Example 10.2**
*Consider two distributions $p=(0,1/2,1/2)$ and $q=(1/2,1/2,0)$ . The fixed-size sample complexity to get errors of both types at most $\delta$ is $\Theta(\log(1/\delta)/\mathrm{h}^{2}(p,q))\asymp\log(1/\delta)$ . The lower bound on the sequential sample complexity from Equation 33 does not depend on $\delta$ and equals 0, since the KL divergence is infinite. More interestingly, the exact sequential sample complexity also does not depend on $\delta$ . In fact, it is possible to get 0 errors of both types by taking $2$ samples in expectation.*
This example shows that the sample complexity behaviour between the fixed sample-size case and the sequential case can be quite different. It is unclear if the expected sample size for the optimal sequential test is characterized by any $f$ -divergence in general. In what follows, we address some special cases of interest.
10.1 Constant error regime
When $\alpha$ and $\beta$ are constants (say $1/8$ ), the sample complexity of sequential hypothesis testing turns out to be identical to that of the prior-free hypothesis testing problem. This is summarized in the following theorem:
**Theorem 10.3**
*Let $p$ and $q$ be as in Definition 10.1. Then ${n}^{*}_{\mathrm{SEQ}}(p,q,1/8,1/8)\asymp\frac{1}{\mathrm{h}^{2}(p,q)}$ .*
* Proof*
Clearly, ${n}^{*}_{\mathrm{SEQ}}(p,q,1/8,1/8)≤{n}^{*}_{\mathrm{PF}}(p,q,1/8,1/8)%
\asymp\frac{1}{\mathrm{h}^{2}(p,q)}$ since a fixed-size test is also a valid sequential test. It remains to show the reverse inequality. Suppose there exists a sequential test with sample complexity $N$ that attains errors of both types at most $1/8$ . By Markov’s inequality, the probability that the sample-size of this test exceeds $8N$ is at most $1/8$ . Thus, we may construct a fixed size test as follows: take $8N$ samples and declare the result of the sequential test if there is one, otherwise declare 0. By the union bound, the probabilities of type-I and type-II errors of this test are most 1/4. This means $8N≥{n}^{*}_{\mathrm{PF}}(p,q,1/4,1/4)\asymp\frac{1}{\mathrm{h}^{2}(p,q)}$ , and so $N\gtrsim\frac{1}{\mathrm{h}^{2}(p,q)}$ . Since this is true for any sequential test with errors bounded by 1/8, we conclude that ${n}^{*}_{\mathrm{SEQ}}(p,q,1/8,1/8)\gtrsim\frac{1}{\mathrm{h}^{2}(p,q)}$ . This concludes the proof. ∎
This proof technique fails when the desired errors are much smaller than a constant, say $\delta$ , because the application of Markov’s inequality leads to a lower bound that is $\asymp\delta{n}^{*}_{\mathrm{PF}}(p,q,\delta,\delta)$ , which is far away from the upper bound of ${n}^{*}_{\mathrm{PF}}(p,q,\delta,\delta)$ .
10.2 Bounded likelihood ratio
One way to avoid the situation in Example 10.2 is considering distribution pairs that have bounded likelihood ratios. This ensures that the KL divergence is finite, and the lower bound (33) is non-trivial. The assumption of bounded log-likelihood simplifies the relationships between various $f$ -divergences, as noted in the following lemma:
**Lemma 10.4**
*Let $p$ and $q$ be two distributions supported on a discrete space $\mathcal{X}$ . Suppose there exists a constant $C>1$ such that
$$
\frac{1}{C}\leq\frac{p(x)}{q(x)}\leq C,
$$
for all $x∈\mathcal{X}$ . Then the following hold:
1. $\mathrm{h}^{2}(p,q)\asymp\mathrm{KL}(p\|q)$ ,
1. For $\alpha≤ 1/2$ , $\mathrm{JS}_{\alpha}(p,q)\asymp\alpha\mathrm{h}^{2}(p,q)$ ,*
* Proof ofLemma10.4*
The proof of the first part follows from Lemma 5 in [BirMas98], where it is shown that if $p(x)/q(x)≤ C$ then
$$
\mathrm{h}^{2}(p,q)\leq\mathrm{KL}(p,q)\leq\mathrm{h}^{2}(p,q)(2+\log C).
$$
Note that the lower bound holds even without the bounded likelihood assumption. The second part follows by expressing $\mathrm{JS}_{\alpha}(p,q)$ in terms of the KL-divergence, and using part 1 above:
| | $\displaystyle\mathrm{JS}_{\alpha}(p,q)$ | $\displaystyle=\alpha\mathrm{KL}(p,\alpha p+\overline{\alpha}q)+\overline{%
\alpha}\mathrm{KL}(q,\alpha p+\overline{\alpha}q)$ | |
| --- | --- | --- | --- |
Here, $(a)$ follows from part 1, and $(b)$ follows from the convexity of $\mathrm{h}^{2}$ in its arguments. To show the other direction,
| | $\displaystyle\mathrm{JS}_{\alpha}(p,q)$ | $\displaystyle=\alpha\mathrm{KL}(p,\alpha p+\overline{\alpha}q)+\overline{%
\alpha}\mathrm{KL}(q,\alpha p+\overline{\alpha}q)$ | |
| --- | --- | --- | --- |
Now observe that $\mathrm{h}^{2}(p,\alpha p+\overline{\alpha}q)$ is a monotonically decreasing function of $\alpha$ for fixed $p$ and $q$ . Since $\alpha≤ 1/2$ , we have
| | $\displaystyle\mathrm{h}^{2}(p,\alpha p+\overline{\alpha}q)$ | $\displaystyle≥\mathrm{h}^{2}(p,(p+q)/2)$ | |
| --- | --- | --- | --- |
where the last line follows from Lemma 5 in [BirMas98]. This leads us to conclude
| | $\displaystyle\mathrm{JS}_{\alpha}(p,q)\gtrsim\alpha\mathrm{h}^{2}(p,q),$ | |
| --- | --- | --- |
which completes the proof of part 2. ∎
Let $\alpha,\beta<1/4$ . Using Lemma 10.4, it is easy to see that the lower bound from (33) is simply
| | $\displaystyle{n}^{*}_{\mathrm{SEQ}}(p,q,\alpha,\beta)≥\frac{\log(1/\min\{%
\alpha,\beta\})}{\mathrm{h}^{2}(p,q)}.$ | |
| --- | --- | --- |
The upper bound (34) gives
| | $\displaystyle{n}^{*}_{\mathrm{SEQ}}(p,q,\alpha,\beta)$ | $\displaystyle≤{n}^{*}_{\mathrm{PF}}(p,q,\alpha,\beta)$ | |
| --- | --- | --- | --- |
These two bounds match, giving a tight characterization of the worst-case expected sample size under the bounded-likelihood assumption, summarized in the following theorem:
**Theorem 10.5 (Sample complexity for sequential hypothesis testing under the bounded likelihood-ratio condition)**
*Let $p$ and $q$ be two distributions on a discrete set $\mathcal{X}$ that satisfy the bounded likelihood-ratio assumption. Then ${n}^{*}_{\mathrm{SEQ}}(p,q,\alpha,\beta)\asymp\frac{\log(1/\min\{\alpha,\beta%
\})}{\mathrm{h}^{2}(p,q)}.$*
**Remark 10.6**
*Our main result on Bayesian sample complexity identifies the sample complexity ${n}^{*}_{\mathrm{B}}(p,q,\pi,\delta)$ for different regimes of $\delta≤\pi/4$ for $\pi∈(0,1/2]$ . Using Lemma 10.4, it can be easily verified that under the bounded likelihood-ratio assumption, the sample complexity is given by
$$
{n}^{*}_{\mathrm{B}}(p,q,\pi,\delta)\asymp\frac{\log(1/\delta)}{\mathrm{h}^{2}%
(p,q)},
$$
that is, the trivial upper bound is tight. A similar conclusion holds for the prior-free setting, where we obtain that the sample complexity for $\alpha,\beta∈(0,1/8]$ is
$$
{n}^{*}_{\mathrm{PF}}(p,q,\beta,\alpha)\asymp\frac{\log(1/\min\{\alpha,\beta\}%
)}{\mathrm{h}^{2}(p,q)}.
$$*
11 Hypothesis testing with erasures
Hypothesis testing with erasures (also called abstention or rejection) allows the statistician to declare an output in the range $\{p,q,\star\}$ , where the $\star$ indicates insufficient evidence to make a decision. A decision of $\star$ leads to an error event called the erasure error, whereas an incorrect decision (that is not an erasure) is called an undetected error. When erasures are not allowed, the usual type-I and type-II errors fall under the undetected error category. The rationale behind the $\star$ option is that is some settings, the cost of making an undetected error is significantly higher than that of making an erasure error. Thus, the statistician is willing to incur some probability of erasure error to reduce the probability of undetected errors. This added flexibility can allow for tests with small undetected errors using potentially much fewer samples than tests without the $\star$ option.
To give a concrete example, consider the prior-free hypothesis testing problem with distributions $p$ and $q$ , and suppose we require the type-I and type-II errors to be at most $\beta$ . Without erasures, the sample complexity is $\asymp\frac{\log(1/\beta)}{\mathrm{h}^{2}(p,q)}$ . Now suppose the statistician is allowed to declare $\star$ with probability at most a small constant, say $1/10$ , under either hypothesis, while keeping probabilities of undetected errors at most $\beta$ under both hypotheses. Will the sample complexity of this new statistical problem be meaningfully smaller than $\frac{\log(1/\beta)}{\mathrm{h}^{2}(p,q)}$ ? We answer this question in the affirmative, showing that the new sample complexity is $\asymp\max\{{n}^{*}_{\mathrm{PF}}(p,q,1/10,\beta),{n}^{*}_{\mathrm{PF}}(p,q,%
\beta,1/10)\}$ . This quantity can be arbitrarily smaller than $\frac{\log(1/\beta)}{\mathrm{h}^{2}(p,q)}$ , as shown in Example 11.6. We generalise this special case and identify the sample complexity in the prior-free and Bayesian settings with erasure (to be defined later) in almost all error regimes.
11.1 Background on simple hypothesis testing with erasures
In what follows, we denote the simple binary hypothesis testing problem without erasures as the standard setting and with erasures as the erasure setting. We briefly describe known results in the erasure setting. The problem of trading off erasures versus undetected errors was studied in Forney [For68] for binary (as well as $M$ -ary) hypothesis testing. Consider a (Bayesian) hypothesis $\theta$ taking values in $\{p,q\}$ and a single observation $x$ that comes from $p$ or $q$ . Let $\mathbb{P}(x,\theta)$ be the joint probability of choosing hypothesis $\theta$ and observing $x$ under $\theta$ . For simplicity, let $\overline{\theta}$ denote the second hypothesis. There are three error events to consider: an error, denoted by $\mathcal{E}_{\mathrm{total}}$ , an undetected error, denoted by $\mathcal{E}_{\mathrm{undetected}}$ , and an erasure error denoted by $\mathcal{E}_{\mathrm{erasure}}$ . An error is when the estimate is not the correct hypothesis; i.e., it is either $\overline{\theta}$ or $\star$ . An undetected error is when estimate is the incorrect hypothesis $\overline{\theta}$ . An erasure error is when the estimate is the erasure $\star$ . Observe that if the test declares $p$ in $R_{p}⊂eq\mathcal{X}$ , $q$ in $R_{q}⊂eq\mathcal{X}$ , and $\star$ in $(R_{p}\cup R_{q})^{c}$ , then the probabilities of these errors are given by
| | $\displaystyle\mathbb{P}(\mathcal{E}_{\mathrm{total}})$ | $\displaystyle=\sum_{\theta∈\{p,q\}}\sum_{x∈ R_{\theta}^{c}}\mathbb{P}(x,%
\theta),$ | |
| --- | --- | --- | --- |
Forney [For68] considers the problem of optimally trading off $\mathbb{P}(\mathcal{E}_{\mathrm{total}})$ versus $\mathbb{P}(\mathcal{E}_{\mathrm{undetected}})$ , as this captures the tradeoff between declaring erasures and undetected errors. Using the Neyman–Pearson lemma, Forney [For68] established the optimality of a certain family of tests, where optimality means no other test can simultaneously better both $\mathbb{P}(\mathcal{E}_{\mathrm{total}})$ and $\mathbb{P}(\mathcal{E}_{\mathrm{undetected}})$ . This family of tests is characterised by thresholds $\eta>1$ as follows:
$$
\displaystyle\phi(x)=\begin{cases}p&\text{ if }\frac{p(x)}{q(x)}\geq\eta,\\
q&\text{ if }\frac{q(x)}{p(x)}\geq\eta,\\
\star&\text{ otherwise.}\end{cases} \tag{35}
$$
Since $\eta>1$ , the decision regions $R_{p}$ and $R_{q}$ are disjoint. We were unable to find a complete analysis of the trade-offs between $\mathbb{P}(\mathcal{E}_{\mathrm{total}})$ and $\mathbb{P}(\mathcal{E}_{\mathrm{undetected}})$ in the asymptotic regime (similar to the Stein and Chernoff regimes in the standard setting). For completeness, we have included this in Section 14. This analysis may be of independent interest.
We now formally define the simple binary hypothesis testing problem with erasures.
**Definition 11.1 (Bayesian simple binary hypothesis testing with erasures)**
*Let $p$ and $q$ be two distributions over a discrete domain $\mathcal{X}$ , and let $\pi,\delta∈(0,1)$ and $\lambda>0$ . We say that a test $\phi:\cup_{n=0}^{∞}\mathcal{X}^{n}→\{p,q,\star\}$ solves the Bayesian simple binary hypothesis testing with erasures problem of distinguishing $p$ versus $q$ , under the prior $(\pi,1-\pi)$ , with sample complexity $n$ and $\lambda$ -weighted error $\delta$ , if for $X:=(X_{1},...,X_{n})$ , we have
$$
\displaystyle\pi\cdot\mathbb{P}_{p}\left\{\phi(X)\neq p\right\}+(1-\pi)\cdot%
\mathbb{P}_{q}\left\{\phi(X)\neq q\right\}+\lambda\mathbb{P}(\phi(X)=\star)%
\leq\delta. \tag{36}
$$
Here, we use $\mathbb{P}_{p}$ and $\mathbb{P}_{q}$ to denote $\mathbb{P}_{X\sim p^{\otimes n}}$ and $\mathbb{P}_{X\sim q^{\otimes n}}$ , respectively. Let $\mathcal{B}_{\mathrm{B\text{-}E}}(p,q,\pi,\lambda,\delta)$ denote this problem and define its sample complexity ${n}^{*}_{\mathrm{B\text{-}E}}(p,q,\pi,\lambda,\delta)$ to be the smallest $n$ such that there exists a test $\phi$ which solves $\mathcal{B}_{\mathrm{B\text{-}E}}(p,q,\pi,\gamma,\delta)$ .*
**Remark 11.2 (Remark on the notation)**
*When $p$ and $q$ are clear from context, we shall also use the notation $\mathcal{B}_{\mathrm{B\text{-}E}}(\pi,\lambda,\delta)$ and ${n}^{*}_{\mathrm{B\text{-}E}}(\pi,\lambda,\delta)$ . For a given test $\phi$ that uses $n$ samples, we shall use the notation $\mathcal{E}_{\mathrm{total}}^{n}$ , $\mathcal{E}_{\mathrm{undetected}}^{n}$ , and $\mathcal{E}_{\mathrm{erasure}}^{n}$ to denote the error event, the undetected error event, and the erasure error event, respectively. For a test $\phi$ , we shall also use the shorthand $\mathbb{P}^{\phi}(·)$ , $\mathbb{P}_{p}^{\phi}(·)$ and $\mathbb{P}_{q}^{\phi}(·)$ to denote the probabilities of the error events for $\phi$ . With this notation, we can rewrite (36) as
| | $\displaystyle\lambda\mathbb{P}^{\phi}(\mathcal{E}_{\mathrm{total}}^{n})+%
\overline{\lambda}\mathbb{P}^{\phi}(\mathcal{E}_{\mathrm{undetected}}^{n})≤%
\delta,\quad\text{ or }\quad\mathbb{P}^{\phi}(\mathcal{E}_{\mathrm{undetected}%
}^{n})+\lambda\mathbb{P}^{\phi}(\mathcal{E}_{\mathrm{erasure}}^{n})≤\delta.$ | |
| --- | --- | --- |
Note that $\lambda$ could be arbitrarily large; however, we show in Lemma 11.9 that when $\lambda≥ 1/2$ , the optimal decision rule can disregard erasures altogether. Intuitively, this is because the since the cost of declaring an erasure is too high. Thus, for all intents and purposes, we may restrict to $\lambda<1/2$ , which is also the more practical regime.*
We also consider a prior-free version, where all possible errors are analysed separately.
**Definition 11.3 (Prior-free simple binary hypothesis testing with erasures)**
*Let $p$ , $q$ , and $\phi$ be as in Definition 11.1. We say that $\phi$ solves the simple binary hypothesis testing with erasures problem of $p$ versus $q$ with type-I error $\alpha$ , type-II error $\beta$ , erasure errors $\delta_{0}$ and $\delta_{1}$ , and sample complexity $n$ , if
$$
\displaystyle\mathbb{P}_{p}^{\phi}(\phi(X)=q)\leq\alpha,\quad\mathbb{P}_{q}^{%
\phi}(\phi(X)=p)\leq\beta,\quad\mathbb{P}_{p}^{\phi}(\phi(X)=\star)\leq\delta_%
{0},\quad\text{ and }\mathbb{P}_{q}^{\phi}(\phi(X)=\star)\leq\delta_{1}. \tag{37}
$$
We use $\mathcal{B}_{\mathrm{PF\text{-}E}}(p,q,\alpha,\beta,\delta_{0},\delta_{1})$ to denote this problem, and define its sample complexity ${n}^{*}_{\mathrm{PF\text{-}E}}(p,q,\alpha,\beta,\delta_{0},\delta_{1})$ to be the smallest $n$ such that there exists a test $\phi$ which solves $\mathcal{B}_{\mathrm{PF\text{-}E}}(p,q,\alpha,\beta,\delta_{0},\delta_{1})$ . When the context is clear, we shall drop $p$ and $q$ and use the notation $\mathcal{B}_{\mathrm{PF\text{-}E}}(\alpha,\beta,\delta_{0},\delta_{1})$ and ${n}^{*}_{\mathrm{PF\text{-}E}}(\alpha,\beta,\delta_{0},\delta_{1})$ .*
11.2 Prior-free hypothesis testing with erasures
Our main contribution in this section is to find a formula for ${n}^{*}_{\mathrm{PF\text{-}E}}(\alpha,\beta,\delta_{0},\delta_{1})$ when the parameters lie in $(0,1/8]$ .
**Theorem 11.4 (Sample complexity of simple binary hypothesis testing with erasures)**
*Let $p$ and $q$ be distributions on a finite alphabet $\mathcal{X}$ . Let $\alpha$ , $\beta$ , $\delta_{0}$ , and $\delta_{1}$ all lie in $(0,1/8]$ . Then the prior-free sample complexity of the binary hypothesis testing problem with erasures is given by
| | $\displaystyle{n}^{*}_{\mathrm{PF\text{-}E}}(\alpha,\beta,\delta_{0},\delta_{1}%
)\asymp\max\{{n}^{*}_{\mathrm{PF}}(\alpha,\beta+\delta_{1}),{n}^{*}_{\mathrm{%
PF}}(\alpha+\delta_{0},\beta)\},$ | |
| --- | --- | --- |
where we used the shorthand ${n}^{*}_{\mathrm{PF}}(\alpha,\beta):={n}^{*}_{\mathrm{PF}}(p,q,\alpha,\beta)$ to denote the sample complexity of the standard binary hypothesis testing problem (without erasures) with type-I error at most $\alpha$ and type-II error at most $\beta$ .*
**Remark 11.5 (Benefit of erasure)**
*We noted earlier that the sample complexity of simple binary hypothesis testing when the type-I and type-II errors are at most $\beta$ is $\asymp\frac{\log(1/\beta)}{\mathrm{h}^{2}(p,q)}$ . If erasure probabilities of $\delta_{0}=\delta_{1}=1/10$ are allowed, the above result shows that the new sample complexity is roughly $\max({n}^{*}_{\mathrm{PF}}(0.1,\beta),{n}^{*}_{\mathrm{PF}}(\beta,0.1))$ $\asymp\frac{\beta\log(1/\beta)}{\min\{\mathrm{JS}_{\beta}(p,q),\mathrm{JS}_{%
\beta}(q,p)\}}$ . This can be much lower than the sample complexity without erasures.
**Example 11.6**
*Let $p=(1-\epsilon,\epsilon,0)$ and $q=(1-\epsilon,0,\epsilon)$ . Observe that $\mathrm{h}^{2}(p,q)\asymp\epsilon$ , and so ${n}^{*}_{\mathrm{PF}}(p,q,\beta,\beta)\asymp\frac{\log(1/\beta)}{\mathrm{h}^{2%
}(p,q)}$ . Now suppose we are allowed to declare an erasure with some fixed probability like $1/10$ . Using the above result, the sample complexity for getting type-I and type-II errors at most $\beta$ will be characterized by ${\min\{\mathrm{JS}_{\beta}(p,q),\mathrm{JS}_{\beta}(q,p)\}}$ . Observe that
| | $\displaystyle\mathrm{JS}_{\beta}(p,q)$ | $\displaystyle=H(\beta p+\bar{\beta}q)-\beta H(p)-\bar{\beta}H(q)$ | |
| --- | --- | --- | --- |
By symmetry, $\mathrm{JS}_{\beta}(q,p)=\mathrm{JS}_{\beta}(p,q)$ . Thus, the sample complexity with erasures is
| | $\displaystyle{n}^{*}_{\mathrm{PF\text{-}E}}(p,q,\beta,\beta,0.1)\asymp\frac{%
\beta\log(1/\beta)}{\epsilon\beta\log\frac{1}{\beta}}=\frac{1}{\epsilon}.$ | |
| --- | --- | --- |
Observe that this is much favourable compared to $\log(1/\beta)/\epsilon$ , which is the sample complexity without erasures.**
* Proof ofTheorem11.4*
We start with the lower bound. Suppose a test $\phi$ solves $\mathcal{B}_{\mathrm{PF\text{-}E}}(\alpha,\beta,\delta_{0},\delta_{1})$ with sample complexity ${n}^{*}_{\mathrm{PF\text{-}E}}(\alpha,\beta,\delta_{0},\delta_{1})$ . We can modify this test to a new test $\hat{\phi}$ that outputs a decision in the set $\{p,q\}$ by always declaring $p$ when the original decision is $\phi(X)=\star$ . This new test is prior-free hypothesis testing problem without erasures with type-I and type-II errors controlled by
| | $\displaystyle\mathbb{P}_{p}(\hat{\phi}(X)=q)≤\alpha,\quad\text{ and }\quad%
\mathbb{P}_{q}(\hat{\phi}(X)=p)≤\beta+\delta_{1}.$ | |
| --- | --- | --- |
Thus, we must have the lower bound
| | $\displaystyle{n}^{*}_{\mathrm{PF\text{-}E}}(\alpha,\beta,\delta_{0},\delta_{1}%
)≥{n}^{*}_{\mathrm{PF}}(\alpha,\beta+\delta_{1}).$ | |
| --- | --- | --- |
Using a similar argument but declaring $\hat{\phi}(X)=q$ when $\phi(X)=\star$ , we conclude the lower bound
| | $\displaystyle{n}^{*}_{\mathrm{PF\text{-}E}}(\alpha,\beta,\delta_{0},\delta_{1}%
)≥\max\{{n}^{*}_{\mathrm{PF}}(\alpha,\beta+\delta_{1}),{n}^{*}_{\mathrm{PF}%
}(\alpha+\delta_{0},\beta)\}.$ | |
| --- | --- | --- |
We now turn our attention to the upper bound. Observe that any test in the standard setting that controls type-I and type-II errors by $\alpha$ and $\beta$ , respectively, is also a valid test for the erasure setting. This gives the upper bound
| | $\displaystyle{n}^{*}_{\mathrm{PF\text{-}E}}(\alpha,\beta,\delta_{0},\delta_{1}%
)≤{n}^{*}_{\mathrm{PF}}(\alpha,\beta).$ | |
| --- | --- | --- |
Now if either $\delta_{0}≤\alpha$ or $\delta_{1}≤\beta$ , the lower and upper bounds are tight. To see this, suppose $\delta_{0}≤\alpha$ , then
$$
{n}^{*}_{\mathrm{PF}}(2\alpha,\beta)\leq{n}^{*}_{\mathrm{PF}}(\alpha+\delta_{0%
},\beta)\leq{n}^{*}_{\mathrm{PF}}(\alpha,\beta).
$$
Since $\max\{2\alpha,\alpha,\beta\}≤ 1/4$ , we may apply Proposition 4.7 to conclude that the sample complexity is $\asymp{n}^{*}_{\mathrm{PF}}(\alpha,\beta)$ . A similar argument also works for $\delta_{1}≤\beta$ . Thus, we may restrict our attention to the case with $\delta_{0}>\alpha$ and $\delta_{1}>\beta$ . (This is also the more relevant regime in practice.) To show a matching upper bound, we will construct a test with sample complexity ${n}^{*}_{\mathrm{PF}}(\alpha/3,(\beta+\delta_{1})/3)+{n}^{*}_{\mathrm{PF}}((%
\alpha+\delta_{0})/3,\beta/3)$ that achieves the necessary bounds on the prior-free problem with erasures. Our test is the following:
1. Let $\phi_{1}$ be a binary hypothesis test without erasures that attains type-I and type-II errors of at most $\alpha/3$ and $(\beta+\delta_{1})/3$ , respectively, using $N_{1}:={n}^{*}_{\mathrm{PF}}(\alpha/3,(\beta+\delta_{1})/3)$ samples. Sample $N_{1}$ points and let the first decision $D_{1}=\phi_{1}(X_{1},...,X_{N_{1}})$ .
1. Let $\phi_{2}$ be a binary hypothesis test without erasures that attains type-I and type-II errors of at most $(\alpha+\delta_{0})/3$ and $\beta/3$ , respectively, using $N_{2}:={n}^{*}_{\mathrm{PF}}((\alpha+\delta_{0})/3,\beta/3)$ samples. Sample $N_{2}$ points and let the second decision $D_{2}=\phi_{2}(X_{N_{1}+1},...,X_{N_{1}+N_{2}})$ .
1. Our final output $D:=\phi(X_{1},...,X_{N_{1}+N_{2}})$ is given as follows. If $D_{1}=D_{2}=p$ , declare $D=p$ . If $D_{1}=D_{2}=q$ , declare $D=q$ . Otherwise, declare $\star$ .
We now evaluate all the errors for this test. The undetected error under $p$ is bounded as
| | $\displaystyle\mathbb{P}_{p}(D=q)=\mathbb{P}_{p}(D_{1}=q)\mathbb{P}_{p}(D_{2}=q%
)≤\frac{\alpha}{3}·\frac{\alpha+\delta}{3}≤\frac{\alpha}{3}≤\alpha.$ | |
| --- | --- | --- |
The undetected error under $q$ is bounded as
| | $\displaystyle\mathbb{P}_{q}(D=p)$ | $\displaystyle=\mathbb{P}_{q}(D_{1}=p)\mathbb{P}_{q}(D_{2}=p)≤\frac{\beta+%
\delta_{1}}{3}·\frac{\beta}{3}≤\frac{\beta}{3}≤\beta.$ | |
| --- | --- | --- | --- |
The erasure error under $p$ is bounded as
| | $\displaystyle\mathbb{P}_{p}(D=\star)$ | $\displaystyle=\mathbb{P}_{p}(D_{1}=p)\mathbb{P}_{p}(D_{2}=q)+\mathbb{P}_{p}(D_%
{1}=q)\mathbb{P}_{p}(D_{2}=p)$ | |
| --- | --- | --- | --- |
Similarly, the erasure error under $q$ is bounded as
| | $\displaystyle\mathbb{P}_{q}(D=\star)$ | $\displaystyle=\mathbb{P}_{q}(D_{1}=p)\mathbb{P}_{q}(D_{2}=q)+\mathbb{P}_{q}(D_%
{1}=q)\mathbb{P}_{q}(D_{2}=p)$ | |
| --- | --- | --- | --- | Thus, our test achieves the desired errors using $N_{1}+N_{2}$ samples, which is easily seen to be
$$
\displaystyle N_{1}+N_{2} \displaystyle={n}^{*}_{\mathrm{PF}}(\alpha/3,(\beta+\delta_{1})/3)+{n}^{*}_{%
\mathrm{PF}}((\alpha+\delta_{0})/3,\beta/3) \displaystyle\asymp{n}^{*}_{\mathrm{PF}}(\alpha,\beta+\delta_{1})+{n}^{*}_{%
\mathrm{PF}}(\alpha+\delta_{0},\beta) \displaystyle\asymp\max\{{n}^{*}_{\mathrm{PF}}(\alpha,\beta+\delta_{1}),{n}^{*%
}_{\mathrm{PF}}(\alpha+\delta_{0},\beta)\}.
$$
This concludes the proof. ∎
11.3 Bayesian hypothesis testing with erasures
The sample complexities of the Bayesian and prior-free hypothesis testing problems in the erasure setting satisfy a relation similar to that in the standard setting noted in Claim 4.6. Throughout this section, we shall assume without loss of generality that $\pi≤ 1/2$ .
**Claim 11.7 (Relation between Bayesian and prior-free sample complexities in the erasure setting)**
*For any $\pi,\beta,\delta_{0},\delta_{1}∈(0,1)$ and $\lambda>0$ , we have
$$
\displaystyle{n}^{*}_{\mathrm{PF\text{-}E}}\left(\frac{\delta}{\pi},\frac{%
\delta}{\overline{\pi}},\frac{\delta}{\pi\lambda},\frac{\delta}{\lambda%
\overline{\pi}}\right)\leq{n}^{*}_{\mathrm{B\text{-}E}}\left(\pi,\lambda,%
\delta\right)\leq{n}^{*}_{\mathrm{PF\text{-}E}}\left(\frac{\delta}{4\pi},\frac%
{\delta}{4\overline{\pi}},\frac{\delta}{4\pi\lambda},\frac{\delta}{4\lambda%
\overline{\pi}}\right). \tag{38}
$$*
* Proof*
Note that the $\lambda$ -weighted Bayes error of a test $\phi$ being at most $\delta$ means
| | $\displaystyle\pi\mathbb{P}_{p}(\phi(X)=q)+\overline{\pi}\mathbb{P}_{q}(\phi(X)%
=p)+\lambda\pi\mathbb{P}_{p}(\phi(X)=\star)+\lambda\overline{\pi}\mathbb{P}_{q%
}(\phi(X)=\star)≤\delta.$ | |
| --- | --- | --- |
The sample complexity lower bound follows by noting that any test in the Bayesian setting whose $\lambda$ -weighted error is at most $\delta$ must have each of the four terms on the left hand side bounded by $\delta$ , as well. The sample complexity upper bound follows by noting that any prior-free test that has its undetected and erasure error probabilities bounded by the parameters on the right has each of the four terms at most $\delta/4$ , and so its $\lambda$ -weighted Bayes error is at most $\delta$ . ∎
This result, combined with Theorem 11.4 and Proposition 4.7 immediately yields the following corollary:
**Corollary 11.8 (Bayes sample complexity in a limited regime)**
*Without loss of generality, assume $\pi≤ 1/2$ . Let $\lambda>0$ and let $\delta≤\lambda\pi/8$ . Then ${n}^{*}_{\mathrm{B\text{-}E}}(\pi,\lambda,\delta)\asymp{n}^{*}_{\mathrm{PF%
\text{-}E}}\left(\frac{\delta}{\pi},\frac{\delta}{\overline{\pi}},\frac{\delta%
}{\pi\lambda},\frac{\delta}{\lambda\overline{\pi}}\right)$ .*
* Proof*
We may apply Proposition 4.7 and Theorem 11.4 to conclude that the lower and upper bounds in expression (38) are within constants of each other if
$$
\max\left\{\frac{\delta}{\pi},\frac{\delta}{\overline{\pi}},\frac{\delta}{\pi%
\lambda},\frac{\delta}{\lambda\overline{\pi}}\right\}\leq 1/8,
$$
which holds when $\delta≤\lambda\pi/8$ . ∎
This result, though useful, is not completely satisfactory because of the limited range of $\delta$ . Note that the Bayesian problem is only worth solving with $\delta≤\max\{\pi,\lambda\}$ , since the constant predictors $\phi(X)=q$ and $\phi(X)=\star$ achieve errors of $\pi$ and $\lambda$ , respectively. Analogous to the standard setting, the regime of interest in the erasure setting should therefore be $\delta∈(0,\min\{\pi,\lambda\}/4]$ . This regime may be much larger that $(0,\pi\lambda/8]$ , for example, when $\lambda\asymp\pi$ .
In what follows, we develop an alternate approach to solve the sample complexity problem in the larger regime of interest. Our next lemma identifies the optimal Bayes test.
**Lemma 11.9 (Optimal Bayes test)**
*Let $p$ and $q$ be two distributions on a finite set $\mathcal{X}$ , and let the prior distribution on the hypotheses be $(\pi,\overline{\pi})$ for some $\pi≤ 1/2$ . For any single-sample test $\phi:\mathcal{X}→\{p,q,\star\}$ , consider the $\lambda$ -weighted Bayes error $\lambda\mathbb{P}^{\phi}(\mathcal{E}_{\mathrm{total}})+\overline{\lambda}%
\mathbb{P}^{\phi}(\mathcal{E}_{\mathrm{undetected}})$ , which equals $\lambda\mathbb{P}^{\phi}(\mathcal{E}_{\mathrm{erasure}})+\mathbb{P}^{\phi}(%
\mathcal{E}_{\mathrm{undetected}})$ . Then the following tests minimise the $\lambda$ -weighted Bayes error:
1. When $\lambda≥ 1/2$ , the optimal test $\phi$ declares no erasures and is the same as the optimal Bayes test in the standard setting. To be precise,
| | $\displaystyle\phi(x)=\begin{cases}p\quad&\text{ if }\frac{\pi p(x)}{\overline{%
\pi}q(x)}≥ 1\\
q\quad&\text{ if }\frac{\pi p(x)}{\overline{\pi}q(x)}<1.\end{cases}$ | |
| --- | --- | --- |
1. When $\lambda<1/2$ , the following test $\phi$ is optimal:
| | $\displaystyle\phi(x)=\begin{cases}p\quad&\text{ if }\frac{\pi p(x)}{\overline{%
\pi}q(x)}≥\frac{\overline{\lambda}}{\lambda}\\
q\quad&\text{ if }\frac{\pi p(x)}{\overline{\pi}q(x)}≤\frac{\lambda}{%
\overline{\lambda}}\\
\star&\text{ otherwise.}\end{cases}$ | |
| --- | --- | --- |*
* Proof*
It is enough to look at deterministic hypothesis tests.
Proof of (i):
Consider any test $\phi:\mathcal{X}→\{p,q,\star\}$ with decision regions $R_{p}$ , $R_{q}$ , and $\mathcal{E}_{\mathrm{erasure}}$ , where the test declares $p$ , $q$ , and $\star$ respectively. We can write the Bayes error as
$$
\displaystyle\lambda\pi\mathbb{P}_{p}(\mathcal{E}_{\mathrm{erasure}})+\lambda%
\overline{\pi}\mathbb{P}_{q}(\mathcal{E}_{\mathrm{erasure}})+\pi\mathbb{P}_{p}%
(R_{q})+\overline{\pi}\mathbb{P}_{q}(R_{p}) \displaystyle\geq\frac{\pi\mathbb{P}_{p}(\mathcal{E}_{\mathrm{erasure}})+%
\overline{\pi}\mathbb{P}_{q}(\mathcal{E}_{\mathrm{erasure}})}{2}+\pi\mathbb{P}%
_{p}(R_{q})+\overline{\pi}\mathbb{P}_{q}(R_{p}) \displaystyle\geq\min\{\pi\mathbb{P}_{p}(\mathcal{E}_{\mathrm{erasure}}),%
\overline{\pi}\mathbb{P}_{q}(\mathcal{E}_{\mathrm{erasure}})\}+\pi\mathbb{P}_{%
p}(R_{q})+\overline{\pi}\mathbb{P}_{q}(R_{p}).
$$
Suppose $\pi\mathbb{P}_{p}(\mathcal{E}_{\mathrm{erasure}})≤\overline{\pi}\mathbb{P}_%
{q}(\mathcal{E}_{\mathrm{erasure}})$ . We show that there is a test without erasure that achieves the lower bound above. Consider the test $\tilde{\phi}:\mathcal{X}→\{p,q\}$ that declares $p$ on $R_{p}$ and $q$ on $R_{q}\cup\mathcal{E}_{\mathrm{erasure}}$ . The $\lambda$ -weighted Bayes error for this test is easily seen to be
| | $\displaystyle\pi\mathbb{P}_{p}(R_{q})+\overline{\pi}\mathbb{P}_{q}(R_{p})+\pi%
\mathbb{P}_{p}(\mathcal{E}_{\mathrm{erasure}}).$ | |
| --- | --- | --- |
A similar argument can be made for when $\overline{\pi}\mathbb{P}_{q}(\mathcal{E}_{\mathrm{erasure}})≤\pi\mathbb{P}_%
{p}(\mathcal{E}_{\mathrm{erasure}})$ . This shows that for every test with erasures, there exists a test without erasures that is as good or better. Thus, the optimal test can be the best test without erasures, which is simply the maximum-a-posteriori rule. This completes the proof of part (i).
Proof of (ii):
Suppose a test declares $p$ on a region $R_{p}⊂eq\mathcal{X}$ , $q$ on a region $R_{q}⊂eq\mathcal{X}$ and $\star$ elsewhere, its weighted error term can be expressed as
$$
\displaystyle\lambda\mathbb{P}(\mathcal{E}_{\mathrm{total}})+\overline{\lambda%
}\mathbb{P}(\mathcal{E}_{\mathrm{undetected}}) \displaystyle=\left(\lambda\sum_{x\in R_{p}^{c}}\pi p(x)+\overline{\lambda}%
\sum_{x\in R_{p}}\overline{\pi}q(x)\right)+\left(\lambda\sum_{x\in R_{q}^{c}}%
\overline{\pi}q(x)+\overline{\lambda}\sum_{x\in R_{q}}\pi p(x)\right). \tag{39}
$$
We can separately minimise both terms, first one over $R_{p}$ and second one over $R_{q}$ . If the resulting minimisers are disjoint, then that means we have identified an optimal test. To minimise the first bracketed expression, it is clear that one should set $x∈ R_{p}$ if $\overline{\lambda}\overline{\pi}q(x)≤\lambda\pi p(x)$ . A similar argument works for the second bracketed expression as well, yielding the rule $x∈ R_{q}$ if $\lambda\overline{\pi}q(x)≥\overline{\lambda}\pi p(x)$ . Since $\lambda<1/2$ , the regions $R_{p}$ and $R_{q}$ are disjoint, and hence $\phi$ is a valid test with erasures. This concludes the proof of part (ii). ∎
Lemma 11.9 shows that we can focus on $\lambda<1/2$ , since the sample complexity in the erasure setting when $\lambda≥ 1/2$ will be identical to that in the standard setting. To tackle the $\lambda<1/2$ case, we first show an alternate characterisation of the $\lambda$ -weighted Bayes error.
**Lemma 11.10**
*Let $p$ and $q$ be supported on a finite set $\mathcal{X}$ . Let the prior distribution on the hypotheses be $(\pi,1-\pi)$ with $\pi≤ 1/2$ , and let $\lambda∈(0,1/2)$ . Then the following equality holds:
| | $\displaystyle\min_{\phi}\mathbb{P}^{\phi}(\mathcal{E}_{\mathrm{undetected}})+%
\lambda\mathbb{P}^{\phi}(\mathcal{E}_{\mathrm{erasure}})=\sum_{x∈\mathcal{X}%
}\min\{\lambda\pi p(x),\overline{\lambda}\overline{\pi}q(x)\}+\sum_{x∈%
\mathcal{X}}\min\{\overline{\lambda}\pi p(x),\lambda\overline{\pi}q(x)\},$ | |
| --- | --- | --- |
where the minimum is taken over all tests $\phi:\mathcal{X}→\{p,q,\star\}$ .*
* Proof*
Consider the equality in equation (39). As $\lambda<1/2$ , the minimum of the left hand side over all tests $\phi$ is the same as the minimum of each of the two terms on the right hand side over the possible decision sets $R_{p}$ and $R_{q}$ . It is easy to check that
| | $\displaystyle\min_{R_{p}}\left(\lambda\sum_{x∈ R_{p}^{c}}\pi p(x)+\overline{%
\lambda}\sum_{x∈ R_{p}}\overline{\pi}q(x)\right)=\sum_{x∈\mathcal{X}}\min%
\{\lambda\pi p(x),\overline{\lambda}\overline{\pi}q(x)\},$ | |
| --- | --- | --- |
and
| | $\displaystyle\min_{R_{q}}\left(\lambda\sum_{x∈ R_{q}^{c}}\overline{\pi}q(x)+%
\overline{\lambda}\sum_{x∈ R_{q}}\pi p(x)\right)=\sum_{x∈\mathcal{X}}\min%
\{\overline{\lambda}\pi p(x),\lambda\overline{\pi}q(x)\}.$ | |
| --- | --- | --- |
∎
**Remark 11.11**
*Suppose $P_{e}(\pi,p,q)$ is the optimal Bayes error for a simple binary hypothesis testing problem without erasures, where the prior is $(\pi,\overline{\pi})$ and the two distributions are $p$ and $q$ from Lemma 11.9. Then the right hand side in the above lemma may be thought of as the weighted error probabilities of two separate binary hypothesis testing problems without erasures:
$$
\displaystyle\min_{\phi}\mathbb{P}^{\phi}(\mathcal{E}_{\mathrm{undetected}})+%
\lambda\mathbb{P}^{\phi}(\mathcal{E}_{\mathrm{erasure}}) \displaystyle=(\lambda\pi+\overline{\lambda}\overline{\pi})P_{e}\left(\frac{%
\lambda\pi}{\lambda\pi+\overline{\lambda}\overline{\pi}},p,q\right)+(\overline%
{\lambda}\pi+\lambda\overline{\pi})P_{e}\left(\frac{\overline{\lambda}\pi}{%
\overline{\lambda}\pi+\lambda\overline{\pi}},p,q\right). \tag{40}
$$*
**Theorem 11.12**
*Let $p$ and $q$ be distributions on a finite priorbet $\mathcal{X}$ . Let the prior distribution on the hypotheses be $(\pi,1-\pi)$ where $\pi≤ 1/2$ . Let $\lambda>0$ and $\delta∈(0,1)$ . The sample complexity of the simple binary hypothesis testing with erasures problem $\mathcal{B}_{\mathrm{B\text{-}E}}(p,q,\pi,\lambda,\delta)$ is given by:
1. When $\lambda≥ 1/2$ , the sample complexity ${n}^{*}_{\mathrm{B\text{-}E}}(p,q,\pi,\lambda,\delta)\asymp{n}^{*}_{\mathrm{B}%
}(p,q,\pi,\delta)$ .
1. Let $\pi<1/16$ and $\lambda<1/16$ . Define $I_{1}=(0,\pi\lambda/4]$ and $I_{2}=[2\pi\lambda,\min\{\pi,\lambda\}/8]$ .
1. When $\pi≤\lambda$ , the sample complexity is given by
| | $\displaystyle{n}^{*}_{\mathrm{B\text{-}E}}(p,q,\pi,\lambda,\delta)\asymp\max\{%
{n}^{*}_{\mathrm{B}}(p,q,\pi^{\prime},\delta^{\prime}),{n}^{*}_{\mathrm{B}}(p,%
q,\pi^{\prime\prime},\delta^{\prime\prime})\}\quad\text{ if }\delta∈ I_{1}%
\cup I_{2},$ | |
| --- | --- | --- |
where $\pi^{\prime}=\frac{\lambda\pi}{\lambda\pi+\overline{\lambda}\overline{\pi}}$ , $\delta^{\prime}=\frac{\delta}{\lambda\pi+\overline{\lambda}\overline{\pi}}$ , $\pi^{\prime\prime}=\frac{\overline{\lambda}\pi}{\overline{\lambda}\pi+\lambda%
\overline{\pi}}$ and $\delta^{\prime\prime}=\frac{\delta}{\overline{\lambda}\pi+\lambda\overline{\pi}}$ .
1. When $\lambda≤\pi$ , the sample complexity is given by
| | $\displaystyle{n}^{*}_{\mathrm{B\text{-}E}}(p,q,\pi,\lambda,\delta)\asymp\max\{%
{n}^{*}_{\mathrm{B}}(p,q,\pi^{\prime},\delta^{\prime}),{n}^{*}_{\mathrm{B}}(q,%
p,\pi^{\prime\prime},\delta^{\prime\prime})\}\quad\text{ if }\delta∈ I_{1}%
\cup I_{2},$ | |
| --- | --- | --- |
where $\pi^{\prime}=\frac{\lambda\pi}{\lambda\pi+\overline{\lambda}\overline{\pi}}$ , $\delta^{\prime}=\frac{\delta}{\lambda\pi+\overline{\lambda}\overline{\pi}}$ , $\pi^{\prime\prime}=\frac{\lambda\overline{\pi}}{\overline{\lambda}\pi+\lambda%
\overline{\pi}}$ and $\delta^{\prime\prime}=\frac{\delta}{\overline{\lambda}\pi+\lambda\overline{\pi}}$ . (Note that the $\pi^{\prime\prime}$ here is different from above to ensure that $\pi^{\prime\prime}≤ 1/2$ . Note also that the second terms within the maximisation has the roles of $p$ and $q$ switched.)*
**Remark 11.13**
*A peculiar point in result $2$ above is that we can identify the sample complexity for all $\delta≤\min\{\pi,\lambda\}/8$ , except those in a small range around $\pi\lambda$ , specifically, $(\pi\lambda/4,2\pi\lambda)$ . This may not be an artefact of the analysis. The analysis relies on applying Proposition 4.7 which is only valid when $\delta≤\pi/4$ in the $\mathcal{B}_{\mathrm{B}}(p,q,\pi,\delta)$ problem. We believe the proposition is not always valid outside this regime, since the sample complexity ${n}^{*}_{\mathrm{B}}(p,q,\pi,\delta)$ exhibits unexpected behaviour when $\delta$ is close to $\pi$ . This may lead to some discontinuity in the sample complexity when $\delta∈(\pi\lambda/4,2\pi\lambda)$ .*
* Proof*
The $\lambda≥ 1/2$ case is immediate using part 1 of Lemma 11.9. We consider the two cases:
1. $\pi≤\lambda<1/16$ .
1. $\lambda<\pi<1/16$ .
Case I ( $\pi≤\lambda$ ):
First note that when $\pi≤\lambda<1/2$ , we have
| | $\displaystyle\lambda\pi<\overline{\lambda}\overline{\pi},\quad\text{ and }%
\overline{\lambda}\pi<\lambda\overline{\pi}.$ | |
| --- | --- | --- |
Using the error probability interpretation in equation (40), we see that if ${n}^{*}_{\mathrm{B\text{-}E}}(\pi,\lambda,\delta)=n$ , then
$$
\displaystyle(\lambda\pi+\overline{\lambda}\overline{\pi})P_{e}^{(n)}\left(%
\frac{\lambda\pi}{\lambda\pi+\overline{\lambda}\overline{\pi}},p,q\right)+(%
\overline{\lambda}\pi+\lambda\overline{\pi})P_{e}^{(n)}\left(\frac{\overline{%
\lambda}\pi}{\overline{\lambda}\pi+\lambda\overline{\pi}},p,q\right)\leq\delta, \tag{41}
$$
and so
| | $\displaystyle P_{e}^{(n)}\left(\frac{\lambda\pi}{\lambda\pi+\overline{\lambda}%
\overline{\pi}},p,q\right)$ | $\displaystyle≤\frac{\delta}{\lambda\pi+\overline{\lambda}\overline{\pi}}%
\quad\text{ and }$ | |
| --- | --- | --- | --- |
This gives us a lower bound on $n$ , specifically,
$$
\displaystyle n\geq\max\left\{{n}^{*}_{\mathrm{B}}\left(p,q,\frac{\lambda\pi}{%
\lambda\pi+\overline{\lambda}\overline{\pi}},\frac{\delta}{\lambda\pi+%
\overline{\lambda}\overline{\pi}}\right),{n}^{*}_{\mathrm{B}}\left(p,q,\frac{%
\overline{\lambda}\pi}{\overline{\lambda}\pi+\lambda\overline{\pi}},\frac{%
\delta}{\overline{\lambda}\pi+\lambda\overline{\pi}}\right)\right\}. \tag{42}
$$
Furthermore, we can also get an upper bound on $n$ by making each of the two terms on the left hand side of (41) at most $\delta/2$ , giving the bound
$$
\displaystyle n\leq\max\left\{{n}^{*}_{\mathrm{B}}\left(p,q,\frac{\lambda\pi}{%
\lambda\pi+\overline{\lambda}\overline{\pi}},\frac{\delta}{2(\lambda\pi+%
\overline{\lambda}\overline{\pi})}\right),{n}^{*}_{\mathrm{B}}\left(p,q,\frac{%
\overline{\lambda}\pi}{\overline{\lambda}\pi+\lambda\overline{\pi}},\frac{%
\delta}{2(\overline{\lambda}\pi+\lambda\overline{\pi})}\right)\right\}. \tag{43}
$$
If we can show that the lower bound (42) matches the upper bound (43) up to constants, we will be done.
Consider the two intervals: $I_{1}=(0,\pi\lambda/4]$ and $I_{2}=[2\pi\lambda,\pi/8]$ . (Note that $I_{2}$ is a valid interval since $\lambda<1/16$ .) We claim that the sample complexity $n$ is determined within constants for $\delta∈ I_{1}\cup I_{2}$ . This resolves the sample complexity question for almost the entire range of interest, i.e. $(0,\pi/8]$ , except for a small range when $\delta∈(\pi\lambda/4,2\pi\lambda)$ .
Observe that Proposition 4.7 states that ${n}^{*}_{\mathrm{B}}(p,q,\pi,\delta)\asymp{n}^{*}_{\mathrm{B}}(p,q,\pi,\delta/2)$ as long as $\delta≤\pi/4$ . Let $\pi^{\prime}=\frac{\lambda\pi}{\lambda\pi+\overline{\lambda}\overline{\pi}}$ , $\delta^{\prime}=\frac{\delta}{\lambda\pi+\overline{\lambda}\overline{\pi}}$ , $\pi^{\prime\prime}=\frac{\overline{\lambda}\pi}{\overline{\lambda}\pi+\lambda%
\overline{\pi}}$ and $\delta^{\prime\prime}=\frac{\delta}{\overline{\lambda}\pi+\lambda\overline{\pi}}$ . If $\delta∈ I_{1}$ , it is easy to check that $\delta^{\prime}≤\pi^{\prime}/4$ and $\delta^{\prime\prime}≤\pi^{\prime\prime}/4$ . Hence, we conclude that
| | $\displaystyle{n}^{*}_{\mathrm{B}}\left(p,q,\pi^{\prime},\delta^{\prime}\right)$ | $\displaystyle\asymp{n}^{*}_{\mathrm{B}}\left(p,q,\pi^{\prime},\delta^{\prime}/%
2\right),\quad\text{ and }$ | |
| --- | --- | --- | --- |
Thus, the sample complexity when $\delta∈ I_{1}$ is simply
| | $\displaystyle{n}^{*}_{\mathrm{B\text{-}E}}(p,q,\pi,\lambda,\delta)\asymp\max\{%
{n}^{*}_{\mathrm{B}}(p,q,\pi^{\prime},\delta^{\prime}),{n}^{*}_{\mathrm{B}}(p,%
q,\pi^{\prime\prime},\delta^{\prime\prime})\},$ | |
| --- | --- | --- |
and both terms on the right hand side can be computed (up to constants) using Theorem 2.1.
If $\delta∈ I_{2}$ , observe that ${n}^{*}_{\mathrm{B}}(p,q,\pi^{\prime},\delta^{\prime})={n}^{*}_{\mathrm{B}}(p,%
q,\pi^{\prime},\delta^{\prime}/2)=0$ , because $\delta^{\prime}≥ 2\pi^{\prime}$ . Furthermore, we have
| | $\displaystyle{n}^{*}_{\mathrm{B}}(p,q,\pi^{\prime\prime},\delta^{\prime\prime}%
)\asymp{n}^{*}_{\mathrm{B}}(p,q,\pi^{\prime\prime},\delta^{\prime\prime}/2)$ | |
| --- | --- | --- |
as long as $\delta^{\prime\prime}≤\pi^{\prime\prime}/4$ , which happens when
| | $\displaystyle\delta≤\overline{\lambda}\pi/4.$ | |
| --- | --- | --- |
This is easily satisfied since $\delta≤\pi/8$ and $\overline{\lambda}>1/2$ . This shows that for $\delta∈ I_{2}$ , the sample complexity is
| | $\displaystyle{n}^{*}_{\mathrm{B\text{-}E}}(p,q,\pi,\lambda,\delta)\asymp{n}^{*%
}_{\mathrm{B}}(p,q,\pi^{\prime\prime},\delta^{\prime\prime}).$ | |
| --- | --- | --- |
Case 2 ( $\lambda<\pi$ ):
First note that when $\lambda≤\pi<1/2$ , we have
| | $\displaystyle\lambda\pi<\overline{\lambda}\overline{\pi},\quad\text{ and }%
\lambda\overline{\pi}<\overline{\lambda}\pi.$ | |
| --- | --- | --- |
Using the error probability interpretation in equation (40), we see that if ${n}^{*}_{\mathrm{B\text{-}E}}(\pi,\lambda,\delta)=n$ , then
$$
\displaystyle(\lambda\pi+\overline{\lambda}\overline{\pi})P_{e}^{(n)}\left(%
\frac{\lambda\pi}{\lambda\pi+\overline{\lambda}\overline{\pi}},p,q\right)+(%
\overline{\lambda}\pi+\lambda\overline{\pi})P_{e}^{(n)}\left(\frac{\lambda%
\overline{\pi}}{\overline{\lambda}\pi+\lambda\overline{\pi}},q,p\right)\leq\delta. \tag{44}
$$
Note that the second term has $p$ and $q$ exchanged to ensure the argument denoting the prior probability in $P^{n}_{e}(·)$ is at most half. This means
| | $\displaystyle P_{e}^{(n)}\left(\frac{\lambda\pi}{\lambda\pi+\overline{\lambda}%
\overline{\pi}},p,q\right)$ | $\displaystyle≤\frac{\delta}{\lambda\pi+\overline{\lambda}\overline{\pi}}%
\quad\text{ and }$ | |
| --- | --- | --- | --- |
Using a similar argument as in Case 1, we get a lower bound
$$
\displaystyle n \displaystyle\geq\max\left\{{n}^{*}_{\mathrm{B}}\left(p,q,\frac{\lambda\pi}{%
\lambda\pi+\overline{\lambda}\overline{\pi}},\frac{\delta}{\lambda\pi+%
\overline{\lambda}\overline{\pi}}\right),{n}^{*}_{\mathrm{B}}\left(q,p,\frac{%
\lambda\overline{\pi}}{\overline{\lambda}\pi+\lambda\overline{\pi}},\frac{%
\delta}{\overline{\lambda}\pi+\lambda\overline{\pi}}\right)\right\} \displaystyle=\max\{{n}^{*}_{\mathrm{B}}(p,q,\pi^{\prime},\delta^{\prime}),{n}%
^{*}_{\mathrm{B}}(q,p,\pi^{\prime\prime},\delta^{\prime\prime})\}. \tag{45}
$$
and an upper bound
$$
\displaystyle n \displaystyle\leq\max\left\{{n}^{*}_{\mathrm{B}}\left(p,q,\frac{\lambda\pi}{%
\lambda\pi+\overline{\lambda}\overline{\pi}},\frac{\delta}{2(\lambda\pi+%
\overline{\lambda}\overline{\pi})}\right),{n}^{*}_{\mathrm{B}}\left(q,p,\frac{%
\lambda\overline{\pi}}{\overline{\lambda}\pi+\lambda\overline{\pi}},\frac{%
\delta}{2(\overline{\lambda}\pi+\lambda\overline{\pi})}\right)\right\} \displaystyle=\max\{{n}^{*}_{\mathrm{B}}(p,q,\pi^{\prime},\delta^{\prime}/2),{%
n}^{*}_{\mathrm{B}}(q,p,\pi^{\prime\prime},\delta^{\prime\prime}/2)\}. \tag{46}
$$
We now consider the intervals $I_{1}=(0,\pi\lambda/4]$ and $I_{2}=[2\pi\lambda,\lambda/8]$ . Using essentially identical arguments as in Case 1, we note that when $\delta∈(0,\pi\lambda/4]$ , the sample complexity is
| | $\displaystyle{n}^{*}_{\mathrm{B\text{-}E}}(p,q,\pi,\lambda,\delta)\asymp\max\{%
{n}^{*}_{\mathrm{B}}(p,q,\pi^{\prime},\delta^{\prime}),{n}^{*}_{\mathrm{B}}(q,%
p,\pi^{\prime\prime},\delta^{\prime\prime})\}.$ | |
| --- | --- | --- |
For the second interval, ${n}^{*}_{\mathrm{B}}(p,q,\pi^{\prime},\delta^{\prime})={n}^{*}_{\mathrm{B}}(p,%
q,\pi^{\prime},\delta^{\prime}/2)=0$ , and so only the second term matters. Here, we are able to apply Proposition 4.7 because
| | $\displaystyle\delta≤\lambda\overline{\pi}/4,$ | |
| --- | --- | --- |
which is true since $\delta≤\lambda/8≤\lambda\overline{\pi}/4.$ Thus, when $\delta∈ I_{2}$ , we get the result
| | $\displaystyle{n}^{*}_{\mathrm{B\text{-}E}}(p,q,\pi,\lambda,\delta)\asymp{n}^{*%
}_{\mathrm{B}}(q,p,\pi^{\prime\prime},\delta^{\prime\prime}).$ | |
| --- | --- | --- |
This concludes the proof of all cases. ∎
12 Discussion
We studied simple binary hypothesis testing from the sample complexity perspective, which is a non-asymptotic regime that had received scant attention in prior literature. We identified the sample complexity (up to universal multiplicative constants) in the Bayesian and prior-free formulations of the problem, in terms of $f$ -divergences from the Hellinger and Jensen–Shannon families. We also addressed the sample complexity of Bayesian and prior-free when erasures are permitted, and we made some observations about the sample complexity in the sequential setting. There are many interesting research directions that emerge from our work. Characterising the sample complexity in the weak detection regime (cf. Section 2.5) remains an open problem. Indeed, even in the simple setting of $p=\mathcal{N}(-1,1)$ and $q=\mathcal{N}(1,1)$ and $\pi<1/2$ , the sample complexity (computed using a computer) demonstrates unusual behaviour with respect to $\gamma$ . In particular, it does depend on $\gamma$ (i.e., the upper bound in Theorem 8.2 is loose), but the dependence on $\gamma$ is super-linear (i.e., the lower bound in Theorem 8.2 is also loose). The Hellinger characterisation of the sample complexity of simple binary hypothesis testing appears in Le Cam’s bound in minimax optimal estimation [Lecam73], in rates of convergence for locally minimax optimal estimation [DonL91], and in communication complexity lower bounds in computer science theory [BravGMNW16]. It would be interesting to see if the skewed $\mathrm{JS}_{\pi}$ characterization of the sample complexity presented in this work has similar applications. Finding the sample complexity in the sequential setting is another interesting direction to pursue.
Acknowledgements
Part of this work was done when VJ and PL were at semester-long programs at the Simons Institute for the Theory of Computing and the Simons–Laufer Mathematical Sciences Institute. VJ and PL thank these institutions for their hospitality.
\printbibliography
13 Auxiliary results
13.1 Proof of Fact 4.9
We will use the following form of Bennet’s inequality:
**Fact 13.1 (Concentration of binomial: Bennet’s inequality[Vershynin18, Theorem 2.9.2])**
*Let $X_{1},...,X_{m}$ be $m$ i.i.d. Bernoulli random variables with bias $p>0$ . For $u≥ 0$ , we have
| | $\displaystyle\mathbb{P}\left\{\left(\frac{1}{m}\sum_{i=1}^{m}X_{i}\right)≥ p%
(1+u)\right\}$ | $\displaystyle≤\mathbb{P}\left\{\left(\frac{1}{m}\sum_{i=1}^{m}X_{i}\right)%
≥ mp(1+u(1-p))\right\}$ | |
| --- | --- | --- | --- |
where $f(u)=(1+u)\log(1+u)-u$ . In particular, if the bias $p$ satisfies $p≤ 0.5$ , and $t≥ 1$ , we have This follows by noting that $(1-p)≥ 0.5$ and $(1+u)\log(1+u)-u≥ 0.5u\log(1+u)$ . Indeed, the derivative of the second expression is $0.5\left(\log(1+u)-\frac{u}{u+1}\right)$ , which is nonnegative.
$$
\displaystyle\mathbb{P}\left\{\left(\frac{1}{m}\sum_{i=1}^{m}X_{i}\right)\geq
tp%
\right\}\leq\exp\left(-(1/4)mp(t-1)\log t\right)\,. \tag{47}
$$*
We provide the proof of Fact 4.9 below for completeness:
* Proof ofFact4.9*
Since the buckets are independent and $\tau≤ 1/4$ , the probability that the fraction of the buckets reporting wrong outcomes is at least $1/2$ can be upper-bounded using Equation 47 (by taking $m=T$ , $p=\tau$ , $t=\frac{0.5}{\tau}$ ) by
| | $\displaystyle\exp\left(-(1/4)(T)(\tau)((1/2\tau)-1)\log\left(\frac{1}{2\tau}%
\right)\right)≤\exp\left(-(1/4)T(1/4\tau)((1/2)\log(1/\tau))\right)$ | |
| --- | --- | --- |
∎
13.2 Proof of Claim 7.8
See 7.8
* Proof*
We consider two cases. For all $x$ such that $\pi x≤ 1$ , we have
| | $\displaystyle 1+\frac{x}{1+\pi x}≤ 1+x=\left(1+x\right)^{\overline{\lambda}%
_{*}}\left(1+x\right)^{\lambda_{*}}≤\left(1+x\right)^{\overline{\lambda}_{*%
}}\left(1+\frac{1}{\pi}\right)^{\lambda_{*}}.$ | |
| --- | --- | --- |
Thus, it remains to control the overhead term $\left(1+\frac{1}{\pi}\right)^{\lambda_{*}}$ , which we will show is at most $e^{2r}$ . We first rewrite it as $\exp\left(\lambda_{*}\log\left(1+\frac{1}{\pi}\right)\right)$ . For the exponent, we have
$$
\lambda_{*}\log(1+1/\pi)\leq\frac{r}{\log(1/\pi)}\cdot 2\log(1/\pi)\leq 2r\,,
$$
where we use that $\log((1+\pi)/\pi)≤ 2\log(1/\pi)$ . Thus, we have $1+\frac{x}{1+\pi x}≤ e^{2r}(1+x)^{\overline{\lambda}_{*}}$ . We now consider the case when $x>1/\pi$ . Following similar steps, we arrive at the desired conclusion:
| | $\displaystyle 1+\frac{x}{1+\pi x}≤ 1+\frac{x}{\pi x}=1+\frac{1}{\pi}=\left(%
1+\frac{1}{\pi}\right)^{\overline{\lambda}_{*}}\left(1+\frac{1}{\pi}\right)^{%
\lambda_{*}}≤ e^{2r}\left(1+\frac{1}{\pi}\right)^{\overline{\lambda}_{*}}%
≤ e^{2r}(1+x)^{\overline{\lambda}_{*}}.$ | |
| --- | --- | --- |
∎
14 Asymptotic analyses of the errors in hypothesis testing with erasures
In this section, we analyse how the probabilities of undetected error and erasure error behave in the asymptotic regime of sample size $n$ tending to infinity. We could not find these results in the literature and they may be of independent interest.
There are three different asymptotic regimes we consider: the Stein regime, the Chernoff regime, and the Bayes regime. The regime names and their analyses closely follow the proof techniques in [PolWu23] for the simple binary hypothesis testing setting without erasures, which we shall refer to as the standard setting. The simple binary hypothesis testing with erasures will be referred to as the erasure setting.
We recap known results in the standard setting:
1. Stein regime: When the type-I error is at most $\epsilon$ , the type-II error decays exponentially fast with exponent $\mathrm{KL}(p,q)$ . When the type-II error is at most $\epsilon$ , the type-I error decays exponentially fast with exponent $\mathrm{KL}(q,p)$ .
1. Chernoff regime: Let $A$ and $B$ be the exponential decay rates of the type-I and the type-II errors respectively. Then the set of all achievable $(A,B)$ is
| | $\displaystyle\{(A_{\lambda},B_{\lambda}):A_{\lambda}=\mathrm{KL}(p_{\lambda},q%
),B_{\lambda}=\mathrm{KL}(p_{\lambda},p),\lambda∈[0,1]\}$ | |
| --- | --- | --- |
where $p_{\lambda}$ is the tilted distributed $p_{\lambda}(x)\propto p(x)^{\overline{\lambda}}q(x)^{\lambda}$ .
1. Bayes regime: When the prior distribution on the hypothesis is $(\pi,1-\pi)$ for $\pi∈(0,1)$ , the exponential rate of decay of the Bayes error (with is a weighted sum of the type-I and type-II errors) is given by the Chernoff information between $p$ and $q$ , which equals
| | $\displaystyle\mathrm{CI}(p,q)=-∈f_{\lambda∈[0,1]}\log\sum_{x∈\mathcal{X}%
}p(x)^{1-\lambda}q(x)^{\lambda}.$ | |
| --- | --- | --- |
Chernoff information also equals $A_{\lambda^{*}}$ from the Chernoff regime where $\lambda^{*}$ is such that $A_{\lambda^{*}}=B_{\lambda^{*}}$ .
In the erasure setting, if the prior is $(\pi,1-\pi)$ , the two errors of interest are:
| | $\displaystyle\mathbb{P}(\mathcal{E}_{\mathrm{erasure}})$ | $\displaystyle=\mathbb{P}(\phi(X_{1},...,X_{n})=\star)=\pi\mathbb{P}_{p}(\phi%
(X_{1},...,X_{n})=\star)+\overline{\pi}\mathbb{P}_{q}(\phi(X_{1},...,X_{n}%
)=\star),$ | |
| --- | --- | --- | --- |
and
| | $\displaystyle\mathbb{P}(\mathcal{E}_{\mathrm{undetected}})$ | $\displaystyle=\pi\mathbb{P}_{p}(\phi(X_{1},...,X_{n})=q)+\overline{\pi}%
\mathbb{P}_{q}(\phi(X_{1},...,X_{n})=p).$ | |
| --- | --- | --- | --- |
The questions to be addressed in the three asymptotic regimes are as follows:
1. Stein regime: When $\mathbb{P}(\mathcal{E}_{\mathrm{erasure}})$ is at most $\epsilon$ , what is the exponential rate of decay of $\mathbb{P}(\mathcal{E}_{\mathrm{undetected}})$ ?
1. Chernoff regime: When both $\mathbb{P}(\mathcal{E}_{\mathrm{erasure}})$ and $\mathcal{P}(\mathcal{E}_{\mathrm{undetected}})$ are decaying exponentially fast, say with rates $A$ and $B$ , what are all pairs of achievable rates $(A,B)$ ?
1. Bayes regime: What is the exponential rate of decay of $\lambda\mathbb{P}(\mathcal{E}_{\mathrm{erasure}})+\mathbb{P}(\mathcal{E}_{%
\mathrm{undetected}})$ , which corresponds to the weighted sum $\lambda\mathbb{P}(\mathcal{E}_{\mathrm{total}})+\overline{\lambda}\mathbb{P}(%
\mathcal{E}_{\mathrm{undetected}})$ ?
Thus, the main difference between the standard and the erasure settings is that the role of type-I and type-II errors is taken up by the erasure and undetected errors. The latter two errors are not symmetric, leading to a few differences between the two settings. First, it is necessary to specify the prior distribution on the hypotheses to define the errors in the erasure setting, and so the Stein and Chernoff regimes are not “prior-free” as in the standard setting. And second, in the Stein regime, it only makes sense to bound the erasure error probability and ask for the decay rate of the undetected error probability; the counterpart of bounding the undetected error probability and studying the decay rate of the erasure error probability does not make sense. This is because we can simply refrain from declaring erasures, thus ensuring an erasure error probability of 0, while making the undetected error probability arbitrarily small for all large $n$ (note that it decays like $e^{-n(\mathrm{CI}(p,q)+o(1))}$ ).
14.1 Stein regime in the erasure setting
**Theorem 14.1**
*Let $p$ and $q$ be two distributions on a finite domain $\mathcal{X}$ . Let the prior distribution on the hypothesis $\theta$ be $(\pi,\overline{\pi})$ for $\pi∈(0,1)$ . Consider a (possibly randomised) hypothesis test $\phi$ that takes as input $n$ i.i.d. samples $X_{1},...,X_{n}$ distributed according to $p^{\otimes n}$ if $\theta=p$ and $q^{\otimes n}$ if $\theta=q$ , and produces an estimate $\phi(X_{1},...,X_{n})∈\{p,q,\star\}$ . Let $\mathcal{E}_{\mathrm{undetected}}^{n}$ be the event of an undetected error and $\mathcal{E}_{\mathrm{erasure}}^{n}$ be the event on an erasure error when using $n$ observations. The $\epsilon$ -optimal exponent in Stein’s regime is defined as
| | $\displaystyle V_{\epsilon}:=\sup\{E~{}:~{}∃ n_{0},∀ n>n_{0},%
∃\phi\text{ such that }\mathbb{P}(\mathcal{E}_{\mathrm{erasure}}^{n})<%
\epsilon,\mathcal{P}(\mathcal{E}_{\mathrm{undetected}}^{n})<e^{-nE}\}.$ | |
| --- | --- | --- |
Then the Stein exponent is given by
| | $\displaystyle V_{\epsilon}=\begin{cases}\min\{\mathrm{KL}(p,q),\mathrm{KL}(q,p%
)\}\quad&\text{ if }\epsilon∈(0,\pi),\\
\max\{\mathrm{KL}(p,q),\mathrm{KL}(q,p)\}\quad&\text{ if }\epsilon∈(\pi,1).%
\end{cases}$ | |
| --- | --- | --- |*
Observe that in the Stein regime, if erasures are not allowed then the probability of undetected error decays like $e^{-n(\mathrm{CI}(p,q)+o(1))}$ . Since $\mathrm{CI}(p,q)≤\min\{\mathrm{KL}(p,q),\mathrm{KL}(q,p)\}$ , it is evident that the flexibility of declaring erasures boosts the error exponent even for very small $\epsilon$ . As $\epsilon$ increases beyond $\pi$ , we observe a threshold phenomenon wherein the exponent jumps from the minimum of the two KL divergences to their maximum. It is interesting to note that this phenomenon has no counterpart in the standard setting.
We begin by collecting structural results for the erasure setting.
**Lemma 14.2**
*Let $p$ and $q$ be two distributions on a finite set $\mathcal{X}$ . Let $0<\lambda_{1}<\lambda_{2}$ and consider the weighted error term $\lambda_{1}\mathbb{P}(\mathcal{E}_{\mathrm{total}})+\lambda_{2}\mathbb{P}(%
\mathcal{E}_{\mathrm{undetected}})$ , which equals $\lambda_{1}\mathbb{P}(\mathcal{E}_{\mathrm{erasure}})+(\lambda_{1}+\lambda_{2}%
)\mathbb{P}(\mathcal{E}_{\mathrm{undetected}})$ . Then the following test $\phi$ minimises this weighted error:
| | $\displaystyle\phi(x)=\begin{cases}p\quad&\text{ if }\frac{\pi p(x)}{\overline{%
\pi}q(x)}≥\frac{\lambda_{2}}{\lambda_{1}}\\
q\quad&\text{ if }\frac{\pi p(x)}{\overline{\pi}q(x)}≤\frac{\lambda_{1}}{%
\lambda_{2}}\\
\star&\text{ otherwise.}\end{cases}$ | |
| --- | --- | --- |*
* Proof*
The proof is essentially identical to Lemma 11.9. ∎
We now show one-shot achievability and converse results for the erasure setting. To facilitate the discussion, define the function $T:\mathcal{X}→\mathbb{R}$ as
| | $\displaystyle T(x):=\log\frac{\pi p(x)}{\overline{\pi}q(x)}.$ | |
| --- | --- | --- |
In what follows, we shall always choose $\lambda_{1}=1$ and $\lambda:=\lambda_{2}>1$ . For this choice, the optimal test in Lemma 14.2 gives the regions $R_{p}=\{x:T(x)≥\log\lambda\}$ and $R_{q}=\{x:T(x)≤-\log\lambda\}$ .
**Lemma 14.3 (Achievability)**
*Let $\tau>0$ . There exists a hypothesis test with erasures such that
| | $\displaystyle\mathbb{P}(\mathcal{E}_{\mathrm{erasure}})=\mathbb{P}(T∈(-\tau,%
\tau)),\quad\text{ and }\quad\mathbb{P}(\mathcal{E}_{\mathrm{undetected}})≤
e%
^{-\tau}.$ | |
| --- | --- | --- |*
* Proof*
Consider the optimal test in Lemma 11.9 with $\lambda_{1}=1$ and $\lambda_{2}=e^{\tau}$ . This test has an erasure probability exactly as stated in Lemma 14.3, and its undetected error probability may be upper bounded as:
| | $\displaystyle\mathbb{P}(\mathcal{E}_{\mathrm{undetected}})$ | $\displaystyle=\pi\mathbb{P}_{p}(T<-\tau)+\overline{\pi}\mathbb{P}_{q}(T>\tau)$ | |
| --- | --- | --- | --- |
∎
**Lemma 14.4 (Converse)**
*Consider any hypothesis test with erasures $\phi$ . Set $\lambda_{1}=1$ and $\lambda_{2}=\lambda>1.$ Consider the weighted error term $\mathbb{P}(\mathcal{E}_{\mathrm{total}})+\lambda\mathbb{P}(\mathcal{E}_{%
\mathrm{undetected}})$ , which equals $\mathbb{P}(\mathcal{E}_{\mathrm{erasure}})+(1+\lambda)\mathbb{P}(\mathcal{E}_{%
\mathrm{undetected}})$ . The following lower bound holds for this weighted error:
| | $\displaystyle\mathbb{P}(\mathcal{E}_{\mathrm{erasure}})+(1+\lambda)\mathbb{P}(%
\mathcal{E}_{\mathrm{undetected}})≥\mathbb{P}\left(T∈\left(-\log\lambda,%
\log\lambda\right)\right).$ | |
| --- | --- | --- |*
* Proof*
We know that the weighted error of $\phi$ is at least as large as that of the optimal test $\phi^{*}$ in Lemma 14.2. Let the erasure and undetected probabilities under the optimal test be $\mathbb{P}^{\phi^{*}}(\mathcal{E}_{\mathrm{erasure}})$ and $P^{\phi^{*}}(\mathcal{E}_{\mathrm{undetected}})$ , respectively. Then
| | $\displaystyle\mathbb{P}(\mathcal{E}_{\mathrm{erasure}})+(1+\lambda)\mathbb{P}(%
\mathcal{E}_{\mathrm{undetected}})$ | $\displaystyle≥\mathbb{P}^{\phi^{*}}(\mathcal{E}_{\mathrm{erasure}})+(1+%
\lambda)\mathbb{P}^{\phi^{*}}(\mathcal{E}_{\mathrm{undetected}})$ | |
| --- | --- | --- | --- |
∎
We now provide the proof of Theorem 14.1.
* Proof ofTheorem14.1*
Without loss of generality, assume $\mathrm{KL}(p,q)<\mathrm{KL}(q,p)$ .
Case I: $\epsilon<\pi$ .
Let $\delta>0$ . We first show that $V_{\epsilon}≥\mathrm{KL}(p,q)-\delta$ . Consider the test $\phi$ with $\tau=n(\mathrm{KL}(p,q)-\delta)$ in Lemma 14.3. Note that $T(X_{1},...,X_{n})=\log\frac{\pi}{\bar{\pi}}+\sum_{i=1}^{n}\log\frac{p(X_{i}%
)}{q(X_{i})}$ . Thus, we may apply the weak law of large numbers, we have
| | $\displaystyle\mathbb{P}_{p}(\phi(X_{1},...,X_{n})=\star)$ | $\displaystyle≤\mathbb{P}_{p}\left(T(X_{1},...,X_{n})<n(\mathrm{KL}(p,q)-%
\delta)\right)→ 0,\quad\text{ and }$ | |
| --- | --- | --- | --- |
Hence, for all large enough $n$ the erasure probability $\mathbb{P}(\mathcal{E}_{\mathrm{erasure}}^{n})$ is at most $\epsilon$ . By Lemma 14.3, the probability of undetected error is at most
| | $\displaystyle\mathbb{P}(\mathcal{E}_{\mathrm{undetected}}^{n})≤ e^{-\tau}=e%
^{-n(\mathrm{KL}(p,q)-\delta)}.$ | |
| --- | --- | --- |
This shows that $V_{\epsilon}≥\mathrm{KL}(p,q)-\delta$ . As this holds for any choice of $\delta$ , we conclude that $V_{\epsilon}≥\mathrm{KL}(p,q).$
To show the converse, suppose $E<V_{\epsilon}$ is any achievable exponent. This means for all large enough $n$ , there exist tests that guarantee $\mathbb{P}(\mathcal{E}_{\mathrm{erasure}}^{n})<\epsilon$ and $\mathbb{P}(\mathcal{E}_{\mathrm{undetected}}^{n})<e^{-nE}$ . For $0<\delta<\mathrm{KL}(q,p)-\mathrm{KL}(p,q)$ , apply the converse result in Lemma 14.4 with $\log\lambda=n(\mathrm{KL}(p,q)+\delta)$ and conclude
| | $\displaystyle\epsilon+(1+e^{n(\mathrm{KL}(p,q)+\delta)})e^{-nE}$ | $\displaystyle≥\mathbb{P}(\mathcal{E}_{\mathrm{erasure}})+(1+e^{n(\mathrm{KL%
}(p,q)+\delta)})\mathbb{P}(\mathcal{E}_{\mathrm{undetected}})$ | |
| --- | --- | --- | --- |
We claim that as $n$ tends to infinity, the right hand side tends to $\pi$ . This is because
$$
\displaystyle\lim_{n\to\infty}\mathbb{P}\left(T\in\left(-n(\mathrm{KL}(p,q)+%
\delta),n(\mathrm{KL}(p,q)+\delta)\right)\right) \displaystyle=\lim_{n\to\infty}\pi\mathbb{P}_{p}\left(T\in\left(-n(\mathrm{KL}%
(p,q)+\delta),n(\mathrm{KL}(p,q)+\delta)\right)\right) \displaystyle\qquad+\lim_{n\to\infty}\overline{\pi}\mathbb{P}_{q}\left(T\in%
\left(-n(\mathrm{KL}(p,q)+\delta),n(\mathrm{KL}(p,q)+\delta)\right)\right) \displaystyle=\pi.
$$
Thus,
| | $\displaystyle\epsilon+\lim_{n→∞}e^{n(\mathrm{KL}(p,q)+\delta-E)}≥\pi.$ | |
| --- | --- | --- |
Since $\epsilon<\pi$ , the only way the above inequality can hold is if $E≤\mathrm{KL}(p,q)+\delta$ . As this holds for all achievable rates $E$ , we conclude $V_{\epsilon}≤\mathrm{KL}(p,q)+\delta$ . Taking $\delta→ 0$ , we conclude $V_{\epsilon}≤\mathrm{KL}(p,q)$ . Combining the upper and lower bounds on $V_{\epsilon}$ we conclude $V_{\epsilon}=\mathrm{KL}(p,q)$ .
Case II: $\epsilon>\pi$ .
Let $\delta>0$ . We first show that $V_{\epsilon}≥\mathrm{KL}(q,p)-\delta$ . Consider the test $\phi$ with $\tau=n(\mathrm{KL}(q,p)-\delta)$ in Lemma 14.3. By the weak law of large numbers, we have
| | $\displaystyle\mathbb{P}_{p}(\phi(X_{1},...,X_{n})=\star)$ | $\displaystyle≤\mathbb{P}_{p}\left(\sum_{i=1}^{n}T(X_{i})<n(\mathrm{KL}(q,p)%
-\delta)\right)→ 1,\quad\text{ and }$ | |
| --- | --- | --- | --- |
Since $\pi<\epsilon$ , we conclude that for all large enough $n$ the erasure probability $\mathbb{P}(\mathcal{E}_{\mathrm{erasure}}^{n})$ is at most $\epsilon$ . By Lemma 14.3, the probability of undetected error is at most
| | $\displaystyle\mathbb{P}(\mathcal{E}_{\mathrm{undetected}}^{n})≤ e^{-\tau}=e%
^{-n(\mathrm{KL}(q,p)-\delta)}.$ | |
| --- | --- | --- |
This shows that $V_{\epsilon}≥\mathrm{KL}(p,q)-\delta$ . As this holds for any choice of $\delta$ , we conclude that $V_{\epsilon}≥\mathrm{KL}(q,p).$
To show the converse, suppose $E<V_{\epsilon}$ is any achievable exponent. This means for all large enough $n$ , there exist tests that guarantee $\mathbb{P}(\mathcal{E}_{\mathrm{erasure}}^{n})<\epsilon$ and $\mathbb{P}(\mathcal{E}_{\mathrm{undetected}}^{n})<e^{-nE}$ . For $0<\delta$ , apply the converse result in Lemma 14.4 with $\log\lambda=n(\mathrm{KL}(q,p)+\delta)$ and conclude
| | $\displaystyle\epsilon+(1+e^{n(\mathrm{KL}(q,p)+\delta)})e^{-nE}$ | $\displaystyle≥\mathbb{P}(\mathcal{E}_{\mathrm{erasure}})+(1+e^{n(\mathrm{KL%
}(q,p)+\delta)})\mathbb{P}(\mathcal{E}_{\mathrm{undetected}})$ | |
| --- | --- | --- | --- |
We claim that as $n$ tends to infinity, the right hand side tends to $1$ . This is because
$$
\displaystyle\lim_{n\to\infty}\mathbb{P}\left(T\in\left(-n(\mathrm{KL}(q,p)+%
\delta),n(\mathrm{KL}(q,p)+\delta)\right)\right) \displaystyle=\lim_{n\to\infty}\pi\mathbb{P}_{p}\left(T\in\left(-n(\mathrm{KL}%
(q,p)+\delta),n(\mathrm{KL}(q,p)+\delta)\right)\right) \displaystyle+\lim_{n\to\infty}\overline{\pi}\mathbb{P}_{q}\left(T\in\left(-n(%
\mathrm{KL}(q,p)+\delta),n(\mathrm{KL}(q,p)+\delta)\right)\right) \displaystyle=\pi+\overline{\pi} \displaystyle=1.
$$
Thus,
| | $\displaystyle\epsilon+\lim_{n→∞}e^{n(\mathrm{KL}(q,p)+\delta-E)}≥ 1.$ | |
| --- | --- | --- |
Since $\epsilon<1$ , the only way the above inequality can hold is if $E≤\mathrm{KL}(q,p)+\delta$ . As this holds for all achievable rates $E$ , we conclude $V_{\epsilon}≤\mathrm{KL}(q,p)+\delta$ . Taking $\delta→ 0$ , we conclude $V_{\epsilon}≤\mathrm{KL}(q,p)$ . Combining the upper and lower bounds on $V_{\epsilon}$ we conclude $V_{\epsilon}=\mathrm{KL}(q,p)$ . ∎
14.2 Chernoff and Bayes regimes in the erasure setting
The results in these two regimes rely on the achievability and converse results in Section 14.1 and Cramer’s theorem from large deviation theory. Throughout this section, we shall assume that $p$ and $q$ are absolutely continuous with respect to each other. Since they are assumed to have discrete support, the moment generating function of $L(X)=\log p(X)/q(X)$ is finite for all $t∈\mathbb{R}$ , when $X\sim p$ or $X\sim q$ . We state Cramer’s theorem first:
**Theorem 14.5 (Cramer’s theorem)**
*The logarithmic moment generating function of a random variable is defined as:
$$
\Lambda(t)=\log\mathbb{E}\left[\exp(tX_{1})\right].
$$
Let $X_{1},X_{2},...$ be a sequence of i.i.d. random variables with finite logarithmic moment generating function, i.e., $\Lambda(t)<∞$ for all $t∈\mathbb{R}$ . Then the Legendre transform of $\Lambda$ , defined by
$$
\Lambda^{*}(x):=\sup_{t\in\mathbb{R}}\left(tx-\Lambda(t)\right)
$$
and satisfies:
$$
\lim_{n\to\infty}\frac{1}{n}\log\left(\mathbb{P}\left(\sum_{i=1}^{n}X_{i}\geq
nx%
\right)\right)=-\Lambda^{*}(x)
$$
for all $x>\mathbb{E}[X_{1}]$ .*
Our goal is to identify the pairs of possible exponential decay rates of $\mathbb{P}(\mathcal{E}_{\mathrm{erasure}}^{n})$ and $\mathbb{P}(\mathcal{E}_{\mathrm{undetected}}^{n})$ , denoted by $A$ and $B$ . Note that the decay rate of $\mathbb{P}(\mathcal{E}_{\mathrm{total}}^{n})$ is the minimum of these two. Define
$$
B^{*}(A)\triangleq\sup\left\{B:\exists n_{0},\forall n\geq n_{0},\exists\phi%
\text{ such that }\mathbb{P}(\mathcal{E}_{\mathrm{erasure}}^{n})<\exp(-nA),%
\mathbb{P}(\mathcal{E}_{\mathrm{undetected}}^{n})<\exp(-nB)\right\}.
$$
**Theorem 14.6 (Chernoff regime)**
*Let $p$ and $q$ be supported on a finite set $\mathcal{X}$ , and assume they are absolutely continuous with respect to each other. For $t∈\mathbb{R}$ , define
| | $\displaystyle\psi(t)$ | $\displaystyle=\log\sum_{x∈\mathcal{X}}p(x)^{1-t}q(x)^{t}.$ | |
| --- | --- | --- | --- |
Let $\psi^{*}(·)$ denote the Legendre transforms of $\psi$ . Then the upper boundary of the set of all achievable pairs $(A,B)$ is characterized by
| | $\displaystyle A(\theta)$ | $\displaystyle=\min\{\psi^{*}(-\theta),\psi^{*}(\theta)-\theta\},$ | |
| --- | --- | --- | --- |
for $\theta∈[0,\min\{\mathrm{KL}(q,p),\mathrm{KL}(p,q)\}]$ .*
The following corollary is an immediate consequence of Theorem 14.6:
**Corollary 14.7**
*Let the optimal Bayes error exponent be
| | $\displaystyle C_{\lambda}:=\{E~{}:~{}∃ n_{0},∀ n>n_{0},∃\phi%
\text{ such that }\mathbb{P}(\mathcal{E}_{\mathrm{erasure}}^{n})+\lambda%
\mathbb{P}(\mathcal{E}_{\mathrm{undetected}}^{n})<e^{-nE}\}.$ | |
| --- | --- | --- |
Then for all $\lambda>0$ ,
$$
C_{\lambda}=\max_{\theta\in[0,\min\{\mathrm{KL}(p,q),\mathrm{KL}(q,p)\}]}\min%
\{A(\theta),B(\theta)\}.
$$*
We note that the expressions for $A(\theta)$ and $B(\theta)$ have been derived in Sason [Sas11], but only in the achievable sense, also using Cramer’s theorem. However the converse part, which states that these pairs of $(A(\theta),B(\theta))$ form the upper boundary of the achievable region, was not established in prior works.
* Proof ofTheorem14.6*
Let $\theta∈[0,\min\{\mathrm{KL}(p,q),\mathrm{KL}(q,p)\}]$ . Consider the following test $\phi$ :
| | $\displaystyle\phi(x_{1},...,x_{n})$ | $\displaystyle=\begin{cases}p\quad\text{ if }\sum_{i=1}^{n}\log p(x_{i})/q(x_{i%
})≥ n\theta,\\
q\quad\text{ if }\sum_{i=1}^{n}\log p(x_{i})/q(x_{i})≤-n\theta,\\
\star\quad\text{ if }\sum_{i=1}^{n}\log p(x_{i})/q(x_{i})∈(-n\theta,n\theta)%
.\\
\end{cases}$ | |
| --- | --- | --- | --- |
The probabilities of errors are given by
| | $\displaystyle\mathbb{P}(\mathcal{E}_{\mathrm{erasure}}^{n})=\pi\mathbb{P}_{p}%
\left(\sum_{i=1}^{n}\log p(X_{i})/q(X_{i})∈(-n\theta,n\theta)\right)+%
\overline{\pi}\mathbb{P}_{q}\left(\sum_{i=1}^{n}\log p(X_{i})/q(X_{i})∈(-n%
\theta,n\theta)\right),$ | |
| --- | --- | --- |
and
| | $\displaystyle\mathbb{P}(\mathcal{E}_{\mathrm{undetected}}^{n})=\pi\mathbb{P}_{%
p}\left(\sum_{i=1}^{n}\log p(X_{i})/q(X_{i})≤-n\theta\right)+\overline{\pi}%
\mathbb{P}_{q}\left(\sum_{i=1}^{n}\log p(X_{i})/q(X_{i})≥-n\theta\right).$ | |
| --- | --- | --- |
A straightforward application of Cramer’s theorem yields
| | $\displaystyle\lim_{n→∞}-\frac{1}{n}\log\mathbb{P}(\mathcal{E}_{\mathrm{%
erasure}}^{n})$ | $\displaystyle=A(\theta),\quad\text{ and }$ | |
| --- | --- | --- | --- |
This concludes the achievabilty part of the proof. To show the converse, suppose $\mathbb{P}(\mathcal{E}_{\mathrm{erasure}}^{n})<e^{-nA^{\prime}}$ and $\mathbb{P}(\mathcal{E}_{\mathrm{undetected}}^{n})<e^{-nB^{\prime}}$ for some $A^{\prime}$ and $B^{\prime}$ , for all large enough $n$ . Applying Lemma 14.4,
| | $\displaystyle e^{-nA^{\prime}}+(1+\lambda)e^{-nB^{\prime}}$ | $\displaystyle>\mathbb{P}(\mathcal{E}_{\mathrm{erasure}})+(1+\lambda)\mathbb{P}%
(\mathcal{E}_{\mathrm{undetected}})$ | |
| --- | --- | --- | --- |
Choosing $\lambda=e^{n\theta}$ and using Cramer’s theorem again, we derive the lower bound
$$
\displaystyle e^{-nA^{\prime}}+e^{-n(B^{\prime}-\theta)} \displaystyle>\pi\mathbb{P}_{p}\left(\sum_{i=1}^{n}\log\frac{p(X_{i})}{q(X_{i}%
)}\in\left(-n\theta-\log\pi/\overline{\pi},n\theta-\log\pi/\overline{\pi}%
\right)\right) \displaystyle+\overline{\pi}\mathbb{P}_{q}\left(\sum_{i=1}^{n}\log\frac{p(X_{i%
})}{q(X_{i})}\in\left(-n\theta-\log\pi/\overline{\pi},n\theta-\log\pi/%
\overline{\pi}\right)\right) \displaystyle=e^{n(A(\theta)+o(1))}. \tag{1}
$$
The only way this inequality can hold for all large $n$ is when either $A^{\prime}≤ A(\theta)$ or $B^{\prime}≤ A(\theta)+\theta=B(\theta)$ , and concedes the proof of the converse. ∎