We are given n observations from a Gaussian Sequence Model denoted by Yi=μi+ξi, where ξi are independent and ξi∼N(0,σ2). Suppose the vector μ belongs to a bounded convex set, K. We want to obtain the minimax rate of this problem under ℓ2 loss, i.e., we want a matching upper and lower bound of ν^inf(μ∈KsupE∥ν^(Y)−μ∥22), where ν^ is
the set of all estimators of the data.
Pre requisites
There are two important concepts which I assume familiarity with. The first is packing of a set in a metric space, for which the Chapter 5 of (Wainwright, 2019) is a good resource. The second is measurability of a function, which is used in the subsection Measurability to prove the validity of the proposed estimator and can be skipped if the reader is not familiar with measure theory.
The paper’s main contribution is in showing that the minimax rate is (ε∗)2∧diam(K)2, where ε∗ is a parameter based on the local entropy of K, which is defined as follows.
Local entropy, denoted Mloc(ε), is the largest cardinality of an cε - packing set (c>0) of the intersection of K with B(θ,ε), where θ∈K.
\begin{align*}
\boxed{M^{loc}(\varepsilon) := \sup_{\theta \in K} M \left(\frac{\varepsilon}{c}, B(\theta, \varepsilon) \cap K \right)},
\end{align*}
where M(δ,T) is the δ packing number of a set T.
List of Sections
In Section 1 we will see that it is straightforward to get a lower bound of the minimax rate using Fano’s inequality if the local entropy is sufficiently large.
Section 2 contains the main results involving the minimax upper bound where we will learn about the algorithm to obtain the estimator ν∗. Then, we check the validity of this estimator by showing that ν∗ is a measurable function.
In the subsequent sections, we will see that if the ν∗ is selected from the Algorithm defined in Section 2, then with high probability, ν∗(Y) is close to μ. This is done by showing that the binary hypothesis testing problem that picks the estimator based on its distance to Y has low type 1 and 2 errors. Finally, we show an upper bound on E∥ν∗(Y)−μ∥22. This result, along with geometric properties of local entropy gives the minimax rate.
(1)
The minimax risk has a lower bound of 8c2ε2 for any ε>0 such that logMloc(ε)>4(2σ2ε2∨log2).
Let μ1,μ2,…,μm be ε seperated points in K. Suppose J is uniformly distributed over the index set [m] and the conditional distribution of Y given J=j is equal to the distribution of μj+ξ where ξ∼N(0,Iσ2).
To apply Fano's inequality, we need an upper bound on the mutual information between Y and J. Note that I(Y;J)≤m1∑j=1mDKL(Pμj∣∣Pν) for any ν∈Rn where Px the the Gaussian measure centered at x with variance σ2I. Therefore, for any pair (μj,ν),
DKL(Pμj∣∣Pν)=2σ2∥μj−ν∥22.
And thus,
\begin{align*}
I(Y;J) \le \frac{1}{m}\sum_{j=1}^m \frac{\lVert\mu^j - \nu\rVert_2^2}{2\sigma^2} &\le \max_{j} \frac{\lVert\mu^j - \nu\rVert_2^2}{2\sigma^2}.
\end{align*}
Now, the maximum distance between μj and ν is ε and so applying Fano's inequality we get the following:
\begin{align*}
\inf_{\hat{\nu}}\sup_{\mu}\mathbb{E}\lVert\hat{\nu}(Y) - \mu\rVert_2^2 &\ge \frac{\varepsilon^2}{4c^2}\left(1-\frac{\frac{\varepsilon^2}{2\sigma^2} + \log 2}{\log M^{loc}(\varepsilon)}\right).
\end{align*}
To conclude, if logMloc(ε)>2(2σ2ε2+log2), we get a lower bound of 4c2ε2⋅(1−21)=8c2ε2.
In the next sections, we will see how the condition of lemma 1 will be satisfied.
Suppose d is the diameter of the set K. The estimator is returned at the end of the following process:
Begin with ν∗∈K, 2(C+1)=c
Repeat for all k∈N
Construct a 2k(C+1)dmaximal packing, Mk, of the set K∩B(ν∗,2k−1d)
Find the closest point to Y in Mk and reassign it as ν∗
This algorithm picks a random point ν∗∈K, constructs a 2d packing of the intersection of K with a ball of radius d, which in the beginning is the set K. Then, it finds the point in this packing closest to Y in the ∥.∥2.
Now the entire process is repeated with the ball centered at the previous closest point with the radius of the ball and packing halved.
There are a couple of points to note in this algorithm.
First, such a process is not achievable in practice because k∈N.
Second, any point Y∈K can be reached through this algorithm. This is because if the point in K is in any one level of the packings then we are done. Otherwise, Y can be reached in the limit since the union of all such packings, is a countably dense subset of K. It is countable because it is the union of countable finite sets. And dense because for any δ>0, and Y∈K, the set B(Y,δ) will contain at least one point that belongs to a 2δ packing.
We argue ν∗ is a valid statistic by showing that it is a measurable function of the data. Let Υk:Rn→Rn return the estimator at the level k. This means that Υk maps a point in Mk−1 (that was closest to Y at level k−1) to a point in Mk that is closest to Y.
(2) The function ν∗:Rn→Rn is measurable in the Borel σ algebra.
First, consider the event that Υk(y) belongs to a packing set M. For any point m∈M, the set {y:Υk(y)=m} is the set of all y such that Υk(y)∈M and y belongs to a polytope (given by the Voronoi tesselation by M).
Since K is convex, this is the set of all y which belong to a convex polytope. A convex polytope is a countable combination of of closed sets, making the inverse image of a singleton, Υk−1({m}) a Borel set.
Now, since Mk is a packing, it is a finite set. This implies that Υk is a discrete function and the σ algebra is generated from singleton. So we can say that Υk is a measurable function.
Note that the sequence {Υk} is convergent because it is Cauchy since for any k1<k2,
\begin{align*}
\lVert\Upsilon_{k_1} - \Upsilon_{k_2}\rVert_2 &= \lVert\Upsilon_{k_1} - \Upsilon_{k_1 + 1} + \Upsilon_{k_1+1} + \ldots \Upsilon_{k_2}\rVert_2.\\
&\le \sum_{i=k_1}^{k_2-1}\lVert\Upsilon_{i} - \Upsilon_{i+1}\rVert_2\\
&\le \sum_{i=k_1}^{k_2 -1 }\frac{d}{2^{i-1}} \le \frac{d}{2^{k_1 - 1}}
\end{align*}
The inequalities above are a consequence of Triangle Inequality.
Note that our estimator is achieved in the limit i.e.
\begin{align*}
\nu^* &= \lim_{k \to \infty }\Upsilon_k\\
&= \lim \sup_{k \to \infty} \Upsilon_k.
\end{align*}
So, to see that ν∗ is measurable, it suffices to show that for any closed box B, (ν∗)−1(B) is a Borel set.
If BjL and BjU are the upper and lower bounds of B for the jth co-ordinate, then our problem reduces to proving that j=1⋂n{y:limk→∞supΥkj(y)∈[BjL,BjU]} is Borel.
Each set in the above intersection can be written as following.
\begin{align*}
\{y: \lim \sup_{k \to \infty} \Upsilon_k^j (y) \in [B_j^L, B_j^U]\} &= \{y : \lim \sup_{k \to \infty} \Upsilon_k^j(y) \ge B_j^L\} \cap \{y:\lim \sup_{k \to \infty} \Upsilon_k^j(y) \le B_j^U\} \\
&:= P \cap Q
\end{align*}
From the definition of limsup of sets, Q is ⋂k=1∞⋃i=k∞{y:Υij(y)≥BjL}, which is measurable since Υi is measurable and this is a countable combination of measurable sets.
P can be written as ⋂l=1∞{y:Υkj(y)<BjL+l1}. From a property of limsup, if a>limsupΥk, then there exists an N∈N such that for all n≥NΥk<a. This translates in set notation as ⋂l=1∞⋃k=1∞⋂i=k∞{y:Υij(y)<BjU}, which is again measurable.
Mapping to the Binary Hypothesis Testing Problem
We know from the previous section that $\nu^*$ is close to $Y$. To understand whether it is close to $\mu$, we pose this as a binary hypothesis testing problem and show that in a $\delta$ packing, the type $1$ and $2$ errors are small.
(3) Consider the hypothesis H0:μ=ν1 and H1:μ=ν2 for ∥ν1−ν2∥2≥Cδ, where C>2. Then the test, ψ(Y)=1∥Y−ν1∥2>∥Y−ν2∥2 satisfies the following:
\begin{align*}
\min \left(\sup_{\mu:\lVert\mu - \nu_1 \rVert_2 \le \delta} \mathbb{P}_\mu(\psi=1) ,\sup_{\mu:\lVert \mu -\nu_2 \rVert_2 \le \delta}\mathbb{P}_\mu(\psi=0)\right)\le \exp \left(- (C-2 ) \frac{\delta^2}{8\sigma^2}\right).
\end{align*}
We demonstrate the result for Type 1 error, and the same argument holds for Type 2.
The Type 1 error implies that ∥μ−ν1∥2≤δ and ψ(Y)=1 which holds if ∥Y−ν1∥22−∥Y−ν2∥22≥0, i.e. Y is closer to ν2.
Next, note that the random variable Z:=∥Y−ν1∥22−∥Y−ν2∥22 is normally distributed and the error is bounded from the tail probability of a subgaussian random variable which can be written as
∥ν1∥22+∥ν2∥22+2(μ+ξ)T(ν2−ν1). Letting μ=ν1+η such that ∥η∥2≤δ, gives us Z=∥ν1∥22+∥ν2∥22+2(ν1+η+ξ)T(ν2−ν1), which is equal to −∥ν1−ν2∥22+2(η+ξ)T(ν2−ν1).
Now, note that ηT(ν2−ν1)≤∥η∥2∥ν2−ν1∥2≤δ∥ν2−ν1∥2.
From the assumption of packing, δ≤∥ν1−ν2∥2/C, giving that
ηT(ν2−ν1)≤C∥ν1−ν2∥22. Since ξ∼N(0,σ2), from the stability of Gaussians, Z∼N(θ,4σ2∥ν1−ν2∥22), where θ≤(2/C−1)∥ν1−ν2∥22.
Finally, applying the tail probability bound on Gaussians concludes the proof.
We now show that the above result extends to all points in M.
(4) Let i∗=argmini∈M∥Y−νi∥2. Then with probability at least 1−∣M∣⋅exp(−8σ2(C−2)2δ2),
∥νi∗−μ∥2≤(C+1)δ.
The ingredients of this proof are triangle inequality, characterising the event as the type 1 error of the test in lemma 3 and a union bound.
First, the author defines an intermediate random variable, Ti which is the maximum distance between νi and another point, νj in the packing set such that νj is closer to Y than νi, i.e. maxj∥νi−νj∥2 such that ∥Y−νi∥2≥∥Y−νj∥2 and ∥νi−νj∥2≥Cδ. If no such j exists then Ti=0.
Analysing the probability in the statement, we can see that
\begin{align*}
\mathbb{P}\{\lVert \nu_{i^*} - \mu \rVert_2\ge(C+1)\delta \} &\le \mathbb{P}\{\lVert\nu_i^* - \nu_i\rVert_2 + \lVert \nu_i - \mu\rVert_2 \ge (C+1)\delta\}.
\end{align*}
Since M is the δ packing, it is also a δ covering and thus ∥μ−νi∥2≤δ for some i. So the above probability is equal to P{∥νi∗−νi∥2+δ≥Cδ+δ}=P{i∗∈{j:∥νi−νj∥2≥Cδ}}. Now if i∗ belongs to the set of all νj which are Cδ away from νi, then this means that Ti≥Cδ>0 since we already know that i∗ is closer to Y. The probability that Ti>0 is the same as the probability that there exists a j which satisfies the condition of Ti, which is the probability of type 1 error for each j. By a union bound over all j∈M and lemma 4 we get the desired result.
Before proving the rate, let us briefly state a result that will be heavily used in next two theorems.
(5) The function ε→Mloc(ε) is monotone non increasing.
The above can be proved by showing that for any ε′<ε, K∩B(θ,ε′)⊃K∩B(θ,ε). This involves showing that an affine transformation of K∩B(θ,ε′) gives K∩B(θ,ε), which by convexity is in the set with ε′.
The next theorem is important to get the (ε∗)2 part of the rate.
(6) If ν∗ is returned by the previously defined Algorithm, then E∥μ−ν∗(Y)∥22≤Cˉ(ε∗)2,
where ε∗ is the maximal J∈N such that εJ=2J−2d⋅cc/2−3 and σ2εJ2>16logMloc(c/2−3εJc)∨16log2 or ε1, if no such J exists.
Throughout the proof, we attempt to simplify the numerous variables used to defined ε∗. To apply the results from lemma 4 , it is important to see that M is the packing set for all levels up till J. So, applying the lemma and adjusting the constant c in Mloc gives us the following:
\begin{align*}
\mathbb{P}\{\lVert\mu - \Upsilon_J\rVert_2 \ge \frac{d}{2^{J-1}}\} &\le \sum_{j=1}^{J-1} \lvert M_j \rvert \exp\left(-\frac{(C-2)^2d^2}{2^{2j}(C+1)^2 8\sigma^2}\right).
\end{align*}
Since ε→Mloc(ε) is monotone decreasing from lemma 5 this probability is bounded from above by Mloc(2J−2d)∑j=1J−1exp(−22j(C+1)28σ2(C−2)2d2). This probability is further ≤Mloc(2J−2d)1−aa for J>1 where a is the last term of the summation, which is exp(−22(J−1)(C+1)28σ2(C−2)2d2).
Let εJ=2J−1d⋅C+1C−2. Setting c=2(C+1) matches the definition of εJ. Now, assuming a<1/2 and 2log(Mloc(2J−1d)) to be bounded by the exponential part of a which is 2log(Mloc(2J−1d))≤8σ2εJ2, we get:
\begin{align*}
\mathbb{P}\{\lVert\mu - \Upsilon_J\rVert_2 \ge \frac{d}{2^{J-2}}\} \le 2\exp\left(-\frac{\varepsilon_J^2}{16\sigma^2}\right)
\end{align*}
To show that μJ is not too far from ν∗, we apply triangle inequality and the fact that the sequence {ΥJ} is Cauchy, which implies that ∥ΥJ−ν∗∥2=∥ΥJ−limJ→∞ΥJ∥2≤2J−1d. Therefore, we get the following result w.p ≥1−2exp(−16σ2εJ2)
\begin{align*}
\lVert\nu^* - \mu\rVert_2 &\le \lVert\nu^* - \mu\rVert_2 + \lVert \mu - \Upsilon_J\rVert_2 \le 3\frac{d}{2^{J-1}} = 3 \varepsilon_J \frac{C+1}{C-2}
\end{align*}
Let J∗ be the maximum J>1 such that the assumption on 2log(Mloc(2J−1d)) holds, otherwise we set J∗=1, which in turn implies the condition on σ2εJ2. The author then shows that for any x≥ε∗, P{∥μ−ν∗∥2≥C′x}≤Cexp(−C′′x2/σ2)1J∗>1 by applying that x→Cexp(−C′′x2/σ2) is monotone decreasing.
Finally, we get the upper bound on expectation by integrating the tail probability, E∥μ−ν∗∥22=∫0∞2xP{∥μ−ν∗∥2≥x}dx. From the above analysis, with high probability, ∥μ−ν∗∥2 is bounded by C′ε∗. Thus, we get the following:
\begin{align*}
\mathbb{E}\lVert\mu - \nu^*\rVert_2^2 &\le C_1(\varepsilon^*)^2 + \int_{C'\varepsilon^*}^\infty 2x \exp(-C''x^2/\sigma^2) \mathbb{1}_{J^* > 1} \\
&= C_1(\varepsilon^*)^2 + C_2\sigma^2\exp(-C_3(\varepsilon^*)^2/\sigma^2)\mathbb{1}_{J^* > 1}
\end{align*}
Since εJ∗/σ2 is greater than some constant from the assumption, this gives the desired result.
3. Proof of the Minimax Rate
In the concluding section, we will show how the minimax rate of (ε∗)2∧diam(K)2 is achieved.
(7) If ε∗ is now defined as sup{ε:σ2ε2≤log(Mloc(ε))}, then the minimax rate is (ε∗)2∧diam(K)2.
Consider the following two cases:
σ2(ε∗)2>16log2: The lower and upper bounds in this case are (ε∗)2.
Lower Bound : If ε∗ is the supremum of the above set then ε∗/4 lies in the set. Therefore, log(Mloc(ε∗/4))≥4σ2(ε∗)2=8σ2(ε∗)2+8σ2(ε∗)2≥2log2+σ22(ε∗/4)2. Thus (ε∗)2/4 satisfies the sufficient condition of lemma 1 giving a lower bound of C(ε∗)2
Upper Bound : Since ε∗>0, 2ε∗ does not lie in the set and for any C>1, σ2C(2ε∗)2≥Clog(Mloc(2ε∗)). This in turn is bounded below by Clog(Mloc(2ε∗C)) from the monotonicity of Mloc. Placing C=16 gives that 2ε∗C satisfies the sufficient condition for theorem 6 giving an upper bound of C(ε∗)2
σ2(ε∗)2≤16log2: The rate in this case is proportional to diam(K)2:=d2/
Again, 2ε∗ is not in the set and hence log(Mloc(2ε∗))≤σ24(ε∗)2, which from the assumption is upper bounded by 64log2. This means that exp(64log2) points are enough to pack K. Further note that if c is large enough, i.e. the packing radius is small enough, then placing ≥exp(64log2) points at the diameter of a ball of radius 2ε∗ gives us a packing set. This implies that K lies entirely in a ball of radius 2ε∗≤64log2σ. The author claims that it is not needed to take the supremum of the set, instead choosing ε to be proportional to d should suffice. This will ensure that ε2/σ2=d2/σ2≤(4ε∗)2/σ2 will be upper bounded by a constant (from the assumption). And Mloc(ε) can be made larger than a constant by placing
θ at the center of the diameter of K constructing a packing set from equidistant points along the diameter. This method gives us matching upper and lower bounds of d2 upto constant factors.
Combining the previous two cases gives the desired minimax rate.
4. Discussion and Open Directions
The assumption of convexity is universal and occurs in applications like isotonic regression where K is a convex cone and linear prediction with correlated designs where K is an ellipse (Guntuboyina and Sen, 2018) .
The author thus gives a minimax optimal estimator on a general problem than K taking a specific shape. In the remaining part of the paper, he also shows show ε∗ can be calculated for certain convex bodies like hyperrectangles and ellipses.
There are two major concepts in his proof: the boundedness of local entropy and convexity. Convexity of K plays an important role in showing that the local entropy is a decreasing function of ε, which in turn is applied in bounding E∥ν∗−μ∥2 from above. A question to think about is how to extend this to other constraints on K.
Finally, another open question is to understand if it is possible to get a computationally tractable estimator that achieves this minimax rate.
Wainwright, Martin J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint,
Cambridge University Press
,
121,
Cambridge Series in Statistical and Probabilistic Mathematics
Guntuboyina, Adityanand and Sen, Bodhisattva. (2018). Nonparametric shape-restricted regression,
Statistical Science