Notes on Minimax Rate of the Gaussian Location Model with Bounded Convex Constraints

This is a (detailed) summary of the proof techniques used by Matey Neykov in his recent paper.

In Spring 2022, I took the course, Information Theoretic Methods in Statistics taught by Professor Jonathan Niles Weed to learn the techniques in the source and channel coding problems of Information Theory that can also be applied to Machine Learning and Statistics. As a part of the course project, I wrote this summary of On the minimax rate of the Gaussian sequence model under bounded convex constraints by Matey Neykov.

Let’s begin by formally defining the problem.

We are given nn observations from a Gaussian Sequence Model denoted by Yi=μi+ξiY_i = \mu_i + \xi_i, where ξi\xi_i are independent and ξiN(0,σ2)\xi_i \sim \mathcal{N}(0,\sigma^2). Suppose the vector μ\mu belongs to a bounded convex set, KK. We want to obtain the minimax rate of this problem under 2\ell_2 loss, i.e., we want a matching upper and lower bound of infν^(supμKEν^(Y)μ22),\boxed{\inf_{\hat{\nu}}\left(\sup_{\mu \in K}\mathbb{E}\lVert\hat{\nu}(Y)-\mu\rVert_2^2\right)}, where ν^\hat{\nu} is the set of all estimators of the data.

Pre requisites

There are two important concepts which I assume familiarity with. The first is packing of a set in a metric space, for which the Chapter 5 of (Wainwright, 2019) is a good resource. The second is measurability of a function, which is used in the subsection Measurability to prove the validity of the proposed estimator and can be skipped if the reader is not familiar with measure theory.

The paper’s main contribution is in showing that the minimax rate is (ε)2diam(K)2(\varepsilon^*)^2 \wedge \text{diam}(K)^2, where ε\varepsilon^* is a parameter based on the local entropy of KK, which is defined as follows.

Local entropy, denoted Mloc(ε)M^{loc}(\varepsilon), is the largest cardinality of an εc\tfrac{\varepsilon}{c} - packing set (c>0c > 0) of the intersection of KK with B(θ,ε)B(\theta,\varepsilon), where θK\theta \in K. \begin{align*} \boxed{M^{loc}(\varepsilon) := \sup_{\theta \in K} M \left(\frac{\varepsilon}{c}, B(\theta, \varepsilon) \cap K \right)}, \end{align*} where M(δ,T)M(\delta,\mathbb{T}) is the δ\delta packing number of a set T\mathbb{T}.

List of Sections

In Section 1 we will see that it is straightforward to get a lower bound of the minimax rate using Fano’s inequality if the local entropy is sufficiently large.

Section 2 contains the main results involving the minimax upper bound where we will learn about the algorithm to obtain the estimator ν\nu^*. Then, we check the validity of this estimator by showing that ν\nu^* is a measurable function.

In the subsequent sections, we will see that if the ν\nu^* is selected from the Algorithm defined in Section 22, then with high probability, ν(Y)\nu^*(Y) is close to μ\mu. This is done by showing that the binary hypothesis testing problem that picks the estimator based on its distance to YY has low type 11 and 22 errors. Finally, we show an upper bound on Eν(Y)μ22\mathbb{E}\lVert\nu^* (Y) - \mu\rVert_2^2. This result, along with geometric properties of local entropy gives the minimax rate.

1. Lower Bound

(1) The minimax risk has a lower bound of ε28c2\frac{\varepsilon^2}{8c^2} for any ε>0\varepsilon>0 such that logMloc(ε)>4(ε22σ2log2)\log M^{loc} (\varepsilon) > 4 (\frac{\varepsilon^2}{2\sigma^2} \vee \log 2).
Let μ1,μ2,,μm\mu^1,\mu^2, \ldots, \mu^m be ε\varepsilon seperated points in KK. Suppose JJ is uniformly distributed over the index set [m][m] and the conditional distribution of YY given J=jJ = j is equal to the distribution of μj+ξ\mu_j + \xi where ξN(0,Iσ2)\xi \sim \mathcal{N}(0,I \sigma^2). \newline \newline To apply Fano's inequality, we need an upper bound on the mutual information between YY and JJ. Note that I(Y;J)1mj=1mDKL(PμjPν)I(Y;J)\le \frac{1}{m} \sum_{j=1}^m D_{KL}(\mathbb{P}_{\mu^j}||\mathbb{P}_\nu) for any νRn\nu \in \R^n where Px\mathbb{P}_x the the Gaussian measure centered at xx with variance σ2I\sigma^2I. Therefore, for any pair (μj,ν)(\mu^j,\nu), DKL(PμjPν)=μjν222σ2.D_{KL}(\mathbb{P}_{\mu^j}||\mathbb{P}_{\nu}) = \frac{\lVert\mu^j - \nu\rVert_2^2}{2\sigma^2}. And thus, \begin{align*} I(Y;J) \le \frac{1}{m}\sum_{j=1}^m \frac{\lVert\mu^j - \nu\rVert_2^2}{2\sigma^2} &\le \max_{j} \frac{\lVert\mu^j - \nu\rVert_2^2}{2\sigma^2}. \end{align*} Now, the maximum distance between μj\mu^j and ν\nu is ε\varepsilon and so applying Fano's inequality we get the following: \begin{align*} \inf_{\hat{\nu}}\sup_{\mu}\mathbb{E}\lVert\hat{\nu}(Y) - \mu\rVert_2^2 &\ge \frac{\varepsilon^2}{4c^2}\left(1-\frac{\frac{\varepsilon^2}{2\sigma^2} + \log 2}{\log M^{loc}(\varepsilon)}\right). \end{align*} To conclude, if logMloc(ε)>2(ε22σ2+log2)\log M^{loc}(\varepsilon) > 2 (\frac{\varepsilon^2}{2\sigma^2} + \log 2), we get a lower bound of ε24c2(112)=ε28c2\frac{\varepsilon^2}{4c^2}\cdot (1-\frac{1}{2}) = \boxed{\frac{\varepsilon^2}{8c^2}}.

In the next sections, we will see how the condition of lemma 1 will be satisfied.

2. Upper Bound

The Algorithm to get the estimator

Suppose dd is the diameter of the set KK. The estimator is returned at the end of the following process:

This algorithm picks a random point νK\nu^* \in K, constructs a d2\frac{d}{2} packing of the intersection of KK with a ball of radius dd, which in the beginning is the set KK. Then, it finds the point in this packing closest to YY in the .2\lVert . \rVert_2.

Now the entire process is repeated with the ball centered at the previous closest point with the radius of the ball and packing halved.

There are a couple of points to note in this algorithm.

First, such a process is not achievable in practice because kNk \in \mathbb{N}.

Second, any point YKY \in K can be reached through this algorithm. This is because if the point in KK is in any one level of the packings then we are done. Otherwise, YY can be reached in the limit since the union of all such packings, is a countably dense subset of KK. It is countable because it is the union of countable finite sets. And dense because for any δ>0\delta > 0, and YKY \in K, the set B(Y,δ)B(Y, \delta) will contain at least one point that belongs to a 2δ2\delta packing.

Measurability

We argue ν\nu^* is a valid statistic by showing that it is a measurable function of the data. Let Υk:RnRn\Upsilon_k: \R^n \to \R^n return the estimator at the level kk. This means that Υk\Upsilon_k maps a point in Mk1M_{k-1} (that was closest to YY at level k1k-1) to a point in MkM_k that is closest to YY.

(2) The function ν:RnRn\nu^*:\R^n \to \R^n is measurable in the Borel σ\sigma algebra.
First, consider the event that Υk(y)\Upsilon_k(y) belongs to a packing set MM. For any point mMm \in M, the set {y:Υk(y)=m}\{y: \Upsilon_k(y) = m \} is the set of all yy such that Υk(y)M\Upsilon_k(y) \in M and yy belongs to a polytope (given by the Voronoi tesselation by MM). Since KK is convex, this is the set of all yy which belong to a convex polytope. A convex polytope is a countable combination of of closed sets, making the inverse image of a singleton, Υk1({m})\Upsilon_k^{-1}(\{m\}) a Borel set. Now, since MkM_k is a packing, it is a finite set. This implies that Υk\Upsilon_k is a discrete function and the σ\sigma algebra is generated from singleton. So we can say that Υk\Upsilon_k is a measurable function. Note that the sequence {Υk}\{\Upsilon_k\} is convergent because it is Cauchy since for any k1<k2k_1< k_2, \begin{align*} \lVert\Upsilon_{k_1} - \Upsilon_{k_2}\rVert_2 &= \lVert\Upsilon_{k_1} - \Upsilon_{k_1 + 1} + \Upsilon_{k_1+1} + \ldots \Upsilon_{k_2}\rVert_2.\\ &\le \sum_{i=k_1}^{k_2-1}\lVert\Upsilon_{i} - \Upsilon_{i+1}\rVert_2\\ &\le \sum_{i=k_1}^{k_2 -1 }\frac{d}{2^{i-1}} \le \frac{d}{2^{k_1 - 1}} \end{align*} The inequalities above are a consequence of Triangle Inequality. \newline Note that our estimator is achieved in the limit i.e. \begin{align*} \nu^* &= \lim_{k \to \infty }\Upsilon_k\\ &= \lim \sup_{k \to \infty} \Upsilon_k. \end{align*} So, to see that ν\nu^* is measurable, it suffices to show that for any closed box BB, (ν)1(B)(\nu^*)^{-1}(B) is a Borel set. \newline If BjLB_j^L and BjUB_j^U are the upper and lower bounds of BB for the jjth co-ordinate, then our problem reduces to proving that j=1n{y:limsupkΥkj(y)[BjL,BjU]}\boxed{\bigcap_{j=1}^n\{y: \lim \sup_{k \to \infty} \Upsilon_k^j (y) \in [B_j^L, B_j^U]\}} is Borel. \newline Each set in the above intersection can be written as following. \begin{align*} \{y: \lim \sup_{k \to \infty} \Upsilon_k^j (y) \in [B_j^L, B_j^U]\} &= \{y : \lim \sup_{k \to \infty} \Upsilon_k^j(y) \ge B_j^L\} \cap \{y:\lim \sup_{k \to \infty} \Upsilon_k^j(y) \le B_j^U\} \\ &:= P \cap Q \end{align*} From the definition of limsup\lim \sup of sets, QQ is k=1i=k{y:Υij(y)BjL}\bigcap_{k=1}^\infty \bigcup_{i=k}^\infty\{y:\Upsilon_i^j(y) \ge B_j^L\}, which is measurable since Υi\Upsilon_i is measurable and this is a countable combination of measurable sets. \newline PP can be written as l=1{y:Υkj(y)<BjL+1l}\bigcap_{l=1}^\infty \{y:\Upsilon_k^j(y) < B_j^L + \frac{1}{l}\}. From a property of limsup\lim \sup, if a>limsupΥka > \lim \sup \Upsilon_k, then there exists an NNN \in \mathbb{N} such that for all nNn \ge N Υk<a\Upsilon_k < a. This translates in set notation as l=1k=1i=k{y:Υij(y)<BjU}\bigcap_{l=1}^\infty \bigcup_{k=1}^\infty \bigcap_{i=k}^\infty\{y: \Upsilon_i^j(y) < B_j^U\}, which is again measurable.

Mapping to the Binary Hypothesis Testing Problem

We know from the previous section that $\nu^*$ is close to $Y$. To understand whether it is close to $\mu$, we pose this as a binary hypothesis testing problem and show that in a $\delta$ packing, the type $1$ and $2$ errors are small.

(3) Consider the hypothesis H0:μ=ν1H_0: \mu = \nu_1 and H1:μ=ν2H_1: \mu= \nu_2 for ν1ν22Cδ\lVert \nu_1 - \nu_2 \rVert_2 \ge C\delta, where C>2C> 2. Then the test, ψ(Y)=1Yν12>Yν22\psi(Y) = \mathbb{1}_{\lVert Y - \nu_1\rVert_2 > \lVert Y - \nu_2 \rVert_2} satisfies the following: \begin{align*} \min \left(\sup_{\mu:\lVert\mu - \nu_1 \rVert_2 \le \delta} \mathbb{P}_\mu(\psi=1) ,\sup_{\mu:\lVert \mu -\nu_2 \rVert_2 \le \delta}\mathbb{P}_\mu(\psi=0)\right)\le \exp \left(- (C-2 ) \frac{\delta^2}{8\sigma^2}\right). \end{align*}
We demonstrate the result for Type 11 error, and the same argument holds for Type 22. \newline The Type 11 error implies that μν12δ\lVert \mu - \nu_1 \rVert_2 \le \delta and ψ(Y)=1\psi(Y) = 1 which holds if Yν122Yν2220,\lVert Y - \nu_1\rVert_2 ^2 - \lVert Y - \nu_2\rVert_2 ^2 \ge 0, i.e. YY is closer to ν2\nu_2. \newline Next, note that the random variable Z:=Yν122Yν222Z:=\lVert Y - \nu_1\rVert_2^2 - \lVert Y - \nu_2\rVert_2^2 is normally distributed and the error is bounded from the tail probability of a subgaussian random variable which can be written as ν122+ν222+2(μ+ξ)T(ν2ν1)\lVert \nu_1\rVert_2^2 + \lVert\nu_2\rVert_2^2 + 2(\mu + \xi)^T(\nu_2 - \nu_1). Letting μ=ν1+η\mu = \nu_1 + \eta such that η2δ\lVert\eta\rVert_2\le \delta, gives us Z=ν122+ν222+2(ν1+η+ξ)T(ν2ν1),Z = \lVert\nu_1\rVert_2^2 + \lVert\nu_2\rVert_2^2 + 2(\nu_1 + \eta + \xi)^T(\nu_2 - \nu_1), which is equal to ν1ν222+2(η+ξ)T(ν2ν1)-\lVert\nu_1 - \nu_2\rVert_2^2 + 2(\eta+\xi)^T(\nu_2 - \nu_1). \newline Now, note that ηT(ν2ν1)η2ν2ν12δν2ν12.\eta^T(\nu_2 - \nu_1)\le \lVert\eta\rVert_2 \lVert\nu_2 - \nu_1\rVert_2 \le \delta \lVert \nu_2 - \nu_1\rVert_2. From the assumption of packing, δν1ν22/C\delta \le \lVert\nu_1 - \nu_2\rVert_2/C, giving that ηT(ν2ν1)ν1ν222C.\eta^T(\nu_2 - \nu_1) \le \frac{\lVert\nu_1 - \nu_2\rVert_2^2}{C}. Since ξN(0,σ2)\xi \sim \N(0, \sigma^2), from the stability of Gaussians, ZN(θ,4σ2ν1ν222)Z \sim \N(\theta, 4\sigma^2 \lVert\nu_1 - \nu_2\rVert_2^2), where θ(2/C1)ν1ν222\theta \le (2/C - 1)\lVert\nu_1 - \nu_2\rVert_2^2. \newline Finally, applying the tail probability bound on Gaussians concludes the proof.

We now show that the above result extends to all points in MM.

(4) Let i=argminiMYνi2i^* = \arg \min_{i\in M}\lVert Y - \nu_i\rVert_2. Then with probability at least 1Mexp((C2)2δ28σ2)1-\lvert M\rvert \cdot \exp\left(- \frac{(C-2)^2\delta^2}{8\sigma^2}\right), νiμ2(C+1)δ.\lVert\nu_{i^*} - \mu\rVert_2\le (C+1)\delta.
The ingredients of this proof are triangle inequality, characterising the event as the type 11 error of the test in lemma 3 and a union bound. \newline First, the author defines an intermediate random variable, TiT_i which is the maximum distance between νi\nu_i and another point, νj\nu_j in the packing set such that νj\nu_j is closer to YY than νi\nu_i, i.e. maxjνiνj2\max_{j}\lVert \nu_i - \nu_j \rVert_2 such that Yνi2Yνj2\lVert Y - \nu_i\rVert_2 \ge \lVert Y - \nu_j\rVert_2 and νiνj2Cδ\lVert \nu_i - \nu_j\rVert_2\ge C\delta. If no such jj exists then Ti=0T_i = 0. \newline Analysing the probability in the statement, we can see that \begin{align*} \mathbb{P}\{\lVert \nu_{i^*} - \mu \rVert_2\ge(C+1)\delta \} &\le \mathbb{P}\{\lVert\nu_i^* - \nu_i\rVert_2 + \lVert \nu_i - \mu\rVert_2 \ge (C+1)\delta\}. \end{align*} Since MM is the δ\delta packing, it is also a δ\delta covering and thus μνi2δ\lVert\mu - \nu_i \rVert_2\le \delta for some ii. So the above probability is equal to P{νiνi2+δCδ+δ}=P{i{j:νiνj2Cδ}}\mathbb{P}\{ \lVert\nu_i^* - \nu_i \rVert_2 + \delta \ge C\delta + \delta\} = \mathbb{P}\{i^* \in \{j:\lVert\nu_i - \nu_j\rVert_2\ge C\delta\}\}. Now if ii^* belongs to the set of all νj\nu_j which are CδC\delta away from νi\nu_i, then this means that TiCδ>0T_i \ge C\delta > 0 since we already know that ii^* is closer to YY. The probability that Ti>0T_i > 0 is the same as the probability that there exists a jj which satisfies the condition of TiT_i, which is the probability of type 11 error for each jj. By a union bound over all jMj \in M and lemma 4 we get the desired result.

Before proving the rate, let us briefly state a result that will be heavily used in next two theorems.

(5) The function εMloc(ε)\varepsilon \to M^{loc}(\varepsilon) is monotone non increasing.
The above can be proved by showing that for any ε<ε\varepsilon'<\varepsilon, KB(θ,ε)KB(θ,ε)K \cap B(\theta, \varepsilon') \supset K \cap B(\theta, \varepsilon). This involves showing that an affine transformation of KB(θ,ε)K \cap B(\theta, \varepsilon') gives KB(θ,ε)K\cap B(\theta, \varepsilon), which by convexity is in the set with ε\varepsilon'.

The next theorem is important to get the (ε)2(\varepsilon^*)^2 part of the rate.

(6) If ν\nu^* is returned by the previously defined Algorithm, then Eμν(Y)22Cˉ(ε)2,\boxed{\mathbb{E}\lVert\mu - \nu^*(Y)\rVert_2^2 \le \bar{C}(\varepsilon^*)^2}, where ε\varepsilon^* is the maximal JNJ\in \mathbb{N} such that εJ=d2J2c/23c\varepsilon_J = \frac{d}{2^{J-2}}\cdot \frac{c/2-3}{c} and εJ2σ2>16logMloc(εJcc/23)16log2\frac{\varepsilon_J^2}{\sigma^2}> 16 \log M^{loc}\left(\frac{\varepsilon_J c}{c/2 - 3}\right) \vee 16\log 2 or ε1\varepsilon_1, if no such JJ exists.
Throughout the proof, we attempt to simplify the numerous variables used to defined ε\varepsilon^*. To apply the results from lemma 4 , it is important to see that MM is the packing set for all levels up till JJ. So, applying the lemma and adjusting the constant cc in MlocM^{loc} gives us the following: \begin{align*} \mathbb{P}\{\lVert\mu - \Upsilon_J\rVert_2 \ge \frac{d}{2^{J-1}}\} &\le \sum_{j=1}^{J-1} \lvert M_j \rvert \exp\left(-\frac{(C-2)^2d^2}{2^{2j}(C+1)^2 8\sigma^2}\right). \end{align*} Since εMloc(ε)\varepsilon \to M^{loc}(\varepsilon) is monotone decreasing from lemma 5 this probability is bounded from above by Mloc(d2J2)j=1J1exp((C2)2d222j(C+1)28σ2)M^{loc}(\frac{d}{2^{J-2}}) \sum_{j=1}^{J-1}\exp\left(-\frac{(C-2)^2d^2}{2^{2j}(C+1)^2 8\sigma^2}\right). This probability is further Mloc(d2J2)a1a\le M^{loc}(\frac{d}{2^{J-2}})\frac{a}{1-a} for J>1J > 1 where aa is the last term of the summation, which is exp((C2)2d222(J1)(C+1)28σ2)\exp\left(-\frac{(C-2)^2d^2}{2^{2(J-1)}(C+1)^2 8\sigma^2}\right). Let εJ=d2J1C2C+1\varepsilon_J = \frac{d}{2^{J-1}}\cdot \frac{C-2}{C+1}. Setting c=2(C+1)c = 2(C+1) matches the definition of εJ\varepsilon_J. Now, assuming a<1/2a < 1/2 and 2log(Mloc(d2J1))2\log(M^{loc}(\frac{d}{2^{J-1}})) to be bounded by the exponential part of aa which is 2log(Mloc(d2J1))εJ28σ22\log(M^{loc}(\frac{d}{2^{J-1}})) \le \frac{\varepsilon_J^2}{8\sigma^2}, we get: \begin{align*} \mathbb{P}\{\lVert\mu - \Upsilon_J\rVert_2 \ge \frac{d}{2^{J-2}}\} \le 2\exp\left(-\frac{\varepsilon_J^2}{16\sigma^2}\right) \end{align*} To show that μJ\mu_J is not too far from ν\nu^*, we apply triangle inequality and the fact that the sequence {ΥJ}\{\Upsilon_J\} is Cauchy, which implies that ΥJν2=ΥJlimJΥJ2d2J1\lVert\Upsilon_J - \nu^*\rVert_2 = \lVert\Upsilon_J - \lim_{J \to \infty} \Upsilon_J\rVert_2 \le \frac{d}{2^{J-1}}. Therefore, we get the following result w.p 12exp(εJ216σ2)\ge 1-2 \exp(-\frac{\varepsilon_J^2}{16\sigma^2}) \begin{align*} \lVert\nu^* - \mu\rVert_2 &\le \lVert\nu^* - \mu\rVert_2 + \lVert \mu - \Upsilon_J\rVert_2 \le 3\frac{d}{2^{J-1}} = 3 \varepsilon_J \frac{C+1}{C-2} \end{align*} Let JJ^* be the maximum J>1J > 1 such that the assumption on 2log(Mloc(d2J1))2 \log(M^{loc}(\frac{d}{2^{J-1}})) holds, otherwise we set J=1J^* = 1, which in turn implies the condition on εJ2σ2\frac{\varepsilon_J^2}{\sigma^2}. The author then shows that for any xεx \ge \varepsilon^*, P{μν2Cx}Cexp(Cx2/σ2)1J>1\mathbb{P}\{\lVert\mu - \nu^*\rVert_2 \ge C' x\} \le \underline{C}\exp(-C''x^2/\sigma^2)\mathbb{1}_{J^* > 1} by applying that xCexp(Cx2/σ2)x \to \underline{C}\exp(-C''x^2/\sigma^2) is monotone decreasing. \newline Finally, we get the upper bound on expectation by integrating the tail probability, Eμν22=02xP{μν2x}dx\mathbb{E}\lVert\mu - \nu^*\rVert_2^2 = \int_{0}^\infty 2x \mathbb{P}\{\lVert\mu - \nu^*\rVert_2\ge x\}\mathrm{d} x. From the above analysis, with high probability, μν2\lVert\mu - \nu^*\rVert_2 is bounded by CεC'\varepsilon^*. Thus, we get the following: \begin{align*} \mathbb{E}\lVert\mu - \nu^*\rVert_2^2 &\le C_1(\varepsilon^*)^2 + \int_{C'\varepsilon^*}^\infty 2x \exp(-C''x^2/\sigma^2) \mathbb{1}_{J^* > 1} \\ &= C_1(\varepsilon^*)^2 + C_2\sigma^2\exp(-C_3(\varepsilon^*)^2/\sigma^2)\mathbb{1}_{J^* > 1} \end{align*} Since εJ/σ2\varepsilon_J^*/\sigma^2 is greater than some constant from the assumption, this gives the desired result.

3. Proof of the Minimax Rate

In the concluding section, we will show how the minimax rate of (ε)2diam(K)2(\varepsilon^*)^2 \wedge \text{diam}(K)^2 is achieved.

(7) If ε\varepsilon^* is now defined as sup{ε:ε2σ2log(Mloc(ε))}\sup\{\varepsilon: \frac{\varepsilon^2}{\sigma^2}\le \log(M^{loc}(\varepsilon))\}, then the minimax rate is (ε)2diam(K)2\boxed{(\varepsilon^*)^2 \wedge \text{diam}(K)^2}.
Consider the following two cases:
  1. (ε)2σ2>16log2\frac{(\varepsilon^*)^2}{\sigma^2} > 16\log 2: The lower and upper bounds in this case are (ε)2(\varepsilon^*)^2.
    • Lower Bound : If ε\varepsilon^* is the supremum of the above set then ε/4\varepsilon^*/4 lies in the set. Therefore, log(Mloc(ε/4))(ε)24σ2=(ε)28σ2+(ε)28σ22log2+2(ε/4)2σ2\log(M^{loc}(\varepsilon^*/4))\ge \frac{(\varepsilon^*)^2}{4\sigma^2} = \frac{(\varepsilon^*)^2}{8\sigma^2} + \frac{(\varepsilon^*)^2}{8\sigma^2} \ge 2\log 2 + \frac{2(\varepsilon^*/4)^2}{\sigma^2}. Thus (ε)2/4(\varepsilon^*)^2/4 satisfies the sufficient condition of lemma 1 giving a lower bound of C(ε)2\underline{C}(\varepsilon^*)^2
    • Upper Bound : Since ε>0\varepsilon^*> 0, 2ε2\varepsilon^* does not lie in the set and for any C>1C > 1, C(2ε)2σ2Clog(Mloc(2ε))\frac{C(2\varepsilon^*)^2}{\sigma^2} \ge C\log(M^{loc}(2\varepsilon^*)). This in turn is bounded below by Clog(Mloc(2εC))C\log (M^{loc}(2\varepsilon^*\sqrt{C})) from the monotonicity of MlocM^{loc}. Placing C=16C = 16 gives that 2εC2\varepsilon^*C satisfies the sufficient condition for theorem 6 giving an upper bound of C(ε)2\overline{C}(\varepsilon^*)^2
  2. (ε)2σ216log2\frac{(\varepsilon^*)^2}{\sigma^2} \le 16\log 2: The rate in this case is proportional to diam(K)2:=d2\text{diam}(K)^2:=d^2/ Again, 2ε2\varepsilon^* is not in the set and hence log(Mloc(2ε))4(ε)2σ2\log (M^{loc}(2\varepsilon^*))\le \frac{4(\varepsilon^*)^2}{\sigma^2} , which from the assumption is upper bounded by 64log264\log 2. This means that exp(64log2)\exp(64\log 2) points are enough to pack KK. Further note that if cc is large enough, i.e. the packing radius is small enough, then placing exp(64log2)\ge \exp(64 \log 2) points at the diameter of a ball of radius 2ε2\varepsilon^* gives us a packing set. This implies that KK lies entirely in a ball of radius 2ε64log2σ2\varepsilon^*\le \sqrt{64\log 2}\sigma. The author claims that it is not needed to take the supremum of the set, instead choosing ε\varepsilon to be proportional to dd should suffice. This will ensure that ε2/σ2=d2/σ2(4ε)2/σ2\varepsilon^2/\sigma^2 = d^2 /\sigma^2 \le (4\varepsilon^*)^2/\sigma^2 will be upper bounded by a constant (from the assumption). And Mloc(ε)M^{loc}(\varepsilon) can be made larger than a constant by placing θ\theta at the center of the diameter of KK constructing a packing set from equidistant points along the diameter. This method gives us matching upper and lower bounds of d2d^2 upto constant factors.
Combining the previous two cases gives the desired minimax rate.

4. Discussion and Open Directions

The assumption of convexity is universal and occurs in applications like isotonic regression where KK is a convex cone and linear prediction with correlated designs where KK is an ellipse (Guntuboyina and Sen, 2018) .

The author thus gives a minimax optimal estimator on a general problem than KK taking a specific shape. In the remaining part of the paper, he also shows show ε\varepsilon^* can be calculated for certain convex bodies like hyperrectangles and ellipses.

There are two major concepts in his proof: the boundedness of local entropy and convexity. Convexity of KK plays an important role in showing that the local entropy is a decreasing function of ε\varepsilon, which in turn is applied in bounding Eνμ2\mathbb{E}\lVert\nu^* - \mu\rVert_2 from above. A question to think about is how to extend this to other constraints on KK.

Finally, another open question is to understand if it is possible to get a computationally tractable estimator that achieves this minimax rate.

  1. Wainwright, Martin J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint, Cambridge University Press , 121, Cambridge Series in Statistical and Probabilistic Mathematics
  2. Guntuboyina, Adityanand and Sen, Bodhisattva. (2018). Nonparametric shape-restricted regression, Statistical Science