Notes on Minimax Rate of the Gaussian Location Model with Bounded Convex Constraints

This is a (detailed) summary of the proof techniques used by Matey Neykov in his recent paper.

Published

12 July 2022

In Spring 2022, I took the course, Information Theoretic Methods in Statistics taught by Professor Jonathan Niles Weed to learn the techniques in the source and channel coding problems of Information Theory that can also be applied to Machine Learning and Statistics. As a part of the course project, I wrote this summary of On the minimax rate of the Gaussian sequence model under bounded convex constraints by Matey Neykov.

Let’s begin by formally defining the problem.

We are given

n

observations from a Gaussian Sequence Model denoted by

Y_i = \mu_i + \xi_i

, where

\xi_i

are independent and

\xi_i \sim \mathcal{N}(0,\sigma^2)

. Suppose the vector

\mu

belongs to a bounded convex set,

K

. We want to obtain the minimax rate of this problem under

\ell_2

loss, i.e., we want a matching upper and lower bound of

\boxed{\inf_{\hat{\nu}}\left(\sup_{\mu \in K}\mathbb{E}\lVert\hat{\nu}(Y)-\mu\rVert_2^2\right)},

where

\hat{\nu}

is the set of all estimators of the data.

Pre requisites

There are two important concepts which I assume familiarity with. The first is packing of a set in a metric space, for which the Chapter 5 of (Wainwright, 2019) is a good resource. The second is measurability of a function, which is used in the subsection Measurability to prove the validity of the proposed estimator and can be skipped if the reader is not familiar with measure theory.

The paper’s main contribution is in showing that the minimax rate is $(\varepsilon^*)^2 \wedge \text{diam}(K)^2$ , where $\varepsilon^*$ is a parameter based on the local entropy of $K$ , which is defined as follows.

Local entropy, denoted

M^{loc}(\varepsilon)

, is the largest cardinality of an

\tfrac{\varepsilon}{c}

- packing set (

c > 0

) of the intersection of

K

with

B(\theta,\varepsilon)

, where

\theta \in K

. \begin{align*} \boxed{M^{loc}(\varepsilon) := \sup_{\theta \in K} M \left(\frac{\varepsilon}{c}, B(\theta, \varepsilon) \cap K \right)}, \end{align*} where

M(\delta,\mathbb{T})

is the

\delta

packing number of a set

\mathbb{T}

List of Sections

In Section 1 we will see that it is straightforward to get a lower bound of the minimax rate using Fano’s inequality if the local entropy is sufficiently large.

Section 2 contains the main results involving the minimax upper bound where we will learn about the algorithm to obtain the estimator $\nu^*$ . Then, we check the validity of this estimator by showing that $\nu^*$ is a measurable function.

In the subsequent sections, we will see that if the $\nu^*$ is selected from the Algorithm defined in Section $2$ , then with high probability, $\nu^*(Y)$ is close to $\mu$ . This is done by showing that the binary hypothesis testing problem that picks the estimator based on its distance to $Y$ has low type $1$ and $2$ errors. Finally, we show an upper bound on $\mathbb{E}\lVert\nu^* (Y) - \mu\rVert_2^2$ . This result, along with geometric properties of local entropy gives the minimax rate.

1. Lower Bound

(1) The minimax risk has a lower bound of

\frac{\varepsilon^2}{8c^2}

for any

\varepsilon>0

such that

\log M^{loc} (\varepsilon) > 4 (\frac{\varepsilon^2}{2\sigma^2} \vee \log 2)

Let

\mu^1,\mu^2, \ldots, \mu^m

\varepsilon

seperated points in

K

. Suppose

J

is uniformly distributed over the index set

[m]

and the conditional distribution of

Y

given

J = j

is equal to the distribution of

\mu_j + \xi

where

\xi \sim \mathcal{N}(0,I \sigma^2)

\newline

\newline

To apply Fano's inequality, we need an upper bound on the mutual information between

Y

and

J

. Note that

I(Y;J)\le \frac{1}{m} \sum_{j=1}^m D_{KL}(\mathbb{P}_{\mu^j}||\mathbb{P}_\nu)

for any

\nu \in \R^n

where

\mathbb{P}_x

the the Gaussian measure centered at

x

with variance

\sigma^2I

. Therefore, for any pair

(\mu^j,\nu)

D_{KL}(\mathbb{P}_{\mu^j}||\mathbb{P}_{\nu}) = \frac{\lVert\mu^j - \nu\rVert_2^2}{2\sigma^2}.

And thus, \begin{align*} I(Y;J) \le \frac{1}{m}\sum_{j=1}^m \frac{\lVert\mu^j - \nu\rVert_2^2}{2\sigma^2} &\le \max_{j} \frac{\lVert\mu^j - \nu\rVert_2^2}{2\sigma^2}. \end{align*} Now, the maximum distance between

\mu^j

and

\nu

\varepsilon

and so applying Fano's inequality we get the following: \begin{align*} \inf_{\hat{\nu}}\sup_{\mu}\mathbb{E}\lVert\hat{\nu}(Y) - \mu\rVert_2^2 &\ge \frac{\varepsilon^2}{4c^2}\left(1-\frac{\frac{\varepsilon^2}{2\sigma^2} + \log 2}{\log M^{loc}(\varepsilon)}\right). \end{align*} To conclude, if

\log M^{loc}(\varepsilon) > 2 (\frac{\varepsilon^2}{2\sigma^2} + \log 2)

, we get a lower bound of

\frac{\varepsilon^2}{4c^2}\cdot (1-\frac{1}{2}) = \boxed{\frac{\varepsilon^2}{8c^2}}

In the next sections, we will see how the condition of lemma 1 will be satisfied.

2. Upper Bound

The Algorithm to get the estimator

Suppose $d$ is the diameter of the set $K$ . The estimator is returned at the end of the following process:

Begin with $\nu^* \in K$ , $2(C+1) = c$
Repeat for all $k \in \N$

Construct a $\frac{d}{2^{k}(C+1)}$ maximal packing, $M_k$ , of the set $K \cap B(\nu^*, \frac{d}{2^{k-1}})$
Find the closest point to $Y$ in $M_k$ and reassign it as $\nu^*$

This algorithm picks a random point $\nu^* \in K$ , constructs a $\frac{d}{2}$ packing of the intersection of $K$ with a ball of radius $d$ , which in the beginning is the set $K$ . Then, it finds the point in this packing closest to $Y$ in the $\lVert . \rVert_2$ .

Now the entire process is repeated with the ball centered at the previous closest point with the radius of the ball and packing halved.

There are a couple of points to note in this algorithm.

First, such a process is not achievable in practice because $k \in \mathbb{N}$ .

Second, any point $Y \in K$ can be reached through this algorithm. This is because if the point in $K$ is in any one level of the packings then we are done. Otherwise, $Y$ can be reached in the limit since the union of all such packings, is a countably dense subset of $K$ . It is countable because it is the union of countable finite sets. And dense because for any $\delta > 0$ , and $Y \in K$ , the set $B(Y, \delta)$ will contain at least one point that belongs to a $2\delta$ packing.

Measurability

We argue $\nu^*$ is a valid statistic by showing that it is a measurable function of the data. Let $\Upsilon_k: \R^n \to \R^n$ return the estimator at the level $k$ . This means that $\Upsilon_k$ maps a point in $M_{k-1}$ (that was closest to $Y$ at level $k-1$ ) to a point in $M_k$ that is closest to $Y$ .

(2) The function

\nu^*:\R^n \to \R^n

is measurable in the Borel

\sigma

algebra.

First, consider the event that

\Upsilon_k(y)

belongs to a packing set

M

. For any point

m \in M

, the set

\{y: \Upsilon_k(y) = m \}

is the set of all

y

such that

\Upsilon_k(y) \in M

and

y

belongs to a polytope (given by the Voronoi tesselation by

M

). Since

K

is convex, this is the set of all

y

which belong to a convex polytope. A convex polytope is a countable combination of of closed sets, making the inverse image of a singleton,

\Upsilon_k^{-1}(\{m\})

a Borel set. Now, since

M_k

is a packing, it is a finite set. This implies that

\Upsilon_k

is a discrete function and the

\sigma

algebra is generated from singleton. So we can say that

\Upsilon_k

is a measurable function. Note that the sequence

\{\Upsilon_k\}

is convergent because it is Cauchy since for any

k_1< k_2

, \begin{align*} \lVert\Upsilon_{k_1} - \Upsilon_{k_2}\rVert_2 &= \lVert\Upsilon_{k_1} - \Upsilon_{k_1 + 1} + \Upsilon_{k_1+1} + \ldots \Upsilon_{k_2}\rVert_2.\\ &\le \sum_{i=k_1}^{k_2-1}\lVert\Upsilon_{i} - \Upsilon_{i+1}\rVert_2\\ &\le \sum_{i=k_1}^{k_2 -1 }\frac{d}{2^{i-1}} \le \frac{d}{2^{k_1 - 1}} \end{align*} The inequalities above are a consequence of Triangle Inequality.

\newline

Note that our estimator is achieved in the limit i.e. \begin{align*} \nu^* &= \lim_{k \to \infty }\Upsilon_k\\ &= \lim \sup_{k \to \infty} \Upsilon_k. \end{align*} So, to see that

\nu^*

is measurable, it suffices to show that for any closed box

B

(\nu^*)^{-1}(B)

is a Borel set.

\newline

B_j^L

and

B_j^U

are the upper and lower bounds of

B

for the

j

th co-ordinate, then our problem reduces to proving that

\boxed{\bigcap_{j=1}^n\{y: \lim \sup_{k \to \infty} \Upsilon_k^j (y) \in [B_j^L, B_j^U]\}}

is Borel.

\newline

Each set in the above intersection can be written as following. \begin{align*} \{y: \lim \sup_{k \to \infty} \Upsilon_k^j (y) \in [B_j^L, B_j^U]\} &= \{y : \lim \sup_{k \to \infty} \Upsilon_k^j(y) \ge B_j^L\} \cap \{y:\lim \sup_{k \to \infty} \Upsilon_k^j(y) \le B_j^U\} \\ &:= P \cap Q \end{align*} From the definition of

\lim \sup

of sets,

Q

\bigcap_{k=1}^\infty \bigcup_{i=k}^\infty\{y:\Upsilon_i^j(y) \ge B_j^L\}

, which is measurable since

\Upsilon_i

is measurable and this is a countable combination of measurable sets.

\newline

P

can be written as

\bigcap_{l=1}^\infty \{y:\Upsilon_k^j(y) < B_j^L + \frac{1}{l}\}

. From a property of

\lim \sup

, if

a > \lim \sup \Upsilon_k

, then there exists an

N \in \mathbb{N}

such that for all

n \ge N

\Upsilon_k < a

. This translates in set notation as

\bigcap_{l=1}^\infty \bigcup_{k=1}^\infty \bigcap_{i=k}^\infty\{y: \Upsilon_i^j(y) < B_j^U\}

, which is again measurable.

Mapping to the Binary Hypothesis Testing Problem

We know from the previous section that $\nu^*$ is close to $Y$. To understand whether it is close to $\mu$, we pose this as a binary hypothesis testing problem and show that in a $\delta$ packing, the type $1$ and $2$ errors are small.

(3) Consider the hypothesis

H_0: \mu = \nu_1

and

H_1: \mu= \nu_2

for

\lVert \nu_1 - \nu_2 \rVert_2 \ge C\delta

, where

C> 2

. Then the test,

\psi(Y) = \mathbb{1}_{\lVert Y - \nu_1\rVert_2 > \lVert Y - \nu_2 \rVert_2}

satisfies the following: \begin{align*} \min \left(\sup_{\mu:\lVert\mu - \nu_1 \rVert_2 \le \delta} \mathbb{P}_\mu(\psi=1) ,\sup_{\mu:\lVert \mu -\nu_2 \rVert_2 \le \delta}\mathbb{P}_\mu(\psi=0)\right)\le \exp \left(- (C-2 ) \frac{\delta^2}{8\sigma^2}\right). \end{align*}

We demonstrate the result for Type

1

error, and the same argument holds for Type

2

\newline

The Type

1

error implies that

\lVert \mu - \nu_1 \rVert_2 \le \delta

and

\psi(Y) = 1

which holds if

\lVert Y - \nu_1\rVert_2 ^2 - \lVert Y - \nu_2\rVert_2 ^2 \ge 0,

i.e.

Y

is closer to

\nu_2

\newline

Next, note that the random variable

Z:=\lVert Y - \nu_1\rVert_2^2 - \lVert Y - \nu_2\rVert_2^2

is normally distributed and the error is bounded from the tail probability of a subgaussian random variable which can be written as

\lVert \nu_1\rVert_2^2 + \lVert\nu_2\rVert_2^2 + 2(\mu + \xi)^T(\nu_2 - \nu_1)

. Letting

\mu = \nu_1 + \eta

such that

\lVert\eta\rVert_2\le \delta

, gives us

Z = \lVert\nu_1\rVert_2^2 + \lVert\nu_2\rVert_2^2 + 2(\nu_1 + \eta + \xi)^T(\nu_2 - \nu_1),

which is equal to

-\lVert\nu_1 - \nu_2\rVert_2^2 + 2(\eta+\xi)^T(\nu_2 - \nu_1)

\newline

Now, note that

\eta^T(\nu_2 - \nu_1)\le \lVert\eta\rVert_2 \lVert\nu_2 - \nu_1\rVert_2 \le \delta \lVert \nu_2 - \nu_1\rVert_2.

From the assumption of packing,

\delta \le \lVert\nu_1 - \nu_2\rVert_2/C

, giving that

\eta^T(\nu_2 - \nu_1) \le \frac{\lVert\nu_1 - \nu_2\rVert_2^2}{C}.

Since

\xi \sim \N(0, \sigma^2)

, from the stability of Gaussians,

Z \sim \N(\theta, 4\sigma^2 \lVert\nu_1 - \nu_2\rVert_2^2)

, where

\theta \le (2/C - 1)\lVert\nu_1 - \nu_2\rVert_2^2

\newline

Finally, applying the tail probability bound on Gaussians concludes the proof.

We now show that the above result extends to all points in $M$ .

(4) Let

i^* = \arg \min_{i\in M}\lVert Y - \nu_i\rVert_2

. Then with probability at least

1-\lvert M\rvert \cdot \exp\left(- \frac{(C-2)^2\delta^2}{8\sigma^2}\right)

\lVert\nu_{i^*} - \mu\rVert_2\le (C+1)\delta.

The ingredients of this proof are triangle inequality, characterising the event as the type

1

error of the test in lemma 3 and a union bound.

\newline

First, the author defines an intermediate random variable,

T_i

which is the maximum distance between

\nu_i

and another point,

\nu_j

in the packing set such that

\nu_j

is closer to

Y

than

\nu_i

, i.e.

\max_{j}\lVert \nu_i - \nu_j \rVert_2

such that

\lVert Y - \nu_i\rVert_2 \ge \lVert Y - \nu_j\rVert_2

and

\lVert \nu_i - \nu_j\rVert_2\ge C\delta

. If no such

j

exists then

T_i = 0

\newline

Analysing the probability in the statement, we can see that \begin{align*} \mathbb{P}\{\lVert \nu_{i^*} - \mu \rVert_2\ge(C+1)\delta \} &\le \mathbb{P}\{\lVert\nu_i^* - \nu_i\rVert_2 + \lVert \nu_i - \mu\rVert_2 \ge (C+1)\delta\}. \end{align*} Since

M

is the

\delta

packing, it is also a

\delta

covering and thus

\lVert\mu - \nu_i \rVert_2\le \delta

for some

i

. So the above probability is equal to

\mathbb{P}\{ \lVert\nu_i^* - \nu_i \rVert_2 + \delta \ge C\delta + \delta\} = \mathbb{P}\{i^* \in \{j:\lVert\nu_i - \nu_j\rVert_2\ge C\delta\}\}

. Now if

i^*

belongs to the set of all

\nu_j

which are

C\delta

away from

\nu_i

, then this means that

T_i \ge C\delta > 0

since we already know that

i^*

is closer to

Y

. The probability that

T_i > 0

is the same as the probability that there exists a

j

which satisfies the condition of

T_i

, which is the probability of type

1

error for each

j

. By a union bound over all

j \in M

and lemma 4 we get the desired result.

Before proving the rate, let us briefly state a result that will be heavily used in next two theorems.

(5) The function

\varepsilon \to M^{loc}(\varepsilon)

is monotone non increasing.

The above can be proved by showing that for any

\varepsilon'<\varepsilon

K \cap B(\theta, \varepsilon') \supset K \cap B(\theta, \varepsilon)

. This involves showing that an affine transformation of

K \cap B(\theta, \varepsilon')

gives

K\cap B(\theta, \varepsilon)

, which by convexity is in the set with

\varepsilon'

The next theorem is important to get the $(\varepsilon^*)^2$ part of the rate.

(6) If

\nu^*

is returned by the previously defined Algorithm, then

\boxed{\mathbb{E}\lVert\mu - \nu^*(Y)\rVert_2^2 \le \bar{C}(\varepsilon^*)^2},

where

\varepsilon^*

is the maximal

J\in \mathbb{N}

such that

\varepsilon_J = \frac{d}{2^{J-2}}\cdot \frac{c/2-3}{c}

and

\frac{\varepsilon_J^2}{\sigma^2}> 16 \log M^{loc}\left(\frac{\varepsilon_J c}{c/2 - 3}\right) \vee 16\log 2

\varepsilon_1

, if no such

J

exists.

Throughout the proof, we attempt to simplify the numerous variables used to defined

\varepsilon^*

. To apply the results from lemma 4 , it is important to see that

M

is the packing set for all levels up till

J

. So, applying the lemma and adjusting the constant

c

M^{loc}

gives us the following: \begin{align*} \mathbb{P}\{\lVert\mu - \Upsilon_J\rVert_2 \ge \frac{d}{2^{J-1}}\} &\le \sum_{j=1}^{J-1} \lvert M_j \rvert \exp\left(-\frac{(C-2)^2d^2}{2^{2j}(C+1)^2 8\sigma^2}\right). \end{align*} Since

\varepsilon \to M^{loc}(\varepsilon)

is monotone decreasing from lemma 5 this probability is bounded from above by

M^{loc}(\frac{d}{2^{J-2}}) \sum_{j=1}^{J-1}\exp\left(-\frac{(C-2)^2d^2}{2^{2j}(C+1)^2 8\sigma^2}\right)

. This probability is further

\le M^{loc}(\frac{d}{2^{J-2}})\frac{a}{1-a}

for

J > 1

where

a

is the last term of the summation, which is

\exp\left(-\frac{(C-2)^2d^2}{2^{2(J-1)}(C+1)^2 8\sigma^2}\right)

. Let

\varepsilon_J = \frac{d}{2^{J-1}}\cdot \frac{C-2}{C+1}

. Setting

c = 2(C+1)

matches the definition of

\varepsilon_J

. Now, assuming

a < 1/2

and

2\log(M^{loc}(\frac{d}{2^{J-1}}))

to be bounded by the exponential part of

a

which is

2\log(M^{loc}(\frac{d}{2^{J-1}})) \le \frac{\varepsilon_J^2}{8\sigma^2}

, we get: \begin{align*} \mathbb{P}\{\lVert\mu - \Upsilon_J\rVert_2 \ge \frac{d}{2^{J-2}}\} \le 2\exp\left(-\frac{\varepsilon_J^2}{16\sigma^2}\right) \end{align*} To show that

\mu_J

is not too far from

\nu^*

, we apply triangle inequality and the fact that the sequence

\{\Upsilon_J\}

is Cauchy, which implies that

\lVert\Upsilon_J - \nu^*\rVert_2 = \lVert\Upsilon_J - \lim_{J \to \infty} \Upsilon_J\rVert_2 \le \frac{d}{2^{J-1}}

. Therefore, we get the following result w.p

\ge 1-2 \exp(-\frac{\varepsilon_J^2}{16\sigma^2})

\begin{align*} \lVert\nu^* - \mu\rVert_2 &\le \lVert\nu^* - \mu\rVert_2 + \lVert \mu - \Upsilon_J\rVert_2 \le 3\frac{d}{2^{J-1}} = 3 \varepsilon_J \frac{C+1}{C-2} \end{align*} Let

J^*

be the maximum

J > 1

such that the assumption on

2 \log(M^{loc}(\frac{d}{2^{J-1}}))

holds, otherwise we set

J^* = 1

, which in turn implies the condition on

\frac{\varepsilon_J^2}{\sigma^2}

. The author then shows that for any

x \ge \varepsilon^*

\mathbb{P}\{\lVert\mu - \nu^*\rVert_2 \ge C' x\} \le \underline{C}\exp(-C''x^2/\sigma^2)\mathbb{1}_{J^* > 1}

by applying that

x \to \underline{C}\exp(-C''x^2/\sigma^2)

is monotone decreasing.

\newline

Finally, we get the upper bound on expectation by integrating the tail probability,

\mathbb{E}\lVert\mu - \nu^*\rVert_2^2 = \int_{0}^\infty 2x \mathbb{P}\{\lVert\mu - \nu^*\rVert_2\ge x\}\mathrm{d} x

. From the above analysis, with high probability,

\lVert\mu - \nu^*\rVert_2

is bounded by

C'\varepsilon^*

. Thus, we get the following: \begin{align*} \mathbb{E}\lVert\mu - \nu^*\rVert_2^2 &\le C_1(\varepsilon^*)^2 + \int_{C'\varepsilon^*}^\infty 2x \exp(-C''x^2/\sigma^2) \mathbb{1}_{J^* > 1} \\ &= C_1(\varepsilon^*)^2 + C_2\sigma^2\exp(-C_3(\varepsilon^*)^2/\sigma^2)\mathbb{1}_{J^* > 1} \end{align*} Since

\varepsilon_J^*/\sigma^2

is greater than some constant from the assumption, this gives the desired result.

3. Proof of the Minimax Rate

In the concluding section, we will show how the minimax rate of $(\varepsilon^*)^2 \wedge \text{diam}(K)^2$ is achieved.

(7) If

\varepsilon^*

is now defined as

\sup\{\varepsilon: \frac{\varepsilon^2}{\sigma^2}\le \log(M^{loc}(\varepsilon))\}

, then the minimax rate is

\boxed{(\varepsilon^*)^2 \wedge \text{diam}(K)^2}

Consider the following two cases:

$\frac{(\varepsilon^*)^2}{\sigma^2} > 16\log 2$ : The lower and upper bounds in this case are $(\varepsilon^*)^2$ .

Lower Bound : If $\varepsilon^*$ is the supremum of the above set then $\varepsilon^*/4$ lies in the set. Therefore, $\log(M^{loc}(\varepsilon^*/4))\ge \frac{(\varepsilon^*)^2}{4\sigma^2} = \frac{(\varepsilon^*)^2}{8\sigma^2} + \frac{(\varepsilon^*)^2}{8\sigma^2} \ge 2\log 2 + \frac{2(\varepsilon^*/4)^2}{\sigma^2}$ . Thus $(\varepsilon^*)^2/4$ satisfies the sufficient condition of lemma 1 giving a lower bound of $\underline{C}(\varepsilon^*)^2$
Upper Bound : Since $\varepsilon^*> 0$ , $2\varepsilon^*$ does not lie in the set and for any $C > 1$ , $\frac{C(2\varepsilon^*)^2}{\sigma^2} \ge C\log(M^{loc}(2\varepsilon^*))$ . This in turn is bounded below by $C\log (M^{loc}(2\varepsilon^*\sqrt{C}))$ from the monotonicity of $M^{loc}$ . Placing $C = 16$ gives that $2\varepsilon^*C$ satisfies the sufficient condition for theorem 6 giving an upper bound of $\overline{C}(\varepsilon^*)^2$

$\frac{(\varepsilon^*)^2}{\sigma^2} \le 16\log 2$ : The rate in this case is proportional to $\text{diam}(K)^2:=d^2$ / Again, $2\varepsilon^*$ is not in the set and hence $\log (M^{loc}(2\varepsilon^*))\le \frac{4(\varepsilon^*)^2}{\sigma^2}$ , which from the assumption is upper bounded by $64\log 2$ . This means that $\exp(64\log 2)$ points are enough to pack $K$ . Further note that if $c$ is large enough, i.e. the packing radius is small enough, then placing $\ge \exp(64 \log 2)$ points at the diameter of a ball of radius $2\varepsilon^*$ gives us a packing set. This implies that $K$ lies entirely in a ball of radius $2\varepsilon^*\le \sqrt{64\log 2}\sigma$ . The author claims that it is not needed to take the supremum of the set, instead choosing $\varepsilon$ to be proportional to $d$ should suffice. This will ensure that $\varepsilon^2/\sigma^2 = d^2 /\sigma^2 \le (4\varepsilon^*)^2/\sigma^2$ will be upper bounded by a constant (from the assumption). And $M^{loc}(\varepsilon)$ can be made larger than a constant by placing $\theta$ at the center of the diameter of $K$ constructing a packing set from equidistant points along the diameter. This method gives us matching upper and lower bounds of $d^2$ upto constant factors.

Combining the previous two cases gives the desired minimax rate.

4. Discussion and Open Directions

The assumption of convexity is universal and occurs in applications like isotonic regression where $K$ is a convex cone and linear prediction with correlated designs where $K$ is an ellipse (Guntuboyina and Sen, 2018) .

The author thus gives a minimax optimal estimator on a general problem than $K$ taking a specific shape. In the remaining part of the paper, he also shows show $\varepsilon^*$ can be calculated for certain convex bodies like hyperrectangles and ellipses.

There are two major concepts in his proof: the boundedness of local entropy and convexity. Convexity of $K$ plays an important role in showing that the local entropy is a decreasing function of $\varepsilon$ , which in turn is applied in bounding $\mathbb{E}\lVert\nu^* - \mu\rVert_2$ from above. A question to think about is how to extend this to other constraints on $K$ .

Finally, another open question is to understand if it is possible to get a computationally tractable estimator that achieves this minimax rate.

Wainwright, Martin J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint, Cambridge University Press , 121, Cambridge Series in Statistical and Probabilistic Mathematics
Guntuboyina, Adityanand and Sen, Bodhisattva. (2018). Nonparametric shape-restricted regression, Statistical Science