BTS III: Some loose ends

This is the last installment to the series sparked off by the SMO Open Q4 problem, which led us to introduce some basic learning theory.

If you're deciding whether to read this, I've approximately split up this post into two parts:

  • the first part is about dealing with a certain flavor of inequalities, where you are averaging a maximum over all 2n2^n combinations of ±\pm signs.
  • the second part is some conventional ML theory content tying generalization error to the Rademacher complexity of loss functions

I would suggest reading just the first part unless you are really dying to see where Rademacher complexity comes from.

\newcommand\lrangle[1]{\left\langle #1 \right\rangle} \newcommand \E {\mathbb E} \newcommand \cH {\mathcal H} \newcommand \Var {\mathrm{Var}} \newcommand \MSE {\mathrm{MSE}} \newcommand \cR {\mathcal R} \newcommand \cS {\mathcal S} \newcommand \cF {\mathcal F}

Part 1: Dealing with ±\pm inequalities

Recap: Rademacher complexity

From Part I, we met the mysterious quantity termed the Rademacher complexity:

R(S):=EσmaxySσ,y,σ{±1}n\cR(\cS) := \E_\sigma \max_{y\in S} \lrangle{\sigma,y}, \quad \sigma\sim \{\pm 1\}^n

and the corresponding version for functions R(F):=EσmaxfFi=1nσif(ti),σ{±1}n.\cR(\cF) := \E_\sigma \max_{f\in \cF} \sum_{i=1}^n \sigma_i f(t_i), \quad \sigma\sim \{\pm 1\}^n.

In words, we want to pick the best element of S\cS (or F\cF) given a (uniformly) random choice of signs (which is expressed by σ{±1}n\sigma\sim \{\pm 1\}^n).

In general, these aren't easy to compute. You can have your hand at showing the following, which corresponds to the special case S={x,x}\cS = \{-x, x\}:

(Khinchin's inequality) Show that C1x2Eσ{±1}nσ,xC2x2C_1\| x\|_2 \le \E_{\sigma\sim \{\pm 1\}^n} |\lrangle {\sigma, x}| \le C_2 \|x\|_2 for some fixed C2C1>0C_2\ge C_1 > 0 across all nn (the number of dimensions). Here, x2\|x\|_2 is the 2-norm x12+...+xn2\sqrt{x_1^2 + ... + x_n^2}.

In fact, C2=1C_2 = 1 works by taking Eσσ,x(Eσ[σ,x2])1/2=x2\E_\sigma |\lrangle {\sigma, x}| \le \left(\E_\sigma [\lrangle{\sigma, x}^2]\right)^{1/2} = \|x\|_2. The hard part is the left inequality.

(Fun fact: multiple X-men failed this problem when I gave it to them, but my thesis professor eyepowered it in 5 minutes. Turns out theory and experience does help sometimes!)

I hope this convinces you that the straggling fact from part I might be tricky to prove. If you forgot what it was, here it is again:

(Massart's lemma) For finite S\cS', one has R(S)C(maxzSz2)logS.\cR(\cS') \le C\cdot \left(\max_{z\in \cS'} \|z\|_2\right) \cdot \sqrt{\log |\cS'|}.

But first, let me explain why the square root of the log is there. (Maybe it's not surprising because this also showed up in mean estimation!)

The Gaussian heuristic

A very nice reason to suspect such inequalities are true comes from the following idea:

i=1nσixi\sum_{i=1}^n \sigma_i x_i behaves roughly like a Gaussian/normal random variable with variance xi2\sum x_i^2.

The heuristic is based on the guess that the central limit theorem will kick in even if random variables added up are not identically distributed, as long as they are independent. (This is only partly true - for example if a single xix_i dominates the size then clearly this isn't true!)

For Khinchin's inequality, applying this heuristic gives tells us that we expect Eσσ,xx2EZ=2πx2\E_\sigma |\lrangle {\sigma, x}| \approx \|x\|_2 \cdot \E|Z| = \frac{2}{\pi}\cdot \|x\|_2 where ZZ is a normal random variable.

What about Massart's lemma? Well, we extend our heuristic a little, and assume that

For each xx, i=1nσixi\sum_{i=1}^n \sigma_ix_i behaves like an independent normal with variance xi\sum x_i.

Here, we're banking on the fact that in high-dimensions, most xx's are orthogonal to each other. In the case where for two such xx which are actually orthogonal, one can check that their covariance is 0, so in the limit this heuristic makes sense.

Thus a very related problem is to estimate the expected value of the maximum of nn independent normals Z1,...,ZnZ_1,...,Z_n. This turns out to be 2logn\sqrt{2\log n}, so do have a good reason to suspect that the inequality is true!

Even if you don't buy this heuristic, you can still get pretty close using a union bound: let's say all we know that each ZiZ_i is normal with variance 11, but we know nothing about whether they are independent. Then, the probability that the maximum is at least tt is at most nPr(ZitV)nexp(t2/2)n\cdot \Pr(|Z_i| \ge t\sqrt V) \le n\cdot \exp(-t^2/2), so if tlognt \ll \sqrt{\log n}, then the probability that the maximum is at least tVt\sqrt V is exponentially small.

Remark. The union bound gives the right answer here, because we're measuring rare events here (a single normal deviating from the mean by a lot), so any higher order terms (e.g. intersections of two or more events) don't matter as much.

The secret sauce

The secret sauce is the following function, sometimes called the moment generating function: MX(λ)=E[eλX].M_X(\lambda) = \E[e^{\lambda X}].

This plays really well with the sum of independent random variables, since by independence we have MX1+X2(λ)=E[eλ(X1+X2)]=E[eλX1]E[eλX2]=MX1(λ)MX2(λ).M_{X_1+X_2} (\lambda) = \E[e^{\lambda (X_1+X_2)}] = \E[e^{\lambda X_1}]\cdot \E[e^{\lambda X_2}] = M_{X_1}(\lambda)\cdot M_{X_2} (\lambda).

What's more, under certain conditions the moment generating function determines the underlying random variable, simply because all moments are captured by this function - if we Taylor expand it, we get MX(λ)=1+E[X]λ+E[X]22λ2+.M_X(\lambda) = 1 + \E[X]\cdot \lambda + \frac{ \E[X]^2}{2} \cdot \lambda^2 + \cdots.

For instance, the CLT is really the fact that if XiX_i has mean 0, then 1n(X1+...+Xn)\frac{1}{\sqrt n} (X_1+...+X_n) approaches the normal distribution because (E[eλX/n])n(1+λ22nVar(X))neλ2Var(X)/2\left(\E[e^{\lambda X / \sqrt{n}}]\right)^n \approx \left(1 + \frac{\lambda^2}{2n} \Var(X)\right)^n \to e^{\lambda^2\Var(X) / 2}

where eλ2/2e^{\lambda^2/2} is precisely the moment generating function of a standard normal.

So, to do Massart's lemma for real, we just have to think about controlling Eσ[eλmaxzSσ,z]\E_\sigma[e^{\lambda \max_{z\in \cS'}\lrangle{\sigma, z}}] A really silly trick that makes our life way easier is that eλmaxzSσ,zzSeλσ,ze^{\lambda \max_{z\in \cS'}\lrangle{\sigma, z}} \le \sum_{z\in \cS'} e^{\lambda \lrangle{\sigma,z}}

Another silly trick is that ex+ex2ex2/2\frac{e^x + e^{-x}}{2} \le e^{x^2/2} which you could check by Taylor expanding and then comparing term-by-term, so Eσeλz,σexp(λ2z2/2)\E_\sigma e^{\lambda\lrangle{z, \sigma}} \le \exp(\lambda^2\|z\|^2/2). Putting it all together, Eσ[eλmaxzSσ,z]Sexp(λ2z2/2).\E_\sigma[e^{\lambda\max_{z\in \cS'}\lrangle{\sigma, z}}] \le |\cS'|\cdot \exp(\lambda^2\|z\|^2/2).

Now, we need aa third trick to extract out the "first moment", which is slightly more serious. By Jensen's inequality, we have exp(λEσmaxzSσ,z)Eσ[eλmaxzSσ,z] \exp (\lambda\E_\sigma \max_{z\in \cS'}\lrangle{\sigma, z}) \le \E_\sigma[e^{\lambda\max_{z\in \cS'}\lrangle{\sigma, z}}] which we can rewrite as EmaxzSσ,z1λEσ[eλmaxzSσ,z] \E\max_{z\in \cS'}\lrangle{\sigma, z} \le \frac{1}{\lambda}\E_\sigma[e^{\lambda\max_{z\in \cS'}\lrangle{\sigma, z}}] and noticing that the LHS has no λ\lambda term, we may slap on an infλ>0\inf_{\lambda > 0} on the right for free. Putting it all together, EmaxzSσ,zinfλ>01λ(logS+λ2z2/2)=z2logS. \begin{align*} \E\max_{z\in \cS'}\lrangle{\sigma, z} &\le \inf_{\lambda > 0} \frac{1}{\lambda} \left( \log|\cS'| + \lambda^2 \|z\|^2/2\right)\\ &= \|z\| \cdot \sqrt{2\log |\cS'|}. \end{align*}

Remarks.

(a) Trick #2 can be interpreted by saying that the random variable σ,x\lrangle{\sigma, x} is subgaussian - one definition is that the moment generating function is less than that of a corresponding normal distribution, but in general this means that the tail behavior and moments are "better behaved" than the corresponding normal distribution.

(b) Trick #3 has its roots in large deviations theory. See Cramer's theorem for more. As practice, you could try to prove Hoeffding's inequality.

Part 2: But why Rademacher complexity?

Recap: Learning theory

Now we go back to learning theory. Recall from Part II that if we're predicting YY based on XX, we can measure the quality of the prediction function hh using the mean squared error MSE(h):=E[(Yh(X))2]\MSE(h) := \E[(Y-h(X))^2]

Under this definition, the best prediction is the conditional mean hopt(x)=E[YX=x]h^{opt}(x)=\E[Y|X=x]. However, there are two problems:

  • maybe hopth^{opt} is a very strange function
  • there's no way to evaluate E[(Yh(X))2]\E[(Y-h(X))^2]!

Machine learning makes the following reductions: we will assume that (1) hh belongs to some given function class H\cH, and (2) we optimize hh given data (X(i),Y(i))(X^{(i)}, Y^{(i)}) using the empirical mean squared error MSE^(h):=1ni=1nE[(Yh(X))2].\hat \MSE(h) := \frac 1 n \sum_{i=1}^n \E[(Y-h(X))^2].

One could interpret this as a stochastic estimate of the actual mean squared error MSE(h)\MSE(h), but subtleties emerge once you optimize hh using this value itself - in some sense, the optimization process messes up the guarantee that MSE^(h)MSE(h)\hat\MSE(h) \approx \MSE(h).

Machine learning works if eventually the learned function hh has MSE(h)0\MSE(h) \to 0 as the number of datapoints goes to infinity (that is, we eventually learn the correct function).

From last time, we saw that one approach was to have a single upper bound for all hHh\in \cH simultaneously - that is, we should bound suphH{MSE^(h)MSE(h)}\sup_{h\in \cH} \left\{ \hat \MSE(h) - \MSE(h) \right\} then the empirical mean going to zero is as good as the mean. (In ML language, this means that the training error going to 0 is good enough!)

However, we saw that the approach used is a somewhat blanket treatment of things - it doesn't take into account the family H\cH of functions being optimized across, nor does it care about the actual distribution of the data.

A better bound

Secretly, MSE^(h)\hat\MSE(h) is really also a function of the data D=(X(i),Y(i))iD = (X^{(i)}, Y^{(i)})_i. So maybe we can try instead to bound

EDsuphH{MSE^(h;D)MSE(h)}\E_D \sup_{h\in \cH} \left\{ \hat \MSE(h; D) - \MSE(h) \right\}

The hard part, once again, is dealing with the idealized quantity MSE(h)\MSE(h). But what if we didn't have to? Since EDMSE^(h;D)=MSE(h)\E_D \hat \MSE(h;D) = \MSE(h), we can use a symmetrization trick - by Jensen's inequality, we can upper bound this by switching out the average term for an independently generated random term (using an independent but identically distributed dataset DD'): EDsuphH{MSE^(h;D)MSE(h)}ED,DsuphH{MSE^(h;D)MSE^(h;D)}.\E_D \sup_{h\in \cH} \left\{ \hat \MSE(h; D) - \MSE(h) \right\} \le \E_{D,D'} \sup_{h\in \cH} \left\{\hat \MSE(h; D) - \hat \MSE(h; D') \right\}.

This is okay, because in general supaA(ax)\sup_{a\in A} (a-x) is a convex function in xx. (Imagine drawing a bunch of linear graphs, and taking the upper envelope.)

Now that this is symmetric (and that we've eliminated the unknown MSE(h)\MSE(h)), we will apply the following (which is pretty easy to check):

(Symmetrization lemma.) If Zi(h)Z_i(h) and Zi(h)Z_i'(h) are independent and identically distributed random variables, then EZsuphHi=1n(Zi(h)Zi(h))2EZ,σsuphHi=1nσiZi(h)\E_Z \sup_{h\in H} \sum_{i=1}^n (Z_i(h) - Z_i'(h)) \le 2 \E_{Z,\sigma} \sup_{h\in H} \sum_{i=1}^n \sigma_i Z_i(h) where σ{±1}n\sigma \sim \{\pm 1\}^n is a random choice of signs independent of ZZ.

If we summarize (X,Y)(X,Y) as a single variable TT, and we write (T;h)=(Yh(X))2\ell(T; h) = (Y-h(X))^2, the result is that as long as H\cH is also symmetric, we must have

EDsuphH{MSE^(h;D)MSE(h)}2ED,σsuphH1ni=1nσi(T(i);h)\E_D \sup_{h\in \cH} \left\{\hat \MSE(h; D) - \MSE(h) \right\} \le 2 \E_{D, \sigma} \sup_{h\in \cH} \frac{1}{n}\sum_{i=1}^n \sigma_i \ell(T^{(i)}; h)

In other words, the error in using MSE^\hat\MSE instead of MSE\MSE is precisely the Rademacher complexity of the loss function. And that, is where it shows up!

There are two sanity checks that we can do here:

  • this bound is indeed sensitive to the distribution of (X,Y)(X,Y) and the function class H\cH. (Of course, it is still tricky to bound.)
  • remember how we saw how 1-variable functions that are convex have surprisingly low Rademacher complexity? You would be right to guess that in general, convex nn-variable functions have low complexity, which roughly corresponds the fact that convex optimization is easy.

Conclusion

I hope you've enjoyed this series! This is a topic that I'm still very eagerly learning about, and it gives a good basis of understanding for machine learning in general. It has been a long time dream of mine to attempt a "nutshell tutorial" on the core ideas of learning theory, and I'm glad that proposing this problem gave me the opportunity to do it.

If you're interested to learn more, I found a really good book that covers many ML theory topics.

Comments

Popular posts from this blog

SMO Open 2024 ??% Speedrun

Musings about the IMO