BTS III: Some loose ends
This is the last installment to the series sparked off by the SMO Open Q4 problem, which led us to introduce some basic learning theory.
If you're deciding whether to read this, I've approximately split up this post into two parts:
- the first part is about dealing with a certain flavor of inequalities, where you are averaging a maximum over all $2^n$ combinations of $\pm$ signs.
- the second part is some conventional ML theory content tying generalization error to the Rademacher complexity of loss functions
I would suggest reading just the first part unless you are really dying to see where Rademacher complexity comes from.
$\newcommand\lrangle[1]{\left\langle #1 \right\rangle} \newcommand \E {\mathbb E} \newcommand \cH {\mathcal H} \newcommand \Var {\mathrm{Var}} \newcommand \MSE {\mathrm{MSE}} \newcommand \cR {\mathcal R} \newcommand \cS {\mathcal S} \newcommand \cF {\mathcal F}$
Part 1: Dealing with $\pm$ inequalities
Recap: Rademacher complexity
From Part I, we met the mysterious quantity termed the Rademacher complexity:
$$\cR(\cS) := \E_\sigma \max_{y\in S} \lrangle{\sigma,y}, \quad \sigma\sim \{\pm 1\}^n$$
and the corresponding version for functions $$\cR(\cF) := \E_\sigma \max_{f\in \cF} \sum_{i=1}^n \sigma_i f(t_i), \quad \sigma\sim \{\pm 1\}^n.$$
In words, we want to pick the best element of $\cS$ (or $\cF$) given a (uniformly) random choice of signs (which is expressed by $\sigma\sim \{\pm 1\}^n$).
In general, these aren't easy to compute. You can have your hand at showing the following, which corresponds to the special case $\cS = \{-x, x\}$:
(Khinchin's inequality) Show that $$C_1\| x\|_2 \le \E_{\sigma\sim \{\pm 1\}^n} |\lrangle {\sigma, x}| \le C_2 \|x\|_2$$ for some fixed $C_2\ge C_1 > 0$ across all $n$ (the number of dimensions). Here, $\|x\|_2$ is the 2-norm $\sqrt{x_1^2 + ... + x_n^2}$.
In fact, $C_2 = 1$ works by taking $\E_\sigma |\lrangle {\sigma, x}| \le \left(\E_\sigma [\lrangle{\sigma, x}^2]\right)^{1/2} = \|x\|_2$. The hard part is the left inequality.
(Fun fact: multiple X-men failed this problem when I gave it to them, but my thesis professor eyepowered it in 5 minutes. Turns out theory and experience does help sometimes!)
I hope this convinces you that the straggling fact from part I might be tricky to prove. If you forgot what it was, here it is again:
(Massart's lemma) For finite $\cS'$, one has $$\cR(\cS') \le C\cdot \left(\max_{z\in \cS'} \|z\|_2\right) \cdot \sqrt{\log |\cS'|}.$$
But first, let me explain why the square root of the log is there. (Maybe it's not surprising because this also showed up in mean estimation!)
The Gaussian heuristic
A very nice reason to suspect such inequalities are true comes from the following idea:
$\sum_{i=1}^n \sigma_i x_i$ behaves roughly like a Gaussian/normal random variable with variance $\sum x_i^2$.
The heuristic is based on the guess that the central limit theorem will kick in even if random variables added up are not identically distributed, as long as they are independent. (This is only partly true - for example if a single $x_i$ dominates the size then clearly this isn't true!)
For Khinchin's inequality, applying this heuristic gives tells us that we expect $$\E_\sigma |\lrangle {\sigma, x}| \approx \|x\|_2 \cdot \E|Z| = \frac{2}{\pi}\cdot \|x\|_2$$ where $Z$ is a normal random variable.
What about Massart's lemma? Well, we extend our heuristic a little, and assume that
For each $x$, $\sum_{i=1}^n \sigma_ix_i$ behaves like an independent normal with variance $\sum x_i$.
Here, we're banking on the fact that in high-dimensions, most $x$'s are orthogonal to each other. In the case where for two such $x$ which are actually orthogonal, one can check that their covariance is 0, so in the limit this heuristic makes sense.
Thus a very related problem is to estimate the expected value of the maximum of $n$ independent normals $Z_1,...,Z_n$. This turns out to be $\sqrt{2\log n}$, so do have a good reason to suspect that the inequality is true!
Even if you don't buy this heuristic, you can still get pretty close using a union bound: let's say all we know that each $Z_i$ is normal with variance $1$, but we know nothing about whether they are independent. Then, the probability that the maximum is at least $t$ is at most $n\cdot \Pr(|Z_i| \ge t\sqrt V) \le n\cdot \exp(-t^2/2)$, so if $t \ll \sqrt{\log n}$, then the probability that the maximum is at least $t\sqrt V$ is exponentially small.
Remark. The union bound gives the right answer here, because we're measuring rare events here (a single normal deviating from the mean by a lot), so any higher order terms (e.g. intersections of two or more events) don't matter as much.
The secret sauce
The secret sauce is the following function, sometimes called the moment generating function: $$M_X(\lambda) = \E[e^{\lambda X}].$$
This plays really well with the sum of independent random variables, since by independence we have $$M_{X_1+X_2} (\lambda) = \E[e^{\lambda (X_1+X_2)}] = \E[e^{\lambda X_1}]\cdot \E[e^{\lambda X_2}] = M_{X_1}(\lambda)\cdot M_{X_2} (\lambda).$$
What's more, under certain conditions the moment generating function determines the underlying random variable, simply because all moments are captured by this function - if we Taylor expand it, we get $$M_X(\lambda) = 1 + \E[X]\cdot \lambda + \frac{ \E[X]^2}{2} \cdot \lambda^2 + \cdots.$$
For instance, the CLT is really the fact that if $X_i$ has mean 0, then $\frac{1}{\sqrt n} (X_1+...+X_n)$ approaches the normal distribution because $$\left(\E[e^{\lambda X / \sqrt{n}}]\right)^n \approx \left(1 + \frac{\lambda^2}{2n} \Var(X)\right)^n \to e^{\lambda^2\Var(X) / 2}$$
where $e^{\lambda^2/2}$ is precisely the moment generating function of a standard normal.
So, to do Massart's lemma for real, we just have to think about controlling $$\E_\sigma[e^{\lambda \max_{z\in \cS'}\lrangle{\sigma, z}}]$$ A really silly trick that makes our life way easier is that $$e^{\lambda \max_{z\in \cS'}\lrangle{\sigma, z}} \le \sum_{z\in \cS'} e^{\lambda \lrangle{\sigma,z}}$$
Another silly trick is that $$\frac{e^x + e^{-x}}{2} \le e^{x^2/2}$$ which you could check by Taylor expanding and then comparing term-by-term, so $\E_\sigma e^{\lambda\lrangle{z, \sigma}} \le \exp(\lambda^2\|z\|^2/2)$. Putting it all together, $$\E_\sigma[e^{\lambda\max_{z\in \cS'}\lrangle{\sigma, z}}] \le |\cS'|\cdot \exp(\lambda^2\|z\|^2/2).$$
Now, we need aa third trick to extract out the "first moment", which is slightly more serious. By Jensen's inequality, we have $$ \exp (\lambda\E_\sigma \max_{z\in \cS'}\lrangle{\sigma, z}) \le \E_\sigma[e^{\lambda\max_{z\in \cS'}\lrangle{\sigma, z}}] $$ which we can rewrite as $$ \E\max_{z\in \cS'}\lrangle{\sigma, z} \le \frac{1}{\lambda}\E_\sigma[e^{\lambda\max_{z\in \cS'}\lrangle{\sigma, z}}] $$ and noticing that the LHS has no $\lambda$ term, we may slap on an $\inf_{\lambda > 0}$ on the right for free. Putting it all together, $$ \begin{align*} \E\max_{z\in \cS'}\lrangle{\sigma, z} &\le \inf_{\lambda > 0} \frac{1}{\lambda} \left( \log|\cS'| + \lambda^2 \|z\|^2/2\right)\\ &= \|z\| \cdot \sqrt{2\log |\cS'|}. \end{align*} $$
Remarks.
(a) Trick #2 can be interpreted by saying that the random variable $\lrangle{\sigma, x}$ is subgaussian - one definition is that the moment generating function is less than that of a corresponding normal distribution, but in general this means that the tail behavior and moments are "better behaved" than the corresponding normal distribution.
(b) Trick #3 has its roots in large deviations theory. See Cramer's theorem for more. As practice, you could try to prove Hoeffding's inequality.
Part 2: But why Rademacher complexity?
Recap: Learning theory
Now we go back to learning theory. Recall from Part II that if we're predicting $Y$ based on $X$, we can measure the quality of the prediction function $h$ using the mean squared error $$\MSE(h) := \E[(Y-h(X))^2]$$
Under this definition, the best prediction is the conditional mean $h^{opt}(x)=\E[Y|X=x]$. However, there are two problems:
- maybe $h^{opt}$ is a very strange function
- there's no way to evaluate $\E[(Y-h(X))^2]$!
Machine learning makes the following reductions: we will assume that (1) $h$ belongs to some given function class $\cH$, and (2) we optimize $h$ given data $(X^{(i)}, Y^{(i)})$ using the empirical mean squared error $$\hat \MSE(h) := \frac 1 n \sum_{i=1}^n \E[(Y-h(X))^2].$$
One could interpret this as a stochastic estimate of the actual mean squared error $\MSE(h)$, but subtleties emerge once you optimize $h$ using this value itself - in some sense, the optimization process messes up the guarantee that $\hat\MSE(h) \approx \MSE(h)$.
Machine learning works if eventually the learned function $h$ has $\MSE(h) \to 0$ as the number of datapoints goes to infinity (that is, we eventually learn the correct function).
From last time, we saw that one approach was to have a single upper bound for all $h\in \cH$ simultaneously - that is, we should bound $$\sup_{h\in \cH} \left\{ \hat \MSE(h) - \MSE(h) \right\} $$ then the empirical mean going to zero is as good as the mean. (In ML language, this means that the training error going to 0 is good enough!)
However, we saw that the approach used is a somewhat blanket treatment of things - it doesn't take into account the family $\cH$ of functions being optimized across, nor does it care about the actual distribution of the data.
A better bound
Secretly, $\hat\MSE(h)$ is really also a function of the data $D = (X^{(i)}, Y^{(i)})_i$. So maybe we can try instead to bound
$$\E_D \sup_{h\in \cH} \left\{ \hat \MSE(h; D) - \MSE(h) \right\}$$
The hard part, once again, is dealing with the idealized quantity $\MSE(h)$. But what if we didn't have to? Since $\E_D \hat \MSE(h;D) = \MSE(h)$, we can use a symmetrization trick - by Jensen's inequality, we can upper bound this by switching out the average term for an independently generated random term (using an independent but identically distributed dataset $D'$): $$\E_D \sup_{h\in \cH} \left\{ \hat \MSE(h; D) - \MSE(h) \right\} \le \E_{D,D'} \sup_{h\in \cH} \left\{\hat \MSE(h; D) - \hat \MSE(h; D') \right\}.$$
This is okay, because in general $\sup_{a\in A} (a-x)$ is a convex function in $x$. (Imagine drawing a bunch of linear graphs, and taking the upper envelope.)
Now that this is symmetric (and that we've eliminated the unknown $\MSE(h)$), we will apply the following (which is pretty easy to check):
(Symmetrization lemma.) If $Z_i(h)$ and $Z_i'(h)$ are independent and identically distributed random variables, then $$\E_Z \sup_{h\in H} \sum_{i=1}^n (Z_i(h) - Z_i'(h)) \le 2 \E_{Z,\sigma} \sup_{h\in H} \sum_{i=1}^n \sigma_i Z_i(h) $$ where $\sigma \sim \{\pm 1\}^n$ is a random choice of signs independent of $Z$.
If we summarize $(X,Y)$ as a single variable $T$, and we write $\ell(T; h) = (Y-h(X))^2$, the result is that as long as $\cH$ is also symmetric, we must have
$$\E_D \sup_{h\in \cH} \left\{\hat \MSE(h; D) - \MSE(h) \right\} \le 2 \E_{D, \sigma} \sup_{h\in \cH} \frac{1}{n}\sum_{i=1}^n \sigma_i \ell(T^{(i)}; h)$$
In other words, the error in using $\hat\MSE$ instead of $\MSE$ is precisely the Rademacher complexity of the loss function. And that, is where it shows up!
There are two sanity checks that we can do here:
- this bound is indeed sensitive to the distribution of $(X,Y)$ and the function class $\cH$. (Of course, it is still tricky to bound.)
- remember how we saw how 1-variable functions that are convex have surprisingly low Rademacher complexity? You would be right to guess that in general, convex $n$-variable functions have low complexity, which roughly corresponds the fact that convex optimization is easy.
Conclusion
I hope you've enjoyed this series! This is a topic that I'm still very eagerly learning about, and it gives a good basis of understanding for machine learning in general. It has been a long time dream of mine to attempt a "nutshell tutorial" on the core ideas of learning theory, and I'm glad that proposing this problem gave me the opportunity to do it.
If you're interested to learn more, I found a really good book that covers many ML theory topics.
Comments
Post a Comment