Behind the Scenes of SMO Q4, Part 1: the Proposal Backstory

(David here.)

Recently, a problem of mine came out as Q4 on SMO Open 2024. I'll go over how I proposed the problem and the theory behind it, which involves some machine learning ideas (!).

This will take a small sequence of posts:

  • Part 1: a complete backstory for how this problem came about

  • Part 1.5: trying to understand Rademacher complexity through examples

  • Part 2: an introduction to learning theory

  • (maybe?) Part 3: where Rademacher theory comes into the picture

$$\newcommand\lrangle[1]{\left\langle #1 \right\rangle}$$ $$\newcommand \E {\mathbb E}$$ $$\newcommand \cH {\mathcal H}$$ $$\newcommand \Var {\mathrm{Var}}$$ $$\newcommand \MSE {\mathrm{MSE}}$$

The "official" solution

This was the original phrasing of what I submitted:

Alice and Bob play a game. Bob starts by picking a finite set $S$ of length $n$ sequences of reals in $[0,1]$. Alice picks a monotone sequence $y_1\le y_2\le \cdots\le y_n$ from the interval $[0,1]$, and a choice of real numbers $x_1,\cdots, x_n\in \mathbb R$. If Bob can pick $(z_1,z_2,\cdots, z_n)\in S$ such that $$\sum_{i=1}^n x_i y_i \le \sum_{i=1}^n x_iz_i$$ then Bob wins, and otherwise Alice wins. What is the minimum $|S|$ required so that Bob can guarantee a win?

And this was the solution I submitted:

Solution. We claim that $|S| = n+1$.

Firstly, $|S|\ge n+1$, since if Alice selects $$ \begin{align*} (y_1,\cdots, y_n) &= (\underbrace{0,0,\cdots, 0}_{k\text{ 0's}}, 1,1, \cdots 1)\\ (x_1,\cdots, x_n) &= (\underbrace{-1,-1,\cdots, -1}_{k\text{ (-1)'s}}, 1,1,\cdots, 1) \end{align*} $$ then Bob can only win if $(z_1,\cdots, z_n) = (y_1,\cdots, y_n)$, so $S$ contains $(\underbrace{0,0,\cdots, 0}_{k\text{ 0's}}, 1,1, \cdots 1).$

Now, suppose $S$ only contains elements of that form. It remains to show that Bob can win. Rewrite the inequality as: $$\sum_{i=1}^n (y_i-y_{i-1})(x_i+\cdots + x_n) \le \sum_{i=1}^n (z_i-z_{i-1})(x_i+\cdots + x_n)$$ where $y_0=z_0=0$ for convenience, and we would like to set $z_i-z_{i-1} = 1$ for exactly one index $i$, and $0$ for the other indices. However, $$ \begin{align*} \sum_{i=1}^n (y_i-y_{i-1}) (x_i+\cdots + x_n) &\le \left( \sum_{i=1}^n (y_i-y_{i-1})\right) \max_{1\le i \le n} (x_i+\cdots x_n)\\ &\le \max_{1\le i \le n} (x_i+\cdots x_n) \end{align*} $$ so we can pick precisely the $i$ that attains that maximum.

The "easier" solution

I think the hard part of this problem is perhaps figuring out what this set should look like. There is a perspective that makes this really easy (which was found by AYS during testsolving, and also Glen when he eyepowered it in 5 minutes post-SMO).

The key is to think geometrically; treat $x = (x_1,x_2,...,x_n)$ (and $y$ and $z$ analogously) as points in $\mathbb R^n$. If we want $$\lrangle {x,y} \le \max_{z\in S}\lrangle{x,z}$$ then really what we mean is that some point in $z$ is "further" than $y$ along direction $x$.

Thinking geometrically, all this means is that $y$ is "contained" inside the shaped formed by $S$! (We can make this rigorous: if it were not the case that $y$ was in the convex hull of $S$, then there must exist a direction for which it is larger - this is the hyperplane separation theorem.)

So if $y$ could vary along all monotone sequences, we'd want $S$ to be some set whose convex hull contained them. But the set of monotone $y$'s is really just a simplex with vertices $(0,...,0,1,...,1)$, which is exactly the $S$ that we'd want. The end!

Commentary. Actually, when I first learnt about this approach, I was slightly disappointed, because it meant that the problem was more trivial easier than I thought. But in retrospect, having a geometry perspective that makes the problem way easier is actually really cool! I don't think many problems would have this pedagogical effect - it really turned out to be a feature, not a bug.

Behind the scenes

I'll give some teasers about what went behind the scenes in the proposal of this problem. The full story is extremely long and deserves its own blogpost, but hopefully this gets you interested to read the follow-up!

The original problem

Actually, my real original submission looked like this:

The original problem submission

Can you spot the difference?

...

Well, it's this:

Update email to Dr Wong

The reason that I had $x_i\in \{\pm 1\}$ initially will be apparent soon.

A generalization

Actually, the following generalization is also true: we define a sequence having $k$ monotone runs if $\{y_i-y_{i-1}\}$ changes sign at most $(k-1)$ times. For instance, a sequence like $$ y_1 \le y_2 \le ... \le y_k \ge y_{k+1} \ge \cdots \ge y_n$$ will have $2$ monotone runs.

This is also true:

Alice and Bob play a game. Bob starts by picking a finite set $S$ of length $n$ sequences of reals in $[0,1]$. Alice picks a sequence $\{y_i\}$ consisting of $k$ monotone runs, and a choice of signs $x_i \in \{\pm 1\}$. If Bob can pick $(z_1,z_2,\cdots, z_n)\in S$ such that $$\sum_{i=1}^n x_i y_i \le \sum_{i=1}^n x_iz_i$$ then Bob wins, and otherwise Alice wins. Then Bob can guarantee a win with $|S| = \binom {n+1} {k}$.

Actually, this was the true original version of the problem I'd came up with, but it starts to get clunky. For aesthetic reasons I went for $k=1$.

$\newcommand \cF {\mathcal F}$ $\newcommand \cG {\mathcal G}$ $\newcommand \cR {\mathcal R}$ $\newcommand \cS {\mathcal S}$

Rewriting the game as a maximum

We take a quick detour into game theory.

A general principle is that games can often be written as maxima and/or minima. For example, in the famous two player zero-sum game, suppose that if the first player makes the move $a_1$, and the second player makes the move $a_2$, then the first player gets payout $u(a_1,a_2)$ (and the second gets payout $-u(a_1,a_2)$ - that's what zero-sum means).

The second player's best move is to play $a_2$ that minimizes $u(a_1,a_2)$, so the first player's best move is to play $a_1$ that maximizes $\min_{a_2} u(a_1,a_2)$, leading to the optimal payout

$$\max_{a_1} \min_{a_2} u(a_1,a_2)$$

If the second player started first instead, it would instead be

$$\min_{a_2} \max_{a_1} u(a_1,a_2)$$

Clearly, playing second is advantageous, since you'd be able to react to the opponents strategy. However, the famous minimax theorem says that if both players are allowed to randomize amongst $n$ possible actions, then these values are in fact equal.

Coming back to the problem: we can "abstract" away Alice's and Bob's choices of sequences $(y_i)$ and $(z_i)$ by writing them as a maximum over respective sets of sequences:

Let $\cF$ be the set of all length $n$ sequences $(y_1,y_2,...,y_n)$ satisfying $0 \le y_1\le y_2\le ... \le y_n \le 1$. Show that there exists a set of $n+1$ sequences $\cG \subseteq \R^n$ such that $$\max_{y\in \cF} \lrangle{x,y} \le \max_{z\in \cG} \lrangle{x,z}$$ for any choice of $x\in\{\pm 1\}^n$.

The problem on my problem set

Now I'm ready to show you the problem on my problem set that inspired this problem. It reads:

(Problem on ML theory problem set) Let $\cF$ be the class of all continuous functions $f:[0,1]\to [0,1]$ with at most $k$ local maxima. Prove that the empirical Rademacher complexity of $\cF$ is $O(\sqrt{kn\log n})$.

Expanding the definition of Rademacher complexity, we're asked to show that for any $t_1,t_2,..., t_n\in [0,1]$ distinct, for some $C>0$ we have

$$ \cR(\cF) := \E_{x\sim \{\pm 1\}^n} \max_{f\in \cF} \sum_{i=1}^n x_if(t_i) \le C \sqrt{kn\log n}, \quad x\sim\{\pm 1\}^n.$$

(Here, $x\sim \{\pm 1\}^n$ just means that $x_i$ is a random $\pm$ sign.)

The first observation is that we can really replace the class of functions with a subset of $\R^n$. By reordering our indices so that that $t_1\le ...\le t_n$, it isn't hard to check that the set

$$ \cS = \{f(t_1),f(t_2),...,f(t_n): f\in \cF\}$$

can be characterized by the monotone runs (something like: we can only have $k$ "ups" and "downs"). So we can instead define

$$\cR(\cS) := \E_x \max_{y\in S} \lrangle{x,y} \le C\sqrt{kn\log n}, \quad x\sim \{\pm 1\}^n.$$

This might still seem hopeless to show, but thankfully, we were given the following fact in our notes:

(Massart's lemma) For finite $\cS'$, one has $$\cR(\cS') \le C\cdot \left(\max_{z\in \cS'} \|z\|_2\right) \cdot \sqrt{\log |\cS'|}.$$

But how does this help? Well, if only we had a finite set $\cS'$ where we had $\cR(\cS) \le \cR(\cS')$... or if we were a little more ambitious, we could want this inequality to hold pointwise (instead of averaged over $x$). That is, we're looking for a set $\cS'$ such that

$$\max_{x\in \cS} \lrangle{x,y} \le \max_{z\in \cS'} \lrangle{x,z}$$

Doesn't this look familiar now?

How do you figure out the set without knowing what it is?

If you don't know what the set is, you could think about how to "smooth" a point $x$ to a point $x'$ that was more extreme while not increasing the desired quantity $\langle x,z \rangle$. This reduce the number of possibilities for $x$ that we have to take a maximum over. (What I'm about to describe would be yet another solution to the original problem!)

Consider the following algorithm:

  • Start from $z = y$.
  • We will adjust $z_i$ (shorthand) without decreasing $\sum_{i=1}^n x_i z_i$ and without forming "new maxima" (or generally, keeping the monotonic structure of the original problem).
  • Depending on the current partial sum $\sum_{i=1}^k x_i$, we could smooth all of $f_1,...,f_k$ one way or the other (until we broke the monotone structure).
  • Spelt out in full, for $k=1,2,...,n$:
    • If $\sum_{i\le k} x_i \ge 0$, simultaneously increase $f_1,..., f_k$ (where possible) until $f_k = 1$ or $f_k = f_{k+1}$.
    • If $\sum_{i\le k} x_i < 0$, simultaneously decrease $f_1,..., f_k$ (where possible) until $f_k = 0$ or $f_k = f_{k+1}$.

Let the end state be $z_1', ..., z_n'$. Then it's easy to check that (a) $z_i'$ must be 0 or 1, (b) $z_i\le z_{i+1}$ implies $z_i' \le z_{i+1}$ and (c) $z_i \ge z_{i+1}$ implies $z_i'\ge z_{i+1}'$.

It's not hard to check that if $f$ had at most $k$ local maxima, then $f'$ has at most $2k+2$ value changes, so there are at most $\binom{n-1}{2k+2}$ such sequences $(z_1', z_2', ..., z_n')$. This solves the problem!

What's next?

In the next part, I will try to explain why we care about Rademacher complexity at all. This will lead us into the story of statistics and machine learning.

Comments

Popular posts from this blog

SMO Open 2024 ??% Speedrun

Musings about the IMO