Behind the Scenes of SMO Q4, Part 1: the Proposal Backstory

(David here.)

Recently, a problem of mine came out as Q4 on SMO Open 2024. I'll go over how I proposed the problem and the theory behind it, which involves some machine learning ideas (!).

This will take a small sequence of posts:

  • Part 1: a complete backstory for how this problem came about

  • Part 1.5: trying to understand Rademacher complexity through examples

  • Part 2: an introduction to learning theory

  • (maybe?) Part 3: where Rademacher theory comes into the picture

\newcommand\lrangle[1]{\left\langle #1 \right\rangle} \newcommand \E {\mathbb E} \newcommand \cH {\mathcal H} \newcommand \Var {\mathrm{Var}} \newcommand \MSE {\mathrm{MSE}}

The "official" solution

This was the original phrasing of what I submitted:

Alice and Bob play a game. Bob starts by picking a finite set SS of length nn sequences of reals in [0,1][0,1]. Alice picks a monotone sequence y1y2yny_1\le y_2\le \cdots\le y_n from the interval [0,1][0,1], and a choice of real numbers x1,,xnRx_1,\cdots, x_n\in \mathbb R. If Bob can pick (z1,z2,,zn)S(z_1,z_2,\cdots, z_n)\in S such that i=1nxiyii=1nxizi\sum_{i=1}^n x_i y_i \le \sum_{i=1}^n x_iz_i then Bob wins, and otherwise Alice wins. What is the minimum S|S| required so that Bob can guarantee a win?

And this was the solution I submitted:

Solution. We claim that S=n+1|S| = n+1.

Firstly, Sn+1|S|\ge n+1, since if Alice selects (y1,,yn)=(0,0,,0k 0’s,1,1,1)(x1,,xn)=(1,1,,1k (-1)’s,1,1,,1) \begin{align*} (y_1,\cdots, y_n) &= (\underbrace{0,0,\cdots, 0}_{k\text{ 0's}}, 1,1, \cdots 1)\\ (x_1,\cdots, x_n) &= (\underbrace{-1,-1,\cdots, -1}_{k\text{ (-1)'s}}, 1,1,\cdots, 1) \end{align*} then Bob can only win if (z1,,zn)=(y1,,yn)(z_1,\cdots, z_n) = (y_1,\cdots, y_n), so SS contains (0,0,,0k 0’s,1,1,1).(\underbrace{0,0,\cdots, 0}_{k\text{ 0's}}, 1,1, \cdots 1).

Now, suppose SS only contains elements of that form. It remains to show that Bob can win. Rewrite the inequality as: i=1n(yiyi1)(xi++xn)i=1n(zizi1)(xi++xn)\sum_{i=1}^n (y_i-y_{i-1})(x_i+\cdots + x_n) \le \sum_{i=1}^n (z_i-z_{i-1})(x_i+\cdots + x_n) where y0=z0=0y_0=z_0=0 for convenience, and we would like to set zizi1=1z_i-z_{i-1} = 1 for exactly one index ii, and 00 for the other indices. However, i=1n(yiyi1)(xi++xn)(i=1n(yiyi1))max1in(xi+xn)max1in(xi+xn) \begin{align*} \sum_{i=1}^n (y_i-y_{i-1}) (x_i+\cdots + x_n) &\le \left( \sum_{i=1}^n (y_i-y_{i-1})\right) \max_{1\le i \le n} (x_i+\cdots x_n)\\ &\le \max_{1\le i \le n} (x_i+\cdots x_n) \end{align*} so we can pick precisely the ii that attains that maximum.

The "easier" solution

I think the hard part of this problem is perhaps figuring out what this set should look like. There is a perspective that makes this really easy (which was found by AYS during testsolving, and also Glen when he eyepowered it in 5 minutes post-SMO).

The key is to think geometrically; treat x=(x1,x2,...,xn)x = (x_1,x_2,...,x_n) (and yy and zz analogously) as points in Rn\mathbb R^n. If we want x,ymaxzSx,z\lrangle {x,y} \le \max_{z\in S}\lrangle{x,z} then really what we mean is that some point in zz is "further" than yy along direction xx.

Thinking geometrically, all this means is that yy is "contained" inside the shaped formed by SS! (We can make this rigorous: if it were not the case that yy was in the convex hull of SS, then there must exist a direction for which it is larger - this is the hyperplane separation theorem.)

So if yy could vary along all monotone sequences, we'd want SS to be some set whose convex hull contained them. But the set of monotone yy's is really just a simplex with vertices (0,...,0,1,...,1)(0,...,0,1,...,1), which is exactly the SS that we'd want. The end!

Commentary. Actually, when I first learnt about this approach, I was slightly disappointed, because it meant that the problem was more trivial easier than I thought. But in retrospect, having a geometry perspective that makes the problem way easier is actually really cool! I don't think many problems would have this pedagogical effect - it really turned out to be a feature, not a bug.

Behind the scenes

I'll give some teasers about what went behind the scenes in the proposal of this problem. The full story is extremely long and deserves its own blogpost, but hopefully this gets you interested to read the follow-up!

The original problem

Actually, my real original submission looked like this:

The original problem submission

Can you spot the difference?

...

Well, it's this:

Update email to Dr Wong

The reason that I had xi{±1}x_i\in \{\pm 1\} initially will be apparent soon.

A generalization

Actually, the following generalization is also true: we define a sequence having kk monotone runs if {yiyi1}\{y_i-y_{i-1}\} changes sign at most (k1)(k-1) times. For instance, a sequence like y1y2...ykyk+1yn y_1 \le y_2 \le ... \le y_k \ge y_{k+1} \ge \cdots \ge y_n will have 22 monotone runs.

This is also true:

Alice and Bob play a game. Bob starts by picking a finite set SS of length nn sequences of reals in [0,1][0,1]. Alice picks a sequence {yi}\{y_i\} consisting of kk monotone runs, and a choice of signs xi{±1}x_i \in \{\pm 1\}. If Bob can pick (z1,z2,,zn)S(z_1,z_2,\cdots, z_n)\in S such that i=1nxiyii=1nxizi\sum_{i=1}^n x_i y_i \le \sum_{i=1}^n x_iz_i then Bob wins, and otherwise Alice wins. Then Bob can guarantee a win with S=(n+1k)|S| = \binom {n+1} {k}.

Actually, this was the true original version of the problem I'd came up with, but it starts to get clunky. For aesthetic reasons I went for k=1k=1.

\newcommand \cF {\mathcal F} \newcommand \cG {\mathcal G} \newcommand \cR {\mathcal R} \newcommand \cS {\mathcal S}

Rewriting the game as a maximum

We take a quick detour into game theory.

A general principle is that games can often be written as maxima and/or minima. For example, in the famous two player zero-sum game, suppose that if the first player makes the move a1a_1, and the second player makes the move a2a_2, then the first player gets payout u(a1,a2)u(a_1,a_2) (and the second gets payout u(a1,a2)-u(a_1,a_2) - that's what zero-sum means).

The second player's best move is to play a2a_2 that minimizes u(a1,a2)u(a_1,a_2), so the first player's best move is to play a1a_1 that maximizes mina2u(a1,a2)\min_{a_2} u(a_1,a_2), leading to the optimal payout

maxa1mina2u(a1,a2)\max_{a_1} \min_{a_2} u(a_1,a_2)

If the second player started first instead, it would instead be

mina2maxa1u(a1,a2)\min_{a_2} \max_{a_1} u(a_1,a_2)

Clearly, playing second is advantageous, since you'd be able to react to the opponents strategy. However, the famous minimax theorem says that if both players are allowed to randomize amongst nn possible actions, then these values are in fact equal.

Coming back to the problem: we can "abstract" away Alice's and Bob's choices of sequences (yi)(y_i) and (zi)(z_i) by writing them as a maximum over respective sets of sequences:

Let F\cF be the set of all length nn sequences (y1,y2,...,yn)(y_1,y_2,...,y_n) satisfying 0y1y2...yn10 \le y_1\le y_2\le ... \le y_n \le 1. Show that there exists a set of n+1n+1 sequences GRn\cG \subseteq \R^n such that maxyFx,ymaxzGx,z\max_{y\in \cF} \lrangle{x,y} \le \max_{z\in \cG} \lrangle{x,z} for any choice of x{±1}nx\in\{\pm 1\}^n.

The problem on my problem set

Now I'm ready to show you the problem on my problem set that inspired this problem. It reads:

(Problem on ML theory problem set) Let F\cF be the class of all continuous functions f:[0,1][0,1]f:[0,1]\to [0,1] with at most kk local maxima. Prove that the empirical Rademacher complexity of F\cF is O(knlogn)O(\sqrt{kn\log n}).

Expanding the definition of Rademacher complexity, we're asked to show that for any t1,t2,...,tn[0,1]t_1,t_2,..., t_n\in [0,1] distinct, for some C>0C>0 we have

R(F):=Ex{±1}nmaxfFi=1nxif(ti)Cknlogn,x{±1}n. \cR(\cF) := \E_{x\sim \{\pm 1\}^n} \max_{f\in \cF} \sum_{i=1}^n x_if(t_i) \le C \sqrt{kn\log n}, \quad x\sim\{\pm 1\}^n.

(Here, x{±1}nx\sim \{\pm 1\}^n just means that xix_i is a random ±\pm sign.)

The first observation is that we can really replace the class of functions with a subset of Rn\R^n. By reordering our indices so that that t1...tnt_1\le ...\le t_n, it isn't hard to check that the set

S={f(t1),f(t2),...,f(tn):fF} \cS = \{f(t_1),f(t_2),...,f(t_n): f\in \cF\}

can be characterized by the monotone runs (something like: we can only have kk "ups" and "downs"). So we can instead define

R(S):=ExmaxySx,yCknlogn,x{±1}n.\cR(\cS) := \E_x \max_{y\in S} \lrangle{x,y} \le C\sqrt{kn\log n}, \quad x\sim \{\pm 1\}^n.

This might still seem hopeless to show, but thankfully, we were given the following fact in our notes:

(Massart's lemma) For finite S\cS', one has R(S)C(maxzSz2)logS.\cR(\cS') \le C\cdot \left(\max_{z\in \cS'} \|z\|_2\right) \cdot \sqrt{\log |\cS'|}.

But how does this help? Well, if only we had a finite set S\cS' where we had R(S)R(S)\cR(\cS) \le \cR(\cS')... or if we were a little more ambitious, we could want this inequality to hold pointwise (instead of averaged over xx). That is, we're looking for a set S\cS' such that

maxxSx,ymaxzSx,z\max_{x\in \cS} \lrangle{x,y} \le \max_{z\in \cS'} \lrangle{x,z}

Doesn't this look familiar now?

How do you figure out the set without knowing what it is?

If you don't know what the set is, you could think about how to "smooth" a point xx to a point xx' that was more extreme while not increasing the desired quantity x,z\langle x,z \rangle. This reduce the number of possibilities for xx that we have to take a maximum over. (What I'm about to describe would be yet another solution to the original problem!)

Consider the following algorithm:

  • Start from z=yz = y.
  • We will adjust ziz_i (shorthand) without decreasing i=1nxizi\sum_{i=1}^n x_i z_i and without forming "new maxima" (or generally, keeping the monotonic structure of the original problem).
  • Depending on the current partial sum i=1kxi\sum_{i=1}^k x_i, we could smooth all of f1,...,fkf_1,...,f_k one way or the other (until we broke the monotone structure).
  • Spelt out in full, for k=1,2,...,nk=1,2,...,n:
    • If ikxi0\sum_{i\le k} x_i \ge 0, simultaneously increase f1,...,fkf_1,..., f_k (where possible) until fk=1f_k = 1 or fk=fk+1f_k = f_{k+1}.
    • If ikxi<0\sum_{i\le k} x_i < 0, simultaneously decrease f1,...,fkf_1,..., f_k (where possible) until fk=0f_k = 0 or fk=fk+1f_k = f_{k+1}.

Let the end state be z1,...,znz_1', ..., z_n'. Then it's easy to check that (a) ziz_i' must be 0 or 1, (b) zizi+1z_i\le z_{i+1} implies zizi+1z_i' \le z_{i+1} and (c) zizi+1z_i \ge z_{i+1} implies zizi+1z_i'\ge z_{i+1}'.

It's not hard to check that if ff had at most kk local maxima, then ff' has at most 2k+22k+2 value changes, so there are at most (n12k+2)\binom{n-1}{2k+2} such sequences (z1,z2,...,zn)(z_1', z_2', ..., z_n'). This solves the problem!

What's next?

In the next part, I will try to explain why we care about Rademacher complexity at all. This will lead us into the story of statistics and machine learning.

Comments

Popular posts from this blog

SMO Open 2024 ??% Speedrun

Musings about the IMO