Too lazy to integrate

(David here.) A math teacher from Primary school once told me that "lazy people make good mathematicians". This is one story of laziness-inspired mathematics.

$\newcommand \Ber {\mathrm{Ber}}$ $\newcommand \Bin {\mathrm{Bin}}$ $\newcommand \Geo {\mathrm{Geo}}$ $\newcommand \Exp {\mathrm{Exp}}$ $\newcommand \cN {\mathcal N}$ $\newcommand \Poi {\mathrm{Poi}}$ $\newcommand \dto {\overset d \to}$ $\newcommand \E {\mathbb E}$ $\newcommand \Var {\mathrm{Var}}$

A sad story

Imagine you're in probability class one day, trying not to fall asleep. You've just learnt about random variables, independence and binomials, and now the professor's covering Poissons and Exponentials. You don't catch much, except that exponentials are defined by $$ P(X > t) = e^{-\lambda t}$$ and Poissons by $$P(X = n) = \frac{e^{-\lambda} \lambda^n}{n!}.$$

(You barely have a clue of what $\lambda$ is supposed to mean.)

You blink, and it's 2am and you're staring at a problem set that looks like this.

The problem set

  1. $Y\sim \Exp(1)$. Find $P(Y>2 | Y>1)$.

  2. Let $Y\sim \Exp(\lambda)$. Find $\E[Y|Y\ge a]$.

  3. Let $Y\sim \Exp(\lambda)$. Find $\E[Y^2|Y\ge a]$.

  4. $Y_1,Y_2\sim \Exp(\lambda)$ independent. Find $P(Y_1+Y_2\le 1)$.

  5. $Y_1,Y_2, Y_3 \sim \Exp(\lambda)$ independent. Find $P(Y_1 + Y_2 \le 1, Y_1+Y_2+Y_3\ge 1)$.

  6. $Y_1 \sim \Exp(\lambda_1)$ and $Y_2\sim \Exp(\lambda_2)$ are indepedent. Find $P(Y_1 \le Y_2)$.

  7. $X\sim \Exp(\lambda)$ and $Y\sim \Poi(X)$. Find $P(Y = 1)$.

  8. $Y_1,Y_2,Y_3\sim \Exp(\lambda)$ independent. Find $P(Y_1+Y_2 \le \frac 2 3 | Y_1+Y_2+Y_3 = 1)$.

  9. $Y_1,...,Y_n\sim \Exp(1)$ independent. Let $M= \max\{Y_1,...,Y_n\}$. What is $\E M$?

  10. $X_1\sim \Exp(1), X_2 \sim \Exp(1/2), X_3\sim \Exp(1/3)$ independent. Find $P(X_1 < X_2 < X_3)$.

Why is my problem set is full of gross integrals?

Yep. It slowly dawns on you what you have to contend with.

Take for instance #4. It's not hard, but it certainly is gross: using the formula for the pdf $\lambda e^{-\lambda x}$,

$$ P(Y_1+Y_2\le 1) = \int_{y_1+y_2 \le 1} \lambda e^{-\lambda y_1}\cdot \lambda e^{-\lambda y_2} dy_1 dy_2 $$

and then you have the annoying job of integrating this. It's not hard by any measure, but it is very uninspiring. And you have another 9 of these to get through before the night ends. You curse your professor's name, and wonder why you had the misfortune of having to take a course in probability.

Maybe it didn't have to be this way.

Context: Random variables and limits

I was a Teaching Assistant for the introductory probability course in the Computer Science department at Stanford for an entire year (which is three iterations). There were a handful of topics that I thought was interesting in the "big picture" beyind the course, and this was perhaps the first one.

So the most basic random variable you could possibly have is the biased coin flip, or Bernoulli distribution. That is, if $$ X = \begin{cases} 1 & \text{with probability }p \\ 0 & \text{with probability }1-p \end{cases}$$ and we'd denote this $X \sim \Ber(p)$.

Things start to get interesting when we're allowed to have many independent coin flips. The number of heads we get when we flip $n$ coins for a coin that lands on heads with probability $p$ gets a special name: the binomial distribution $\Bin(n,p)$. We could also say that it's the sum of $n$ independent $\Ber(p)$'s.

This is also the point where we can start doing basic statistics: let's say that we didn't know exactly how biased our coin was, but we observe $k$ out of $n$ coinflips. A very reasonable guess is that $p\approx k/n$, and it's also optimal in various senses.

Another thing we can do is that in the case $n\to \infty$, we can understand the "macro" behaviour using the central limit theorem, which tells us that the sum behaves like a normal distribution with the same mean and variance. (If you forgot what variance was, it's a measure of the typical deviation from the mean: $\Var X = \E[(X-\E X)^2] = \E[X^2] - \E[X]^2$.) In particular, the variance of a $\Ber(p)$ is $p(1-p)$, so the variance of a $\Bin(n,p)$ is $np(1-p)$ (because when summing independent variables, the variances add up). So we have the following "limit"

$$\frac{ \Bin(n,p) - np}{\sqrt{np(1-p)}} \to \mathcal N(0,1)$$

What in the world does this mean? it means that for a fixed real number $t$, $\Bin(n,p)$ exceeds the mean by $t$ standard deviations with a probability that tends to the equivalent probability for a standard normal distribution, i.e. $$P(X - np > t\sqrt{np(1-p)}) \overset{n\to\infty}{\longrightarrow} P(Z > t)\qquad X\sim \Bin(n,p), Z\sim \cN(0,1)$$

One also calls this property distributional convergence, and this is denoted sometimes with $\overset d \to$.

Context: Poissons and Exponentials

So earlier we saw that a Poisson with mean $\lambda$, denoted $\Poi(\lambda)$, is the distribution of $X$ where $$P(X=k) = e^{-\lambda} \frac{\lambda^k}{k!}$$

which, if you're looking at this for the first time, should make absolutely no sense whatsoever. But what's usually glossed over is that this actually comes from a very natural looking limit:

$$\Bin(n, \lambda/n) \dto Poi(\lambda)$$

Here, the $\lambda$ can be interpreted as a rate. If you were waiting at a bus stop, and $\lambda$ buses came every hour, a reasonable way to model this is to assume that the rate is "uniform", so maybe we picture that every minute has a $\lambda / 60$ probability of a bus arriving, and we can do this time subdivision ad infinitum.

To prove that the above are actually the same, we can use a moment generating function. One thinks of these as racks to put probabilities on: $$M_X(t) = \E[t^X] = P(X=0) + P(X=1)\cdot t + P(X = 2) \cdot t^2 + ...$$ and the neat thing is that for independent $X,Y$, $M_{X+Y}(t) = M_X(t)M_Y(t)$. The mgf for a $\Ber(\lambda/n)$ is $1 + \frac{\lambda(t-1)} {n}$. So the binomial convergence to a Poisson can be written as $$\left(1+\frac{\lambda (t-1)}{n}\right)^n \to e^{\lambda (t-1)}.$$

The only final thing you have to believe is that convergence of an mgf implies convergence of individual coefficients, but let's not sweat the details at this point.

What about an exponential? How would we obtain some $Y$ that satisfies $$P(Y > t) = e^{-\lambda t}?$$ If you squint a little, maybe you're thinking that $$\left(1 - \frac{\lambda} {n}\right)^{tn} \to e^{-\lambda t}$$ so we can spin up the following story: if we toss a coin with head probability $\lambda / n$, then that's the probability that we see no heads until at least the $(tn)$-th toss. If we introduce the geometric distribution $\Geo(p)$ as the number of coin flips until we see a head, then we have $$\frac{1}{n} \Geo(\lambda /n) \to \Exp(\lambda)$$

Meta. So, this is roughly as far as we got telling folks about the "origin" of the Poisson and Exponential random variables. But to me, this is not enough. It is a cute idea that these occur as limits of simpler distributions, and they somewhat give us an intuition, but the level that I demand from ideas is that they must show us something that we didn't know before.

Putting these together: the Poisson process

If you get the sense that after those definitions, the Poisson and Exponential must be intricately related, you were right! Here I'll show you a conceptual framework to "couple" the definitions together.

Imagine that I have the positive real line chopped up into intervals of length $1/n$, and that on each interval I toss a coin independently with head probability $\lambda / n$. Now, I take the limit $n\to\infty$. In this setup, both the Poisson and Exponential show up:

  • The number of heads within the first unit length is $\Poi(\lambda)$. More generally, the number of heads within the first $x$ length is $\Poi(\lambda x)$.
  • The position of the first head is $\Exp(\lambda)$.

It's perhaps not too hard to imagine such an object using our physical intuitions - and let's call this the Poisson process (with rate $\lambda$). The heads are sometimes also called arrivals (and we speak of arrivals happening at a certain position, along the positive real line). This is a fundamental object when we want to understand queues and to model real-world arrivals (fun fact: Disney used this in the 1980s to come up with their expedited queueing system at Disneyland), but right now we'll concern ourselves with exploiting the existence of Poisson processes to skip over computations.

(For the truly mathematically skeptical, the existence of the Poisson process seems like something that should not be assumed. How would you give a concrete definition to something like this, and how might you show a convergence of distributions?)

We'll start with an easy example. Perhaps you've noticed that for $X \sim \Poi(\lambda)$ and $Y\sim \Exp(\lambda)$, that $$P(X = 0) = P(Y > 1) = e^{-\lambda}.$$ Coincidence? I think not - both $X=0$ and $Y>1$ describe the same event of no arrivals of a Poisson process (with rate $\lambda$) happening in the first unit interval.

Another thing one could easily show now is that the sum of an independent $\Poi(\alpha)$ and $\Poi(\beta)$ is distributed like $\Poi(\alpha + \beta)$.

Exercise. Now go back and try the list of problems. How many of them can you immediately solve now?

Meta. This is perhaps the first level of understanding we can extract from a limiting object, which allows us to connect various aspects together. You should feel that you could probably tackle any problems about Poissons and Exponentials that arise from a single process. We'll see that this can be pushed further.

Process Splitting

Here's a very counterintuitive property: let's say that we split a single Poisson process of rate $\lambda$ by tossing an independent coin with head probability $p$. If you tried thinking by analogy to doing the same thing to $n$ objects and splitting them into two groups using a coin, it's evident that the size of the two groups are not independent. But here's the headliner:

If we split a Poisson process $\lambda$ using a coin, we get two independent Poisson processes! In particular, if $\lambda = \alpha + \beta$ and set the head probability to $p = \frac {\alpha} {\alpha + \beta}$, then out pops two independent Poisson processes $A$ and $B$ with rate $\alpha$ and $\beta$ respectively.

How do we know that this works? Let's start the reverse statement: how do we know that putting two independent processes with rate $\alpha$ and $\beta$ together gives us a single process with rate $\alpha + \beta$? Well, we appeal to the limit argument again. If each interval of length $1/n$ had independent probabilities of $\alpha/n$ and $\beta/n$ of getting an arrival, then the probability of getting at least one arrival is $$\frac{\alpha}{n} + \frac{\beta}{n} - \frac{\alpha \beta}{n^2}$$ which, when we take $n\to\infty$, we can ignore the $1/n^2$ term and say that the result is a Poisson process with rate $\alpha + \beta$.

If we squint a little, there's nothing stopping us from running this argument backwards: if the probability of "at least one arrival" is as given above, then we can split this event into three events of respective probabilities:

  • with probability $\frac{\alpha}{n} - \frac{\alpha \beta}{n^2}$, only $A$ arrives but $B$ doesn't
  • with probability $\frac{\beta}{n} - \frac{\alpha \beta}{n^2}$, only $B$ arrives but $A$ doesn't
  • with probability $\frac{\alpha \beta}{n^2}$ both arrive.

It's easy to see now that we've undone this and obtained independence between $A$ and $B$ per bin.

Meta. What exactly did we do here? In essence, the characterization of the Poisson process as a limit allowed us to realize that Poisson processes can be arbitrarily merged and split. This means, for instance, that we have a good interpretation of coin tosses on top of Poisson processes.

Interlude: Poissonization

I want to show a cool application of this that I learnt from the Randomized Algorithms (CS 265) class at Stanford. It left a really longlasting impression on me about the level of creativity that was possible in the analysis of algorithms.

In the context of hashing and hash collisions, one can model the assignment of data to hashes by using the toy model of tossing $n$ balls (the "data") independently and uniformly into $n$ bins (the "hashes"). One would like to limit the amount of hash collisions (which have to be resolved by a separate mechanism), and the extent of that can be phrased by the following question:

If I toss $n$ balls into $n$ bins, what is the typical largest number of balls in a bin?

Here is one argument: instead of tossing $n$ balls, we toss $N\sim Poi(n)$ balls, which we know to be roughly $N \pm \sqrt{N}$ (because it is the sum of $n$ independent copies of $\Poi(1)$). In particular, if we model the balls as Poisson arrivals and split them into $n$ bins independently, the arrival to each bin is itself an independent Poisson process of mean 1! So the question becomes asking: what function $m(n)$ gives me a negligible probability of $$P(\max(B_1, B_2,..., B_n) \le m(n)) \to 1$$ where $B_1,B_2,..,B_n\sim \Poi(1)$ are the independent number of balls in each bin? Well,

$$P(\max(B_1, B_2,..., B_n) \le m(n)) = P(B_1 \le m(n))^n = \left(1 - e^{-1}\sum_{k > m(n)} \frac{1}{k!}\right)^n.$$

At this point, the computations get a little more complicated, but conceptually it is just about establishing the rate of decay of the upper tail of the Poisson distribution (i.e. what is $P(B_1>t)$ is in terms of $t$). If we squint very hard, for this quantity to go to 1, we need $e^{-1}\sum_{k > m(n)} \frac{1}{k!} = o(1/n)$, and since this quantity is mostly determined by the leading term $1/m(n)!$ (up to e.g. a fixed multiplicative factor), so doing a "Sterling's approximation" we roughly get $$m(n)^{m(n)/2} \approx n$$ or the threshold for $m(n)$ is roughly $\sqrt{\frac{\log n}{\log \log n}}$. There are some additional tricks to make this argument rigorous (e.g. reverting from $\Poi(n)$ to $n$) but that's basically the gist of it.

Problem set speedrun

So if you knew all of this stuff really well, you can do the entire problem set without a single integral. (Promise!)

(Last chance to try it yourself!)

Q1. $Y\sim \Exp(1)$. Find $P(Y>2 | Y>1)$.

Consider a rate 1 process. Given that there are no arrivals until time 1, the next arrival is just distributed like $1 + \Exp(1)$, so the answer is $P(Y>1) = \boxed{1/e}$. (We sometimes call this the memorylessness of exponentials.)

Q2. Let $Y\sim \Exp(\lambda)$. Find $\E[Y|Y\ge a]$.

Same as above, $\E[Y|Y\ge a] = a+\E[Y] = \boxed{a+ \lambda^{-1}}$.

Q3. Let $Y\sim \Exp(\lambda)$. Find $\E[Y^2|Y\ge a]$.

Again, $Y|Y\ge a$ has the same distribution as $a+Y$, so the answer is $a^2 + 2a\E[Y] + \E[Y^2] = \boxed{a^2+2a\lambda^{-1} + 2\lambda^{-2}}$.

Q4. $Y_1,Y_2\sim \Exp(\lambda)$ independent. Find $P(Y_1+Y_2\le 1)$.

Consider a Poisson process with rate $\lambda$. Then this is asking whether two events have occured by time 1. The number of events occuring by time 1 is $X\sim \Poi(\lambda)$, so the answer is $P(X\ge 2) = \boxed{1-e^{-\lambda}(1 + \lambda)}$.

Q5. $Y_1,Y_2, Y_3 \sim \Exp(\lambda)$ independent. Find $P(Y_1 + Y_2 \le 1, Y_1+Y_2+Y_3\ge 1)$.

Consider a Poisson process with rate $\lambda$. Then this is exactly the probability that exactly 2 events occur before 1, so the answer is $\boxed{e^{-\lambda}\lambda^2/2}$.

Q6. $Y_1 \sim \Exp(\lambda_1)$ and $Y_2\sim \Exp(\lambda_2)$ are indepedent. Find $P(Y_1 \le Y_2)$.

In other words, if we put two Poisson processes of rate $\lambda_1$ and $\lambda_2$ together, what is the probability of the first being from the first process? But we can also achieve this by splitting a single rate $\lambda_1+\lambda_2$ process into two, so the answer is $\boxed{\frac{\lambda_1}{\lambda_1+\lambda_2}}$.

Q7. $X\sim \Exp(\lambda)$ and $Y\sim \Poi(X)$. Find $P(Y = 1)$.

Consider two processes $A$ (with rate $\lambda$) and $B$ (with rate $1$). Then, $X$ is the first event on $A$, while $Y$ is the number of events between $0$ and $X$ on $B$. But by splitting a rate $1+\lambda$ process, this is the probability of getting a tail then a head for a coin with head probability $\frac{1}{1+\lambda}$, so the answer is $\boxed{\frac{\lambda}{(1+\lambda)^2}}$.

Q8. $Y_1,Y_2,Y_3\sim \Exp(\lambda)$ independent. Find $P(Y_1+Y_2 \le \frac 2 3 | Y_1+Y_2+Y_3 = 1)$.

Consider a Poisson process with rate $\lambda$. Conditioned on the third event happening at 1, everything before is uniformly distributed (use a limiting argument), so having both events happen before $2/3$ has probability $\boxed{4/9}$.

Q9. $Y_1,...,Y_n\sim \Exp(1)$ independent. Let $M= \max\{Y_1,...,Y_n\}$. What is $\E M$?

Let $S_1 \lt S_2 \lt ... \lt S_n$ be the sorted version of $Y_i$'s. Then

  • $S_1\sim \Exp(n)$, because it is the first arrival among $n$ rate 1 processes
  • $S_2-S_1\sim \Exp(n-1)$, because it is the next arrival among $n-1$ rate 1 processes
  • and so on...

Thus, $\E S_n = \boxed{1+\frac 1 2 + \frac 1 3 + ... + \frac 1 n}$.

Q10. $X_1\sim \Exp(1), X_2 \sim \Exp(1/2), X_3\sim \Exp(1/3)$ independent. Find $P(X_1 < X_2 < X_3)$.

This is pretty similar to the above. The probability that the first arrival is $X_1$ is $\frac{1}{1+\frac 1 2 + \frac 1 3} = \frac 5 {11}$, and the probability that the next arrival is $X_2$ (between $X_2$ and $X_3$) is $\frac{\frac 1 2}{\frac 1 2 + \frac 1 3} = \frac 3 5$, so the answer is $\boxed{3/11}$.

Epilogue, and a challenge

I feel like this article came at an apropos time where the IMO committee is considering including probability into the official syllabus - hopefully this gives you a taste of the level of creativity one could have in this topic.

But let me also tell you about the inspiration for me to write this article.

Around three years ago, I was helping my girlfriend out with her statistics problem set, and there was this incredible problem statement:

Let $X, Y, Z$ be independent real numbers uniformly selected from $[0,1]$. Show that $(XY)^Z$ is also uniformly distributed on $[0,1]$.

This is one of those problems where you just stare at it in disbelief, and get a rude shock when the computation actually works out. I thought to myself at that point: "Surely there has to be a reason for this." But there was none at that point. (I think the cleanest way I had for that was to use an mgf, but it was still not convincing.)

And then roughly two years ago, I TA'd for the aforementioned intro to probability class, and while teaching I refined my understanding of the whole Poisson process structure.

A month ago, I overheard my colleagues talking about the following algorithm to do streaming sampling of data. Let's say you have a huge number of datapoints $N$, each with a weight $w_i$. You would like to sample $k$ of them such that each datapoint makes it into your sample with probability proportional $w_i$, but you are allowed to store at most $k$ datapoints. How would you do this?

My other colleague points out that there was a paper that suggests sampling $U^{1/w_i}$ for each datapoint (for uniform $U$ on $[0,1]$), and then keeping the $k$ largest entries, which was absolutely nuts to me. To convince yourself, you could try the two-datapoint case, where you should obtain that $$P(U_i^{1/w_i} < U_j^{1/w_j}) = \frac{w_i}{w_i+w_j}.$$ In my head, I was just thinking that "hmm, this looks like Poisson splitting". And indeed it was: $-\log U$ is actually distributed like $\Exp(1)$, so $-\log U^{1/w_i}$ is really just a rate $\Exp(w_i)$ process. And then all these facts were immediately obvious.

And then I thought - maybe that fact about the uniforms was actually, secretly, also a fact about the Poisson process! Taking the log, we would like to show that $$U\cdot (E_1 + E_2) \sim \Exp(1)$$ for independent $E_1,E_2\sim \Exp(1)$. So the remaining mystery is how a uniform random variable shows up in the Poisson process world.

I'll leave this last discovery to the reader, and maybe you can feel the same excitement as I felt when I solved this problem.

Comments

Popular posts from this blog

SMO Open 2024 ??% Speedrun

Musings about the IMO