Lattices for prime factorization

(David here.) In this post I want to share a really interesting factoring algorithm that I learnt while taking an advanced cryptography class, and it will serve as a good springboard into talking about the mathematics and algorithms of lattices.

$$\newcommand{\abs}[1]{\left | #1\right |}$$ $$\newcommand{\vol}{\mathrm{vol}}$$ $$\newcommand{\norm}[1]{\left \| #1 \right \|}$$ $$\newcommand{\normn}[1]{\| #1 \|}$$ $$\newcommand\cL{\mathcal L}$$ $$\newcommand{\lrangle}[1]{\left\langle #1 \right\rangle}$$ $$\newcommand{\lranglen}[1]{\langle #1 \rangle}$$

An integer factorization problem

Factorizing a number is hard, especially if your number only has large prime divisors. There is no known efficient way to factor an number - that is, no algorithm that factors an $n$-digit number in time polynomial of $n$. (For size, the sieve of Eratosthenes factors a number $N$ using $\sqrt N = \exp((\log N) / 2)$ operations, which is exponential in the number of digits $\log N$.)

The fact that prime factorization is difficult forms the basis of the RSA protocol, widely used today to encrypt data over the internet. If $N=pq$, then $N$ can be released to the public while $\varphi(N)$ remains unknown. If we wanted to receive a message $m < N$, we could do the following:

generate a pair $(d,e)$ such that $de\equiv 1 \pmod {\varphi(N)}$
release $e$ (the "encryption key") to the public
anyone in the public could send the encrypted message $m^e \pmod {N}$, but the only way anyone could decrypt it is effectively to know $d$, which would allow them to do $(m^e)^d \equiv m \pmod N$.

One might be tempted to conclude some basis of hardness based on "the randomness of primes". This is misleading for at least two reasons:

In 2002, the AKS algorithm was discovered that allowed for us to test for the primality of $N$ in $O((\log N)^6)$ time.
The problem I'm about to show you can be solved in polynomial of $\log N$ time.

So what is this problem?

If the prime factors of a semiprime shares half their leading digits, then we can factor it efficiently. That is, given $N = pq$ with $|p-q| \le O(N^{1/4})$, $N$ can be factored in time polynomial in $\log N$.

The initial attempt

Fix $a = \lfloor N^{1/2} \rfloor$. Assuming that $p$ is the larger prime, this amounts to assuming that $p= x_0 + a$ for some small but unknown $x_0 = O(N^{1/4})$.

Here's the idea: we will define a bunch of polynomials $\{h_j\}$ where a large power of $p$ divides $h_j(x_0)$. Then, we will take a integer linear combination of these polynomials to produce a polynomial $g$ with small coefficients, but we will still have a large power of $p$ dividing $g(x_0)$. If the coefficients are small enough, we can force $g(x_0) = 0$, at which point $x_0$ can be recovered because we can efficiently find all roots of a polynomial. (This can be done analytically! Perhaps using winding number methods.)

Let's make things concrete. Define $$h_j(x) = N^{m-j} (x+a)^j$$ and so $p^m$ must divide any of $\{h_j(x_0)\}_j$. Thus, it must also divide any polynomial of the form $$g = \sum_{j=0}^{m-1} c_j h_j$$ at $x_0$, and each such $g$ has degree at most $m-1$. So:

on one hand, $p^m$ divides this.
On the other hand, the hope is that we can pick coefficients $c_j$ to make the coefficients of $g$ small enough to force $|g(x_0)| < p^m$. (It's sufficient to enforce $|g(x)| < N^{m/2}$.)

If these are both true, then we must have $g(x_0) = 0$, so $x_0$ can be obtained by finding all roots of $g$.

The second point is made trickier because we don't know what $x_0$ is. Let's first specify a bound for $x$, which we call $B$ (which is $O(N^{1/4})$ in our problem). Then our endgoal is to show that

$$\max_{|x| \le B} \abs{g(x)} \le N^{m/2} < p^{m}.$$

For starters, we bound $|g(x)|$ using the coefficient values of $g(Bx)$:

$$ \abs{g(x)} \le m \cdot \max_{k=0,..,m-1}\abs{[x^k]_{g(Bx)}}$$

for all $|x| \le B$. So, interpreting $g(Bx)$ as an $m$-dimensional vector (using its coefficients as coordinates), it suffices to find $g$ with a small $\infty$-norm (which is the size of the largest coefficient).

Under the vector space interpretation, the set $\{\sum_j c_j h_j (Bx)\}_{c\in \mathbb Z^m}$ forms a lattice (which we call $\mathcal L$). However, by Minkowski's theorem, the box $[-r, r]^m$ is centrally symmetric, so if $(2r)^m \ge 2^m\mathrm{vol}(\cL)$, where $\vol (\cL)$ is the volume of the fundamental parallelpiped, then it contains a point of the lattice other than the origin. This let's us go one step further to write down $$\max_{x\in [-B, B]} \abs{g(x)} \le m \cdot \mathrm{vol}(\cL)^{1/m}.$$

It also turns out that the volume can be computed surprisingly exactly. One can think of the volume of $\cL$ as the "upscale factor" of the linear map $c \mapsto \sum c_j h_j(Bx)$, which can be thought of three maps:

the prescaling $c \mapsto (c_j N^{m-j})$
the invertible map $d\mapsto \sum_j d_j (x+a)^j$ (which has no scale-up)
and finally the post scaling $g(x) \mapsto g(Bx)$.

In the first and last step, the volume increases independently along the dimensions, so we can just take the stretch-factor along each dimension and multiply them together. The result is precisely $$\mathrm{vol}(\cL) = N^{m(m+1)/2} B^{m(m-1)/2}$$ so our final bound is $$\max_{x\in [-B, B]}|g(x)| \le m \cdot N^{(m+1)/2} B^{(m-1)/2}.$$ which is a little too large - remember that if we wanted to zero this out from the fact that $p^m$ divides it, then we need this to be less than $p^m$ (or less than $N^{m/2}$, since $p \ge \sqrt N$).

Even though this failed, we see that there are many tweaks possible, and the computations so far are fairly clean and simple.

Meta-commentary. The application of Minkowski's theorem is really just an application of pigeonhole, where we chop up the cube $[-r, r]^m$ and split it over copies of the fundamental parallelpiped. In some sense this is dual to finding two points of a lattice that are close together mod 1, e.g. for the simultaneous Diophantine approximation trick. I'm also reminded of IMO 2021/3.

Adding more polynomials

To improve this, we toss another set of polynomials into the mix which are also divisible by $p^m$. Consider $$(x+a)^m, (x+a)^{m+1}, (x+a)^{m+2},..., (x+a)^{m+t-1}.$$

This expands the dimension of our lattice to $n = m+t$. Now we try again:

$$\max_{x\in [-B, B]} \abs{g(x)} \le n \cdot \vol(L)^{1/n} = n\cdot N^{\frac{m(m+1)}{2n}}\cdot B^{(n-1)/2} \overset{?}{\le} N^{m/2}$$

which is definitely looking much better! Let's work in log-space and drop all the little details we don't care about: we (essentially) have $$n \log B \le (m - m^2 / n) \log N$$ Simplify by taking $n = \gamma m$, $$\gamma \log B \le (1 - 1/\gamma)\log N$$ so $B \le N^{1/\gamma - 1/\gamma^2}$ is sufficient, and we could take $\gamma = 2$, for instance.

Meta-commentary. Improvements are often possible when your overall approach is simple. Adding a free variable to a simple approach can make the final conclusion very nontrivial (c.f. crossing number inequality).

One caveat

Is this algorithm actually efficient? For a factorization algorithm to be considered efficient, we require the runtime to be bounded by some power of $\log N$. This means that the numbers we deal with must be of size at most $\exp((\log N)^{O(1)}) = N^{(\log N)^{O(1)}}$.

One immediate implication is that the largest $m$ (or $n$) that we can put should be on the order of $(\log N)^c$.

The next issue, however, is that we cannot efficiently determine the lattice point of minimum $\infty$-norm. This is known to be an NP-hard problem. However, there is an efficient algorithm to produce some vector that is (loosely) approximately optimal:

(LLL basis reduction.) There is an efficient algorithm that computes a "reduced" basis with some properties. Furthermore, the shortest basis vector $\ell$ satisfies $$\norm \ell_\infty \le \norm{\ell}_2 \le 2^{(n-1)/4}\cdot \vol(\mathcal L)^{1/n}.$$

This result is actually sufficient for us, because it just translates to shrinking $B$ by a constant factor (which can be patched in by chopping the initial interval up and running the algorithm on each interval). So with some modification, we can in fact get an efficient algorithm!

Meta-commentary 1. It's easy to lose track of what we were trying to do, isn't it?

Meta-commentary 2. If this were a college lecture, you would be learning all about LLL basis reduction before any of the interesting applications. But if I got to design it, I would present the most interesting application before diving into the theory itself.

How this showed up in real life

The method we showed actually works much more generally, as long as the entropy in the primes' digits are reduced in some fashion. A popular library for RSA used a protocol to generate primes of the form $$p = kM + (65537^a \pmod M)$$ where $M$ is the product for the first $n$ primes (e.g. $n=39$). Researchers were able to adapt this method to factor primes generated using this library, which forms a security vulnerability.

If you're interested, you can check out:

This algorithm is useful practically any time you need to invoke Minkowski's theorem or if you want a small integer combination of anything at all. Section 4 in these notes gives a ton of examples.

A deeper dive into the shortest vector problem

I could just state the algorithm right now and call it a day - but let's have a stab at inventing it.

Here was my first idea (which is based on "rounding" constructions): we start off with some linear combination which we know to be small - e.g. the projection of some basis vector on subspace spanned by the rest:

$$\tilde b_n = b_n - c_1b_1 - c_2b_2 - ... -c_{n-1} b_{n-1}, \qquad \lranglen{\tilde b_n, b_i} = 0$$

Exactly how small? We know that the product of lengths of $\{\tilde b_i\}$ must be at most $\vol(\cL)$, so the smallest one is definitely the right size.

Of course, the issue here is that we're looking for an integer combination of the $b_i$'s. So here's a silly proposal: why not just round each $c_i$ to the nearest integer?

Well, this is bad, because maybe all the $b_i$'s are large but fairly aligned so that the actual fundamental region is really small - think about if $b_i = x + s_i$ where $s_i$'s was small and $x$ was large.

II.

Here's a slight improvements: we take the step-by-step projections $$ \begin{align*} g_1 &= b_1 \\ g_2 &= b_2 - \mu_{2,1} g_1 \\ g_3 &= b_3 - \mu_{3,1} g_1 - \mu_{g, 2} g_2 \\ \vdots \end{align*} $$

(This is precisely the Gram-Schmidt orthogonalization - hence $g$ for "Gram-Schmidt".)

Firstly note that now we must have (exactly!)

$$\normn{g_1} \cdot \normn{g_2} \cdots \normn{g_n} = \vol(\cL)$$

In this instance, we could do the following rounding procedure:

$$ \begin{align*} g_n &= b_n - c_1 g_{n-1} - (\text{linear combination of }g_{n-2},..., g_1) \\ &= b_n - \lfloor c_1 \rfloor b_{n-1} - \{c_1\} g_{n-1} - (\text{linear combination of }g_{n-2},..., g_1)\\ &= b_n - \lfloor c_1 \rfloor b_{n-1} - \{c_1\} g_{n-1} - c_1 g_{n-2} - (\text{linear combination of }g_{n-3},..., g_1) \\ &= b_n - \lfloor c_1 \rfloor b_{n-1} -\lfloor c_2\rfloor b_{n-2} - \{c_1\} g_{n-1} - \{c_2\} g_{n-2} - (\text{linear combination of }g_{n-3},..., g_1) \\ &= ... \end{align*} $$

The result: we find some integer combination of $\{b_i\}$ that has size at most $\sum_i \normn{g_i}$. This is the wrong side of the AM-GM inequality, but nonetheless it tells us that we're not worried about the case where $\normn{g_i}$'s are close together - it's when they're spread out. We must expect one of the $\normn{g_i}$'s for this rounding approach to fail.

What can we do if it turns out that one of the $\normn{g_i}$'s was indeed large? Two possibilities:

We could reorder our basis vectors to have $b_i$ projected onto other vectors first.
We could just exclude $b_i$.

Thinking about it a little, I was convinced that the "bad case" looks something like if the $\norm{g_i}$ were spaced apart with a similar geometric ratio. But in those instances we could take the exclusion strategy to the extreme and just pick $b_1$.

III.

It is perhaps a mistake to have this be the third thing I tried instead of the first, but why not try small $n$? A spoiler here is that the smallest vector problem is exactly solvable on the plane (due to Gauss). Let's see what that looks like.

Suppose we have two vectors $b_1, b_2$, and WLOG $b_2$ is longer. We could do a "Euclidean algorithm"-like move: we can try to subtract a multiple of $b_1$ to make $b_2-kb_1$ as short as possible (and we note that $b_1, b_2-kb_1$ still spans the same lattice).

So when we have the shortest possible $b_2$, we must have $$\norm{b_2 + b_1}^2 \ge \norm{b_2}^2$$ $$\norm{b_2 - b_1}^2 \ge \norm{b_2}^2$$ Or equivalently, $|\lrangle{b_1, b_2}| / \norm{b_1}^2 \le 1/2$. (By some shocking coincidence, $\mu_{2,1}$ from (II) is exactly $\lrangle{b_1, b_2} / \norm{b_1}^2$.)

And of course, if the resulting $b_2$ is shorter than $b_1$, then we can flip them around and repeat.

Let's say that we end up with a pair of vectors where $\norm{b_1}\le \norm{b_2}$ and $|\lrangle{b_1, b_2}| / \norm{b_1}^2 \le 1/2$. Then, any other vector is of the form $k_1b_1 + k_2b_2$, so $$ \begin{align*} \norm{k_1b_1 + k_2b_2}^2 &= k_1^2 \norm{b_1}^2 + 2k_1k_2 \lrangle{b_1,b_2} + k_2^2 \norm{b_2}^2\\ &\ge (k_1^2 - k_1k_2 + k_2^2) \norm{b_1}^2 \ge \norm{b_1}^2 \end{align*} $$ since the coefficient cannot possibly be 0 unless $k_1=k_2=0$.

We now get a hint about the name - even though we want a shortest vector, we might need to keep track of the entire basis. And perhaps we might want to modify the basis incrementally - a "basis reduction" algorithm!

IV.

Chasing down this idea of "keeping track of the whole basis", we find that there is actually a symmetric version of the theorem:

(LLL Basis reduction, basis version) There exists an efficiently computable basis of $\cL$ where the product of lengths is at most $2^{n(n-1)/4} \cdot \vol (\cL)$.

In other words, there exists an "almost orthogonal" basis of the lattice, for some very loose definition of "almost orthogonal". If we were working with vector spaces, we would just do Gram-Schmidt to get an exact orthogonal basis (that is, the constant there would be 1). But due to the imprecision/chunkiness of the lattice, we can only be this accurate.

Ok, so let's do (III) but just with the goal of getting the volume to be within some factor of $\norm{b_1}\cdot \norm{b_2}$. In fact, we have $$\vol(b_1,b_2)^2 + \langle b_1, b_2\rangle^2 = \norm{b_1}^2 \norm{b_2}^2$$ thus all we have to do is to try to guarantee that $|\lrangle{b_1, b_2}| \le \frac{1}{\sqrt 2} \norm{b_1} \norm{b_2}$, or that $b_1$ and $b_2$ have more than a 45 deg angle between them. We saw earlier that this can be done (and in fact that we can force the angle to be at least 60).

What is the general relationship between the volume and the angles between basis vectors? Well, consider the matrix $$B = \begin{pmatrix}\vdots & \vdots & & \vdots\\ b_1 & b_2 & \cdots & b_n\\ \vdots & \vdots & & \vdots\end{pmatrix}$$ then we could take the so-called Gram matrix: $$B^\top B = (\lrangle{b_i, b_j})_{ij}$$ and $\det B^\top B = \vol(\cL)^2$ since the determinant is multiplicative. One could further rescale the columns and vectors by $\norm{b_i}$ to get like a normalized Gram matrix:

$$\det \left(\frac{\lrangle{b_i, b_j}} {\norm{b_i}\cdot \norm{b_j}}\right)_{ij} = \frac{\vol(\cL)^2}{\prod_i \norm{b_i}^2}.$$

So we know the diagonals of this matrix are 1s, and the off-diagonal terms might presumably be less than $1/\sqrt{2}$ in absolute value. Is that sufficient to conclude that the determinant is greater than some constant?

The answer is, unfortunately, no. In the special case where all off-diagonal terms are equal to $\alpha$, the determinant is exactly $(1-\alpha)^{n-1}(1+(n-1)\alpha)$. So if $\alpha = -1/(n-1)$, the determinant is 0...

What's missing here? In some sense, we got a worse problem from doing the symmetrization - if we compare with (III), we'll see that the missing part is that we enforced an ordering between the size of the basis vectors.

This time, we do the more straightforward thing - just generalize the approach for (III) to $n$-dimensions. We will simply (1) subtract copies of one vector from another whenever we can decrease their lengths, and (2) swap vectors to preserve the length order, and we study the object we get upon termination. And we pick up one clue from earlier: keep tracak of the Gram-Schmidt vectors.

Let's write down the Gram-Schmidt again:

$$ \begin{align*} b_1 &= g_1 \\ b_2 &= g_2 + \mu_{2,1} g_1 \\ b_3 &= g_3 + \mu_{3,1} g_1 + \mu_{g, 2} g_2 \\ \vdots \end{align*} $$

If we ever do $b_j \mapsto b_j - kb_i$, the only change to these equations would be that we subtracted $k$ copies of the $i$-th equation from the $j$-th equation, i.e. $$b_j - kb_i = g_j - ... - \mu_{j, i+1} g_{i+1} - (\mu_{j,i} - k) g_i - (\mu_{j,i-1} - k\mu_{i,i-1}) g_{i-1} + ...$$ and with a handful of such operations we can control that $|\mu_{i,j}| \le 1/2$ for all $i,j$.

What about (1)? Well, let's say we swapped the order of $b_i$ and $b_{i+1}$. Then we get (writing $c:= \mu_{i+1, i}$)

$$b_{i+1} = (g_{i+1} + c g_i) + \mu_{i+1, i-1} g_{i-1} + ...$$ $$b_i = \frac 1 {c^2 \norm{g_i}^2 + \norm{g_{i+1}^2}} (g_i - cg_{i+1}) + \frac c {c^2 \norm{g_i}^2 + \norm{g_{i+1}^2}} (g_{i+1} + c g_i) + \mu_{i, i-1} g_{i-1} + ...$$

i.e. the transformation on the $g_i$'s is simply $$(g_i, g_{i+1}) \mapsto \frac{1}{\norm{g_{i+1}}^2 + c^2 \norm{g_i}^2}(g_{i+1} + cg_i, g_i - cg_{i+1}).$$

(One sanity check to do at this point is to check that $\norm{g_i} \cdot \norm{g_{i+1}})$ stays the same after applying the transformation, and also that the result is orthogonal.)

When is this transformation a "good deal"? You could imagine that in the "stable" state, we have $$\norm{g_1} \ge \norm{g_2} \ge ... \ge \norm{g_n}$$ since more and more of the space vanishes. From (II), we would like $\norm{g_i}$'s to be as balanced as possible, so perhaps we want to enforce that we only do this if $\norm{g_i'} < \norm{g_i}$, or equivalently,

$$\norm{g_{i+1} + \mu_{i+1, i} g_i}^2 =\norm{g_{i+1}}^2 + \mu_{i+1,i}^2 \norm{g_i}^2 < \norm{g_i}^2$$

so one can assume that $\norm{g_{i+1}} \ge \sqrt{1-\mu_{i+1,i}^2} \norm{g_i} \ge \frac{\sqrt{3}} 2 \norm{g_i}$. So, adjacent $\norm{g_i}$'s must be within a factor of $2/\sqrt{3}$ of each other. Thus,

$$\vol(\cL)^{1/n} = \left(\prod_i \norm{g_i}\right)^{1/n} \ge \prod_i \left(\norm{g_1} \cdot(2/\sqrt{3})^{i-1} \right)^{1/n} = (2/\sqrt{3})^{(n-1)/2} \norm{g_1}$$ but also $g_1 = b_1$ is a valid integer combination of the original basis, so we're done.

VI.

Are we really done? We forgot about algorithmic efficiency again. How do we know that the sequence of (1) and (2) steps terminate (quickly)?

Well, (1)'s terminate because there are only $n(n-1)/2$ possible $\mu$'s anyway. By rounding them in the correct order (from larger indices to smaller ones), we can round everything at most once.

What about (2)? Well, let's suppose our criteria of performing (2) was stricter: we want a factor of improvement in the length. Concretely, let's suppose we want $\norm{g_i'} \le \frac 3 4 \norm{g_i}$. Then, the same arguments work out with different constants: $$\norm{g_{i+1}}\ge \tfrac 1 {\sqrt 2} \norm{g_i}, \qquad \vol(\cL)^{1/n} \ge 2^{-(n-1)/4} \norm{g_1}.$$

In this case, we know that the algorithm must terminate by constructing a monovariant. When performing (2), we only affect $g_i$ and $g_{i+1}$, and we know that $\norm{g_i}\cdot \norm{g_{i+1}}$ stays the same, but $\norm{g_i}$ shrinks by a factor of $\frac 3 4$. So the natural candidate is $$M = \prod_{i=1}^n \norm{g_i}^{n-i}$$ and this number decreases by a factor of $3/4$ each time. Conveniently, this number is an integer because the partial product $\prod_{i=1}^k \norm{g_i}$ is an integer - it describes the $k$-dimensional volume of the fundamental region when we consider the first $k$-vectors. Thus, we can only perform (2) at most $O(n^2 \log B)$ times, where $B = \max \norm{b_i}$.

Notes.

Part (VI) holds with $3/4$ switched out for any $\delta \in (1/4,1)$. The final reduced basis is called a LLL reduced basis with parameter $\delta$.
In practice, the shortest vector we obtain from doing basis reduction is within a factor much better than exponential in $n$. There is not yet an exact explanation for this.

Meta-commentary. The definition of LLL reduced basis is kind of terrible if you just look at it in isolation. But with all this context it (hopefully) starts to make sense.

Meta-commentary 2. I actually spent a lot of time stuck on (III), and tried to make versions of the other parts using the formulation in (III). If you were able to directly construct an "almost orthogonal" basis you would also be done - let me know if you were able to do this.

Meta-commentary 3. I think this is what math feels like sometimes - you build intuition through various partial attempts, and at each step you try to invoke any learnings from previous steps or intuition from elsewhere (e.g. linear algebra).

Meta-commentary 4. In the purely mathematical world, one never really cares about efficiency of computation - it's good enough that something exists. In this case, Minkowski's theorem does it for us. However, in the broader world, efficiency is really important. Entire fields (e.g. theoretical cryptography, complexity theory) do not exist if you assume all computation is efficient.

Epilogue

If you've reached this point, thank you for haviing enough patience to wade through this whole post! I think this post captures a small slice of the type of mathematics I really enjoy: something that is (1) mathematically and theoretically interesting, (2) accessible and (3) having many varied applications (real-world or other areas).

During my last year in college, I jumped head-first into the topics course having no background in cryptography at all, and this was just one instance among many where looking at a different area inspired me to learn more mathematics. Have I not done this, I would have never heard of Coppersmith's theorem or LLL basis reduction. However, it is also true that the course glossed over the mathematical details, and it remained a bucket list item of mine (until now!) to flesh out the details and make sense of the underlying theory of lattice basis reductions.

Search This Blog

SIMO Retiree Blog