Mathematicians love a good puzzle. Even something as abstract as multiplying matrices (two-dimensional tables of numbers) can feel like a game when you try to find the most efficient way to do it. It’s a little like trying to solve a Rubik’s Cube in as few moves as possible — challenging, but alluring. Except that for a Rubik’s Cube, the number of possible moves at each step is 18; for matrix multiplication, even in relatively simple cases, every step can present more than 10 options.
Over the past 50 years, researchers have approached this problem in many ways, all based on computer searches aided by human intuition. Last month, a team at the artificial intelligence company DeepMind showed how to tackle the problem from a new direction, reporting in a paper in Nature that they’d successfully trained a neural network to discover new fast algorithms for matrix multiplication. It was as if the AI had found an unknown strategy for solving a monstrously complex Rubik’s Cube.
“It’s a very neat result,” said Josh Alman, a computer scientist at Columbia University. But he and other matrix multiplication specialists also emphasized that such AI assistance will complement rather than replace existing methods — at least in the near term. “It’s like a proof of concept for something that could become a breakthrough,” Alman said. The result will simply help researchers on their quest.
As if to prove the point, three days after the Naturepaper came out, a pair of Austrian researchers illustrated how new and old methods might complement each other. They used a conventional computer-aided search to further improve one of the algorithms that the neural network had discovered.
The results suggest that, like the process of solving a Rubik’s Cube, the path to better algorithms will be full of twists and turns.
Matrix multiplication is one of the most fundamental and ubiquitous operations in all of mathematics. To multiply a pair of n-by-n matrices, each with n elements, you multiply and add these elements together in particular combinations to generate the product, a third n-by-n matrix. The standard recipe for multiplying two n-by-n matrices requires n multiplication operations, so a 2-by-2 matrix, for example, requires eight multiplications.
For larger matrices, with thousands of rows and columns, this process quickly becomes cumbersome. But in 1969, the mathematician Volker Strassen discovered a procedure for multiplying a pair of 2-by-2 matrices using seven rather than eight multiplication steps, at the cost of introducing more addition steps.
Strassen’s algorithm is needlessly convoluted if you just want to multiply a pair of 2-by-2 matrices. But he realized it would also work for bigger matrices. That’s because the elements of a matrix can themselves be matrices. For example, a matrix with 20,000 rows and 20,000 columns can be reimagined as a 2-by-2 matrix whose four elements are each 10,000-by-10,000 matrices. Each of these matrices can then be subdivided into four 5,000-by-5,000 blocks, and so on. Strassen could apply his method to multiply 2-by-2 matrices at each level of this nested hierarchy. As the matrix size increases, the savings from fewer multiplications grows.
Strassen’s discovery led to a search for efficient algorithms for matrix multiplication, which has since inspired two distinct subfields. One focuses on a question of principle: If you imagine multiplying two n-by-n matrices and let n run toward infinity, how does the number of multiplication steps in the fastest possible algorithm scale with n? The current record for the best scaling, n, belongs to Alman and Virginia Vassilevska Williams, a computer scientist at the Massachusetts Institute of Technology. (A recent unpublished preprint reported a tiny improvement using a new technique.) But these algorithms are of purely theoretical interest, winning out over methods like Strassen’s only for absurdly large matrices.
The second subfield thinks on a smaller scale. Soon after Strassen’s work, the Israeli American computer scientist Shmuel Winograd showed that Strassen had reached a theoretical limit: It’s not possible to multiply 2-by-2 matrices with fewer than seven multiplication steps. But for all other matrix sizes, the minimum number of required multiplications remains an open question. And fast algorithms for small matrices could have an outsize impact, since repeated iterations of such an algorithm might beat Strassen’s algorithm when reasonably sized matrices are being multiplied.
Unfortunately, the sheer number of possibilities is huge. Even for 3-by-3 matrices, “the number of possible algorithms exceeds the number of atoms in the universe,” said Alhussein Fawzi, a DeepMind researcher and one of the leaders of the new work.
Faced with this dizzying menu of options, researchers have made progress by transforming matrix multiplication into what seems like a totally different math problem — one that is easier for computers to handle. It’s possible to represent the abstract task of multiplying two matrices as a specific kind of mathematical object: a three-dimensional array of numbers called a tensor. Researchers can then break this tensor up into a sum of elementary components, called “rank-1” tensors; each of these will represent a different step in the corresponding matrix multiplication algorithm. That means that finding an efficient multiplication algorithm amounts to minimizing the number of terms in a tensor decomposition — the fewer the terms, the fewer the steps involved.
In this way, researchers have discovered new algorithms that multiply n-by-n matrices using fewer than the standard n multiplication steps for many small matrix sizes. But algorithms that outperform not only the standard but also Strassen’s algorithm for small matrices have remained out of reach — until now.
The DeepMind team approached the problem by turning tensor decomposition into a single-player game. They started with a deep learning algorithm descended from AlphaGo — another DeepMind AI that in 2016 learned to play the board game Go well enough to beat the top human players.
All deep learning algorithms are built around neural networks: webs of artificial neurons sorted into layers, with connections that can vary in strength representing how much each neuron influences those in the next layer. The strength of these connections is tweaked over many iterations of a training procedure, during which the neural network learns to transform each input it receives into an output that helps the algorithm accomplish its overall goal.
In DeepMind’s new algorithm, dubbed AlphaTensor, the inputs represent steps along the way to a valid matrix multiplication scheme. The first input to the neural network is the original matrix multiplication tensor, and its output is the rank-1 tensor that AlphaTensor has chosen for its first move. The algorithm subtracts this rank-1 tensor from the initial input, yielding an updated tensor that is fed back into the network as a new input. The process repeats until eventually every element in the starting tensor has been reduced to zero, meaning there are no more rank-1 tensors to take out.
At that point, the neural network has discovered a valid tensor decomposition, since it’s mathematically guaranteed that the sum of all the rank-1 tensors is exactly equal to the starting tensor. And the steps it took to get there can be translated back into steps of the corresponding matrix multiplication algorithm.
Here’s the game: AlphaTensor repeatedly decomposes a tensor to a set of rank-1 components. Each time, AlphaTensor gets rewarded if it finds a way to reduce the number of steps. But shortcuts to victory are not at all intuitive, just as you sometimes must scramble a perfectly ordered face on a Rubik’s Cube before you can solve the whole thing.
The team now had an algorithm that could, theoretically, solve their problem. They just had to train it first.
Like all neural networks, AlphaTensor needs a lot of data to train on, but tensor decomposition is a notoriously hard problem. There were few examples of efficient decompositions that the researchers could feed the network. Instead, they helped the algorithm get started by training it on the much easier inverse problem: adding up a bunch of randomly generated rank-1 tensors.
“They’re using the easy problem to produce more data for the hard problem,” said Michael Littman, a computer scientist at Brown University. Combining this backward training procedure with reinforcement learning, wherein AlphaTensor generated its own training data as it blundered around looking for efficient decompositions, worked much better than either training method on its own.
The DeepMind team trained AlphaTensor to decompose tensors representing the multiplication of matrices up to 12-by-12. It sought fast algorithms for multiplying matrices of ordinary real numbers and also algorithms specific to a more constrained setting known as modulo 2 arithmetic. (This is math based on only two numbers, so matrix elements can only be 0 or 1, and 1 + 1 = 0.) Researchers often start with this more restricted but still vast space, in hopes that algorithms discovered here can be adapted to work on matrices of real numbers.
After training, AlphaTensor rediscovered Strassen’s algorithm within minutes. It then discovered up to thousands of new fast algorithms for each matrix size. These were different from the best-known algorithms but had the same number of multiplication steps.
In a few cases, AlphaTensor even beat existing records. Its most surprising discoveries happened in modulo 2 arithmetic, where it found a new algorithm for multiplying 4-by-4 matrices in 47 multiplication steps, an improvement over the 49 steps required for two iterations of Strassen’s algorithm. It also beat the best-known algorithm for 5-by-5 modulo 2 matrices, reducing the number of required multiplications from the previous record of 98 to 96. (But this new record still lags behind the 91 steps that would be required to beat Strassen’s algorithm using 5-by-5 matrices.)
The new high-profile result created a lot of excitement, with some researchers heaping praise on the AI-based improvement on the status quo. But not everyone in the matrix multiplication community was so impressed. “I felt like it was a little overhyped,” said Vassilevska Williams. “It’s another tool. It’s not like, ‘Oh, the computers beat the humans,’ you know?”
Researchers also emphasized that immediate applications of the record-breaking 4-by-4 algorithm would be limited: Not only is it valid only in modulo 2 arithmetic, but in real life there are important considerations besides speed.
Fawzi agreed that really, this is just the beginning. “There is a lot of room for improvement and research, and that’s a good thing,” he said.
AlphaTensor’s greatest strength relative to well-established computer search methods is also its greatest weakness: It’s not constrained by human intuition about what good algorithms look like, so it can’t explain its choices. That makes it difficult for researchers to learn from its achievements.
But this may not be as big a drawback as it seems. A few days after the AlphaTensor result, the mathematician Manuel Kauers and his graduate student Jakob Moosbauer, both of Johannes Kepler University Linz in Austria, reported another step forward.