Neural networks — the technique underlying nearly all systems we describe as “artificial intelligence” today — are revolutionizing the way we communicate, get around, and express ourselves. In one of the most prominent recent examples, the GPT-3 model from OpenAI has learned to automatically generate web page layouts, search queries, and entire blog posts.
Although these new applications are unquestionably extraordinary, the neural networks that make them possible have become enormous. Consider GPT-3: it needs to run data through 175 billion individual parameters every time the model is used. These parameter-counts have been increasing dramatically in recent years. GPT-3 is more than 100 times bigger than its predecessor GPT-2, which was released just a year earlier. This trend shows no signs of stopping, especially since increased size seems to unlock increased capabilities.
However, these enormous models are also enormously expensive to build and use. Creating GPT-3 was estimated to cost $4.6M for the computational power alone, and using it requires dozens of neural network chips from NVIDIA costing thousands of dollars apiece. The electricity necessary for this computation generates enormous quantities of CO2 along the way. By one estimate, merely training a simpler network for natural language processing (one that is 1,000 times smaller than GPT-3) would emit nearly a ton of CO2. That’s about the amount of CO2 emitted by flying a single person from New York to San Francisco.
In short, when it comes to neural networks, we face a dilemma. By making our models larger, we are unlocking incredible new capabilities. Yet the costs — financial and environmental — of using these models pose a severe limit on our ability to keep this up.
Researchers are working hard to develop ways that we can enjoy the benefits of neural networks while minimizing financial and environmental tradeoffs. One such approach involves asking a simple question: once the model has finished learning, are all 175 billion of those parameters really necessary? Maybe some of them aren’t doing much, and we can delete (prune) them without hurting the model. Perhaps we could make these models much smaller — and thereby reduce the cost of using them — while maintaining their full capabilities.
Researchers have been asking this question for decades, with the earliest work on pruning dating back to the late 1980s. More recent work has shown that we can often prune 90% or more parameters from trained neural networks without hurting their ability to succeed at the tasks for which we trained them. The takeaway from this work is that only a small fraction of a neural network’s parameters is actually necessary for it to operate, meaning seemingly massive models aren’t really so large once we prune them.
One shortcoming of pruning is that it only reduces the cost of using a model after it is trained. In the case of GPT-3, it can’t eliminate the $4.6M bill (and the accompanying CO2 emissions) from training the model in the first place.
However, the pruning results should give us some hope that we can also reduce the cost of training: if these parameters weren’t necessary after training, maybe they were never necessary at all. Maybe we could have pruned them before any training took place and trained a smaller model from the start. However, this is no simple feat, and — for many years — the research community was pessimistic that doing so would be possible.
Consider a metaphor. Imagine that you are learning an intellectually demanding topic, for example calculus. Intuitively, learning the subject for the first time takes much more work than remembering a synthesized understanding of it afterwards. Similarly, one might intuitively believe that we can prune many more parameters after a neural network has finished learning something than while it is still in the process of learning.
It therefore came as a big surprise to the research community when, in 2019, my adviser (Prof. Michael Carbin) and I showed that parameters pruned after training could have been pruned before or early in training without any effect on the network’s ability to learn successfully. In short, if we can figure out the right parameters to prune, it is possible to train much smaller networks than the ones we currently use.
We refer to this finding as the Lottery Ticket Hypothesis. To explain this metaphor, consider one additional detail about neural network training. When we create a network for the first time, we don’t yet know good values for any of the parameters. As such, we set each of the parameters to random values. The process of training the network involves gradually adjusting those initially random values until the network reaches good performance.
It turns out that the specific initial value each parameter receives is crucial for determining whether it is important. Parameters that we keep after pruning received lucky initial values; in fact, those values are so lucky that the network can train successfully with only those parameters present (and all others deleted). However, if you give each parameter new random values, the network can no longer train successfully. In other words, those parameters won the “initialization lottery,” hence the lottery ticket metaphor.
These innovations — pruning and the Lottery Ticket Hypothesis — should make us optimistic: there are significant opportunities to reduce the training and using neural networks. However, two major challenges remain before this research is ready for production.
Finding lottery tickets efficiently. Work on the Lottery Ticket Hypothesis shows that it would have been possible to prune before or early in training. However, it only does so retroactively: it prunes after training and looks at what would have happened if we had pruned those same parameters earlier. In other words, we have to train the entire network first before we can figure out what we should have pruned. For this work to be successful, we will need to develop new ways of pruning that work before or early in training. This is a problem that the research community is actively pursuing, a promising sign that we may develop such pruning strategies soon.
Accelerating pruned networks. A second major challenge is that pruning 90% of parameters doesn’t necessarily reduce the cost of training or using a neural network by 90%. The cost of running a neural network is dependent on the kind of hardware we use to run it. We structure neural networks in a way that is designed to take advantage of the strengths of modern graphics processing units (GPUs). When we prune individual parameters, the structures that we end up with are often a less-than-ideal fit for this hardware, meaning we may not fully realize these cost reductions in practice. Thankfully, researchers are hard at work on this problem as well, developing new tricks for accelerating pruned networks and designing entire chips with pruned networks in mind.
For all of their extraordinary accomplishments over the past few years, neural networks are exceedingly costly to train and use, and these costs are only growing as the networks become larger. Pruning unnecessary parameters (after training, or even beforehand) presents an opportunity for dramatically reducing those costs, and researchers are hard at work to make this promising research ready for practical use.
My research on the Lottery Ticket Hypothesis was made possible by the generous support of IBM through the MIT-IBM Watson AI Lab. We conducted the original experiments on IBM’s Deep Learning as a Service (DLaaS) infrastructure; today, we prototype on the Satori cluster that IBM donated to MIT in 2019.