This article is part of our coverage of the latest in AI research.
Amid its wave of layoffs and tumbling stock price, Meta (Facebook) went head-to-head with another crisis after releasing its latest artificial intelligence announcement: Galactica.
Galactica is “a large language model that can store, combine and reason about scientific knowledge,” according to a paper published by Meta AI. It is a transformer model that has been trained on a carefully curated dataset of 48 million papers, textbooks and lecture notes, millions of compounds and proteins, scientific websites, encyclopedias, and more.
Galactica was supposed to help scientists navigate the ton of published scientific information. Its developers presented it as being able to find citations, summarize academic literature, solve math problems, and perform other tasks that help scientists in research and writing papers.
In collaboration with Papers with Code, Meta AI open-sourced Galactica and launched a website that allowed visitors to interact with the model.
However, three days after Galactica’s release, Meta had to shut down the online demo following a deluge of criticism by scientists and tech media about the model’s incorrect and biased output.
While Galactica was obviously not a success, I believe that its short history provides us with some useful lessons about LLMs and the future of AI research.
Large language models represent impressive advances in artificial intelligence and have even become the basis for several commercial products. In the past couple of years, LLMs have continued to push the limits of what is possible with deep neural networks. Galactica is no exception. If you read the paper, there is a ton to learn about curating data, configuring tokens, and adjusting the model’s architecture to do more with less.
However, LLMs are also a controversial subject. When it comes to topics such as understanding, reasoning, planning, and common sense, scientists are divided about how to assess LLMs. Some dismiss LLMs as stochastic parrots while others go as far as considering them sentient.
This, unfortunately, is where I think Meta AI erred. In their paper, they used some of these contested terms, such as “reason about scientific knowledge.” And on Twitter, the model was presented in a way that created the impression that it can write its own scientific papers.
To their credit, Meta and Papers with Code explicitly state on Galactica’s website that “There are no guarantees for truthful or reliable output from language models, even large ones trained on high-quality data like Galactica.” They also acknowledge that Galactica performs best when used to generate content about well-cited concepts. And they warn that in some cases, Galactica might generate text that appears authentic but is inaccurate.
But the use of vague terms in the paper, website, and tweets was enough to overshadow those warnings and trigger a backlash by scientists and researchers who are (rightly) exhausted by the unwarranted hype surrounding large language models. (I will not go into those criticisms here because they have been comprehensively reported by tech media.)
Benchmarks are one of the thorniest problems of AI research. On the one hand, researchers need a way to evaluate and compare their models. On the other hand, some concepts are really hard to measure.
Galactica makes some impressive progress on some of the benchmarks used to measure reasoning, planning, and problem-solving capabilities in AI systems. At a maximum size of 120 billion parameters, Galactica is considerably smaller than other state-of-the-art language models such as GPT-3, BLOOM, and PaLM. But according to experiments by Meta AI, Galactica outperforms these SOTA models by a comfortable margin on benchmarks such as MMLU and MATH.
However, the problem with these benchmarks is that we usually view them from a human intelligence perspective. As a simplified example, take chess, which was long thought of as the ultimate challenge of AI. We consider chess as a complicated intelligence challenge because, on their way to mastering chess, humans must acquire a set of cognitive skills through hard work and talent. This is why we expect chess masters to be able to make smart decisions on a larger set of tasks that require long-term planning but are not directly related to chess. But from a computational perspective, you can shortcut your way to finding good chess moves through sheer computation, a good algorithm, and the right inductive biases. And you don’t need any of the general intelligence skills that human chess masters have.
Scientists try their best to create benchmarks that can’t be “cheated” with computational shortcuts. But it’s a very difficult feat. Computer scientist Melanie Mitchell has thoroughly studied the shortcomings of benchmarks used to evaluate reasoning in deep learning models. And according to her findings, even some of the most carefully crafted benchmarks are prone to computational shortcuts.
What this means is that, while benchmarks are a good tool to compare machine learning models against one another, they are not anthropomorphic measures of cognitive skills in machines.
One of the big challenges of large language models is that they can create output that is convincingly human but not based on human cognition. Models like Galactica can be extremely powerful but also dangerously misleading.
As some researchers have pointed out, Galactica’s output can feel real but not grounded in real facts. This does not happen all the time, but it happens frequently enough to make you want to double-check the suggestions that the language model provides instead of accepting them blindly. This applies not only to Galactica but also to other LLMs used for reasoning and problem-solving tasks, such as source code generation.
But does it mean that Galactica should be dismissed as useless in math, science, and programming? Absolutely not. In fact, there is plenty of evidence that shows LLMs—with all their shortcomings—can be very effective tools. Take GitHub Copilot, a programming AI tool powered by OpenAI’s Codex model. Multiple studies show that Copilot makes the job of programmers much more pleasant and productive.
That said, I’m a bit disappointed with the scientists and media outlets who jumped on Galactica’s failures as an opportunity to bash deep learning, large language models, and the work done by researchers at Meta. With the right interface and guardrails, a model like Galactica can be a good complement to scientific search tools like Google Scholar.
Otherwise put, we should look at Galactica’s initial failure as another scientific experiment. And as the history of scientific discovery has proven time and again, every failed experiment brings us one step closer to success.