Drawing photorealistic images is a major accomplishment for AI, but is it really a step towards general intelligence? Since DALL-E 2 came out, many people have hinted at that conclusion; when the system was announced, Sam Altmann tweeted that “AGI is going to be wild”; for Kevin Roose atThe New York Times, such systems constitute clear evidence that “We’re in a golden age of progress in artificial intelligence”. (Earlier this week, Scott Alexander seems to have taken apparent progress in these systemsas evidence for progress towards general intelligence; I expressed reservationshere.)
In assessing progress towards general intelligence, the critical question should be, how much do systems like Dall-E, Imagen, Midjourney, and Stable Diffusion really understand the world, such that they can reason on and act on that knowledge? When thinking about how they fit into AI, both narrow and broad, here are three questions you could ask:
On #1, the answer is a clear yes; only highly trained human artists could do better.
On #2, the answer is mixed. They do well on some inputs (likeastronaut rides horse) but more poorly on others (likehorse rides astronaut, which I discussed in an earlier post). (Below I will show some more examples of failure; there are many examples on the internet of impressive success, as well.)
Crucially, DALL-E and co’s potential contribution to general intelligence (“AGI”) ultimately rests on #3; if all the systems can do is in a hit-or-miss yet spectacular way convert many sentences into text, they may revolutionize the practice of art, but still not really speak to general intelligence, or even represent progress towards general intelligence.
Until this morning, I despaired of assessing what these systems understand about the world at all.
The single clearest hint that they might have trouble that I had seen thus far was from the graphic designer Irina Blok:
As my 8 year old said, reading this draft, “how does the coffee not fall out of the cup?”
The trouble, though, with asking a system like Imagen to draw impossible things is thatthere is no fact of the matter about what the picture should look like,so the discussion about results cycles endlessly. Maybe the system just “wanted” to draw a surrealistic image. And for that matter, maybe a person would do the same, as Michael Bronstein pointed out.
So here is a different way to go after the same question, inspired by a chat I had yesterday with the philosopher Dave Chalmers.
What if we tried to get at what the systems knew about (a) parts and wholes, and (b) function, in a task that had a clearer notion of correct performance, with prompts like “Sketch a bicycle and label the parts that roll on the ground”, “Sketch a ladder and label one of the parts you stand on”?
From what I can tell Craiyon (formerly known as a DALL-E mini) is completely at sea on this sort of thing:
Might this be a problem specific to DALL-E Mini?
I found the same kinds of results withStable Diffusion, currently the most popular text-to-image synthesizer, the crown jewel of a new company that is purportedly in the midst ofraising $100 million on a billion dollar valuation. Here, for example is “sketch a person and make the parts that hold things purple”,
Nine more tries, and only one very marginal success (top right corner):
Here’s “sketch a white bike and make the parts that you push with your feet orange”.
“Sketch a bicycle and label the parts that roll on the ground”
Negation is,as ever, a problem. “Draw a white bicycle with no wheels”:
Even “draw a white bicycle with green wheels”, which focuses purely on part-whole relationships without function or complex syntax, is problematic:
Can we really say that a system that doesn’t understand what wheels are—or what they are for—is a major advance towards artificial intelligence?
Coda: While I was writing this essay, I posted a poll:
Moments later, the CEO of Stability.AI (creator of Stable Diffusion), Emad Mostique, offered wise counsel: