See Through AI Hype with Arvind Narayanan - Initiative for Digital Public Infrastructure

Read original article here

Arvind Narayanan is a Princeton computer science professor who wants to make it easy for you to cut through the AI. In a fascinating and plain old helpful interview, Arvid runs through all the big claims made about AI today and makes them very simple to understand.

Arvind Narayanan writes the Substack AI Snake Oil with his colleague Sayash Kapoor, and they’re currently writing a book together too.

Hey everybody. Welcome back to Reimagining the Internet. I am your host, Ethan Zuckerman. I’m here today with Arvind Narayanan. He’s professor of computer science at Princeton University. He is affiliated with Princeton’s Center for Information Technology Policy. Arvind studies the societal impact of digital technologies, particularly AI. He’s done really important and influential work on de-anonymization, showing how data can sometimes be inferred from apparently innocuous data. He’s writing a textbook on algorithmic bias and fairness. He’s worked on projects to mitigate some of the harms of cryptocurrencies, all of which is hugely important work and not why we have him today. We brought him on today because he’s been finishing up a book with Sayash Kapoor called AI Snake Oil, that is already getting a really deserved amount of good attention. Arvind, it’s so great to have you here.

Thank you Ethan, for that wonderful introduction. It’s lovely to be here. I’ve really enjoyed a lot of the past episodes, so I’m really looking forward to this. It’s an honor. Thank you.

Oh, well thank you. So we wanted to get you here to talk about the new book, and what I’m so excited about with this new book is that it’s looking at ways that both computer scientists and people who really understand this field deeply, as well as less experienced readers, can evaluate some of the claims that researchers and particularly corporations are making about the future of AI. Why do we need this book?

We are in a time of a lot of technical advances in AI, and at the same time a lot of hype about AI. And I think the hype is coming partly from the companies for sure. Partly it’s coming from researchers themselves who might be over excited about their new developments. A lot of it is coming from journalists who are often amplifying some of the hype that’s coming out of companies. And so for the ordinary person, it’s incredibly hard to separate the real progress from AI that doesn’t work.

But it’s important to do that. The reason it’s important is that I think most people today are going to be in a position where they’ll need to evaluate an AI technology, whether it’s part of their work, if they’re considering buying some AI related software, or in their own personal lives, or even just thinking about how they interact with AI driven systems like online personalization systems. So people really need this skill, at least to be able to call bullshit on certain things, even if not to deeply understand the technology. And that’s hard to do based on the information that’s out there. Of course, you can try to go read computer science papers, but that’s not really the most efficient way to come to a high level understanding of the promise and limits of AI.

I think another thing that’s going on besides the hype and the journalism, is a number of products that seem to have put AI into people’s hands in areas that AI seems to be showing pretty impressive results. For me, perhaps the most straightforward version of this is language translation. So for many years we would have very constrained AI systems that could do things like beat Garry Kasparov at chess, but had a really hard time translating a Spanish book into English. There was this set of problems which were the interim problems in the 1950s, and there’s this wonderful paper where programmers predict that they’ll be able to translate Russian into English with 300 grammar rules and a dictionary. And of course that turns out not to be quite so simple. But now we’ve had a wave of these systems based around so-called neural nets that actually seem to have a great deal of success at some of these very human problems. They’re quite good in some cases at language translation. Many of them are quite good at object detection. They’ve gotten quite good at speech to text, which was another very difficult to solve problem.

Then there’s problems that seem easier than either of the problems that we’ve talked about, like deciding whether someone should be offered bail when they’re facing an arraignment in a court. Can you talk about the criminal justice prediction problem and why that’s been such a thorny and problematic area for AI systems?

For sure, but if you don’t mind, I want to say one quick thing about speech to text, which you brought, the classic example in the decade when speech to text was really crappy that researchers used to give for why it’s a hard problem, is the phrase “recognized speech,” which to computers would often sound like “wreck a nice beach.” They sound very similar when you’re speaking quickly. And the wisdom back then was that to be able to tell apart these two phrases which sound very similar, you really have to have some kind of understanding of the world, that wrecking a beach is not something that someone is going to do in a certain context, because that’s the way a person would figure out which of those two was probably meant. It turned out we got to really good speech to text systems without actually necessarily having that understanding, simply because if you have a corpus of trillions of utterances, recognize speech is going to be much more common than wreck a nice beach. And so purely based on these statistical patterns, we have gotten to systems that simulate a degree of understanding without necessarily doing it the same way that a person might.

This probably means that we train the systems on computer scientists and not on property developers, in which case it might have turned out differently.

Yeah, that’s a great point about what the training data is, and we should probably come back to that in this conversation. But yeah, criminal justice, that’s a completely different type of problem than any of the ones we’ve been talking about, because the way that these systems purport to work, the way that they measure a criminal risk supposedly, is to train on a data set of past defendants who have then gone on to commit or not commit another crime, or appear in court or not appear in court or whatever it is that the outcome that the system is trying to predict. So that’s a prediction problem. It’s a true prediction problem. You’re trying to predict a future event. So given everything we know about this defendant, are they going to do something bad in the future? The problem with that is that the future is not determined.

That’s something that in the excitement about AI, people very often tend to forget, but it’s not even so much a limitation of AI, but it’s just a limitation of the world that we live in. The future is not determined, it could go in multiple ways. Even if someone has supposedly a propensity to commit a crime, it might be a crime of passion, it might be a crime in the moment, it might be something that’s opportunistic. And so even if you can know something about that person and perhaps their circumstances and perhaps their proclivities, it’s not possible to predict the circumstances in which that person might find themselves, which might lead them to do certain things. So for all of these reasons, no matter how much data you throw at these systems, they do a little bit better than random, but pretty much their performance plateaus there.

The accuracy numbers you get with these systems, it’s technically called AUC, for technologists in the audience, those numbers are around 0.7 or 70%. That’s a bit better than random, which is 50%, but it’s not nearly close to the kinds of things you might expect from a speech recognition system, for example, which gets most of the words or most of the sentences correctly.

And I think that 70% number is not really going to go up because it’s not a limitation of how clever the algorithms are. It’s not a limitation of how much data we have, and it’s certainly not a limitation of how much computing power we have, because the algorithms that are getting to 70% are really incredibly simple two variable linear regression models that look at the person’s age and the number of prior arrests and things like that. And that pretty much extracts all of the signal there is in the data.

These models are basically saying the more times someone’s been arrested in the past, the more likely they are to be arrested in the future. That’s really all you can glean from the data, the rest of it is just randomness, it’s unpredictable.

So [inaudible 00:09:25] that being the case, if we assume that’s the case, and some people will debate that claim on making, they’ll say that if you have bigger and better data sets you can do a better job of prediction. Maybe, but if we assume that for a second, that leaves us with a very difficult moral question. It might be true that even a 70% predictor is a big improvement from the perspective of certain decision maker in the criminal justice system, compared to a 50% compared to a random system because it means that if you detain the people who are considered highly risky, then the thought is that you might be able to improve public safety without increasing jail populations.

That’s a good thought, but I think there are incredibly difficult normative questions there. Imagine the perspective of a defendant who has every intention of appearing at their appointed court date, and has no intention of being a flight risk, but because of the behavior of people like them in the past, they’re now considered high risk for failure to appear. That’s not a prediction problem. That’s a moral question that I don’t think we should be delegating to computers. And so that’s where the gap comes in. So part of the problem is that these are highly imperfect predictors. Part of the problem is, is prediction the right paradigm for making these highly consequential decisions?

As it happens, I interviewed Julia Angwin last night, a person who did some of the really groundbreaking reporting on one of these systems, COMPAS, which was being used to determine risk in the context of Florida, and she was able to demonstrate that these algorithms failed rather badly in some contexts. The COMPAS work came up because one of the students in the audience asked how Julia picked problems, and she said she looked for human impact and tried to figure out how many humans were impacted, how severely they were impacted, and concluded that there was nothing more impactful than something like a criminal risk assessment. I’ve written about some of this COMPAS work and argued that prediction may not be possible, or fair prediction may not be possible because we’re dealing with an unfair justice system. So that we know that people of color are more likely to be rearrested than white people. And that one of the ways to de-bias the system would be step one, de-bias the justice system, step two, collect data, step three, run the algorithms. You seem to be suggesting something as well, which is that even in a de-biased justice system, we have free will, and that there may be a limit to predictability. Is that an implication to take away from what you’re saying? That maybe whether or not someone reoffends just isn’t ever predictable at more than 70% accuracy?

Or some number. I’m not committed to that 70% number, but I think that number is always going to remain low enough that the moral question is going to be the predominant one, not the technical question. And yes, exactly. I think Julia’s bias work is amazing, super important. At the same time, I’m trying to push the conversation beyond the bias question, and I think in some domains that’s particularly important. When you look at these algorithms used for hiring, to be honest, I think those are pure random number generators. Those are not even 70% algorithms, those are 50% algorithms. Unfortunately, not one of them has ever had peer reviewed evidence of effectiveness. So I can only allege this, I can’t say this for sure. I think the burden of proof should be on the companies and that proof has not been forthcoming. So these are algorithms that might screen resumes or even more problematically look at the video of a person speaking, and supposedly based on their body language and speech patterns and other things, try to predict their fit for a job or are they going to be a good employee and things like that.

So as you might imagine, bias concerns have come up here and the companies have responded by saying, look, we’ve made technical tweaks to these systems and we’ve de-biased them. I believe them, because you can easily de-bias a random number generator. And I think that is what’s going on here. And a de-biasing does not in any way address the more fundamental concern that making job offers based on reading tea leaves is demeaning to the candidates, and puts them in a position where they have no idea what to improve if they were denied one time and want to have a better prospect the next time. And many, many other reasons. But all of those go back to the fact that these systems aren’t really good at predicting, and all of the normative concerns that come out of that, even more so than the biased concerns.

And is there a computational reason why prediction is so difficult? If I had a massive data set of every resume that I looked at for Google, everyone who I hired, everyone who I didn’t hire, I had perfect information on what they ended up doing at other companies, I had perfect information on what they ended up doing at Google. Could I then build a system with a better chance of prediction? Or prediction is always going to top out at some number, based on the fact that we’re humans and we’re inherently unpredictable.

I think that’s right. If you have bigger and more data, you could push the accuracy up a little bit, but it’s still going to top out at a level that’s going to be pretty unsatisfying. Let me give you one reason for this. You brought up Google. Google actually did a great study probably about a decade ago. I’ll try to pull the link up after this conversation, where they looked at the factors behind success in the workplace, and one of the things they found is that the manager has a huge impact on someone’s success. It’s based on our experience, it very much checks out, not really a surprise, but what that tells us is that if you’re doing this prediction based on someone’s resume and CV or whatever, and you don’t know which team they’re going to be assigned to, and if their manager is going to be creating a toxic work environment and all of those other factors that have not yet been determined, but ultimately are going to be so critical to that person’s success, or whether they’re going have difficulties at home that affect their job performance, all of these things that are not in their CV in any manner whatsoever.

So I think those are the things that ultimately cause the predictive abilities to top out or flat line.

Let’s talk about a place where there’s enormous real world complexity, but the hope is that we’ve got to get much better than 70 or 80%, which is self-driving cars. I think it was either you or Sayash who made the point that getting a car to handle 90% of situations is not 90% of the effort, it might actually be closer to 10% of the effort, that those 10% of edge cases might require 90% of the time. Do we need to set up scenarios where a camel crosses the road, a moose crosses the road, a circus parade crosses the road, for the Tesla to figure out how to stop? Or is that possibility of mapping novel objects into the worldview, is there a different approach to the problem that we’re going to need to solve these edge cases, in cases where you can’t get it wrong? One of the other observations that you’ve made recently in some of the parts of the book that you’ve been sharing on Substack, is that a huge amount of AI work has worked on problems where it doesn’t really matter if you get it wrong, if you’re predicting what ad I’m more likely to click on, it might be the difference between the 0.01% chance that I click it and the 0.02% chance that I click it.

Most ads are mis-targeted, it doesn’t really matter. Whether or not my Tesla hits a camel or a moose matters a great deal.

I can’t really predict, as talking about the difficulty of prediction, I can’t really predict which approach is going to be pursued by the industry and what will succeed. But I can say that it’s not necessarily true that we need to put a camel on the road to be able to teach a car what to do if it does encounter a camel on the road. There are other approaches. I can’t speak to how commercially feasible they are, but in terms of technical limitations, I think other approaches are possible. Let me give you an analogy. There is something called zero-shot learning, and that’s a really cool technique. So you have these image captioning models, which, let’s say have never been shown a picture of a zebra, and yet the first time that they ever see a zebra, they’re going to output the label zebra, how is that possible?

That seems like a logical impossibility. How is that possible? So it turns out they have been trained both on image to text translation data sets, as well as a text only corpus the image dataset might not contain any zebras, but the text corpus contains some sort of description of the world, which includes the fact that a zebra looks like a horse but has stripes. And so based on that combination of associations using things called image embeddings and text embeddings, the models are good enough to say, this looks like a horse, this looks like images of a horse that I’ve seen before, but it has some patterns and based on other images that I’ve seen, I think these patterns are called stripes. And my text corpus tells me that animals that look like a horse but have stripes are probably zebras. I mean, I’ve explained it [inaudible 00:19:58] sequence of logical reasoning steps, but really for the model it’s more like linear algebra. But at the end of the day, it’s able to output zebras.

That question of extrapolation and embeddings, it takes us over to the hot topic of the moment, which is image generation. And for me, this feels like another one of those, wow, this technology is doing things that we did not expect it would be doing anytime soon. Whether it’s Dall-E, whether it’s Stable Diffusion, there are a number of tools out there that can take a text prompt of an image that probably has not existed in the world before and without a great deal of time, turn out some plausible and often quite striking images. What should we take seriously about this and what should we not overvalue about this? I think a lot of journalists are looking at this right now and saying, the game just changed radically. We have to rethink what we know about AI. What’s the caution on that? Or is this a moment where we should be reevaluating what’s possible with these systems?

So certain things are clear. This is a genuine set of interesting technical advances. I think these text to image models are producing interesting outputs, and they’re probably also going to have some economically useful applications. There is none that is super clear yet. I don’t think they’re quite yet going to take over the jobs of artists. But certain things, stock photos could be … instead of using stock photos, you could use images generated through these models. And I’ve already talked to people who have done that, not because they were experimenting with these models, but that’s because that was the best way they had for generating or finding the image that they wanted. So that’s clearly an economically useful output. Is art going to be one of those economically useful outputs? Unclear. One reason that it might not be, even if these models are very technically capable, is that the reason we value art is because it’s an expression of human creativity.

It’s not obvious that people will value automatically generated art the same way. Just like chess, the fact that these new engines are going to have a less than a one in a million chance of losing even to a grand master doesn’t mean that we have devalued human chess playing ability. So that’s one possible way that these models could have an impact on arts. They will enrich the art world, perhaps, just like chess engines have enriched the chess world, as opposed to putting artists out of work. Again, hard to predict, I’m just throwing out scenarios. And moving beyond those, there are certainly applications in entertainment that are plausible. I’m going to say something that sounds pretty out there, which is very different from my usual shtick, but I’m saying just to be provocative, just to say it’s hard to predict, just to say that a lot of things are possible, all these kind of metaverse wet dreams.

The idea that, for example, someone could say, Ethan riding a horse in a forest on the moon and the model will in real time generate a video of that, and you could be in the metaverse and you could be interacting with this character that you have just created. So the reason I’m throwing out this out there scenario is that I don’t think there are any fundamental technical barriers to be able to create something like that. Unlike, again, criminal justice where predicting the future, there are inherent limitations to that. So the limitations to this kind of metaverse interactive video scenario are simply computational limits, and do we have enough training data and that sort of thing. And I can very much imagine that in a few years’ time, these models which already have hundreds of billions, perhaps trillions of parameters are going to get even bigger. And the data sets they’re trained on are going to continue to get bigger.

The data set, for instance, that Stable Diffusion has trained on is minuscule. When I say minuscule, I mean that it only has on the order of a billion images, whereas YouTube has on the order of a billion videos. So if a video is worth, say roughly a thousand images, just with the data that is currently available, there is at least a factor of a thousand that you can push the size of the training data set. And of course the availability of data continues to improve or increase at an exponential rate. And similarly, the copy that’s being thrown at these models, even though in absolute terms it is huge, in relative terms compared to the data centers that big tech companies have for running their core products, again, is minuscule. So again, there’s several orders of magnitude improvement to go.

And so just by pushing these computational abilities, we might be able to do qualitatively new things like real time video interaction through a headset like scenario. So all of those things are possible, and if those things are possible, it has huge implications for entertainment. What I’m less sure of, is does any of this does go into translate to other things like looking at medical images and doing medical diagnoses? Or all of those other socially useful and economically important things? Much less clear that these large models are going to have a big impact or a transformative impact there?

So let’s go to that question of medical imaging because this is often considered one of the great AI success stories. There’s a paper a few years back that looks at the ability to algorithmically detect tumors from imagery of the lungs. And the headline associated with the paper was that this algorithm was significantly more accurate than the most accurate humans. And in a funny way, what this system was sort of able to do was almost look into the future. It was able to train not just on the data of someone’s lung on day one, it could then look at data 90 days into the future, 120 days in the future and extrapolate that what might look like simply a missed pixel was actually a tumor. Some commentators looked at this and said, stop training radiologists, we won’t need human radiologists anymore. This system’s already better than the best human radiologist. I will note that no one’s stopped training radiologists. Why is that? And how did the people projecting that get that so wrong? That feels like a place image recognition where these systems actually have been making great leaps and bounds.

It is true. Again, there has been impressive technical improvements in this area. I don’t know how much better they are than human physicians, but even if they are equally accurate and can do things a lot faster, that is still pretty interesting and useful. But you know what? In all the decades past, whenever there was really important medical technology that came into the field, like a MRI machine or a CAT scan or whatever, no one said we should stop training doctors. Somehow in all those cases, we understood that the tool is something that doctors can certainly use. It’s going to make them do their jobs better and perhaps more efficient, more accurate, more productive, all that good stuff. But it’s only a small part of the universe of things that a doctor does, looking at all of the entire context of the medical profile that the patient presents, I’m sure I’m not using the right medical terms, but we all have an intuition for the fact that these tools are helpful, but they’re not going to replace doctors.

But somehow, perhaps because of this flawed analogy of AI to robots and having agency and being similar to human intelligence, all that, we seem to forget when it comes to AI that it’s just a tool, and instead we think about human labor replacement instead of augmentation and improvement in productivity. I don’t really have much more to say about that. I have this kind of snarky way of putting it. It’s like if the inventor of the typewriter had said that it’s going to make writers or poets or journalists obsolete, not recognizing that the work that they do is more than the act of putting words on paper, it’s more than the externally visible activity. I think some of the confusion around AI replacing various human experts is perhaps along similar lines. I mean that might be a little bit of an unfair example. It is true that what some of these automated systems are able to do is pretty complex and relatively central to the work that a physician does, but by no means the complete set of things the physician does.

And I think that’s why the story here is not, should we stop training doctors? But instead, okay, these systems are accurate enough in the lab, how do we make sure they work actually well in the field? Are there biases? Are they going to stop working if you go to a different country based on a new population? Those things are all still being worked out. And I think those are important questions. We should continue to work on those questions. I think we’ll make progress on those questions. And at the end of the day, we will hopefully have made doctors more productive.

One of the things that comes into mind when people put up this scenario of the machine is going to replace the human, and you’ve made the point that the much more likely scenario is that the machine is going to enhance the human. But one of the things that comes into play for me is the question of trust. If we are relying on an algorithm to make decisions that are going to lead someone perhaps to chemotherapy, we really need to trust a great deal of that algorithm is accurate. We need to trust that algorithm doesn’t have inbuilt biases. We have all sorts of situations where we’ve seen systems that work well in one group of people, don’t work well in another group of people. If I’m a person of color, if I’m in my case, obese and maybe my data is quite different from the training data, maybe I have reason not to trust that algorithm. As we get more involved with AI making these sort of real life and death decisions, how do we deal with these questions of trust? How do we deal with auditability, testability of these systems? How do we know that they can do what their creators say they do? And maybe more complicated than that, how do we know that they will work for us as well as their creators say they will?

Yeah, that’s a great question. I don’t have any perfect answers, but there are a lot of historical precedents to look to. Again, medical devices, AI is not the first technology to come into the medical domain. There are important medical devices that also are involved in life and death decisions. And if these machines malfunction, it could lead to loss of human life, and that has happened sometimes. And we’ve managed to put guardrails around those things. There are rigorous testing regimes. That’s why one of the reasons the FDA exists and you have to get your machine approved before you bring it to market. So that’s a very different mindset from what AI developers currently inhabits. It’s the kind of move fast and break things mindset. And that of course needs to change, as and when we start putting AI into more consequential applications.

There has to be more upfront testing. And I think there has to be much more openness to third party auditors. Right now that is very much a work in progress. Companies have many concerns, some legitimate, some illegitimate, I would say. Things like, oh, the system is proprietary, we can’t share the details, or, oh, we can’t let our researchers audit this because that will violate the privacy of the people on testing data. So I would say varying degrees of credibility in those objections that they raise, but whatever the objections are we’ll have to figure out ways to put guard rails around them and enable a really robust third party auditing regime. And auditability needs to be required by law. I think otherwise there aren’t enough incentives to build that into your system. So I think those are some examples of guardrails. I’m sure there are others, but I think if there is enough political will, it can be figured out.

Assume for the moment that this new book sells a bajillion copies, that you find yourself being interviewed by much smarter, much more prominent people than I, and that you get the chance to share this message far and wide. What are the biggest hopes you have for changing people’s thinking about AI, and particularly about AI hype? If you succeed beyond your wildest dreams, what does this book and your advocacy around this do?

I’d say [inaudible 00:33:59] and I would hope for a few things. One is to give people a set of mental tools that they can use to understand the set of emerging technologies and not necessarily be so wowed by it, even while acknowledging that certain things are genuine advances and represent the work of really smart and more importantly, really hardworking people who built these things. But that doesn’t mean that we should treat it like magic. There’s a difference between them, and we hope to be able to give people tools to be able to understand that. Be able to push back against some of the morally objectionable uses of these tools. I think it’s super critical that that conversation, again, expand beyond the bias issue and start talking about what are the dangers when we start using these tools for problems that are really moral questions, they’re not prediction questions. What is lost when we translate them into a prediction question? And what is further lost when the prediction is quite inaccurate? So those are the kinds of questions I think people need to ask and answer. So those are some messages that I’d like to give people in general, but then specifically for journalists, specifically for policy makers, and specifically for some of the technologists building these tools, we hope to have more actionable guidance on some of the ways to make all this better in the future.

It also sounds like we should probably short Tesla and go long on Meta, particularly if they decide to build a metaverse where you can also ride a horse through a forest on the moon. Because that sounds great.

No predictions. But yeah, perhaps.

Arvind Narayanan, thank you so much for being with us. This was really a pleasure and I’m so looking forward to the book being out there.

Wonderful. Thank you again for the opportunity to talk about all this with you.

Images Powered by Shutterstock

The Data Daily

See Through AI Hype with Arvind Narayanan - Initiative for Digital Public Infrastructure