We must be ready for the consequences of the conversations breaking out in the media. Everyone working in data science will feel the changes driven by increased attention and public curiosity. We have caught people’s imaginations, and more people are thinking about what AI makes possible.
The visual nature of our latest diffusion models is captivating. Giving people in and outside the research community access to the model implementations is an intelligent business strategy. In Harpreet Sahota’sArtists of Data Science Office Hoursyesterday, someone joked about “Vin Vashishta in a Fast and Furious Movie Poster.” This was what DALL-E created with that prompt.
It’s not perfect, but it’s not far off. Oddly enough, I looked a lot like that when I graduated from high school. It is undeniably compelling to play around with that search bar and see what the model returns. AI headlines have been appearing and disappearing for decades, but this has stickiness beyond the business and academic worlds.
John Oliver featuredsome very odd synthetic images(NSFW language) people created of him. That segment was amusing and is an example of how these technologies get on the general public’s radar. Then human nature takes over, like in my movie poster above.
Most discoveries are relevant for a week or less but look at how long DALL-E2 has been discussed. These headlines are different, and we must understand what that means for our field. It’s all fun and games until the weaker parts of human nature take over. We need to talk about that.
Thisarticle in the NY Timesis an excellent example of what happens when people go from fascination to concern. The author makes 3 self-serving recommendations:
The jockeying has begun for control over the narrative. In the attention economy, grabbing a large share of a popular topic’s audience is profitable. This journalist will soon be followed by thousands of others who see dollar signs and are willing to do what it takes to get them.
Data scientists and researchers will quickly lose control of the narrative because journalists, influencers, and businesses are much more effective at controlling them. All 3 have agendas and will present AI through their larger narratives’ lens.
My Monday post was the setup for a much deeper conversation we must have about what’s next for our field. Why?
Journalists are proving the audience and growing it. Companies are already jumping in to capitalize on easy marketing. Once attention gets to a critical mass and sustains it for months, the next wave piles in, and those people threaten our field.
Incredible images will hold attention for only so long. The people who pick the cycle up and keep it going must take much more extreme views. AI will become a tool of the Deep State. Data science will be a tool of the scientific community to take away freedoms and liberties. Be ready for people’s perceptions of data scientists to become polarized, which has policy implications.
During the pandemic, Rebekah Jones, a former data scientist with the Florida Department of Health, went to the media to exposeCOVID data manipulation. An auditrecently confirmedher claims. A warrant was issued for her arrest, and she had to seek whistleblower protection to avoid prosecution.
Arrests are an extreme case. Cambridge Analytica’sscandalwas another major data science whistleblower case with significant fallout. Zuckerberg was called before US and EU oversight hearings to explain Meta’s practices. In the 2016 election cycle, Meta had shown a power that scared politicians.
I commented on one of Zuckerberg’s US hearings that he was the most powerful person in the room and knew it. He had a larger audience than all the people questioning him combined. He runs a company with revenues rivaling the GDP of entire countries. Zuckerberg’s power came from his platform’s massive scale.
It’s been a rough year for Meta. They were an advertising titan until Apple took away a significant data source. The ripple effects of that change have hit Snap, Pinterest, and many other social platforms that relied on Apple for their tracking data. Withholding data is a powerful retaliatory and competitive mechanism between businesses.
Regulators can use their powers to support and suppress businesses similarly. China has been effective inusing regulationunder the guise of protecting user data to keep a grip on its emerging tech industry. Their government realized that tech companies were developing better models than the government had. Platforms were growing so pervasive that they acquired semi-governmental controls over large population segments.
Did Apple decide to restrict access to users’ data on their own, or was there some prodding from policymakers or regulators? Did Meta get a far milder Alibaba treatment as a warning to the rest of the US tech industry? It’s a feasible conspiracy theory but unlikely. I bring it up to illustrate how policymakers can assert influence on businesses.
What’s essential about the handful of extreme cases and ‘what ifs’ is they set precedents. Data scientists may have criminal liability. Businesses can pay a steep price. Powerbrokers are aware of what our field can do, and they are also looking for control. They want touse data sciencebut not have it used against them.
If you Google search, data science political campaigns, there are ads served. Companies are paying for those keywords, and that’s telling.
Data science applications can drive public opinion and support narratives. Models can support and hurt political campaigns. They can threaten to undermine governments’ authority. The field will continue the march into politics and political reprisals. The earliest skirmishes over the control of AI will lead to more serious conflict between governments, businesses, and researchers.
It took weeks for the bad actors to crawl out of the woodwork and apply these technologies for theworst possible applications. The barriers to using large-scale models have fallen.
Old barriers were built by large datasets, expensive compute requirements, and significant domain knowledge. Once a model is open-sourced, people start to optimize it and make it easier to train on cheap hardware. They have legitimate purposes for doing it, but all three barriers come down once that happens.
Researchers know and advertise the risks but release their work into the wild anyway. We have a well-documented history of abuse. Nothing is changing. As the NY Time article points out, capabilities are advancing faster. Small steps are turning into leaps.
Our self-discipline is not keeping up. When the public’s attention came and departed, stories of abuse were quickly forgotten. The models’ capabilities limited the abuse. At a 2017 conference, I saw a bag that said, “What do we want? CHATBOTS! When do we want them? I’m sorry, but I don’t understand the question.” Today, chatbots are capable, and GPT-3 is auto-completing code well enough for people to pay for it.
It gets fiendish from there, and our field’s lack of self-discipline has led to open sourcing high-end capabilities. In a 2019 Pentagon wargame (published as occurring in 2024),it was revealedthat widescale cyberattacks were unnecessary to significantly dimmish US military capabilities. Targeted attacks on key people were enough to have a major impact. Many proposed attacks targeted vehicles with autonomous capabilities. The wargame was entirely fictional, but the scenario is feasible.
We have open-sourced adversarial models, advertised vulnerabilities, and published detailed research on exploitation methods. The wargames simulated China as the attacker, but smaller nations or a small group of Redditors could just as easily be the aggressors.
Cognitive warfare is now mainstream, and most major world powers practice different levels of it on the world stage. The weaponized mind is a military concept that’s been public for over 4 years. It is a strategic construct where data scientists team up with scientists from different cognitive fields and use information to weaponize the minds of their enemy’s population. Russia and Ukraine have traded cognitive warfare attacks and defenses.
None of this is a secret, and the tools to build data science weapons of mass destruction are available on Arxiv.
There is no way to put the tools back into a controlled environment, so that approach isn’t worth wasting time on. Data scientists must acknowledge we are no longer years away from potential impacts. We will see them more and more often, starting now.
Climate change is a good metaphor. It was easy to ignore when the impacts were decades out. It was easy to deny when consequences were minor and happening somewhere else. Phoenix, Las Vegas, and some parts of Colorado are running out of water. Water shortages have forced China to implement power restrictions because their dams aren’t producing enough electricity. It’s hard to ignore when the lights go out, and you’re required to ration water.
Attention from government, businesses, and the public means each incident will have a massive audience. Ethicists have been raising the alarm for years, but just like climate change, no one cares if no one sees it. Now they are feeling it, and attention is growing.
Ethicists tell anyone who listens not to use machine learning to predict crime. Companies still make the software available, and law enforcement buys it. Using facial features to determine emotion is another use case that’s not feasible but continues to be used in hiring software suites. Both cases harm people at the fringes, but public sentiment will shift once the impacts are felt more evenly.
People are going to start testing the limits and legality. I am waiting for the first armed, autonomous home defense robot. I expect more sophisticated stalking and harassment tools to be offered for sale. Dozens more use cases are possible and profitable. It’s only a matter of time, so we need to think ahead and plan a response.
What can we do about it? First, we must establish standards for what can be open sourced and what is too dangerous to open source. Potential impact statements must include a risk assessment for harm from abuse. Based on those findings, some projects can be publicly implemented but not made available for everyone to implement.
We should also list known harmful use cases like predictive policing and emotion tracking. Publishing a scam list and calling out any companies working on those projects will significantly improve our field’s credibility. I like the scam label because it puts those use cases into a category with things like multi-level marketing companies. They can be operated legally, but MLMs are always suspect because of that perception.
The rate of progress and the spread of knowledge will slow. Businesses relying on innovation to drive profits and academics dependent on publishing or perishing for tenure will push back until the pain hits them. I don’t think we can stop everyone from running headlong into the wall, but we can choose not to enable it.
Second, we must use our growing influencer communities to keep expert voices in the conversation. We got into the hype cycle of the early 2010s because the only available representatives were data science illiterates. This time the field has the reach to keep themselves in the conversation and prevent the same thing from happening again. We need consistent messages, or we won’t succeed.
Getting influencers on the same page is like herding cats. I don’t think they need to agree across the board, but they need to rationally bring expert knowledge into the conversation. Influencers must be transparent even when that’s not popular. Credibility is our only advantage; once we give that away, we lose all control and access.
Third, we must teach and open the field to more people. The easiest way to avoid becoming a villain is to have a large, diverse community. We need more data science professionals, and no one is pushing back against that. It’s a high-earning career. Businesses get a solution to the talent gap. Governments get a more robust, more innovative economy.
Right now, academia is going the wrong way. People are leaving higher education for the corporate world, where their earnings potential is much higher. We have a vibrant and growing online education community, but most educators have little experience in the field and aren’t teaching rigorous methods.
The bigger problem is student demand a fast track into the field, and some educators are happy to take the shortcuts necessary to get them there. A data science curriculum cannot be condensed into a 3-month bootcamp. Learning is a never-ending journey, not a destination for data scientists.
We can hire junior-level talent before they are 100% ready. I believe that’s the right way because the best learning happens in the real world. However, we have the responsibility to continue their education.
Causal knowledge is slowly entering the conversation. We should be preparing for that because causal machine learning represents the next leap in capability. Once models acquire that level of accuracy, many more use cases become feasible.
Causal models are reliable enough that people will give agency over to the models. We will trust models enough to let them make decisions or do tasks for us. We must have a framework in place before we get there. People need enough education to decide when to trust models, how much agency to give them, and when to take agency back because the models are behaving unpredictably.
Model literacy and AI ethics should be parts of our educational curriculum no matter what field or trade a person is going into. In the next 5 years, ambient intelligent systems will become a reality. That will progress in leaps, and it will be positive if we prepare for it now. If we roll it out like all previous technologies, it’ll be a train wreck with a lot more control over our daily lives than previous generations had.