Logo

The Data Daily

This nonprofit thinks it knows how to solve the A.I. talent shortage

This nonprofit thinks it knows how to solve the A.I. talent shortage

One of the perpetual problems among businesses hoping to use A.I. is finding people with the right skills in data science and machine learning. This talent is sparse and not evenly distributed across the globe. In fact, there are many countries, and even entire regions, that are currently being left behind when it comes to A.I. skills and, as a result, lack companies able to build their own A.I.-enabled software. Sara Hooker thinks the limited access to real-world experience building cutting-edge A.I. is a problem for everyone. And she wants to change that—enabling more people, especially those not from traditional power-house computer science PhD programs at just a handful of major research universities, to work on projects that can push the state of the art in A.I. forward. A former Google Brain researcher, Hooker is now head of Cohere for AI, a non-profit research lab affiliated with the for-profit A.I. software company Cohere, which has offices in San Francisco, Palo Alto, Toronto, and London. The company was also founded by Google Brain alums and specializes in selling access to ultra-large language models, the kind of A.I. behind recent advances in natural language processing. Last week, Cohere for AI announced a new program that will take people interested in conducting A.I. research from almost any region of the worldand provide them an eight months-long, full-time, paid fellowship with Cohere for AI working on large language models. “A lot of what we are trying to do is to change where research can occur and who can participate in it,” she tells me. Hooker may be particularly attuned to the lack of geographic diversity in A.I. because she grew up in Africa (you can still detect a hint of a South African accent when she speaks.) While still at Google Brain, she helped to establish Google’s research lab in Ghana, the first of its kind in Africa, and she says that Africa remains one of many places in the world that are being left behind by the A.I. revolution—and that this absence is ultimately bad for the development of A.I. itself. She also says that the problem with the lack of diversity is that it is constraining the field. “When I talk about improving geographic representation, people assume this is a cost we are taking on. They think we are sacrificing progress,” Hooker says. “It is completely the opposite.” She says building more diverse team is more likely to lead to innovation, not less. Cohere for AI’s “Scholar” program will accept candidates based on the strength of their ideas and the projects they want to pursue, regardless of whether they have a traditional academic research background, Hooker says. In fact, one of the criteria for the fellowship is that candidates cannot have previously published an academic research paper in machine learning. The problem with the A.I. research departments at large tech companies, Hooker tells me, is that there is a kind of group think that sets in—it is essentially all the same people who once populated academic A.I. labs and they all conduct research in the same way. “We want to find new spaces that exist outside that system,” she says. Cohere for AI is an example of one of several new research “collectives,” where the members themselves decide what problems to pursue. Hooker says that although Cohere as a company uses deep learning to build ultra-large language models, the collective Cohere for AI is a broad church, with members interested in completely different approaches, including symbolic A.I. that don’t use neural networks and are based on logical rules for manipulating symbols. The only requirement is that members have to commit to open sourcing the A.I. systems they build. “We want to participate in and contribute to these open forums,” Hooker says. “We want to provide a space for open source code and open discussion.”

She does allow that there might be concerns about openly publishing powerful A.I. software because it might be abused to create misinformation or used for fraud and that “a nuanced discussion of risk is called for.” But she says that she hopes Cohere for AI can be a forum for such discussions and can play a role in educating policymakers globally about both the risks and benefit of A.I.

It’s not clear whether Cohere for AI’s experiment in collectivist research will work. After all, OpenAI, the San Francisco A.I. research company behind the ultra-large language model GPT-3 as well as the text-to-image generation A.I. DALL-E, also began life as a non-profit. It too had some radical ideas about how research should work. In 2016, shortly after its founding, it held a free and open “unconference” on machine learning in San Francisco, which was supposed to be self-organizing and more accepting of a diversity of ideas and people than a traditional academic A.I. conference. It wasn’t clear that it really worked. The experiment was not repeated, and today, OpenAI is a for-profit company (well, they have said they will cap their investors’ profits at 100 times their initial funding) tightly partnered with Microsoft. It is primarily focused on building very large deep learning models, with natural language processing as a core component. Gone are the days of “unconferences” and broad church thinking about how best to drive progress. But whether or not Cohere for A.I. is able to validate its ideas about organizing research, business should take heed of Hooker’s point about diversity. It isn’t a cost. It’s an opportunity. The A.I. talent shortage will never be solved by just hoping a handful of universities churn out more machine learning experts and data scientists. Companies need to think hard about how to train people from other backgrounds to help build and maintain A.I. software.

*** Please join me for what promises to be a fantastic virtual round table discussion on A.I. “Values and Value” on Thursday, October 6th at 12:00 to 1:00 PM Eastern Time.

The A.I. and machine-learning systems that underwrite so much of digital transformation are designed to serve millions of customers yet are defined by a relatively small and homogenous group of architects. Irrefutable evidence exists that these systems are learning moral choices and prejudices from these same creators. As companies tackle the ethical problems that arise from the widespread collection, analysis, and use of massive troves of data, join us to discuss where the greatest dangers lie, and how leaders like you should think about them.

You can register to attend by following the link from Fortune’s virtual event page. 

Private medical images have been discovered in public text-to-image generation dataset. An artist tells Ars Technica that she discovered images of her face taken by doctors had been somehow scraped into a large public dataset of images that has been used to train A.I. software, including the popular text-to-image generation model Stable Diffusion. The artist, who goes by the name Lupine and suffers from a rare genetic condition called Dyskeratosis Congenita, told the publication that she has used a reverse image search tool on the website Have I Been Trained to find the photos her doctor had taken of her face in 2013. The doctor died in 2018 and the artist said she suspected the photos had somehow left the control of his office after that.

OpenAI releases free multi-lingual speech recognition system. The new A.I. software, called Whisper, can recognize speech across different languages and accents and automatically transcribe any of them into English. The system was trained on 680,000 hours of audio data scraped from the Internet. OpenAI said in a blog post it hoped that by releasing the model as free, open source software it would encourage developers to add speech recognition to products. But the A.I. research company also warned of the potential for malicious uses of the software, such as enabling greater surveillance.

Getty images bans A.I.-generated content over copyright concerns. The photo agency says it will not allow artist and photographers to upload and sell images generated by A.I. software. Getty’s CEO Craig Peters told tech publication The Verge that “there are real concerns with the respect to the copyright outputs from these models and unaddressed rights issues with respect to the imagery, the image metadata and those individuals contained within the imagery.” Getty fears that customers who bought the images for commercial use would put themselves in legal jeopardy.

Consulting giant McKinsey & Co. has hired Jacky Wright to be its first chief technology officer, Bloomberg News reports. Wright was previously Microsoft’s chief digital officer.

Sensa, the Austin, Texas-based insurance company that uses analytics software incorporating machine learning to very rapidly assess damage to vehicles and injuries to people in the wake of an auto accident, has hired Steven Brown to be its chief operating officer, according to trade publication Reinsurance News. Brown, a long-time insurance industry veteran, was previously chief insurance officer at software company Floow.

DeepMind said it has built a better chatbot. DeepMind has developed a chatbot that it says can provide people more factually accurate information. The chatbot, which it calls Sparrow, is designed to talk with humans and answer their questions, using DeepMind’s large language model Chinchilla to compose its responses. But the problem with using Chinchilla without any filtering, is that like most large language models, Chinchilla tends to invent information. It can also regurgitate outdated information that it ingested during its training. To try to make the chatbot more accurate, DeepMind created a system where the responses Chinchilla makes are informed by a live Google search.

To further hone these answers, Sparrow polls human users on which of a number of different responses they prefer. Reinforcement learning is then used to train Sparrow to forecast which answer the majority of people will prefer. The chatbot also follows some 23 hard-coded rules determined by DeepMind, such as not offering financial advice, making threats, or claiming to be a human.

Sparrow was much better at providing answers that humans liked—and that were accurate—compared to previous systems. But it was not perfect. It would still provide off-topic and inaccurate answers some of the time. It also broke those 23 rules about 8% of the time (which was a third less often than previous chatbots, but still not fantastic.)

You can read DeepMind’s non-peer reviewed paper on Sparrow here and this MIT Technology Review story contains comment on Sparrow from non-DeepMind experts.

Commentary: A.I. is not sentient–but we should treat it as such—by Triveni Gandhi

Elon Musk is getting ready to unleash an army of humanoid robots. Here’s what he wants to use them for—by Prarthana Prakash

Is taking copyrighted material off the Internet to train an A.I. system IP theft? That question is increasingly being raised by artists, writers, and intellectual property lawyers and regulators as powerful text-to-image A.I. systems gain in popularity. The likes of DALL-E and Stable Diffusion are trained on millions of images scraped from the Internet. And increasingly artists are saying that this training amounts to IP theft. Concerns over legal challenges led Getty Images to ban people from using the photo agency to sell A.I.-generated images (see the News section above.) And the U.S. Patent and Trademark Office has asked people to comment on its proposals for regulating A.I. and the images and text that A.I. generates.

In its comments to the U.S. PTO, OpenAI, which produces two of the best known generative A.I. systems (GPT-3 for language and DALL-E for images), argues that training on copyrighted material should fall under the “fair use” exception to copyright. It claims that A.I. is “substantially transformative” of A.I. works, and thus qualifies as fair use. It also says previous cases involving the large-scale processing of copyrighted works for the purposes of data analytics are also considered fair use. Finally, it argues that authors and artists who claim that A.I. generated works that are similar in style are harming their ability to profit from their IP can either take legal action in specific cases where an A.I.-generated work is substantially identical to something they’ve copyrighted, or seek other policy solutions rather than using trying to get a ban on allowing A.I. systems to be trained on copyrighted material scraped from the Internet.

But OpenAI’s submission to the PTO drew a sharp response from Nicole Miller, who describes herself on Twitter as an A.I. ethics, IP, and copyright advocate. She said on Twitter that OpenAI admits that the majority of material it takes from the Internet to train large A.I. models is indeed copyrighted. “They’re using it as a 0-cost asset and selling it anyway,” she tweeted.

I expect this legal area to get very heated over the next 12 months as generative A.I. systems go mainstream and find their way into all kinds of commercial uses, while regulators and policymakers struggle to keep up.

Images Powered by Shutterstock