Logo

The Data Daily

Import AI 304: Reality collapse thanks to Facebook; open source speech rec; AI culture wars.

Import AI 304: Reality collapse thanks to Facebook; open source speech rec; AI culture wars.

Facebook shows the future of AI-generated videos – and it is delightful and terrifying:

…Prepare for the reality collapse as a consequence of reality generation…

Facebook researchers have built Make-A-Video, a system that can let users generate videos from short text descriptions, edit videos, stitch pictures together to generate videos, and so on. The most amazing part is the technique relies on paired text-image data along with unsupervised video footage; so it doesn’t require a dataset of text-video footage and therefore sidesteps a potentially expensive data problem. 

How it works: Make-A-Video is made of a basic text-to-image (T2I) model trained on text-image pairs, spatiotemporal convolution and attention layers to help you build networks that generate things over time, and spatiotemporal networks that have a frame interpolation network. The T2I model trains on text-image pairs of 64×64 images, and two super-resolution networks that upscale this all the way to 768×768 pixels. The three components (T2I), the spatiotemporal layers, and the frame interpolation stuff, are all trained separately, then assembled into one architecture. 

Data: They trained the system on 2.3billion text-image pairs from the Laion-5b dataset*, and ran a NSFW-filter over this for further filtering. They also used the WebVid-10M* and a 10M subset from HD-VILA-100M to train the video generation models, and also use WebVid-10M to train the interpolation models.   *Looks like WebVid contains videos scraped from Shutterstock. A good writeup about the phenomenon of even big tech companies using stuff like this here: AI Data Laundering: How Academic and Nonprofit Researchers Shield Tech Companies from Accountability (Waxy).

It’s really good, folks: The results are really, really impressive. Want a short video of a bear painting a portrait of a bear? Done. Want a UFO flying over a desert? Done. Want asteroids tumbling through space? Why, of course. How about variations on existing videos? Sure. Honestly, take a look at the blog and main site linked below and see for yourself – the results are wild. 

   And remember, all we need to do is turn the crank on dataset scale and network complexity to scale this out for longer periods of time and for even greater diversity. “Learning world dynamics from orders of magnitude more videos using unsupervised learning helps researchers break away from the reliance on labeled data,” they write. 

Why this matters: Reality generation and reality collapse: All these generative models point to the same big thing that’s about to alter culture; everyone’s going to be able to generate their own custom and subjective aesthetic realities across text, video, music (and all three) in increasingly delightful, coherent, and lengthy ways. This form of fractal reality is a double-edged sword – everyone gets to create and live in their own fantasies that can be made arbitrarily specific, and that also means everyone loses a further grip on any sense of a shared reality. Society is moving from having a centralized sense of itself to instead highly individualized choose-your-own adventure islands, all facilitated by AI. The implications of this are vast and unknowable. Get ready.

   Read more: Introducing Make-A-Video: An AI system that generates videos from text (Facebook research blog).

   Find out more at the main site, and also apply to potentially get access to future systems (Facebook site).

…Whisper means we’re not going to run out of data to train language models…

OpenAI has trained and released Whisper, a large-scale speech recognition model trained on almost 700,000 hours of internet-collected speech. “We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English,” the company writes. A third of the dataset is non-English. 

Whisper performance: Whisper doesn’t get state-of-the-art performance on popular benchmarks like Librispeech. However, it is trained on a sufficiently broad set of data that it does pretty well when exposed to the diversity of the world. “When we measure Whisper’s zero-shot performance across many diverse datasets we find it is much more robust and makes 50% fewer errors than those models,” OpenAI writes. 

Why this matters: There’s a lot of text data on the internet, but do you know what there’s more data of? Speech data. Especially speech data embedded in the vast stream of content people upload on a day-to-day basis to places like YouTube, Twitter, TikTok, and so on. Additionally, on any given day hundreds of millions of words are spoken in cities like New York, London, and Beijing. Systems like Whisper are going to make it far easier for people to harvest speech recognition data from the Internet and the wider world, transcribe that data, and build useful applications. It also gives developers a way to vastly increase the size of their text datasets – an important capability given that recent language modeling papers like Chinchilla have shown that you need about 4-5X the amount of data people thought to train good systems. 

   Get the code and model from GitHub here (OpenAI GitHub). 

#################################################### US politician says Stable Diffusion is an unsafe AI model:

…While some people cheer open access releases, others have worries…

Rep. Anna Eshoo (a Democrat from California) has sent a letter to the White House National Security Advisor and Office of Science and Technology Policy saying she has “grave concerns about the recent unsafe release of the Stable Diffusion model by Stability AI”. The letter notes that Stable Diffusion can be used to generate egregiously violent and sexual imagery, and – due to eschewing the kinds of controls that OpenAI uses for its commercial product DALL-E2 – the freely accessible model represents a big problem. 

   For those not keeping up, the Stable Diffusion model is behind probably 90% of the recent flurry of activity in the rapidly evolving AI art scene; because Stability released the weights of the model, people have been able to plug it into everything ranging from serving as a Photoshop plugin, to helping to do weird work in VFX. 

You want the ‘dual-use’ model? You can’t handle the model! Eshoo says models like Stable Diffusion qualify as “unsafe dual-use AI models”, and asks the NSA and OSTP to investigate how to use export controls to clamp down on the sharing of certain models. “I strongly urge you to address the release of unsafe AI models similar in kind to Stable Diffusion using any authorities and methods within your power, including export controls,” she writes. 

Why this matters: Here comes (another) AI culture war: Letters like this are indicative of a culture war brewing up among AI researchers; on one side, groups want to slowly and iteratively deploy new technologies via APIs with a bunch of controls applied to them, while on the other side there are people who’d rather take a more libertarian approach to AI development; make models and release the weights and ride the proverbial lightning. 

   There are reasonable arguments for either approach having some desirable safety qualities (either via limiting foreseen harms via control, or innoculating people against the models via release). What freaks me out is the sense of this culture war gaining resources and people on both sides; the higher the stakes, the more capital we can expect to flood into both approaches.

Researchers with Tsinghua University have released CodeGeeX, a 13 billion parameter programming model. The system works well across Python, C++, Java, JavaScript, Go, and others, and can be used – for free! – within the VS Code editor. It’s also open source. CodeGeeX is roughly equivalent with Salesforce’s ‘CodeGen’ model, and achieves a better average performance across languages (Python, C++, Java, JavaScript, and Go) than other systems. 

Ascend processors: CodeGeeX was trained on 850 billion tokens on a cluster of 1,536 Huawei Ascend 910 AI Processors – this is pretty interesting because a) that’s a lot of tokens that implies the developers grokked the DeepMind Chinchilla paper, and b) that’s a whole lot of non-NVIDIA processors; pretty interesting, given the recent A100/H100 US-China trade ban. 

Scale rules everything around us: “We find that the model capacity is essential for its multilingual ability. It is not trivial for the model to benefit from learning multiple programming languages,” the researchers write. “The few-shot ability of CodeGeeX requires further exploration. Instead of using costly fine-tuning approaches, we can provide a few examples to inspire the model to generate the desired programs.”

Why this matters: Code models are going to make human programmers more efficient and also provide an interesting augmentation to other systems (e.g, language models recursively calling out to code models). 

#################################################### GPT3 only costs $500k to train now:…Though the frontier still costs millions… Mosaic, a startup that builds software to make it more efficient to train neural networks, says it only costs $450k to train a GPT3-equivalent model, these days. When GPT3 came out it costs millions of dollars to train, but thanks to a) hardware innovations and b) companies like Mosaic improving their training stack, the cost has come down significantly. “he bottom line: it costs about $450K to train a model that reaches GPT-3 quality*, which is 2x-10x less than people think,” Mosaic writes (specifically, a 30B parameter model which uses the ‘Chinchilla’ insight to train on a compute-optimal amount of data).

Those costs in full: Using Mosaic, it costs about $2k to train a GPT2-style 1.3billion parameter model, $100,000 for a GPT-13B model, $450,000 for a GPT-38B model, and $2.5 million for a GPT-70B model (trained on 1400B tokens of data, so roughly equivalent to the same ‘recipe’ DeepMind used to train Chinchilla). There are a few reasons why the costs are low which relate to nice engineering inherent to Mosaic’s cloud, but the numbers are worth keeping in mind as it gives us a sense of how much we should broadly expect LMs to cost to train if you have a motivated team and decent infrastructure. 

Why this matters – cost rules everything about (stable) diffusion: You know what also cost about $500k to train? StableDiffusion, which cost

Images Powered by Shutterstock