Of the many voice applications for AI, speech recognition is the most widely known and deployed as a building block of voice assistants. To get a sense of its scale, voice and speech recognition market alone is expected to grow from $9.4 billion in 2022 to $28.1 billion by 2027 according to a report by MarketsAndMarkets.
However, voice is a richer medium than text, and there are many interesting products to be built beyond just recognition. Based on speech, one can discern the age, emotion, or identity of a person. We can also generate natural-sounding speech with desired voice timbre and other qualities, or even transform the way people sound. In a previous post, we listed many potential applications of speech technologies.
There is one obstacle to making this vision a reality: most data and AI teams are unable to work with speech data due to the current state of tools. All ML and AI applications - including speech apps - depend on data. Up until recently, teams who work with audio data had to build bespoke tools. In this post we’ll describe a suite of open source software that simplify data processing, data integration, pipelining, and reproducibility for audio data.
There are several quirks associated with each type of data: tabular data can have missing values or unnormalized records; text often needs normalization; images often need to be resized, labeled, and checked for duplicates.
What are the main issues with speech data? Historically, many different formats have been developed for storing and compressing speech data. Data is either lossless or lossy and may require different codecs to read, and not all codecs are readily available in Python. Often the data has multiple channels (mono, stereo, or more - a popular Microsoft Kinect sensor for gaming has four different microphones). These channels can all be in a single file, or in multiple files, depending on the mood of the person releasing the data.
While there are many audio codecs, the speech community has standardized around a few formats (WAV/PCM, MP3, OPUS, FLAC, etc.). The same cannot be said of metadata such as text transcripts or labels used in model training. Common labels include things like “who’s speaking”, how old they are, change in speaker, emotions, and sentiment. Typically every audio dataset has its own way of affixing labels and metadata. The ad hoc approach to labeling might be sufficient for academic and R&D research, but it makes it very difficult to combine multiple sources of data when building real-world speech applications.
In addition, speech applications and services often involve real-time processing, where models require special considerations for handling incremental inputs.
We have yet to meet a machine learning engineer who enjoys dealing with the challenges that come with audio data. Thankfully, an open source project called Lhotse resolves most of these common challenges. Lhotse provides fifty recipes to prepare data from commonly used audio datasets. Let's examine some specifics.
Working with audio is challenging due to the length of recordings. Sometimes data is nicely “segmented” into single phrases, but other times we have longer recordings such as podcasts. How can we effectively work with both?
Lhotse allows users to seamlessly retrieve segments of interest. We call these segments “cuts”. Think of audio engineers in a professional studio, cutting magnetic tapes in the 1980s. A great feature of specific cuts is that they reference all the relevant items: audio, text transcription, speaker label, and any features you might have extracted for that segment. It's like working with rows in a pandas dataframe, but for audio – and like with dataframe columns, you can extend cuts with any new types of features or metadata you happen to collect.
With cuts, it is very easy to repurpose an existing dataset for other tasks. For example, one can easily reuse a speech recognition conversational dataset for voice activity detection (see Figure 3). One can also glue different cuts together and mix them with some noise to create a new dataset, or augment existing data.
Figure 3: The same part of a conversation, used to construct the training data for either (A) speech recognition or (B) voice activity detection.
Audio data has traditionally been stored on filesystems, but we’re increasingly seeing teams move to the cloud object stores and other cloud native storage services. Lhotse handles these cases seamlessly, and lets you reference your data using the same set of APIs and abstractions. This means if you happen to move your data around, there are no code changes required – you simply need to update your metadata manifests.
Another frequent pain point is that the feature arrays are variably-sized (some recordings are longer than others) and can be very large for long recordings. Lhotse supports several backends (raw audio files, HDF5 feature arrays, and custom formats) for writing and reading data. In addition, Lhotse also supports a custom lossy compression format (called lilcom) tailored specifically for speech features that can reduce the storage size by up to 70% without impacting the quality of the trained models.
Data scientists are familiar with Pandas chain-of-operations API, and Lhotse design is heavily inspired by this programming model. Manipulating and transforming your audio data is as simple as the following:
Figure 4: An example of a complex operation on audio data implemented with pandas-style operation chaining in Lhotse.
Lhotse's primary purpose is to support machine learning workflows. Designed from the ground up to integrate seamlessly with PyTorch DataLoader API, it maximizes the developer's modeling velocity and productivity.
Lhotse also offers a number of (sampling) classes that can stratify how the data is selected for training. As we’ve repeatedly noted, a common issue is that speech cuts have uneven durations (similarly to how text data has long and short strings). One of the possible solutions is to use “bucketing” and select data examples of similar size and present them to the ML model together. During the early stages of building Lhotse, we noticed that bucketing increased the amount of speech data used at each training step by 40%. Bucketing translated to batches that contained less padding which led to faster1 training times. Moreover, in Lhotse bucketing can even happen on-the-fly with a minimal amount of memory usage. For large datasets, this can be a memory reduction of as much as 95%!
SSD-powered storage is becoming more common these days, but is still not a commodity for very large datasets. Sometimes, it's more convenient to store large amounts of data on slow-spinning HDDs, or on "cheap" cloud storage systems. A major downside to not using SSDs is that reading data for model training is going to be much slower.
Lhotse integrates with third-party libraries such as WebDataset as a means of maximizing slower storage options. Long story short, the way magnetic disks are constructed, it’s much faster to read data sequentially (i.e., when each record lies next to each other) than randomly, all across the physical disk. With the help of WebDataset, Lhotse can “compile” your speech data to prepare it for lightning-fast reads. Across a range of typical workloads, we found that these techniques may speed up data loading vs random reads by 5-100x. The biggest gains are observed when using multiple types of features for the training examples.
If Lhotse sounds like a real game-changer for ML engineers who work with audio, you’re definitely going to like what’s coming next. Lhotse is being developed as a part of a collaborative speech community effort dubbed “k2” that includes contributors from organizations like Xiaomi, Johns Hopkins University, Nvidia, Microsoft, Cisco, and Meaning. The name k22 is a word play on “Kaldi 2” (as a next-gen successor to the Kaldi project). Kaldi is the most popular speech toolkit to date (with 11.9k GitHub stars). It was started in 2010 before the era of TensorFlow and PyTorch, and Kaldi is written mostly in C++ and Bash, with a bit of Perl and Python.
Within the k2 ecosystem, each project focuses on a specific issue related to speech modeling. The titular k2 implements highly optimized graph (finite state acceptor, FSA) algorithms for CPU and GPU. k2 also integrates with PyTorch to provide training objectives and inference (decoding) methods specific to sequences such as text and speech. Lhotse, as we described above, deals with everything related to data processing and data integration. Finally, Icefall glues k2, Lhotse and PyTorch together to provide reproducible recipes for training speech models. To get a sense of the impact of these tools, there are many pretrained models that rely on Icefall that can be found on HuggingFace Model Hub and HuggingFace Spaces (see here and here).
Figure 6: The three main projects in the k2 ecosystem - k2, Lhotse, and Icefall.
Lhotse, and other k2 ecosystem libraries, are freely available either on PyPI (via “pip install lhotse”) and GitHub. You can also check our NeurIPS DCAI 2021 paper. Please reach out to us at GitHub Discussions or via email (lhotse@meaning.team) to discuss further.
If you have your own audio data, you can start exploring Lhotse with this short snippet:
Here’s another zero-effort example on Colab to get you started with Lhotse using publicly available datasets. You can find more tutorials here.
Piotr Żelasko is the Head of Research at Meaning. He is an expert on automatic speech recognition (ASR) and the main author of Lhotse. He previously worked at academia (John’s Hopkins University, AGH-UST) and industry (Avaya, Techmo).
Jan Vainer is a Lead Machine Learning Engineer at Meaning. He is an expert in speech synthesis and voice conversion, and a Lhotse contributor.
Tomáš Nekvinda is a Speech Research Scientist at Meaning. He is an expert in speech synthesis and generative AI models, and a Lhotse contributor.
Ben Lorica is a principal at Gradient Flow. He helps organize the Data+AI Summit, Ray Summit, and is co-chair of the NLP Summit and K1st World. He is an advisor to Meaning and several other startups.
[1] This is because you are trying to pack multiple snippets with different durations, into a single mini-batch. If you pack [10, 8, 5] second long utterances, you have to add [0, 2, 5] seconds of padding (silence) to present an input tensor to the GPU. Bucketing allows you to construct the mini-batches so that you present sth like [10, 10, 9] instead, so instead of padding 7 seconds of silence, you only pad 1 second. On a more technical note, the mini-batches in Lhotse have dynamic batch sizes which are determined by the total duration of speech of a mini-batch. So if you set the max duration limit at 100s, you keep collecting data until it's total duration is close to 100s. When you're bucketing, you can collect either 10x10 second utterances or 50x2 second utterances. Otherwise if you had both 10 second and 2 second examples in a mini-batch, you would waste close to 40-50% of the input tensor for training by filling it with padding.
[2] k2 happens to be the name of the second highest mountain in the world.