Caches Considered Harmful for Machine Learning

Read original article here

I’ve been working on a new research paper, and a friend gave me the feedback that he was confused by the statement “memory accesses can be accurately predicted at the compilation stage” for machine learning workloads, and that this made them a poor fit for conventional processor architectures with predictive caches. I realized that this was received wisdom among the ML engineers I know, but I wasn’t aware of any papers that discuss this point. I put out a request for help on Twitter, but while there were a lot of interesting resources in the answers, I still couldn’t find any papers that focused on what feels like an important property for machine learning systems. With that in mind, I wanted to at least describe the issue as best as I can in this blog post, so there’s a trail of breadcrumbs for anyone else interested in how system designs might need to change to accommodate ML.

So, what am I talking about? Modern processors are almost universally constructed around multiple layers of predictive memory caches. These are small areas of memory that can be accessed much faster than the main system memory, and are needed because processors can execute instructions far more quickly than they can fetch values from the DRAM used for main memory. In fact, you can usually run hundreds of instructions in the time it takes to bring one byte from the DRAM. This means if processors all executed directly from system memory, they would run hundreds of times more slowly than they could. For decades, the solution to the mismatch has been predictive caches. It’s possible to build memory that’s much faster to access than DRAM, but for power and area reasons it’s not easy to fit large amounts onto a chip. In modern systems you might have gigabytes of DRAM, but only single-digit megabytes of total cache. There are some great papers like What Every Programmer Should Know About Memory that go into a lot more detail about the overall approach, but the most important thing to know is that memory stored in these caches can be accessed in a handful of cycles, instead of hundreds, so moving data into these areas is crucial if you want to run your programs faster.

How do we decide what data should be placed in these caches though? This requires us to predict what memory locations will be accessed hundreds or thousands of cycles in the future, and with general programs with a lot of data dependent branches, comparisons, and complex address calculations this isn’t possible to do with complete accuracy. Instead, the caches use heuristics (like we just accessed address N, so also fetch N+1, N+2, etc in case we’re iterating through an array) to guess how to populate these small, fast areas of memory. The cost of making a mistake is still hundreds of cycles, but as long as most of the accesses are predicted correctly this works pretty well in practice. However there is still an underlying tension between the model used for programming languages, where memory is treated as a uniform arena, and the reality of hardware where data lives in multiple different places with very different characteristics. I never thought I’d be linking to a Hacker News comment, the community has enough toxic members I haven’t read it for years, but this post I was pointed to actually does a good job of talking about all the complexities that are introduced to make processors appear as if they’re working with uniform memory.

Why does all this matter for machine learning? The fundamental problem predictive caches are trying to solve is “What data needs to be prefetched into fast memory from DRAM?”. For most computing workloads, like rendering HTML pages or dealing with network traffic, the answer to this question is highly dependent on the input data to the algorithm. The code is full of lines like ‘if (a[i] == 10) { value = b[j] } else { value = b[k]; }‘, so predicting which addresses will be accessed requires advance knowledge of i, a[i], j, and k, at least. As more of these data-dependent conditionals accumulate, the permutations of possible access addresses become unmanageable, and effectively it’s impossible to predict addresses for code like this without accessing the data itself. Since the problem we’re trying to solve is that we can’t access the underlying data efficiently without a cache, we end up having to rely on heuristics instead.

Machine learning operations are very different. The layers that take up the majority of the time for most models tend to be based on operations like convolutions, which can be expressed as matrix multiplies. Crucially, the memory access patterns don’t depend on the input data. There’s no ‘i‘ code in the inner loops of these kernels, they’re much simpler. The sizes of the inputs are also usually known ahead of time. These properties mean that we know exactly what data we need in fast memory for the entire execution of the layer ahead of time, with no dependencies on the values in that data. Each layer can often take hundreds of thousands of arithmetic operations to compute, and each value fetched has the potential to be used in multiple instructions, so making good use of the small amounts of fast memory available is crucial to reducing latency. What quickly becomes frustrating to any programmer trying to optimize these algorithms on conventional processors is that it’s very hard to transfer our complete knowledge of future access patterns into compiled code.

The caches rely almost entirely on the heuristics that were designed for conventional usage patterns, so we essentially have to reverse-engineer those heuristics to persuade them to load the data we know we’ll need. There are some tools to help like prefetching instructions and branch hints, but optimizing inner loops often feels like a struggle against a system that thinks it’s being helpful, but is actually getting in the way. Optimized matrix multiplication implementations usually require us to gather the needed data into tiles that are a good fit for the fast memory available, so we can do as much as possible with the values while they’re quickly accessible. Getting these tiles the right size and ensuring they’re populated with the correct data before it’s needed requires in-depth knowledge of the capacity, access latencies, and predictive algorithms of all levels of the cache hierarchy on a particular processor. An implementation that works well on one chip may produce drastically poorer performance on another in the same family if any of those characteristics change.

It would make more sense to expose the small, fast memories to the programmer directly, instead of relying on opaque heuristics to populate them. They could be made available as separate address spaces that can be explicitly preloaded ahead of time with data before it’s needed. We know what address ranges we’ll want to have and when, so give us a way to use this knowledge to provide perfect predictions to fill those areas of memory. Some embedded chips do offer this capability, known variously as tightly-coupled memory, or XY memory, and we do use this to improve performance for TensorFlow Lite Micro on platforms that support it.

There are lots of challenges to making this available more widely though. Modern desktop and mobile apps don’t have the luxury of targeting a single hardware platform, and are expected to be able to run across a wide variety of different chips within the same processor family. It would be very difficult to write efficient code that works for all those combinations of cache size, speed, and prefetch heuristics. Software libraries from the processor manufacturers themselves (like CuDNN or Intel’s MKL) are usually the best answer right now, since they are written by engineers with detailed knowledge of the hardware systems and will be updated to handle new releases. These still have to work around the underlying challenges of a programming model that tries to hide the cache hierarchy though, and every engineer I’ve talked to who has worked on these inner loops wishes they had a better way to take advantage of their knowledge of memory access patterns.

This is also the kind of radical workload difference that has inspired a lot of new kinds of NPU hardware aimed specifically at deep learning. From my perspective, these have also been hard to work with, because while their programming models may work better for core operations like convolutions, models also require layers like non-max suppressions that are only efficiently written as procedural code with data-dependent branches. Without the ability to run this kind of general purpose code, accelerators lose many of their advantages because they have to keep passing off work to the main CPU, with a high latency cost (partly because this kind of handover usually involves flushing all caches to keep different memory areas in sync).

I don’t know what the ultimate solution will look like, but I’d imagine it will either involve system programmers being able to populate parts of caches using explicit prefetching, maybe even just supplying a set of address ranges as requirements and relying on the processor to sort it out, or something more extreme. One possible idea is making matrix multiplies first-class instructions at the machine code level, and having each processor implement the optimal strategy in microcode, in a similar way to how floating-point operations have migrated from accelerators, to co-processors, and now to the core CPU. Whatever the future holds, I hope this post at least helps explain why conventional predictive caches are so unhelpful when trying to optimize machine learning operations.

Images Powered by Shutterstock

The Data Daily

Caches Considered Harmful for Machine Learning