Debugging PyTorch memory use with snapshots
Aug 16, 2022
Debugging PyTorch memory use with snapshots
In a previous post , I gave a detailed guide about how PyTorch CUDA caching allocator hands out memory. To understand how it is working with your own training run or application, we also have developed new tools to visualize the state of allocated memory by generating and visualzing memory snapshots. These snapshots record the stack traces where memory was allocated and where it lives inside the state of the caching allocator. By visualizing these snapshots as flamegraphs , it can show where memory is being used at glance.
Generating snapshots
First, we have to enable the recording of stack frame information for each allocation:
import torch torch.cuda.memory._record_memory_history(True)
Recording these stack traces is pretty fast (~1us per allocation, a normal PyTorch kernel call takes at least 8 us), but we leave it off by default. Once enabled we can allocate some memory, and take a snapshot:
from torchvision.models import resnet18 from pprint import pprint model = resnet18(device='cuda') input = torch.rand(1, 3, 224, 224) model.train() output = model(input) snapshot = torch.cuda.memory._snapshot() pprint(snapshot['segments'])
The snapshot records the entire allocator state and looks like:
[{'active_size': 19398656, 'address': 139896043864064, 'allocated_size': 19398656, 'blocks': [{'history': [{'addr': 139896043864064, 'frames': [{'filename': '/raid/zdevito/pytorch/torch/nn/modules/module.py', 'line': 745, 'name': '
'}, {'filename': '/raid/zdevito/pytorch/torch/nn/modules/module.py', 'line': 657, 'name': '_apply'}, ...], 'real_size': 1179648}], 'size': 1179648, 'state': 'active_allocated'}, ... ] 'device': 0, 'segment_type': 'large', 'stream': 0, 'total_size': 20971520}, ...]
The snapshot is a list of Segment dictionaries with this structure:
from typing import TypedDict, List class Segment(TypedDict): address: int total_size: int # cudaMalloc'd size of segment stream: int segment_type: str # 'large' (>1MB) or 'small' allocated_size: int # size of memory in use active_size: int # size of memory in use or in active_awaiting_free state blocks : List[Block] class Block(TypedDict): size: int state: str # 'active_allocated', used by a tensor # 'active_awaiting_free', we are waiting for another stream to finish using # this, then it will become free # 'inactive', free for reuse history: List[History] class History(TypedDict): addr: int frames : List[Frame] # stack trace when address was last allocated # most recent frame first real_size: int # unrounded size requested from the allocator class Frame(TypedDict): filename: str line: int name: str class Snapshot(TypedDict): segments : List[Segment] snapshot : Snapshot = torch.cuda.memory._snapshot()
Segments are the memory directly requested from cudaMalloc and cached by the allocator. Because we might only be using part of one of these segments, the caching allocator splits these up into one or more Blocks. All of a block is always in the same allocation state. With _record_memory_history each block will also record a History object that remembers the last allocation placed in that block, including its stack trace as a list of Frames. For active_allocated blocks there will be a single history for what exists in the block and is currently allocated. For inactive blocks, there may be multiple entries that record the last things that lived in the memory of block. The reason there might be more than one is because the allocator will merge split blocks when they become free, and it records the history of both splits. To avoid recording a huge amount of history, we only keep history for ranges of the block that do not overlap with anything newer.
Saving snapshots
The snapshots are designed to be pickled so that they can be viewed offline later:
from pickle import dump dump(snapshot, open('snapshot.pickle', 'wb'))
The file _memory_viz.py can be used directly as an interactive command to work with saved snapshots:
$ wget https://raw.githubusercontent.com/pytorch/pytorch/master/torch/cuda/_memory_viz.py $ python _memory_viz.py stats snapshot.pickle {'active_allocated': 65.9MiB, 'inactive': 40.1MiB, 'segments': 14, 'total_size': 106.0MiB}
Visualizing snapshots
The _memory_viz.py tool can also generate flamegraph visualizations of memory:
$ python _memory_viz.py memory snapshot.py -o memory.svg
This produces a visualization that segements all the bytes in the allocator in different classes (click for an interactive view):
Flamegraphs visualizations are a way breaking down how a resource (such as memory) is used into different categories that can then be further broken down into even more finegrained categories.
The size of segments along the x-axis corresponds to the number of bytes in that segment, while the depth along the y-axis shows the different categories into which we are classifying how the memory is used. Everything in the tower above the active_allocated category on the left is currently allocated, while the tower above the inactive category holds the memory that is cached in the allocator by not in use. Above these broad categories we further categorize how the memory is used by the stack frames active when it was allocated. In this visualization we can see one stack pattern with lots of _apply frames that holds the model parameters (which were created in the .cuda() which calls _apply). The other stack pattern in the active_allocated class includes resnet.py:285:forward and represents the activations saved for the backward pass. Another similar tower exists in the inactive category with the blocks that were used for temporaries in the foward pass.
The memory view gives a good overview of how the memory is being used. For debugging allocator issues in particular, though, it is useful to first categorized memory into individual Segment objects, which are the invidual cudaMalloc segments that allocated tracks:
$ python _memory_viz.py segments snapshot.py -o segments.svg
This view has a separate tower for each segment (see the seg_
categories), and lets you see how invidiual allocations get packed into the segements. The interactive view lets you zoom in on individual segments by clicking them.
Comparing snapshots
The visualizer can also generate a visualization that shows the segments that have been added and removed between two different snapshots. For instance, we can re-run the model with a larger input, and see how the allocator requested more memory for the larger temporaries:
input8 = torch.rand(8, 3, 224, 224, device='cuda') output = model(input8) snapshot = torch.cuda.memory._snapshot() dump(snapshot, open('snapshot2.pickle', 'wb'))
$ python _memory_viz.py compare snapshot.pickle snapshot2.pickle -o segments2.svg
The comparison view shows just the new segments, which can help figure out what code paths prompted more memory to be allocated:
$ python _memory_viz.py compare snapshot.pickle snapshot2.pickle -o compare.svg only_before = [] only_after = [140636932014080, 140636827156480, 140634912456704, 140634839056384, 140634843250688, 140634841153536, 140634866319360, 140634811793408, 140634845347840, $ 140636806184960, 140636778921984, 140634878902272]
Custom analysis
If these vizualizations are not sufficient, the snapshot format, and _memory_viz.py are simple enough that they can be modified to produce custom views, filter out out noise like small allocations, or calculate other aggregate statistics.
Generating Snapshots when Out of Memory
When debugging how your program runs out of memory, one helpful time to generate a snapshot is during an OutOfMemory exception itself, we can do that today by monkey patching code like so:
def oom_init(self, *args, **kwargs): super(torch.cuda.OutOfMemoryError, self).__init__(*args, **kwargs) # snapshot right after an OOM happened print('saving allocated state during OOM') snapshot = torch.cuda.memory._snapshot() dump(snapshot, open('oom_snapshot.pickle', 'wb')) torch.cuda.OutOfMemoryError.__init__ = oom_init
Whenever an OutOfMemoryError is created, we will produce a dump of the state of the allocator.
Understanding terminology
The visualization uses a number of terms that are useful to understand:
stream_
- the stream that is associated with the memory. The allocator keeps a separate cache for each stream.
active_allocated - memory that is in use.
inactive - memory that is considered “reserved” by the allocator but is availiable for new allocations
active_awaiting_free - on the CPU we have seen the last use of this memory, but we are waiting on another CUDA stream to finish using it (happens due to t.record_stream(s) calls).
- we didn’t capture any Python stacks for when this allocation was created. Most likely this was a tensor created by our C++ autograd engine.
.forward
- These are stack frames for code generated with torch.fx.
- These are parts of segments that are inactive but do not have any stack of history associated with them (either they have never been allocated yet, or we removed it because it overlapped with some other piece of newer history)