You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add pytorch_cuda_alloc_conf config to tune VRAM memory allocation (#7673)
## Summary
This PR adds a `pytorch_cuda_alloc_conf` config flag to control the
torch memory allocator behavior.
- `pytorch_cuda_alloc_conf` defaults to `None`, preserving the current
behavior.
- The configuration options are explained here:
https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf.
Tuning this configuration can reduce peak reserved VRAM and improve
performance.
- Setting `pytorch_cuda_alloc_conf: "backend:cudaMallocAsync"` in
`invokeai.yaml` is expected to work well on many systems. This is a good
first step for those looking to tune this config. (We may make this the
default in the future.)
- The optimal configuration seems to be dependent on a number of factors
such as device version, VRAM, CUDA kernel version, etc. For now, users
will have to experiment with this config to see if it hurts or helps on
their systems. In most cases, I expect it to help.
### Memory Tests
```
VAE decode memory usage comparison:
- SDXL, fp16, 1024x1024:
- `cudaMallocAsync`: allocated=2593 MB, reserved=3200 MB
- `native`: allocated=2595 MB, reserved=4418 MB
- SDXL, fp32, 1024x1024:
- `cudaMallocAsync`: allocated=3982 MB, reserved=5536 MB
- `native`: allocated=3982 MB, reserved=7276 MB
- SDXL, fp32, 1536x1536:
- `cudaMallocAsync`: allocated=8643 MB, reserved=12032 MB
- `native`: allocated=8643 MB, reserved=15900 MB
```
## Related Issues / Discussions
N/A
## QA Instructions
- [x] Performance tests with `pytorch_cuda_alloc_conf` unset.
- [x] Performance tests with `pytorch_cuda_alloc_conf:
"backend:cudaMallocAsync"`.
## Merge Plan
- [x] Merge #7668 first and change target branch to `main`
## Checklist
- [x] _The PR has a short but descriptive title, suitable for a
changelog_
- [x] _Tests added / updated (if applicable)_
- [x] _Documentation added / updated (if applicable)_
- [ ] _Updated `What's New` copy (if doing a release after this PR)_
Copy file name to clipboardExpand all lines: docs/features/low-vram.md
+11Lines changed: 11 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -31,6 +31,7 @@ It is possible to fine-tune the settings for best performance or if you still ge
31
31
Low-VRAM mode involves 4 features, each of which can be configured or fine-tuned:
32
32
33
33
- Partial model loading (`enable_partial_loading`)
34
+
- PyTorch CUDA allocator config (`pytorch_cuda_alloc_conf`)
34
35
- Dynamic RAM and VRAM cache sizes (`max_cache_ram_gb`, `max_cache_vram_gb`)
35
36
- Working memory (`device_working_mem_gb`)
36
37
- Keeping a RAM weight copy (`keep_ram_copy_of_weights`)
@@ -51,6 +52,16 @@ As described above, you can enable partial model loading by adding this line to
51
52
enable_partial_loading: true
52
53
```
53
54
55
+
### PyTorch CUDA allocator config
56
+
57
+
The PyTorch CUDA allocator's behavior can be configured using the `pytorch_cuda_alloc_conf` config. Tuning the allocator configuration can help to reduce the peak reserved VRAM. The optimal configuration is dependent on many factors (e.g. device type, VRAM, CUDA driver version, etc.), but switching from PyTorch's native allocator to using CUDA's built-in allocator works well on many systems. To try this, add the following line to your `invokeai.yaml` file:
A more complete explanation of the available configuration options is [here](https://pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf).
64
+
54
65
### Dynamic RAM and VRAM cache sizes
55
66
56
67
Loading models from disk is slow and can be a major bottleneck for performance. Invoke uses two model caches - RAM and VRAM - to reduce loading from disk to a minimum.
Copy file name to clipboardExpand all lines: invokeai/app/services/config/config_default.py
+4Lines changed: 4 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -91,6 +91,7 @@ class InvokeAIAppConfig(BaseSettings):
91
91
ram: DEPRECATED: This setting is no longer used. It has been replaced by `max_cache_ram_gb`, but most users will not need to use this config since automatic cache size limits should work well in most cases. This config setting will be removed once the new model cache behavior is stable.
92
92
vram: DEPRECATED: This setting is no longer used. It has been replaced by `max_cache_vram_gb`, but most users will not need to use this config since automatic cache size limits should work well in most cases. This config setting will be removed once the new model cache behavior is stable.
93
93
lazy_offload: DEPRECATED: This setting is no longer used. Lazy-offloading is enabled by default. This config setting will be removed once the new model cache behavior is stable.
94
+
pytorch_cuda_alloc_conf: Configure the Torch CUDA memory allocator. This will impact peak reserved VRAM usage and performance. Setting to "backend:cudaMallocAsync" works well on many systems. The optimal configuration is highly dependent on the system configuration (device type, VRAM, CUDA driver version, etc.), so must be tuned experimentally.
94
95
device: Preferred execution device. `auto` will choose the device depending on the hardware platform and the installed torch capabilities.<br>Valid values: `auto`, `cpu`, `cuda`, `cuda:1`, `mps`
95
96
precision: Floating point precision. `float16` will consume half the memory of `float32` but produce slightly lower-quality images. The `auto` setting will guess the proper precision based on your video card and operating system.<br>Valid values: `auto`, `float16`, `bfloat16`, `float32`
96
97
sequential_guidance: Whether to calculate guidance in serial instead of in parallel, lowering memory requirements.
@@ -169,6 +170,9 @@ class InvokeAIAppConfig(BaseSettings):
169
170
vram: Optional[float] =Field(default=None, ge=0, description="DEPRECATED: This setting is no longer used. It has been replaced by `max_cache_vram_gb`, but most users will not need to use this config since automatic cache size limits should work well in most cases. This config setting will be removed once the new model cache behavior is stable.")
170
171
lazy_offload: bool=Field(default=True, description="DEPRECATED: This setting is no longer used. Lazy-offloading is enabled by default. This config setting will be removed once the new model cache behavior is stable.")
171
172
173
+
# PyTorch Memory Allocator
174
+
pytorch_cuda_alloc_conf: Optional[str] =Field(default=None, description="Configure the Torch CUDA memory allocator. This will impact peak reserved VRAM usage and performance. Setting to \"backend:cudaMallocAsync\" works well on many systems. The optimal configuration is highly dependent on the system configuration (device type, VRAM, CUDA driver version, etc.), so must be tuned experimentally.")
175
+
172
176
# DEVICE
173
177
device: DEVICE=Field(default="auto", description="Preferred execution device. `auto` will choose the device depending on the hardware platform and the installed torch capabilities.")
174
178
precision: PRECISION=Field(default="auto", description="Floating point precision. `float16` will consume half the memory of `float32` but produce slightly lower-quality images. The `auto` setting will guess the proper precision based on your video card and operating system.")
0 commit comments