COAP

COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection

¹ByteDance, ²Rutgers University
^*Work done during internship at ByteDance

CVPR 2025

Abstract

Training large-scale neural networks in vision, and multimodal domains demands substantial memory resources, primarily due to the storage of optimizer states. While LoRA, a popular parameter-efficient method, reduces memory usage, it often suffers from suboptimal performance due to the constraints of low-rank updates. Low-rank gradient projection methods (e.g., GaLore, Flora) reduce optimizer memory by projecting gradients and moment estimates into low-rank spaces via singular value decomposition or random projection. However, they fail to account for inter-projection correlation, causing performance degradation, and their projection strategies often incur high computational costs. In this paper, we present COAP (COrrelation-Aware Gradient Projection), a memory-efficient method that minimizes computational overhead while maintaining training performance. Evaluated across various vision, language, and multimodal tasks, COAP outperforms existing methods in both training speed and model performance. For LLaMA-1B, it reduces optimizer memory by 61% with only 2% additional time cost, achieving the same PPL as AdamW. With 8-bit quantization, COAP cuts optimizer memory by 81% and achieves 4x speedup over GaLore for LLaVA-v1.5-7B fine-tuning, while delivering higher accuracy.

Qualitative Comparisons

Images generated by DDPM trained on CIFAR-10 with different Adam-based optimizers.

Images generated by DDPM trained on CelebA-HQ with different Adafactor-based optimizers.

Random class-conditional samples generated by LDM trained on the ImageNet dataset using the COAP optimizer.

Random class-conditional samples generated by SiT-XL/2 trained on the ImageNet dataset using the COAP optimizer.

Images generated at different training steps with ControlNet-XL trained under various optimizers (1).

Images generated at different training steps with ControlNet-XL trained under various optimizers (2).

Quantitative Comparisons

Pre-training DDPM on CIFAR-10 and CelebA-HQ datasets using 8xV100.

Pre-training LDM on the ImageNet-1K dataset using 8xV100.

Pre-training SiT-XL/2 with REPA on the ImageNet-1K dataset using 8xH100.

Pre-Training ControlNet-XL with pose control using 8xH100.

Pre-training LLaMA-1B and LLaMA-7B on the C4 dataset using 8xH100.

Fine-tuning LLaVA-v1.5-7B on the ScienceQA dataset using 1xA100.

BibTeX

@misc{xiao2025coapmemoryefficienttrainingcorrelationaware, title = {COAP: Memory-Efficient Training with Correlation-Aware Gradient Projection}, author = {Xiao, Jinqi and Sang, Shen and Zhi, Tiancheng and Liu, Jing and Yan, Qing and Zhang, Yuqian and Luo, Linjie and Yuan, Bo}, year = {2025}, eprint = {2412.00071}, archivePrefix = {arXiv}, primaryClass = {cs.LG}, url = {https://arxiv.org/abs/2412.00071} }