AutoLoRA:
Automatically Tuning Matrix Ranks in Low-Rank Adaptation Based on Meta Learning
Annual Conference of the North American Chapter of
the Association for Computational Linguistics (NAACL), 2024

Abstract

Large-scale pretraining followed by task-specific finetuning has achieved great success in various NLP tasks. Since finetuning all parameters of large pretrained models poses substantial computational and memory challenges, several efficient finetuning methods have been developed. Among them, low-rank adaptation (LoRA), which finetunes low-rank incremental update matrices on top of frozen pretrained weights, has proven particularly effective. Nonetheless, LoRA's uniform rank assignment across all layers, along with its reliance on an exhaustive search to find the best rank, leads to high computation costs and suboptimal finetuning performance. To address these limitations, we introduce AutoLoRA, a meta learning based framework for automatically identifying the optimal rank of each LoRA layer. AutoLoRA associates each rank-1 matrix in a low-rank update matrix with a selection variable, which determines whether the rank-1 matrix should be discarded. A meta learning based method is developed to learn these selection variables. The optimal rank is determined by thresholding the values of these variables. Our comprehensive experiments on natural language understanding, generation, and sequence labeling demonstrate the effectiveness of AutoLoRA.

overview

Introduction

Large Language Models (LLMs) have demonstrated state-of-the-art performance across a variety of NLP tasks, spanning from Natural Language Understanding (NLU) to Natural Language Generation (NLG), a trajectory highlighted by the success of models like ChatGPT.

However, as models scale up, finetuning becomes highly expensive in computation.

To address this challenge, many efficient finetuning methods have been developed. For instance, To address these limitations, LoRA proposes to add low-rank incremental update matrices to pretrained weight matrices. During finetuning, only the incremental matrices are trained while the pretrained ones are frozen. The low-rank parameterization significantly reduces the number of finetuning parameters.

The low-rank parameterization of LoRA significantly reduces the number of finetuning parameters. However, the update matrices at different layers share the same rank, without considering the varying properties across layers. Obtaining the optimal rank in LoRA typically involves an extensive manual hyperparameter search, which is time-consuming and poses scalability issues.

AutoLoRA

In AutoLoRA, we aim to automatically determine the optimal rank for each LoRA layer, instead of manually specifying it as in LoRA. To achieve this goal, we associate each rank-1 matrix in an update matrix with a selection variable and reparameterize the update matrix as a weighted sum of rank-1 matrices. A meta learning based approach is developed to learn these selection variables. After learning, if the value of a selection variable is close to zero, its corresponding rank-1 matrix is removed. In this way, we can determine the optimal rank for each update matrix based on the selection variables.

Reparameterize Update Matrices

In LoRA, the LoRA update matrix \(\Delta_l\) at layer \( l\) can be written as the summation of \(k_l\) rank-1 matrices: $$\Delta_l=\sum_{j=1}^{k_l}\Delta_l^j $$ where \(\Delta_l^j\) is the outer-product between the \(j\)-th column of \(U_l\) and the \(j\)-th row of \(V_l\).

In AutoLoRA, we associate each rank-1 matrix \(\Delta_l^j\) in the equation above with a selection variable \(\alpha^j_l\in[0,1]\) and reparameterize \(\Delta_l\) as a weighted sum of rank-1 matrices: $$ \Delta_l=\sum_{j=1}^{k_l}\alpha_l^j\Delta_l^j $$ \(\alpha^j_l\) can be interpreted as the importance of \(\Delta_l^j\).

Learn Selection Variables

Let \(A = \{\alpha^j_l|1 \leq j \leq k_l, 1 \leq l \leq M\}\) denote all selection variables, where \(M\) is the number of layers in the pretrained model. And given weight parameters \(W_l=\widetilde{W}_l+\Delta_l\) at layer \(l\) We propose a meta learning based approach to learn \(A\).

We first perform a one-step gradient descent update of \(W_l\): $$ \widehat{W}_l= W_l-\eta \nabla_{W_l}\mathcal{L}_{tr}(\{W_l\}_{l=1}^M,D_{tr}) $$ where \(\eta\) is a learning rate.

Then we evaluate \(\{\widehat{W}_l\}_{l=1}^M\) on a validation dataset \(D_{val}\). The validation loss \(\mathcal{L}_{val}(\{\widehat{W}_l\}_{l=1}^M,D_{val})\) is a function of \(A\) since \(\mathcal{L}_{val}\) depends on \(\{\widehat{W}_l\}_{l=1}^M\) which depends on \(A\). We optimize \(A\) by minimizing the validation loss: $$ \textrm{min}_A\; \mathcal{L}_{val}(\{\widehat{W}_l\}_{l=1}^M,D_{val}) $$ We use an approximate gradient-based algorithm to solve this problem. The updates of \(W\) and \(A\) are iteratively performed until convergence.

Determine Matrix Rank

Given the optimally learned selection variables \(A^*\), we determine the rank of each update matrix based on \(A^*\). For each layer \(l\), we count the number of entries in \(\{\alpha_l^j\}_{j=1}^{k_l}\) that satisfy \(\alpha_l^j\geq\lambda\), where \(\lambda\) denotes a threshold. This number would be the optimal rank for \(\Delta_l\). We set \(\lambda\) to be \(\lambda=1/k_l\).

Algorithm

alg

Results

We conduct extensive experiments on eight datasets from the General Language Understanding Evaluation (GLUE) benchmark to evaluate the performance of AutoLoRA on NLU tasks.

AutoLoRA achieves the best performance on 6 out of 8 datasets, and obtains an average performance of 85.5, outperforming all baseline methods. As AutoLoRA outperforms LoRA on average, we can conclude that the optimal ranks learned by AutoLoRA are better than the manually tuned ranks in LoRA.

scales

The reasons are two-fold. First, AutoLoRA allows different layers to have distinct ranks, sufficiently accounting for the fact that different layers have varying properties and need to have layer-specific amounts of tunable parameters. In contrast, LoRA uniformly uses the same rank for all layers, without considering the difference across layers. Second, AutoLoRA learns the continuous selection variables (which determine the ranks) by maximizing the finetuning performance on validation data via gradient descent. The search space is continuous, which allows more comprehensive exploration of rank configurations.

Visualization

We presents the optimal rank determined by AutoLoRA for the QQP, MNLI, and E2E datasets.

scales

As can be seen, the optimal ranks learned by AutoLoRA for different layers have varying values. This is aligned with the hypothesis discussed that different layers need different matrix ranks.

Citation