LoRa, VeRA, Delta-LoRa... AI Researchers Need to Chill with the Acronyms

AI/ML•December 10, 2024•9 min read

Alright, so we've got these massive language models that can basically do anything, right? Problem is, fine-tuning the entire thing for your specific use case is computationally insane. Like, "sell your kidney to afford GPU time" expensive.

Enter Parameter-Efficient Fine-Tuning (PEFT) methods, which is basically the art of being lazy in a smart way. Instead of training billions of parameters, you freeze most of the model and only train a tiny fraction of new parameters.

But AI researchers, being AI researchers, couldn't just stick with one approach. Nope, we needed an entire alphabet soup of methods:

LoRA (Low-Rank Adaptation)

The OG of the bunch. You take your massive weight matrix W and instead of updating it directly, you add a small update matrix that's the product of two smaller matrices A and B. The "rank" r is way smaller than the original dimension, so you're training way fewer parameters.

Think of it like this: instead of rewriting an entire book, you're just adding footnotes.

LoRA-FA

Basically LoRA but with some tweaks. But it's probably applied differently or in different parts of the network. Because why keep things simple?

VeRA (Vector-based Random Matrix Adaptation)

This one's interesting. Instead of having separate A and B matrices for each layer, VeRA shares some parameters across layers and uses "trainable vectors" to influence transformations. It's like LoRA but with more parameter sharing and randomization.

Delta-LoRA

The rebel of the group. Instead of keeping the original weight matrix W completely frozen, Delta-LoRA actually updates it, but in a structured way using the differences of those AB products over time. So W isn't strictly frozen, but its changes are constrained by the low-rank structure.

LoRA+

Someone looked at regular LoRA and said "but what if we made the learning rates different?" So they give matrix B a higher learning rate (λ > 1) than matrix A. Apparently B benefits from faster updates. Who knew?

The real question: Do we really need all these variants? Probably not. But the core idea is solid - you can adapt massive models to new tasks without breaking the bank or waiting forever for training to finish.

It's actually pretty elegant engineering when you think about it. Take a frozen foundation model, inject a tiny amount of trainable parameters in just the right way, and suddenly you can teach it to be a customer service bot or a code reviewer or whatever.

The creativity in finding efficient ways to inject new knowledge into frozen models is honestly impressive. Even if the naming conventions are getting out of hand.

Next week someone's probably gonna publish "LoRA-XYZ-Ultra-Pro-Max" and I'm gonna lose it.

References

Essential Papers

Original LoRA Paper - Hu et al. (2021) - The foundational paper that started it all. Introduces Low-Rank Adaptation which "freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer." Perfect for understanding the "OG of the bunch."
LoRA+ Paper (2024) - Shows that original LoRA "leads to suboptimal finetuning" because "adapter matrices A and B in LoRA are updated with the same learning rate." Directly validates the whole "different learning rates" thing I mentioned.
VeRA Paper (2023) - VeRA "reduces the number of trainable parameters by 10x compared to LoRA, yet maintains the same performance" by "using a single pair of low-rank matrices shared across all layers." Exactly what I meant by parameter sharing.
Delta-LoRA Paper (2023) - Delta-LoRA "not only updates the low-rank matrices A and B, but also" updates the original weight matrix in a structured way. The "rebel of the group" description is spot on.

Deep Dives

Sebastian Raschka's LoRA Guide (2023) - Excellent technical explanation of the math behind LoRA. Great for readers who want deeper understanding beyond my "footnotes" analogy.
Nature Machine Intelligence PEFT Survey (2023) - Comprehensive overview of "delta-tuning" methods that update "only a small number of parameters while freezing the remaining parameters." Perfect academic backing for the whole "alphabet soup" criticism.