LoRA and QLoRA: Fine-Tuning Large Models on One GPU

Full fine-tuning a 7B language model requires between 100 and 120 gigabytes of GPU memory. That means at minimum two A100 80GB cards and roughly $50,000 in hardware just to run a single training job. For a 70B model, the math gets worse by a factor of ten.

LoRA changed this calculation in 2022. QLoRA changed it again in 2023. Together, they made serious fine-tuning of large language models possible on a single consumer GPU. A 7B model now fine-tunes on a $1,500 RTX 4090. A 70B model on a single A100 80GB. The technique is not approximate or low-quality. On most tasks, LoRA with well-chosen rank recovers 90-95% of full fine-tuning performance while training less than 0.5% of the model’s parameters.

Most explanations of LoRA describe what it does without explaining why it works. The answer comes from a 2021 paper on intrinsic dimensionality that is rarely cited in practitioner guides, even though it is the empirical foundation the LoRA authors explicitly built on.

The Intrinsic Dimension Hypothesis: Why Weight Updates Are Low-Rank

In 2021, Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta at Meta AI published a paper that should be required reading for anyone working with fine-tuning: “Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning” (ACL 2021).

The paper measured how many trainable parameters a language model actually needs during fine-tuning by projecting all gradients into a random low-dimensional subspace. The question was: how small can that subspace be while still reaching 90% of the full fine-tuning performance on downstream tasks?

The answer was striking. A RoBERTa model with 125 million parameters reached 90% of full performance on MRPC (a sentence similarity benchmark) by optimizing just 200 parameters randomly projected back into the full space. The intrinsic dimension of fine-tuning, the minimum number of free parameters required to solve the task adequately, was far smaller than the model’s parameter count by orders of magnitude.

The paper also found that pre-training implicitly minimizes intrinsic dimension, and that larger pre-trained models tend to have lower intrinsic dimension after fine-tuning. This connects to compute-optimal scaling: more training tokens on a well-pre-trained model reduces the rank of fine-tuning adjustments required for downstream tasks. The more parameters a model has, and the better its pre-training, the lower the rank of the weight change required to adapt it to a new task.

This is the foundation LoRA stands on. The weight change during fine-tuning is low-rank in practice because the optimization problem of adapting a well-pre-trained model to a downstream task has low intrinsic dimension.

How LoRA Works: The Mathematics

Edward Hu, Yelong Shen, Philip Wallis, and colleagues at Microsoft published LoRA in 2022. The core idea is to represent the weight update for each linear layer as a product of two low-rank matrices rather than training the full weight matrix directly.

For a weight matrix W of shape (d_out, d_in), standard fine-tuning trains the full update delta_W, which has d_out times d_in parameters. LoRA instead trains two matrices: A of shape (r, d_in) and B of shape (d_out, r), where r is the rank hyperparameter. The effective weight update is B times A, which has the same shape as delta_W but is parameterized by only r times (d_in plus d_out) values.

At rank r = 8, for a typical attention weight matrix with d = 4096, this reduces trainable parameters from roughly 16 million per matrix to about 65,000: a 99.6% reduction.

During training, the pre-trained weights W remain frozen. Only A and B are updated. During inference, A and B are multiplied together and added to W, producing a single merged weight matrix that requires no extra compute. The inference overhead of LoRA is exactly zero once the adapter is merged.

The initialization matters. Matrix A is initialized with Gaussian random values, and B is initialized to zero. This ensures that the product B times A equals zero at training start, meaning the model begins fine-tuning from the pre-trained behavior rather than a random perturbation.

The Scaling Factor Problem: LoRA vs rsLoRA

The original LoRA implementation applies a scaling factor of alpha divided by r to the adapter output, where alpha is a hyperparameter typically set to twice the rank value. This scaling was introduced to make the adapter magnitude consistent across rank choices, with higher rank adapters producing smaller per-parameter updates.

In 2023, Damjan Kalajdzievski published rsLoRA (A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA), identifying a problem with this convention. Standard LoRA’s scaling by alpha/r causes adapter signal strength to decrease as rank increases. This means that increasing rank to improve expressiveness simultaneously decreases the learning signal per parameter, creating a coupling between rank and effective learning rate that forces practitioners to re-tune alpha for each rank choice.

rsLoRA corrects this by scaling with alpha divided by the square root of r instead of alpha divided by r. This stabilizes the adapter gradient magnitude across rank values, allowing higher ranks to produce meaningfully better results without manual re-tuning. The rsLoRA paper demonstrated that models trained with the corrected scaling factor consistently outperformed standard LoRA at the same rank, and that the benefit grew with rank, making rsLoRA especially valuable when rank needs to be high for complex tasks.

rsLoRA is now available in the Hugging Face PEFT library. The practical implication: if you are using LoRA at ranks above 16, rsLoRA is worth enabling. The original convention works well at low ranks but underperforms at high ranks precisely when you need the most expressiveness.

DoRA: Decomposing Magnitude and Direction

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hsuan Yang published DoRA (Weight-Decomposed Low-Rank Adaptation) in 2024, identifying a structural difference between how LoRA and full fine-tuning modify weight matrices.

Full fine-tuning can change a weight matrix’s magnitude (how strongly it responds to inputs) and direction (which inputs it responds to) independently. LoRA, by adding a low-rank update to the full weight matrix, couples these two changes. A 2024 analysis (LoRA vs Full Fine-tuning: An Illusion of Equivalence) confirmed that LoRA and full fine-tuning learn qualitatively different weight structures even when their downstream task performance is similar, with LoRA producing weight matrices whose singular value decompositions have markedly different structure.

DoRA decomposes the weight matrix into magnitude and direction components, then applies LoRA only to the direction component while allowing the magnitude to change freely. This gives the adapter more expressive power to replicate the learning dynamics of full fine-tuning, while keeping the trainable parameter count similar to standard LoRA.

In the original DoRA experiments, the method consistently outperformed standard LoRA on commonsense reasoning, visual instruction tuning (LLaVA-style VLMs), and text-to-image generation benchmarks, with the gains most pronounced on complex tasks requiring structural changes to model representations. DoRA is available in the Hugging Face PEFT library as an alternative to standard LoRA and is worth evaluating when standard LoRA fails to close the gap with full fine-tuning on a specific task.

QLoRA: Training 70B Models on a Single A100

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer at the University of Washington published QLoRA in 2023, solving a different problem than LoRA: even when you only train 0.5% of parameters, the frozen base model still occupies GPU memory.

A LLaMA-7B model in fp16 requires 14GB for its frozen weights alone. LLaMA-13B requires 26GB. LLaMA-65B requires 130GB, which is impossible on a single consumer GPU. QLoRA addresses this by quantizing the frozen base model to 4 bits, then applying standard LoRA adapters at bf16 precision on top of the quantized base.

The key innovation in QLoRA is not 4-bit quantization itself, but a new 4-bit data type called NF4 (NormalFloat4) designed specifically for the distribution of neural network weights. Standard int4 quantization assumes a uniform distribution across the quantization range. Neural network weights follow an approximately normal (Gaussian) distribution. NF4 spaces its 16 quantization levels at the quantiles of a standard normal distribution, minimizing expected quantization error for normally distributed weights compared to uniform int4.

QLoRA also introduces Double Quantization (DQ), which quantizes the quantization constants themselves. Each small block of weights has its own scaling factor (a quantization constant). Standard 4-bit quantization stores these constants in fp32, which adds roughly 0.5 bits per parameter. Double Quantization applies 8-bit quantization to the quantization constants, recovering most of this overhead and reducing the average storage per parameter to approximately 4.127 bits.

The result: LLaMA-65B, which requires approximately 130GB at fp16, fine-tunes in QLoRA at approximately 41GB, within the memory budget of a single A100 80GB GPU. The Guanaco models that the QLoRA paper used to validate the technique were competitive with ChatGPT on human evaluation benchmarks despite being trained on a single GPU over a single day.

What Rank r to Choose

Rank selection is the primary hyperparameter in LoRA and has no universal answer. The right value depends on the complexity of the task, the size of the base model, and the amount of training data available.

Low ranks (r = 4 or r = 8): Appropriate for simple instruction following, style transfer, or domain adaptation tasks where the target behavior is a small perturbation of the base model’s existing capabilities. These are fast, memory-efficient, and often sufficient for single-domain specialization.

Medium ranks (r = 16 to r = 32): The practical default for most fine-tuning tasks, including instruction tuning, chat format adaptation, and moderately complex capability extension. The original LoRA paper uses r = 4 and r = 8 for its reported results, but the broader practitioner community has found that r = 16 to r = 32 provides a better starting point for general-purpose fine-tuning.

High ranks (r = 64 or above): Required for complex structural changes, code generation fine-tuning, or domain transfers that require substantial departure from base model behavior. At high ranks, rsLoRA’s scaling correction becomes important. Biderman et al. (2024) found that LoRA has persistent difficulty matching full fine-tuning on code generation even at r = 256, which DoRA was specifically designed to address.

A practical heuristic from the PEFT literature: start at r = 16 with rsLoRA enabled, evaluate on a held-out validation set, and increase rank only if performance plateaus below the full fine-tuning baseline. Doubling rank from r = 16 to r = 32 roughly doubles trainable parameter count and memory for optimizer states, but often yields diminishing returns beyond r = 64 for most NLP tasks.

Which Layers to Adapt

LoRA can be applied to any linear layer, but the original paper and most subsequent work apply it to the attention projection matrices (query, key, value, and output projections) and, optionally, the feed-forward layer weights. Empirically, adapting the query and value matrices alone captures most of the benefit on NLP tasks. The output projection and feed-forward layers add measurable improvement on some tasks but are optional.

For instruction tuning and conversational fine-tuning, adapting all attention projection matrices at r = 16 is the standard starting configuration. For domain-specific knowledge injection, including the feed-forward layers is often worth the additional parameter cost because factual knowledge in transformers is disproportionately stored in the MLP layers rather than attention.

Merge at Inference: Zero Overhead

The practical advantage most practitioners underuse is LoRA merging. After training, the adapter matrices A and B can be multiplied together and added directly to the corresponding frozen weight matrix W. The merged weight matrix W plus B times A is identical to the original matrix plus the adapter update, and it fits in the same memory footprint as the base model with no additional weights.

After merging, inference is identical in compute and memory to running the base model. There is no adapter overhead, no conditional logic, and no additional memory allocation. This makes LoRA-fine-tuned models production-trivial to deploy: ship the merged model exactly as you would ship the base model. A merged LoRA model also pairs cleanly with speculative decoding, since the draft model needs a static target weight matrix to verify against.

Merging also enables efficient multi-task deployment through task arithmetic. Multiple LoRA adapters trained on different tasks can be merged into a single base model simultaneously, with each adapter’s contribution scaled by a mixing coefficient. This is not perfect (tasks can interfere), but it allows a single model checkpoint to approximate the behavior of several separately fine-tuned models, which has significant implications for serving infrastructure cost.

Where LoRA Falls Short

The 2024 paper “LoRA vs Full Fine-tuning: An Illusion of Equivalence” (Yang et al., 2024, arXiv:2410.21228) is the most important critical analysis of LoRA published to date. The paper found that even when LoRA and full fine-tuning achieve similar downstream benchmark scores, the weight matrices they produce have structurally different singular value decompositions. Full fine-tuning tends to produce weight changes distributed across many singular directions. LoRA, by construction, concentrates weight change in r singular directions regardless of what the task requires.

The practical implication: LoRA may solve the benchmark problem while solving it in a structurally different way than full fine-tuning, which can produce brittle behavior outside the specific distribution the fine-tuning data covered. For production deployments requiring broad generalization, this is worth knowing before committing to LoRA as the sole fine-tuning method.

Biderman et al. (2024) found that LoRA consistently underperforms full fine-tuning on code generation even at high ranks, a finding that has been replicated. DoRA partially addresses this but has not fully closed the gap. If code generation capability is the primary objective, full fine-tuning on a model small enough to fit the compute budget is often worth pursuing over LoRA on a larger model.

QLoRA introduces an additional source of degradation: the 4-bit quantization of the base model. NF4 is carefully designed to minimize quantization error for normal distributions, but it is still lossy compression. The Guanaco results in the original QLoRA paper show competitive performance with full-precision fine-tuning, but more recent work has found that QLoRA typically loses 1-3 perplexity points relative to the same fine-tune at bf16 precision. For tasks where perplexity differences of this magnitude are acceptable, QLoRA is the right choice. For tasks requiring maximum capability extraction from a given model, bf16 LoRA or full fine-tuning at bf16 on a smaller model may perform better.

The Current Fine-Tuning Ecosystem

The Hugging Face PEFT library is the standard implementation for LoRA, QLoRA, rsLoRA, and DoRA. It supports all major model families and integrates directly with the Transformers trainer, making the practical barrier to entry for parameter-efficient fine-tuning extremely low. Axolotl provides a higher-level wrapper around PEFT and Transformers that has become the dominant open-source tool for community fine-tuning of models like LLaMA-3, Mistral, and Qwen.

The Unsloth library implements hand-written CUDA kernels for LoRA and QLoRA that reduce memory usage and training time relative to the standard PEFT implementation by roughly 30-50% through kernel fusion. For practitioners pushing the memory limits of a single GPU, Unsloth is worth evaluating before upgrading hardware.

As base models improve in general capability, the intrinsic dimension of fine-tuning for standard tasks continues to decrease. This means that lower ranks and smaller LoRA adapters become progressively sufficient as the base model gets better at pre-training. The trend suggests that for routine instruction following and style adaptation, LoRA ranks that would have seemed too small two years ago now suffice, a practical benefit of the broader improvements in base model quality driven by frontier-scale training regimes. The same scaling pressure that drives genomic foundation models like Evo 2 to 9.3 trillion training tokens is simultaneously reducing the fine-tuning cost of smaller language models.

For the practitioner making a concrete decision: LoRA at r = 16 with rsLoRA enabled and bf16 precision on a model that fits in memory is the right default for most instruction tuning and domain adaptation tasks. Drop to QLoRA only when the target model exceeds available memory, and accept the 1-3 point capability loss that quantization introduces as the cost of running at that scale on constrained hardware. Evaluate DoRA when standard LoRA consistently falls short of full fine-tuning performance on your specific task.

LoRA and QLoRA: Fine-Tuning Large Models on One GPU

The Intrinsic Dimension Hypothesis: Why Weight Updates Are Low-Rank

How LoRA Works: The Mathematics

The Scaling Factor Problem: LoRA vs rsLoRA

DoRA: Decomposing Magnitude and Direction

QLoRA: Training 70B Models on a Single A100

What Rank r to Choose

Which Layers to Adapt

Merge at Inference: Zero Overhead

Where LoRA Falls Short

The Current Fine-Tuning Ecosystem

Like this:

More posts

LLM Excessive Agency: Why Every Tool Your Agent Has Is a Risk

OWASP LLM Top 10 for 2025: The Mechanism Behind Each Vulnerability

Indirect Prompt Injection: The Attack That Hides in Your Data

Julia Bazinska and the Science of Measurable AI Security

LoRA and QLoRA: Fine-Tuning Large Models on One GPU

The Intrinsic Dimension Hypothesis: Why Weight Updates Are Low-Rank

How LoRA Works: The Mathematics

The Scaling Factor Problem: LoRA vs rsLoRA

DoRA: Decomposing Magnitude and Direction

QLoRA: Training 70B Models on a Single A100

What Rank r to Choose

Which Layers to Adapt

Merge at Inference: Zero Overhead

Where LoRA Falls Short

The Current Fine-Tuning Ecosystem

Share this:

Like this:

More posts

LLM Excessive Agency: Why Every Tool Your Agent Has Is a Risk

OWASP LLM Top 10 for 2025: The Mechanism Behind Each Vulnerability

Indirect Prompt Injection: The Attack That Hides in Your Data

Julia Bazinska and the Science of Measurable AI Security

Discover more from My Written Word