Skip to content

Commit a1e26e4

Browse files
committed
lora.md
1 parent 06d5eb8 commit a1e26e4

File tree

1 file changed

+382
-0
lines changed

1 file changed

+382
-0
lines changed

docs/lora.md

Lines changed: 382 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,382 @@
1+
# **Tutorial Outline: LoRA for LLMs From Scratch**
2+
3+
## **Chapter 1: The Promise - Master LoRA in 30 Minutes**
4+
5+
You've heard of LoRA. It's the key to fine-tuning massive LLMs on a single GPU. You've seen the acronyms: PEFT, low-rank adaptation. But what is it, *really*?
6+
7+
It's not a complex theory. It's a simple, elegant trick.
8+
9+
Instead of training a 1-billion-parameter weight matrix `W`, you freeze it. You then train two tiny matrices, `A` and `B`, that represent the *change* to `W`.
10+
11+
This is LoRA. It's this piece of code:
12+
13+
```python
14+
import torch
15+
import torch.nn as nn
16+
import torch.nn.functional as F
17+
import math
18+
19+
class LoRALinear(nn.Module):
20+
def __init__(self, base: nn.Linear, r: int, alpha: float = 16.0):
21+
super().__init__()
22+
self.r = r
23+
self.alpha = alpha
24+
self.scaling = self.alpha / self.r
25+
26+
# Freeze the original linear layer
27+
self.base = base
28+
self.base.weight.requires_grad_(False)
29+
30+
# Create the trainable low-rank matrices
31+
self.lora_A = nn.Parameter(torch.empty(r, base.in_features))
32+
self.lora_B = nn.Parameter(torch.empty(base.out_features, r))
33+
34+
# Initialize the weights
35+
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
36+
nn.init.zeros_(self.lora_B) # Start with no change
37+
38+
def forward(self, x: torch.Tensor) -> torch.Tensor:
39+
# Original path (frozen) + LoRA path (trainable)
40+
return self.base(x) + (F.linear(F.linear(x, self.lora_A), self.lora_B) * self.scaling)
41+
42+
```
43+
44+
**My promise:** You will understand every line of this code, the math behind it, and why it's so effective, in the next 30 minutes. Let's begin.
45+
46+
## **Chapter 2: The Foundation - The `nn.Linear` Layer**
47+
48+
Before we can modify an LLM, we must understand its most fundamental part: the `nn.Linear` layer. It's the simple workhorse that performs the vast majority of computations in a Transformer.
49+
50+
Its only job is to perform this equation: `output = input @ W.T + b`
51+
52+
#### A Minimal, Reproducible Example
53+
54+
Let's see this in action. We'll create a tiny linear layer that takes a vector of size 3 and outputs a vector of size 2. To make this perfectly clear, we will set the weights and bias manually.
55+
56+
**1. Setup the layer and input:**
57+
58+
```python
59+
import torch
60+
import torch.nn as nn
61+
62+
# A layer that maps from 3 features to 2 features
63+
layer = nn.Linear(in_features=3, out_features=2, bias=True)
64+
65+
# A single input vector (with a batch dimension of 1)
66+
input_tensor = torch.tensor([[1., 2., 3.]])
67+
68+
# Manually set the weights and bias for a clear example
69+
with torch.no_grad():
70+
layer.weight = nn.Parameter(torch.tensor([[0.1, 0.2, 0.3],
71+
[0.4, 0.5, 0.6]]))
72+
layer.bias = nn.Parameter(torch.tensor([0.7, 0.8]))
73+
74+
```
75+
76+
**2. Inspect the Exact Components:**
77+
78+
Now we have known values for everything.
79+
80+
* **Input `x`:** `[1., 2., 3.]`
81+
* **Weight `W`:** `[[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]]`
82+
* **Bias `b`:** `[0.7, 0.8]`
83+
84+
**3. The Forward Pass and Its Output:**
85+
86+
When you call `layer(input_tensor)`, PyTorch computes the result.
87+
88+
```python
89+
# The forward pass
90+
output_tensor = layer(input_tensor)
91+
92+
print("--- PyTorch Calculation ---")
93+
print("Input (x):", input_tensor)
94+
print("Weight (W):\n", layer.weight)
95+
print("Bias (b):", layer.bias)
96+
print("\nOutput (y):", output_tensor)
97+
```
98+
99+
This will print:
100+
101+
```text
102+
--- PyTorch Calculation ---
103+
Input (x): tensor([[1., 2., 3.]])
104+
Weight (W):
105+
tensor([[0.1000, 0.2000, 0.3000],
106+
[0.4000, 0.5000, 0.6000]], grad_fn=<CopySlices>)
107+
Bias (b): tensor([0.7000, 0.8000], grad_fn=<CopySlices>)
108+
109+
Output (y): tensor([[2.1000, 4.7000]], grad_fn=<AddmmBackward0>)
110+
```
111+
The final output is the tensor `[[2.1, 4.7]]`.
112+
113+
**4. Manual Verification: Step-by-Step**
114+
115+
Let's prove this result. The calculation is `x @ W.T + b`.
116+
117+
* **First, the matrix multiplication `x @ W.T`:**
118+
* `[1, 2, 3] @ [[0.1, 0.4], [0.2, 0.5], [0.3, 0.6]]`
119+
* `output[0] = (1*0.1) + (2*0.2) + (3*0.3) = 0.1 + 0.4 + 0.9 = 1.4`
120+
* `output[1] = (1*0.4) + (2*0.5) + (3*0.6) = 0.4 + 1.0 + 1.8 = 3.2`
121+
* Result: `[1.4, 3.2]`
122+
123+
* **Second, add the bias `+ b`:**
124+
* `[1.4, 3.2] + [0.7, 0.8]`
125+
* Result: `[2.1, 4.7]`
126+
127+
The manual calculation matches the PyTorch output exactly. This is all a linear layer does.
128+
129+
#### The Scaling Problem
130+
131+
This seems trivial. So where is the problem? The problem is scale.
132+
133+
* **Our Toy Layer (`3x2`):**
134+
* Weight parameters: `3 * 2 = 6`
135+
* Bias parameters: `2`
136+
* **Total:** `8` trainable parameters.
137+
138+
* **A Single LLM Layer (e.g., `4096x4096`):**
139+
* Weight parameters: `4096 * 4096 = 16,777,216`
140+
* Bias parameters: `4096`
141+
* **Total:** `16,781,312` trainable parameters.
142+
143+
A single layer in an LLM can have over **16 million** parameters. A full model has dozens of these layers. Trying to update all of them during fine-tuning is what melts GPUs. This is the bottleneck LoRA is designed to break.
144+
145+
## **Chapter 3: The LoRA Method - Math and Astonishing Savings**
146+
147+
This is the core idea. Instead of changing the massive weight matrix $W$, we freeze it and learn a tiny "adjustment" matrix, $\Delta W$.
148+
149+
The new, effective weight matrix, $W_{eff}$, is a simple sum:
150+
151+
$W_{eff} = W_{frozen} + \Delta W$
152+
153+
Training the full $\Delta W$ would be too expensive. The breakthrough of LoRA is to force this change to be **low-rank**, meaning we can construct it from two much smaller matrices, $A$ and $B$. We also add a scaling factor, $\frac{\alpha}{r}$, where $r$ is the rank and $\alpha$ is a hyperparameter.
154+
155+
The full LoRA update is defined by this formula:
156+
157+
$\Delta W = \frac{\alpha}{r} B A$
158+
159+
#### A Step-by-Step Numerical Example
160+
161+
Let's build a tiny LoRA update from scratch.
162+
163+
**Given:**
164+
* A frozen weight matrix $W_{frozen}$ of shape `[out=4, in=3]`.
165+
* A LoRA rank $r=2$.
166+
* A scaling factor $\alpha=4$.
167+
168+
$W_{frozen} = \begin{pmatrix} 1 & 1 & 1 \\ 2 & 2 & 2 \\ 3 & 3 & 3 \\ 4 & 4 & 4 \end{pmatrix}$
169+
170+
Now, we define our trainable LoRA matrices, $A$ and $B$:
171+
* $A$ must have shape `[r, in]`, so `[2, 3]`.
172+
* $B$ must have shape `[out, r]`, so `[4, 2]`.
173+
174+
Let's assume after training they have these values:
175+
176+
$A = \begin{pmatrix} 1 & 0 & 2 \\ 0 & 3 & 0 \end{pmatrix} \quad B = \begin{pmatrix} 1 & 0 \\ 0 & 0 \\ 0 & 2 \\ 1 & 1 \end{pmatrix}$
177+
178+
**Step 1: Calculate the core update, $B A$**
179+
180+
This is a standard matrix multiplication. The result will have the same shape as $W_{frozen}$.
181+
182+
$B A = \begin{pmatrix} 1 & 0 \\ 0 & 0 \\ 0 & 2 \\ 1 & 1 \end{pmatrix} \begin{pmatrix} 1 & 0 & 2 \\ 0 & 3 & 0 \end{pmatrix} = \begin{pmatrix} (1*1+0*0) & (1*0+0*3) & (1*2+0*0) \\ (0*1+0*0) & (0*0+0*3) & (0*2+0*0) \\ (0*1+2*0) & (0*0+2*3) & (0*2+2*0) \\ (1*1+1*0) & (1*0+1*3) & (1*2+1*0) \end{pmatrix} = \begin{pmatrix} 1 & 0 & 2 \\ 0 & 0 & 0 \\ 0 & 6 & 0 \\ 1 & 3 & 2 \end{pmatrix}$
183+
184+
**Step 2: Apply the scaling factor, $\frac{\alpha}{r}$**
185+
186+
Our scaling factor is $\frac{4}{2} = 2$. We multiply our result by this scalar.
187+
188+
$\Delta W = 2 \times \begin{pmatrix} 1 & 0 & 2 \\ 0 & 0 & 0 \\ 0 & 6 & 0 \\ 1 & 3 & 2 \end{pmatrix} = \begin{pmatrix} 2 & 0 & 4 \\ 0 & 0 & 0 \\ 0 & 12 & 0 \\ 2 & 6 & 4 \end{pmatrix}$
189+
190+
This $\Delta W$ matrix is the total change that our LoRA parameters will apply to the frozen weights.
191+
192+
**Step 3: The "Merge" for Inference**
193+
194+
After training is done, we can create the final, effective weight matrix by adding the frozen weights and the LoRA update.
195+
196+
$W_{eff} = W_{frozen} + \Delta W = \begin{pmatrix} 1 & 1 & 1 \\ 2 & 2 & 2 \\ 3 & 3 & 3 \\ 4 & 4 & 4 \end{pmatrix} + \begin{pmatrix} 2 & 0 & 4 \\ 0 & 0 & 0 \\ 0 & 12 & 0 \\ 2 & 6 & 4 \end{pmatrix} = \begin{pmatrix} 3 & 1 & 5 \\ 2 & 2 & 2 \\ 3 & 15 & 3 \\ 6 & 10 & 8 \end{pmatrix}$
197+
198+
This final $W_{eff}$ matrix is what you would use for deployment. **Crucially, this merge calculation happens only once after training.** For inference, it's just a standard linear layer, adding zero extra latency.
199+
200+
#### The Forward Pass (How it works during training)
201+
202+
During training, we never compute the full $\Delta W$. That would be inefficient. Instead, we use the decomposed form, which is much faster. The forward pass is:
203+
204+
$y = W_{frozen}x + \frac{\alpha}{r} B(Ax)$
205+
206+
Let's compute this with an input $x = \begin{pmatrix} 1 \\ 2 \\ 3 \end{pmatrix}$:
207+
208+
1. **LoRA Path (right side):**
209+
* `Ax =` $\begin{pmatrix} 1 & 0 & 2 \\ 0 & 3 & 0 \end{pmatrix} \begin{pmatrix} 1 \\ 2 \\ 3 \end{pmatrix} = \begin{pmatrix} (1*1+0*2+2*3) \\ (0*1+3*2+0*3) \end{pmatrix} = \begin{pmatrix} 7 \\ 6 \end{pmatrix}$
210+
* `B(Ax) =` $\begin{pmatrix} 1 & 0 \\ 0 & 0 \\ 0 & 2 \\ 1 & 1 \end{pmatrix} \begin{pmatrix} 7 \\ 6 \end{pmatrix} = \begin{pmatrix} (1*7+0*6) \\ (0*7+0*6) \\ (0*7+2*6) \\ (1*7+1*6) \end{pmatrix} = \begin{pmatrix} 7 \\ 0 \\ 12 \\ 13 \end{pmatrix}$
211+
* `Scale it: 2 *` $\begin{pmatrix} 7 \\ 0 \\ 12 \\ 13 \end{pmatrix} = \begin{pmatrix} 14 \\ 0 \\ 24 \\ 26 \end{pmatrix}$
212+
213+
2. **Frozen Path (left side):**
214+
* `W_frozen * x =` $\begin{pmatrix} 1 & 1 & 1 \\ 2 & 2 & 2 \\ 3 & 3 & 3 \\ 4 & 4 & 4 \end{pmatrix} \begin{pmatrix} 1 \\ 2 \\ 3 \end{pmatrix} = \begin{pmatrix} (1+2+3) \\ (2+4+6) \\ (3+6+9) \\ (4+8+12) \end{pmatrix} = \begin{pmatrix} 6 \\ 12 \\ 18 \\ 24 \end{pmatrix}$
215+
216+
3. **Final Output:**
217+
* `y =` $\begin{pmatrix} 6 \\ 12 \\ 18 \\ 24 \end{pmatrix} + \begin{pmatrix} 14 \\ 0 \\ 24 \\ 26 \end{pmatrix} = \begin{pmatrix} 20 \\ 12 \\ 42 \\ 50 \end{pmatrix}$
218+
219+
#### The Astonishing Savings
220+
221+
This math is why LoRA works. Let's return to the realistic LLM layer (`4096x4096`) to see the impact.
222+
223+
| Method | Trainable Parameters | Calculation | Parameter Reduction |
224+
| :--- | :--- | :--- | :--- |
225+
| **Full Fine-Tuning** | 16,777,216 | `4096 * 4096` | 0% |
226+
| **LoRA (r=8)** | **65,536** | `(8 * 4096) + (4096 * 8)` | **99.61%** |
227+
228+
By performing the efficient forward pass during training, we only need to store and update the parameters for the tiny `A` and `B` matrices, achieving a >99% parameter reduction while still being able to modify the behavior of the massive base layer.
229+
230+
## **Chapter 5: The Main Event - Implementing LoRA in PyTorch**
231+
232+
We will now translate the math from the previous chapter into a reusable PyTorch `nn.Module`. Our goal is to create a `LoRALinear` layer that wraps a standard `nn.Linear` layer, freezes it, and adds the trainable `A` and `B` matrices.
233+
234+
#### The `LoRALinear` Module
235+
236+
Here is the complete implementation, followed by a breakdown of each part.
237+
238+
```python
239+
import torch
240+
import torch.nn as nn
241+
import torch.nn.functional as F
242+
import math
243+
244+
class LoRALinear(nn.Module):
245+
def __init__(self, base: nn.Linear, r: int, alpha: float = 16.0):
246+
super().__init__()
247+
# --- Store hyperparameters ---
248+
self.r = r
249+
self.alpha = alpha
250+
self.scaling = self.alpha / self.r
251+
252+
# --- Store and freeze the original linear layer ---
253+
self.base = base
254+
self.base.weight.requires_grad_(False)
255+
# Also freeze the bias if it exists
256+
if self.base.bias is not None:
257+
self.base.bias.requires_grad_(False)
258+
259+
# --- Create the trainable LoRA matrices A and B ---
260+
# A has shape [r, in_features]
261+
# B has shape [out_features, r]
262+
self.lora_A = nn.Parameter(torch.empty(r, self.base.in_features))
263+
self.lora_B = nn.Parameter(torch.empty(self.base.out_features, r))
264+
265+
# --- Initialize the weights ---
266+
# A is initialized with a standard method
267+
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
268+
# B is initialized with zeros
269+
nn.init.zeros_(self.lora_B)
270+
271+
def forward(self, x: torch.Tensor) -> torch.Tensor:
272+
# 1. The original, frozen path
273+
base_output = self.base(x)
274+
275+
# 2. The efficient LoRA path: B(A(x))
276+
# F.linear(x, self.lora_A) computes x @ A.T
277+
# F.linear(..., self.lora_B) computes (x @ A.T) @ B.T
278+
lora_update = F.linear(F.linear(x, self.lora_A), self.lora_B) * self.scaling
279+
280+
# 3. Return the combined output
281+
return base_output + lora_update
282+
```
283+
284+
**Breakdown:**
285+
286+
1. **`__init__(self, base, r, alpha)`**:
287+
* It accepts the original `nn.Linear` layer (`base`) that we want to adapt.
288+
* `self.base.weight.requires_grad_(False)`: This is the critical **"freezing"** step. We tell PyTorch's autograd engine not to compute gradients for the original weights, so they will never be updated by the optimizer.
289+
* `nn.Parameter(...)`: We register `lora_A` and `lora_B` as official trainable parameters of the module. Their shapes are derived directly from the base layer and the rank `r`.
290+
* `nn.init.zeros_(self.lora_B)`: This is a crucial initialization detail. By starting `B` as a zero matrix, the entire LoRA update (`B @ A`) is zero at the beginning of training. This means our `LoRALinear` layer initially behaves exactly like the original frozen layer, and the model learns the "change" from a stable starting point.
291+
292+
2. **`forward(self, x)`**:
293+
* This is a direct translation of the formula: $y = W_{frozen}x + \frac{\alpha}{r} B(Ax)$
294+
* We compute the output of the frozen path and the LoRA path separately.
295+
* The nested `F.linear` calls are a highly efficient PyTorch way to compute `(x @ A.T) @ B.T` without ever forming the full $\Delta W$ matrix.
296+
* Finally, we add them together.
297+
298+
#### Applying LoRA to a Model
299+
300+
Now we need a helper function to swap out the `nn.Linear` layers in any given model with our new `LoRALinear` layer.
301+
302+
```python
303+
def apply_lora(model: nn.Module, r: int, alpha: float = 16.0):
304+
"""
305+
Replaces all nn.Linear layers in a model with LoRALinear layers.
306+
"""
307+
for name, module in list(model.named_modules()):
308+
if isinstance(module, nn.Linear):
309+
# Find the parent module to replace the child
310+
parent_name, child_name = name.rsplit('.', 1)
311+
parent_module = model.get_submodule(parent_name)
312+
313+
# Replace the original linear layer
314+
setattr(parent_module, child_name, LoRALinear(module, r=r, alpha=alpha))
315+
```
316+
317+
#### Minimal End-to-End Demo
318+
319+
Let's see it all work together.
320+
321+
**1. Create a toy model:**
322+
```python
323+
model = nn.Sequential(
324+
nn.Linear(128, 256),
325+
nn.ReLU(),
326+
nn.Linear(256, 10) # e.g., for classification
327+
)
328+
```
329+
330+
**2. Inject LoRA layers:**
331+
```python
332+
apply_lora(model, r=8, alpha=16.0)
333+
print(model)
334+
```
335+
The output will show that our `nn.Linear` layers have been replaced by `LoRALinear`.
336+
337+
**3. Isolate the Trainable Parameters:**
338+
This is the most important step. We create an optimizer that *only* sees the LoRA weights.
339+
340+
```python
341+
# Filter for parameters that require gradients (only lora_A and lora_B)
342+
trainable_params = [p for p in model.parameters() if p.requires_grad]
343+
trainable_param_names = [name for name, p in model.named_parameters() if p.requires_grad]
344+
345+
print("\nTrainable Parameters:")
346+
for name in trainable_param_names:
347+
print(name)
348+
349+
# Create an optimizer that only updates the LoRA weights
350+
optimizer = torch.optim.AdamW(trainable_params, lr=1e-4)
351+
```
352+
353+
**Output:**
354+
```text
355+
Trainable Parameters:
356+
0.lora_A
357+
0.lora_B
358+
2.lora_A
359+
2.lora_B
360+
```
361+
This proves our success. The optimizer is completely unaware of the massive, frozen weights (`0.base.weight`, `2.base.weight`, etc.) and will only update our tiny, efficient LoRA matrices.
362+
363+
## **Chapter 6: Conclusion - From a Toy Model to a Real Transformer**
364+
365+
Let's recap the journey. We started with a simple `nn.Linear` layer and saw how its parameter count explodes at the scale of a real LLM. We then introduced the core mathematical trick of LoRA: approximating the massive update matrix `ΔW` with two small, low-rank matrices `A` and `B`. This simple idea led to a staggering >99% reduction in trainable parameters. Finally, we translated that math into a clean, reusable `LoRALinear` PyTorch module and proved that an optimizer could be set up to *only* train these new, tiny matrices.
366+
367+
#### Where does LoRA actually go in an LLM?
368+
369+
The `nn.Linear` layers we've been working with are not just abstract examples. They are the primary components of a Transformer, the architecture behind virtually all modern LLMs.
370+
371+
When you apply LoRA to a model like Llama or Mistral, you are targeting these specific linear layers:
372+
373+
* **Self-Attention Layers:** The most common targets are the projection matrices for the **query (`q_proj`)** and **value (`v_proj`)**. Adapting these allows the model to change *what it pays attention to* in the input text, which is incredibly powerful for task-specific fine-tuning.
374+
* **Feed-Forward Layers (MLP):** Transformers also have blocks of linear layers that process information after the attention step. Applying LoRA here helps modify the model's learned representations and knowledge.
375+
376+
So, when you see a LoRA implementation for a real LLM, the `apply_lora` function is simply more selective, replacing only the linear layers named `q_proj`, `v_proj`, etc., with the `LoRALinear` module you just built.
377+
378+
#### Why This Works So Well
379+
380+
The stunning effectiveness of LoRA relies on a powerful hypothesis: the knowledge needed to adapt a pre-trained model to a new task is much simpler than the model's entire knowledge base. You don't need to re-learn the entire English language to make a model a better chatbot. You only need to steer its existing knowledge. This "steering" information lies in a low-dimensional space, which a low-rank update `ΔW = B @ A` can capture perfectly.
381+
382+
You now have a deep, practical understanding of one of the most important techniques in modern AI. You know the "what," the "why," and the "how" behind LoRA, giving you the foundation to efficiently adapt massive language models.

0 commit comments

Comments
 (0)