<<<<<<< HEAD
A complete educational implementation of GPT (Generative Pre-trained Transformer) using only NumPy. This project breaks down every component of the transformer architecture with detailed mathematical explanations and clean, readable code.
This project is designed for learning and understanding transformer architectures from first principles. Every component is implemented from scratch with:
- ✅ Detailed mathematical explanations
- ✅ Clear, readable code with extensive comments
- ✅ No deep learning frameworks (PyTorch, TensorFlow, etc.)
- ✅ Complete backpropagation implementation
GPT is an autoregressive transformer-based language model. Here's what it contains:
graph TD
A[Input Tokens] --> B[Token Embedding]
A --> C[Positional Embedding]
B --> D[Add Embeddings]
C --> D
D --> E[Transformer Block 1]
E --> F[Transformer Block 2]
F --> G[...]
G --> H[Transformer Block N]
H --> I[Layer Normalization]
I --> J[Output Projection]
J --> K[Logits/Predictions]
style B fill:#e1f5ff
style C fill:#e1f5ff
style E fill:#fff9c4
style F fill:#fff9c4
style H fill:#fff9c4
style I fill:#c8e6c9
style J fill:#c8e6c9
Each transformer block contains:
graph LR
A[Input] --> B[Layer Norm]
B --> C[Multi-Head Attention]
C --> D[Residual Add]
A --> D
D --> E[Layer Norm]
E --> F[Feed Forward]
F --> G[Residual Add]
D --> G
G --> H[Output]
style C fill:#ffccbc
style F fill:#ffccbc
style D fill:#b2dfdb
style G fill:#b2dfdb
The core innovation of transformers. Allows the model to weigh the importance of different words when processing each word.
Mathematical Formula:
Attention(Q, K, V) = softmax(QK^T / √d_k) V
Where:
- Q (Query): "What am I looking for?"
- K (Key): "What do I contain?"
- V (Value): "What information do I have?"
- d_k: Dimension of key vectors (for scaling)
Intuition: When processing the word "it" in "The cat sat on the mat because it was tired", attention helps the model understand that "it" refers to "cat".
Instead of single attention, we use multiple attention "heads" in parallel, each learning different relationships.
Formula:
MultiHead(Q,K,V) = Concat(head₁, ..., headₕ)W^O
where head_i = Attention(QW^Q_i, KW^K_i, VW^V_i)
Why? Different heads can capture different types of relationships:
- Head 1: Syntactic relationships (subject-verb)
- Head 2: Semantic relationships (synonyms)
- Head 3: Long-range dependencies
- etc.
Since attention has no inherent concept of order, we add positional information to embeddings.
Method: Each position gets a unique learnable embedding vector that's added to the token embedding.
Normalizes activations to have mean=0 and variance=1, which stabilizes training.
Formula:
LayerNorm(x) = γ * (x - μ) / √(σ² + ε) + β
Add the input of a sub-layer to its output: output = SubLayer(x) + x
Why? Helps gradients flow in deep networks, preventing vanishing gradient problem.
Position-wise fully connected layers that process each position independently.
Formula:
FFN(x) = GELU(xW₁ + b₁)W₂ + b₂
Typically expands to 4x the model dimension, then projects back.
gpt-numpy/
├── gpt_numpy.py # Core GPT implementation
│ ├── Embedding # Token and positional embeddings
│ ├── LayerNorm # Layer normalization
│ ├── GELU # Activation function
│ ├── Linear # Fully connected layer
│ ├── MultiHeadAttention# Attention mechanism
│ ├── FeedForward # Position-wise FFN
│ ├── TransformerBlock # Complete transformer layer
│ └── GPT # Full model
│
├── train.py # Training infrastructure
│ ├── AdamOptimizer # Adam optimization algorithm
│ ├── TextDataset # Character-level data loader
│ ├── cross_entropy_loss# Loss computation
│ └── train() # Training loop
│
├── example.py # Quick demonstration
├── data/
│ └── shakespeare.txt # Sample training data
└── README.md # This file
# Only numpy is required!
pip install numpy
# Optional: for visualization
pip install matplotlibRun the example script to see the model in action:
python example.pyThis will:
- Create a small GPT model
- Train it for 50 epochs (quick demo)
- Show generations before and after training
- Save and load the model
For better results, run the full training script:
python train.pyDefault settings:
- Model size: 128-dimensional embeddings
- Layers: 4 transformer blocks
- Attention heads: 4
- Training epochs: 500
- Batch size: 32
Expected results:
- Training should take 10-30 minutes on CPU (depending on your machine)
- Loss should decrease from ~4.0 to ~1.5-2.0
- Generated text will mimic Shakespeare's style
Replace data/shakespeare.txt with your own text file:
# In train.py, change the data path:
data_path = "data/your_custom_text.txt"For an input sequence of token indices x:
-
Embedding:
E_token = EmbeddingMatrix[x] E_pos = PositionalEmbedding[0, 1, 2, ..., seq_len-1] E = E_token + E_pos -
Transformer Blocks (repeated N times):
# Attention sub-layer norm1 = LayerNorm(E) attn_out = MultiHeadAttention(norm1) E = E + attn_out # Residual # Feed-forward sub-layer norm2 = LayerNorm(E) ff_out = FeedForward(norm2) E = E + ff_out # Residual -
Output:
final = LayerNorm(E) logits = Linear(final) # Project to vocabulary size
The gradient flows backward through:
- Output projection
- Final layer norm
- Each transformer block (in reverse)
- Embeddings
Each component computes:
- Gradient w.r.t. its input
- Gradients w.r.t. its learnable parameters
Updates parameters using adaptive learning rates:
m_t = β₁ * m_{t-1} + (1 - β₁) * g_t # First moment
v_t = β₂ * v_{t-1} + (1 - β₂) * g_t² # Second moment
m̂_t = m_t / (1 - β₁^t) # Bias correction
v̂_t = v_t / (1 - β₂^t)
θ_t = θ_{t-1} - α * m̂_t / (√v̂_t + ε) # Update
Parameters:
α: Learning rate (typically 1e-4 to 1e-3)β₁: First moment decay (typically 0.9)β₂: Second moment decay (typically 0.999)ε: Numerical stability constant (1e-8)
-
Transformer Architecture
- How attention works mathematically
- Why multi-head attention is powerful
- Role of layer normalization and residual connections
-
Training Deep Networks
- Backpropagation through complex architectures
- Gradient computation for matrix operations
- Optimization with Adam
-
Language Modeling
- Autoregressive generation
- Next-token prediction
- Character-level vs. token-level modeling
-
Implementation Skills
- Efficient NumPy operations
- Managing gradients in custom layers
- Numerical stability considerations
| File | Focus Area | Key Concepts |
|---|---|---|
gpt_numpy.py → MultiHeadAttention |
Attention mechanism | Scaled dot-product attention, queries/keys/values |
gpt_numpy.py → TransformerBlock |
Architecture design | Residual connections, layer norm placement |
train.py → AdamOptimizer |
Optimization | Adaptive learning rates, momentum |
train.py → cross_entropy_loss |
Training objectives | Likelihood maximization, softmax |
In train.py or example.py:
model = GPT(
vocab_size=dataset.vocab_size,
d_model=256, # Increase for larger model
num_layers=6, # More layers = deeper network
num_heads=8, # More heads = more parallel attention
d_ff=1024, # Larger FFN = more capacity
max_seq_len=128, # Longer context window
dropout=0.1 # Regularization
)train(
model=model,
dataset=dataset,
num_epochs=1000, # More epochs = better convergence
batch_size=64, # Larger batch = more stable gradients
learning_rate=1e-4, # Lower LR = slower but more stable
eval_interval=50, # How often to evaluate
eval_samples=10 # More samples = better val estimate
)| Config | Parameters | Training Time | Quality |
|---|---|---|---|
| Tiny (d=64, L=2) | ~50K | 5 min | Basic patterns |
| Small (d=128, L=4) | ~200K | 20 min | Decent coherence |
| Medium (d=256, L=6) | ~1M | 1-2 hours | Good quality |
| Large (d=512, L=12) | ~10M | 8+ hours | High quality |
Times are approximate for CPU training on Shakespeare dataset
- GPU Acceleration: While this uses only NumPy, you could adapt it to use CuPy for GPU support
- Gradient Clipping: Add to prevent exploding gradients
- Learning Rate Scheduling: Decay learning rate over time
- Larger Dataset: More data = better generalization
Causes:
- Learning rate too high
- Numerical instability in softmax
Solutions:
# Reduce learning rate
optimizer = AdamOptimizer(learning_rate=1e-4)
# Already handled in code via log-sum-exp trickCauses:
- Model not trained enough
- Temperature too low
Solutions:
# Train longer
num_epochs = 1000
# Increase temperature for more randomness
model.generate(tokens, temperature=1.0) # Try 0.8 to 1.2Solutions:
- Reduce model size
- Reduce batch size
- Use shorter sequences
- Consider GPU implementation
-
"Attention Is All You Need" (Vaswani et al., 2017)
- Original transformer paper
- https://arxiv.org/abs/1706.03762
-
"Language Models are Unsupervised Multitask Learners" (Radford et al., 2019)
-
"Adam: A Method for Stochastic Optimization" (Kingma & Ba, 2014)
- Adam optimizer
- https://arxiv.org/abs/1412.6980
This is an educational project. Feel free to:
- Add more documentation
- Improve code comments
- Add visualization tools
- Optimize performance
- Fix bugs
MIT License - Feel free to use for learning and teaching!
Inspired by:
- Andrej Karpathy's educational content
- The original transformer and GPT papers
- The NumPy community
After mastering this implementation:
-
Implement variants:
- BERT (bidirectional transformer)
- GPT-2/3 (larger models)
- Vision Transformer (ViT)
-
Add features:
- Beam search for generation
- Top-k / nucleus sampling
- Attention visualization
- Gradient checkpointing
-
Scale up:
- Convert to PyTorch/JAX
- Implement data parallelism
- Use larger datasets (e.g., WikiText)
- Fine-tune on specific tasks
-
Explore applications:
- Text generation
- Machine translation
- Question answering
- Code generation
Happy Learning! 🚀