Note
TL;DR: In this repo, we study properties and applications of Engram. More findings are on the way!
If you find TinyEngram useful, a ⭐ helps support the project.
📢 Latest Announcements
- 2026.02.02 — 📌 Released reproduction scripts for Engram vs LoRA experiment.
- 2026.01.30 — 📌 Added comparison of catastrophic forgetting between TinyEngram and LoRA.
- 2026.01.30 — 📌 Added parameter ablation studies of TinyEngram with convergence observations.
- 2026.01.23 — 🎉 Initial TinyEngram commit.
🔍 Quick Navigation
Key Finding 1: Engram as Parameter Efficient Fine-Tuning Method
Key Finding 2: Engram Outperforms LoRA in Catastrophic Forgetting
TinyEngram is an open research project exploring the Engram architecture—an LLM enhancement that boosts phrase-level understanding by integrating a compact N-gram memory module and a gated retrieval mechanism into key transformer layers.
Built on Qwen, TinyEngram provides a lightweight, ready-to-train codebase for anyone to reproduce, experiment with, or extend Engram-style models. We actively share new experiments, training logs, and findings right here—making this repo both a toolkit and a living research notebook.
Tip
Join the Research You are welcome to propose any questions in the Issues. We will burn our own GPUs to research on any interesting questions. Join us in evolving how LLMs remember what matters! 🧠✨
Training Setup |
We insert several Engram modules into decoder layers of Qwen. We fine-tune the Engram module on a subset of the Biomed-Enriched dataset. Only added parameters are trainable during the fine-tuning.
The train and eval loss demonstrate robust convergence. This confirms that the Engram module effectively learns specialized biomedical knowledge while preserving the stability of the underlying pre-trained knowledge base.
Training Loss |
Validation Performance |
| Biomedical Task | Qwen3-0.6B | Engram SFT |
|---|---|---|
| MMLU_Clinical Knowledge | 0.3358 | 0.4415 |
| MMLU_Medical Genetics | 0.3700 | 0.4400 |
| MMLU_Prof. Medicine | 0.3199 | 0.4559 |
| PubMedQA | 0.5700 | 0.6250 |
- Objective: Verify if integrating Engram memory harms the model's pre-trained general capabilities while adapting to new domains.
- Methodology: We fine-tune the Biomed-Enriched on Qwen, and evaluate the checkpoint on general benchmarks (We evaluate on MMLU, excluding all biomedical-related subtasks).
- Full Results: Click here to view detailed results
| Task Group | Qwen 3-0.6B | Engram SFT |
|---|---|---|
| mmlu (overall) | 0.4034 | 0.4500 (⬆️ +0.0466) |
| humanities | 0.4433 | 0.4691 (⬆️ +0.0258) |
| other | 0.4271 | 0.4696 (⬆️ +0.0425) |
| social sciences | 0.4826 | 0.5389 (⬆️ +0.0563) |
| stem | 0.3508 | 0.4088 (⬆️ +0.0580) |
📌 Update (2026.01.30): We have added a new set of experiments comparing Engram and LoRA on catastrophic forgetting. Please refer to Engram vs LoRA Catastrophic Forgetting Experiment for details.
- Objective: Investigate the relationship between Engram memory size (vocabulary size) and performance gains.
- Methodology: Train multiple models with varying
engram_vocab_size(e.g., 2k vs 10k vs 20k vs 100k) and observe the impact on biomedical validation loss. - Full Results: Larger representation capacities do not necessarily translate into better performance. In our experiments, we observe an apparent trade-off: smaller capacities may suffer from semantic collisions, while larger ones can become difficult to fully utilize given limited data. Click here to view detailed results
| Task | Nano (2k/0.2k) | Small (10k/1k) | Medium (20k/2k) | Large (100k/10k) | Qwen3-0.6B (Baseline) | Winner |
|---|---|---|---|---|---|---|
| MMLU_Clinical Knowledge | 0.3736 | 0.4415 | 0.4302 | 0.4226 | 0.3358 | Small 🏆 |
| MMLU_Medical Genetics | 0.3900 | 0.4400 | 0.4400 | 0.4100 | 0.3700 | Small/Med 🤝 |
| MMLU_Prof. Medicine | 0.4081 | 0.4559 | 0.4228 | 0.4412 | 0.3199 | Small 🏆 |
| PubMedQA | 0.6240 | 0.6250 | 0.6170 | 0.6150 | 0.5700 | Small 🏆 |
📌 Update (2026.01.30):
We have expanded our study with a comprehensive ablation of TinyEngram’s configurable hyperparameters. Please refer to Engram Systematic Hyperparameter Tuning Experiment for details.
To reproduce the experiments conducted in Key Finging 1, please refer to this guide.
LoRA is the de-facto PEFT method, So how does Engram compare? We also conduct systematic hyperparameter tuning to understand Engram better.
Preliminary observation: In our experiments, Engram shows noticeably better resistance to catastrophic forgetting than LoRA.
| Model Architecture | Adaptation Metric (Eval Loss) |
General Capability (TruthfulQA MC1) |
General Capability (TruthfulQA MC2) |
|
|---|---|---|---|---|
| Qwen-0.6B (Base) | N/A | 0.2583 | 0.4269 | - |
| LoRA (Rank 16) | 0.1862 | 0.2485 | 0.4078 | -1.91% |
| TinyEngram | 0.1850 | 0.2644 | 0.4340 | +0.71% |
It is worth noting that LoRA generally converges faster. In our experiments, LoRA could reach an even lower loss (0.1458) quickly, but the trade-off was severe: catastrophic forgetting worsened significantly (
$MC1: 0.2472$ ,$MC2: 0.3993$ ). Engram provides a safer learning path.
We fine-tune models on "poisoned" function-call-style data (see processing script) based on the glaive-function-calling-v2 dataset, which encourages a strong bias toward structured function-call outputs. We then evaluate both LoRA and Engram on TruthfulQA, a natural language QA benchmark, to examine how well they retain general-language capabilities under this distribution shift. Click here to view detailed results.
During initial trials, we observed that LoRA converges faster than the default Engram configuration. To enable a scientifically sound comparison, we conducted a systematic hyperparameter study to calibrate Engram such that it reaches evaluation loss levels comparable to LoRA on the same training data.
Using the small-scale, filtered glaive-function-calling-v2 dataset, we ablated key Engram parameters beyond vocabulary size, including:
- N-gram order
- Vocabulary size
- Embedding dimension per n-gram
- Number of hash heads per n-gram
- Target layer(s) for Engram injection
We hope this experiment can serve as a solid starting point for parameter selection in similar small-scale supervised fine-tuning (SFT) scenarios. 🔗 Click here to view detailed results.
Reproduction details of experiments conducted in Key Finging 2: please refer to this guide.
| Category | Item | Status |
|---|---|---|
| Engram as PEFT | Engram works | ✅ |
| Catastrophic Forgetting | ✅ | |
| Vocabulary Scalability | ✅ | |
| vs LoRA | ✅ | |
| Hyperparameter Tuning | ✅ | |
| More | More | ⬜ |
We borrowed a lot of code from the following excellent projects:
We thank the authors of training datasets that help our research:
If you find TinyEngram useful for your research or projects, please cite us:
@software{tinyengram,
author = {Runyuan Cai, Yiming Wang, Yu Lin, Xiaodong Zeng},
title = {TinyEngram},
year = {2026},
version = {0.1.0},
url = {https://github.com/AutoArk/tinyengram},
note = {GitHub repository}
}




