This is an LLM learning project inspired by nanoGPT and Stanford CS336. It is dedicated to implementing the entire LLM training pipeline from scratch, including training of tokenizer, data cleaning, model pre-training, SFT, GRPO, and more.
[2023.07.12]Added A comprehensive guide to CS336 assignment 1。
[2025.07.10]: Added code for training tokenizers from scratch.
[2025.07.08]: Added code for pretraining LLMs from scratch with a custom-trained tokenizer.
[2025.07.07]: nanoQwen: Added Qwen2.5 implementations from scratch and enabled loading of pretrained models from Huggingface.
uv run python -m scripts.train_tokenizer,耗时3分钟uv run python -m scripts.tokenize, 耗时6分钟uv run python -m scripts.pretrain,耗时35分钟uv run python -m scripts.eval_pretrainhuggingface_models folder.uv run python -m scripts.test_qwen2_5 to load the open-source weights into your own implementation of the LLM from scratch and generate text.To be updated.
data/txt folder.scripts/configs/train_tokenizer.yaml as needed.uv run python -m scripts.test_train_tokenizer to start training your tokenizer from scratch.tokenizer_dir in the config file.To be updated.
data folder.uv run python -m scripts.test_pretrain to pretrain your own LLM from scratch.uv run python -m scripts.test_eval_pretrain to evaluate your pretrained LLM.To be updated.
This repository is licensed under the Apache-2.0 License.