Load → Quantize → Fine-tune → Export · Any LLM · One Line Each
Quick Start • Features • Examples • Models • Docs
| Challenge | Without QuantLLM | With QuantLLM |
|---|---|---|
| Loading 7B model | 50+ lines of config | turbo("model") |
| Quantization setup | Complex BitsAndBytes config | Automatic |
| Fine-tuning | LoRA config + Trainer setup | model.finetune(data) |
| GGUF export | Manual llama.cpp workflow | model.export("gguf") |
| Memory management | Manual offloading code | Built-in |
QuantLLM handles the complexity so you can focus on building.
# From GitHub (recommended)
pip install git+https://github.com/codewithdark-git/QuantLLM.git
# With all features
pip install "quantllm[full] @ git+https://github.com/codewithdark-git/QuantLLM.git"from quantllm import turbo
# Load with automatic 4-bit quantization, Flash Attention, optimal settings
model = turbo("meta-llama/Llama-3-8B")
# Generate text
print(model.generate("Explain quantum computing in simple terms"))That's it. QuantLLM automatically:
- ✅ Detects your GPU and memory
- ✅ Chooses optimal quantization (4-bit on most GPUs)
- ✅ Enables Flash Attention 2 if available
- ✅ Configures batch size and memory management
# One line - everything automatic
model = turbo("mistralai/Mistral-7B")
# Override if needed
model = turbo("Qwen/Qwen2-7B", bits=4, max_length=8192) |
|
|
Llama 2/3, Mistral, Mixtral, Qwen/Qwen2, Phi-1/2/3, Gemma, Falcon, GPT-NeoX, StableLM, ChatGLM, Yi, DeepSeek, InternLM, Baichuan, StarCoder, BLOOM, OPT, MPT... |
|
|
|
from quantllm import turbo
model = turbo("meta-llama/Llama-3-8B")
# Simple generation
response = model.generate(
"Write a Python function to calculate fibonacci numbers",
max_new_tokens=200,
temperature=0.7,
)
print(response)
# Chat format
messages = [
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "How do I read a file in Python?"},
]
response = model.chat(messages)
print(response)from quantllm import turbo
model = turbo("mistralai/Mistral-7B")
# Simple - everything auto-configured
model.finetune("training_data.json", epochs=3)
# Advanced - full control
model.finetune(
"training_data.json",
epochs=5,
learning_rate=2e-4,
lora_r=32,
lora_alpha=64,
batch_size=4,
output_dir="./fine-tuned-model",
)Supported data formats:
[
{"instruction": "What is Python?", "output": "Python is a programming language..."},
{"text": "Full text for language modeling"},
{"prompt": "Question here", "completion": "Answer here"}
]from quantllm import turbo
model = turbo("microsoft/phi-3-mini")
# GGUF for llama.cpp / Ollama / LM Studio
model.export("gguf", "phi3-q4.gguf", quantization="Q4_K_M")
# GGUF quantization options:
# Q2_K, Q3_K_S, Q3_K_M, Q4_0, Q4_K_S, Q4_K_M, Q5_0, Q5_K_M, Q6_K, Q8_0
# ONNX for TensorRT / ONNX Runtime
model.export("onnx", "phi3.onnx")
# SafeTensors for HuggingFace
model.export("safetensors", "./phi3-hf/")
# MLX for Apple Silicon Macs
model.export("mlx", "./phi3-mlx/", quantization="4bit")from quantllm import turbo
from quantllm.hub import QuantLLMHubManager
# Load and fine-tune
model = turbo("microsoft/phi-2")
model.finetune("my_data.json", epochs=3)
# Setup Hub manager
manager = QuantLLMHubManager(
repo_id="username/my-fine-tuned-model",
hf_token="your_token",
)
# Track training
manager.track_hyperparameters({
"learning_rate": 0.001,
"epochs": 3,
"base_model": "microsoft/phi-2",
})
# Save and push
manager.save_final_model(model.model, format="safetensors")
manager.push()QuantLLM supports 45+ model architectures out of the box:
| Category | Models |
|---|---|
| Llama Family | Llama 2, Llama 3, CodeLlama |
| Mistral Family | Mistral 7B, Mixtral 8x7B |
| Qwen Family | Qwen, Qwen2, Qwen2-MoE |
| Microsoft | Phi-1, Phi-2, Phi-3 |
| Gemma, Gemma 2 | |
| Falcon | Falcon 7B/40B/180B |
| Code Models | StarCoder, StarCoder2, CodeGen |
| Chinese | ChatGLM, Yi, Baichuan, InternLM |
| Other | DeepSeek, StableLM, MPT, BLOOM, OPT, GPT-NeoX |
# Basic installation
pip install git+https://github.com/codewithdark-git/QuantLLM.git
# With GGUF export support
pip install "quantllm[gguf] @ git+https://github.com/codewithdark-git/QuantLLM.git"
# With Triton kernels (Linux only)
pip install "quantllm[triton] @ git+https://github.com/codewithdark-git/QuantLLM.git"
# With Flash Attention
pip install "quantllm[flash] @ git+https://github.com/codewithdark-git/QuantLLM.git"
# Full installation (all features)
pip install "quantllm[full] @ git+https://github.com/codewithdark-git/QuantLLM.git"
# Hub lifecycle (for HuggingFace integration)
pip install git+https://github.com/codewithdark-git/huggingface-lifecycle.git| Configuration | RAM | GPU VRAM | Recommended For |
|---|---|---|---|
| 🟢 CPU Only | 8GB+ | None | Testing, small models (1-3B) |
| 🔵 Entry GPU | 16GB | 6-8GB | 7B models (4-bit) |
| 🟣 Mid-Range | 32GB | 12-24GB | 13B-30B models |
| 🟠 High-End | 64GB+ | 24-80GB | 70B+ models |
- NVIDIA: RTX 3060, 3070, 3080, 3090, 4070, 4080, 4090, A100, H100
- AMD: RX 7900 XTX (with ROCm)
- Apple: M1, M2, M3 (via MLX export)
| Resource | Description |
|---|---|
| 📖 Examples | Working code examples |
| 📚 API Reference | Full API documentation |
| 🎓 Tutorials | Step-by-step guides |
| 🐛 Issues | Report bugs |
quantllm/
├── core/ # Core functionality
│ ├── turbo_model.py # Main TurboModel API
│ ├── smart_config.py # Auto-configuration
│ ├── hardware.py # Hardware detection
│ ├── compilation.py # torch.compile integration
│ ├── flash_attention.py # Flash Attention 2
│ ├── memory.py # Memory optimization
│ ├── training.py # Training utilities
│ └── export.py # Universal exporter
├── kernels/ # Custom kernels
│ └── triton/ # Triton fused kernels
├── quant/ # Quantization
│ ├── gguf_converter.py # GGUF export (45 models)
│ └── quantization_engine.py
├── hub/ # HuggingFace integration
│ └── hf_manager.py # Lifecycle management
└── utils/ # Utilities
We welcome contributions! Here's how to get started:
# Clone the repository
git clone https://github.com/codewithdark-git/QuantLLM.git
cd QuantLLM
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest
# Format code
black quantllm/
isort quantllm/- 🆕 New model architecture support
- 🔧 Performance optimizations
- 📚 Documentation improvements
- 🐛 Bug fixes
- ✨ New export formats
Coming soon! We're working on comprehensive benchmarks comparing:
- Inference speed vs vanilla transformers
- Memory usage comparisons
- Quantization quality metrics
- Export format performance
MIT License - see LICENSE for details.
Made with ❤️ by Dark Coder