Skip to content

QuantLLM is a Python library designed for developers, researchers, and teams who want to fine-tune and deploy large language models (LLMs) efficiently using 4-bit and 8-bit quantization techniques.

Notifications You must be signed in to change notification settings

codewithdark-git/QuantLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧠 QuantLLM v2.0

QuantLLM v2.0
🚀 One Line to Rule Them All

Downloads PyPI - Version Python License Stars

Load → Quantize → Fine-tune → Export · Any LLM · One Line Each

Quick StartFeaturesExamplesModelsDocs


🤔 Why QuantLLM?

Challenge Without QuantLLM With QuantLLM
Loading 7B model 50+ lines of config turbo("model")
Quantization setup Complex BitsAndBytes config Automatic
Fine-tuning LoRA config + Trainer setup model.finetune(data)
GGUF export Manual llama.cpp workflow model.export("gguf")
Memory management Manual offloading code Built-in

QuantLLM handles the complexity so you can focus on building.


⚡ Quick Start

Installation

# From GitHub (recommended)
pip install git+https://github.com/codewithdark-git/QuantLLM.git

# With all features
pip install "quantllm[full] @ git+https://github.com/codewithdark-git/QuantLLM.git"

Your First Model in 3 Lines

from quantllm import turbo

# Load with automatic 4-bit quantization, Flash Attention, optimal settings
model = turbo("meta-llama/Llama-3-8B")

# Generate text
print(model.generate("Explain quantum computing in simple terms"))

That's it. QuantLLM automatically:

  • ✅ Detects your GPU and memory
  • ✅ Chooses optimal quantization (4-bit on most GPUs)
  • ✅ Enables Flash Attention 2 if available
  • ✅ Configures batch size and memory management

✨ Features

🎯 Ultra-Simple API

# One line - everything automatic
model = turbo("mistralai/Mistral-7B")

# Override if needed
model = turbo("Qwen/Qwen2-7B", bits=4, max_length=8192)

⚡ Speed Optimizations

  • Triton Kernels - Fused dequant+matmul
  • torch.compile - Graph optimization
  • Flash Attention 2 - Fast attention
  • Weight Caching - No re-dequantization

🧠 45+ Model Architectures

Llama 2/3, Mistral, Mixtral, Qwen/Qwen2, Phi-1/2/3, Gemma, Falcon, GPT-NeoX, StableLM, ChatGLM, Yi, DeepSeek, InternLM, Baichuan, StarCoder, BLOOM, OPT, MPT...

📦 6 Export Formats

  • GGUF - llama.cpp, Ollama, LM Studio
  • ONNX - ONNX Runtime, TensorRT
  • SafeTensors - HuggingFace
  • MLX - Apple Silicon
  • AWQ - AutoAWQ
  • PyTorch - Standard .pt

🔧 Zero-Config Smart Defaults

  • Hardware auto-detection (GPU, memory, capabilities)
  • Optimal quantization selection
  • Automatic batch size calculation
  • Memory-aware loading

💾 Memory Optimizations

  • Dynamic CPU ↔ GPU offloading
  • Gradient checkpointing
  • CPU optimizer states
  • Layer-wise memory tracking

🎮 Usage Examples

Chat with Any Model

from quantllm import turbo

model = turbo("meta-llama/Llama-3-8B")

# Simple generation
response = model.generate(
    "Write a Python function to calculate fibonacci numbers",
    max_new_tokens=200,
    temperature=0.7,
)
print(response)

# Chat format
messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "How do I read a file in Python?"},
]
response = model.chat(messages)
print(response)

Fine-Tune with Your Data

from quantllm import turbo

model = turbo("mistralai/Mistral-7B")

# Simple - everything auto-configured
model.finetune("training_data.json", epochs=3)

# Advanced - full control
model.finetune(
    "training_data.json",
    epochs=5,
    learning_rate=2e-4,
    lora_r=32,
    lora_alpha=64,
    batch_size=4,
    output_dir="./fine-tuned-model",
)

Supported data formats:

[
  {"instruction": "What is Python?", "output": "Python is a programming language..."},
  {"text": "Full text for language modeling"},
  {"prompt": "Question here", "completion": "Answer here"}
]

Export to Multiple Formats

from quantllm import turbo

model = turbo("microsoft/phi-3-mini")

# GGUF for llama.cpp / Ollama / LM Studio
model.export("gguf", "phi3-q4.gguf", quantization="Q4_K_M")

# GGUF quantization options:
# Q2_K, Q3_K_S, Q3_K_M, Q4_0, Q4_K_S, Q4_K_M, Q5_0, Q5_K_M, Q6_K, Q8_0

# ONNX for TensorRT / ONNX Runtime
model.export("onnx", "phi3.onnx")

# SafeTensors for HuggingFace
model.export("safetensors", "./phi3-hf/")

# MLX for Apple Silicon Macs
model.export("mlx", "./phi3-mlx/", quantization="4bit")

Push to HuggingFace Hub

from quantllm import turbo
from quantllm.hub import QuantLLMHubManager

# Load and fine-tune
model = turbo("microsoft/phi-2")
model.finetune("my_data.json", epochs=3)

# Setup Hub manager
manager = QuantLLMHubManager(
    repo_id="username/my-fine-tuned-model",
    hf_token="your_token",
)

# Track training
manager.track_hyperparameters({
    "learning_rate": 0.001,
    "epochs": 3,
    "base_model": "microsoft/phi-2",
})

# Save and push
manager.save_final_model(model.model, format="safetensors")
manager.push()

🧠 Supported Models

QuantLLM supports 45+ model architectures out of the box:

Category Models
Llama Family Llama 2, Llama 3, CodeLlama
Mistral Family Mistral 7B, Mixtral 8x7B
Qwen Family Qwen, Qwen2, Qwen2-MoE
Microsoft Phi-1, Phi-2, Phi-3
Google Gemma, Gemma 2
Falcon Falcon 7B/40B/180B
Code Models StarCoder, StarCoder2, CodeGen
Chinese ChatGLM, Yi, Baichuan, InternLM
Other DeepSeek, StableLM, MPT, BLOOM, OPT, GPT-NeoX

📦 Installation Options

# Basic installation
pip install git+https://github.com/codewithdark-git/QuantLLM.git

# With GGUF export support
pip install "quantllm[gguf] @ git+https://github.com/codewithdark-git/QuantLLM.git"

# With Triton kernels (Linux only)
pip install "quantllm[triton] @ git+https://github.com/codewithdark-git/QuantLLM.git"

# With Flash Attention
pip install "quantllm[flash] @ git+https://github.com/codewithdark-git/QuantLLM.git"

# Full installation (all features)
pip install "quantllm[full] @ git+https://github.com/codewithdark-git/QuantLLM.git"

# Hub lifecycle (for HuggingFace integration)
pip install git+https://github.com/codewithdark-git/huggingface-lifecycle.git

💻 Hardware Requirements

Configuration RAM GPU VRAM Recommended For
🟢 CPU Only 8GB+ None Testing, small models (1-3B)
🔵 Entry GPU 16GB 6-8GB 7B models (4-bit)
🟣 Mid-Range 32GB 12-24GB 13B-30B models
🟠 High-End 64GB+ 24-80GB 70B+ models

Tested GPUs

  • NVIDIA: RTX 3060, 3070, 3080, 3090, 4070, 4080, 4090, A100, H100
  • AMD: RX 7900 XTX (with ROCm)
  • Apple: M1, M2, M3 (via MLX export)

📚 Documentation

Resource Description
📖 Examples Working code examples
📚 API Reference Full API documentation
🎓 Tutorials Step-by-step guides
🐛 Issues Report bugs

🏗️ Architecture

quantllm/
├── core/                    # Core functionality
│   ├── turbo_model.py      # Main TurboModel API
│   ├── smart_config.py     # Auto-configuration
│   ├── hardware.py         # Hardware detection
│   ├── compilation.py      # torch.compile integration
│   ├── flash_attention.py  # Flash Attention 2
│   ├── memory.py           # Memory optimization
│   ├── training.py         # Training utilities
│   └── export.py           # Universal exporter
├── kernels/                # Custom kernels
│   └── triton/             # Triton fused kernels
├── quant/                  # Quantization
│   ├── gguf_converter.py   # GGUF export (45 models)
│   └── quantization_engine.py
├── hub/                    # HuggingFace integration
│   └── hf_manager.py       # Lifecycle management
└── utils/                  # Utilities

🤝 Contributing

We welcome contributions! Here's how to get started:

# Clone the repository
git clone https://github.com/codewithdark-git/QuantLLM.git
cd QuantLLM

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black quantllm/
isort quantllm/

Areas for Contribution

  • 🆕 New model architecture support
  • 🔧 Performance optimizations
  • 📚 Documentation improvements
  • 🐛 Bug fixes
  • ✨ New export formats

📈 Benchmarks

Coming soon! We're working on comprehensive benchmarks comparing:

  • Inference speed vs vanilla transformers
  • Memory usage comparisons
  • Quantization quality metrics
  • Export format performance

📜 License

MIT License - see LICENSE for details.


About

QuantLLM is a Python library designed for developers, researchers, and teams who want to fine-tune and deploy large language models (LLMs) efficiently using 4-bit and 8-bit quantization techniques.

Topics

Resources

Stars

Watchers

Forks

Sponsor this project

Packages

No packages published

Contributors 2

  •  
  •  

Languages