Performance Benchmarks
Real-world benchmarks on actual hardware running llcuda v1.1.0.
Cloud Platforms
llcuda v1.1.0 now supports Google Colab and Kaggle!
Tesla T4 (Google Colab / Kaggle) - 15GB VRAM
| Model | Quantization | GPU Layers | Speed | VRAM Usage |
|---|---|---|---|---|
| Gemma 3 1B | Q4_K_M | 26 (all) | ~15 tok/s | ~1.2 GB |
| Llama 3.1 7B | Q4_K_M | 20 | ~5-8 tok/s | ~8 GB |
| Llama 3.1 7B | Q4_K_M | 32 (all) | ~8-10 tok/s | ~12 GB |
Platform: Google Colab Free / Kaggle (2x T4, 30GB total VRAM) llcuda: v1.1.0 Use case: Research, education, rapid prototyping
Tesla P100 (Google Colab) - 16GB VRAM
| Model | Quantization | GPU Layers | Speed | VRAM Usage |
|---|---|---|---|---|
| Gemma 3 1B | Q4_K_M | 26 (all) | ~18 tok/s | ~1.2 GB |
| Llama 3.1 7B | Q4_K_M | 32 (all) | ~10 tok/s | ~12 GB |
| Llama 3.1 13B | Q4_K_M | 20 | ~5-7 tok/s | ~14 GB |
Platform: Google Colab (varies by availability) llcuda: v1.1.0 Use case: Larger models, faster inference
Local GPUs
GeForce 940M - 1GB VRAM
Hardware: GeForce 940M (384 CUDA cores, Maxwell architecture) OS: Ubuntu 22.04 LTS llcuda: v1.1.0 CUDA: 12.8 (bundled)
Gemma 3 1B Q4_K_M (Recommended)
Model: google/gemma-3-1b-it (Q4_K_M quantization)
Size: 700 MB
VRAM Usage: ~800 MB
Performance: ~15 tokens/second
GPU Layers: 20 (auto-configured)
Context: 512 tokens
Use case: General chat, Q&A, code assistance
TinyLlama 1.1B Q5_K_M (Fastest)
Model: TinyLlama-1.1B-Chat (Q5_K_M quantization)
Size: 800 MB
VRAM Usage: ~750 MB
Performance: ~18 tokens/second
GPU Layers: 20 (auto-configured)
Context: 512 tokens
Use case: Fast responses, simple tasks
Llama 3.2 1B Q4_K_M (Best Quality)
Model: meta-llama/Llama-3.2-1B-Instruct (Q4_K_M)
Size: 750 MB
VRAM Usage: ~800 MB
Performance: ~16 tokens/second
GPU Layers: 20 (auto-configured)
Context: 512 tokens
Use case: Best quality for 1GB VRAM
Supported GPU Architectures
llcuda v1.1.0 supports all modern NVIDIA GPUs with compute capability 5.0+:
| Architecture | Compute Cap | Examples | Platforms |
|---|---|---|---|
| Maxwell | 5.0-5.3 | GTX 900, 940M | Local |
| Pascal | 6.0-6.2 | GTX 10xx, P100 | Local, Colab |
| Volta | 7.0 | V100 | Colab Pro |
| Turing | 7.5 | T4, RTX 20xx | Colab, Kaggle |
| Ampere | 8.0-8.6 | A100, RTX 30xx | Colab Pro, Local |
| Ada Lovelace | 8.9 | RTX 40xx | Local |
Full Model Registry
llcuda includes 11 curated models optimized for different VRAM tiers:
1GB VRAM Tier (GeForce 940M)
| Model | Size | Performance (940M) | Use Case |
|---|---|---|---|
| tinyllama-1.1b-Q5_K_M | 800 MB | ~18 tok/s | Fastest option |
| gemma-3-1b-Q4_K_M | 700 MB | ~15 tok/s | Recommended |
| llama-3.2-1b-Q4_K_M | 750 MB | ~16 tok/s | Best quality |
2GB+ VRAM Tier
| Model | Size | Performance (T4) | Use Case |
|---|---|---|---|
| phi-3-mini-Q4_K_M | 2.2 GB | ~12 tok/s | Code-focused |
| gemma-2-2b-Q4_K_M | 1.6 GB | ~14 tok/s | Balanced |
| llama-3.2-3b-Q4_K_M | 2.0 GB | ~12 tok/s | Quality |
4GB+ VRAM Tier
| Model | Size | Performance (T4) | Use Case |
|---|---|---|---|
| mistral-7b-Q4_K_M | 4.1 GB | ~8 tok/s | Highest quality |
| llama-3.1-7b-Q4_K_M | 4.3 GB | ~5-8 tok/s | Latest Llama |
| phi-3-medium-Q4_K_M | 8.0 GB | ~5 tok/s | Code expert |
8GB+ VRAM Tier
| Model | Size | Performance (P100) | Use Case |
|---|---|---|---|
| llama-3.1-13b-Q4_K_M | 7.8 GB | ~5-7 tok/s | Large context |
| mixtral-8x7b-Q4_K_M | 26 GB | ~3-5 tok/s | MoE architecture |
Latency Metrics
llcuda tracks P50/P95/P99 latencies:
GeForce 940M (1GB VRAM)
Model: gemma-3-1b-Q4_K_M
GPU Layers: 20
Context: 512 tokens
P50 latency: 65 ms
P95 latency: 72 ms
P99 latency: 78 ms
Tesla T4 (Colab/Kaggle)
Model: gemma-3-1b-Q4_K_M
GPU Layers: 26
Context: 2048 tokens
P50 latency: 62 ms
P95 latency: 68 ms
P99 latency: 74 ms
Testing Methodology
All benchmarks measured using:
import llcuda
engine = llcuda.InferenceEngine()
engine.load_model("gemma-3-1b-Q4_K_M")
# Warmup
for _ in range(5):
engine.infer("Test", max_tokens=50)
# Measure
results = []
for _ in range(100):
result = engine.infer("Explain AI", max_tokens=50)
results.append(result.tokens_per_sec)
print(f"Mean: {sum(results)/len(results):.1f} tok/s")
Tips for Best Performance
1. GPU Layer Tuning
# Start with auto-configuration
engine.load_model("model.gguf", auto_configure=True)
# Then manually tune
engine.load_model("model.gguf", gpu_layers=20) # Adjust based on VRAM
2. Context Size
# Smaller context = less VRAM
engine.load_model("model.gguf", ctx_size=512) # 1GB VRAM
engine.load_model("model.gguf", ctx_size=2048) # 4GB+ VRAM
3. Batch Size
4. Quantization Selection
Lower quantization = smaller model = faster loading, but slightly lower quality:
- Q4_K_M: Best balance (recommended)
- Q5_K_M: Higher quality, larger size
- Q6_K: Highest quality, biggest size
- Q3_K_M: Smallest, fastest, lower quality
Cloud Platform Quick Start
Google Colab
!pip install llcuda
import llcuda
engine = llcuda.InferenceEngine()
engine.load_model("gemma-3-1b-Q4_K_M", gpu_layers=26)
result = engine.infer("What is AI?")
print(result.text)
Kaggle
!pip install llcuda
import llcuda
engine = llcuda.InferenceEngine()
engine.load_model(
"unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
gpu_layers=26
)
result = engine.infer("Explain machine learning")
print(result.text)
View complete cloud platform guide →
Comparison: v1.0.x vs v1.1.0
| Metric | v1.0.x | v1.1.0 |
|---|---|---|
| Supported GPUs | Compute 5.0 only | Compute 5.0-8.9 (8 architectures) |
| Cloud Platforms | ❌ None | ✅ Colab, Kaggle |
| PyPI Package Size | 50 MB | 51 KB (hybrid architecture) |
| First Install | Quick | +3-5 min (downloads binaries/model) |
| Subsequent Use | ~1 sec | ~1 sec (cached) |
| Performance (940M) | ~15 tok/s | ~15 tok/s (same) |
| T4 Support | ❌ Failed | ✅ Works (~15 tok/s) |
| P100 Support | ❌ Failed | ✅ Works (~18 tok/s) |
Hybrid Bootstrap Architecture: v1.1.0 uses a lightweight PyPI package (51 KB) that automatically downloads optimized binaries (~700 MB) and models (~770 MB) on first import based on your GPU. This solves PyPI's 100 MB file size limit while providing universal GPU support.
Backward compatible: v1.1.0 works exactly the same on GeForce 940M as v1.0.x.
Links
- PyPI: pypi.org/project/llcuda
- GitHub: github.com/waqasm86/llcuda
- Latest Release: v1.1.0
- Cloud Guide: Cloud Platforms