Skip to content

Performance Benchmarks

Real-world benchmarks on actual hardware running llcuda v1.1.0.


Cloud Platforms

llcuda v1.1.0 now supports Google Colab and Kaggle!

Tesla T4 (Google Colab / Kaggle) - 15GB VRAM

Model Quantization GPU Layers Speed VRAM Usage
Gemma 3 1B Q4_K_M 26 (all) ~15 tok/s ~1.2 GB
Llama 3.1 7B Q4_K_M 20 ~5-8 tok/s ~8 GB
Llama 3.1 7B Q4_K_M 32 (all) ~8-10 tok/s ~12 GB

Platform: Google Colab Free / Kaggle (2x T4, 30GB total VRAM) llcuda: v1.1.0 Use case: Research, education, rapid prototyping

Tesla P100 (Google Colab) - 16GB VRAM

Model Quantization GPU Layers Speed VRAM Usage
Gemma 3 1B Q4_K_M 26 (all) ~18 tok/s ~1.2 GB
Llama 3.1 7B Q4_K_M 32 (all) ~10 tok/s ~12 GB
Llama 3.1 13B Q4_K_M 20 ~5-7 tok/s ~14 GB

Platform: Google Colab (varies by availability) llcuda: v1.1.0 Use case: Larger models, faster inference


Local GPUs

GeForce 940M - 1GB VRAM

Hardware: GeForce 940M (384 CUDA cores, Maxwell architecture) OS: Ubuntu 22.04 LTS llcuda: v1.1.0 CUDA: 12.8 (bundled)

Model: google/gemma-3-1b-it (Q4_K_M quantization)
Size: 700 MB
VRAM Usage: ~800 MB
Performance: ~15 tokens/second
GPU Layers: 20 (auto-configured)
Context: 512 tokens

Use case: General chat, Q&A, code assistance

TinyLlama 1.1B Q5_K_M (Fastest)

Model: TinyLlama-1.1B-Chat (Q5_K_M quantization)
Size: 800 MB
VRAM Usage: ~750 MB
Performance: ~18 tokens/second
GPU Layers: 20 (auto-configured)
Context: 512 tokens

Use case: Fast responses, simple tasks

Llama 3.2 1B Q4_K_M (Best Quality)

Model: meta-llama/Llama-3.2-1B-Instruct (Q4_K_M)
Size: 750 MB
VRAM Usage: ~800 MB
Performance: ~16 tokens/second
GPU Layers: 20 (auto-configured)
Context: 512 tokens

Use case: Best quality for 1GB VRAM


Supported GPU Architectures

llcuda v1.1.0 supports all modern NVIDIA GPUs with compute capability 5.0+:

Architecture Compute Cap Examples Platforms
Maxwell 5.0-5.3 GTX 900, 940M Local
Pascal 6.0-6.2 GTX 10xx, P100 Local, Colab
Volta 7.0 V100 Colab Pro
Turing 7.5 T4, RTX 20xx Colab, Kaggle
Ampere 8.0-8.6 A100, RTX 30xx Colab Pro, Local
Ada Lovelace 8.9 RTX 40xx Local

Full Model Registry

llcuda includes 11 curated models optimized for different VRAM tiers:

1GB VRAM Tier (GeForce 940M)

Model Size Performance (940M) Use Case
tinyllama-1.1b-Q5_K_M 800 MB ~18 tok/s Fastest option
gemma-3-1b-Q4_K_M 700 MB ~15 tok/s Recommended
llama-3.2-1b-Q4_K_M 750 MB ~16 tok/s Best quality

2GB+ VRAM Tier

Model Size Performance (T4) Use Case
phi-3-mini-Q4_K_M 2.2 GB ~12 tok/s Code-focused
gemma-2-2b-Q4_K_M 1.6 GB ~14 tok/s Balanced
llama-3.2-3b-Q4_K_M 2.0 GB ~12 tok/s Quality

4GB+ VRAM Tier

Model Size Performance (T4) Use Case
mistral-7b-Q4_K_M 4.1 GB ~8 tok/s Highest quality
llama-3.1-7b-Q4_K_M 4.3 GB ~5-8 tok/s Latest Llama
phi-3-medium-Q4_K_M 8.0 GB ~5 tok/s Code expert

8GB+ VRAM Tier

Model Size Performance (P100) Use Case
llama-3.1-13b-Q4_K_M 7.8 GB ~5-7 tok/s Large context
mixtral-8x7b-Q4_K_M 26 GB ~3-5 tok/s MoE architecture

Latency Metrics

llcuda tracks P50/P95/P99 latencies:

GeForce 940M (1GB VRAM)

Model: gemma-3-1b-Q4_K_M
GPU Layers: 20
Context: 512 tokens

P50 latency: 65 ms
P95 latency: 72 ms
P99 latency: 78 ms

Tesla T4 (Colab/Kaggle)

Model: gemma-3-1b-Q4_K_M
GPU Layers: 26
Context: 2048 tokens

P50 latency: 62 ms
P95 latency: 68 ms
P99 latency: 74 ms

Testing Methodology

All benchmarks measured using:

import llcuda

engine = llcuda.InferenceEngine()
engine.load_model("gemma-3-1b-Q4_K_M")

# Warmup
for _ in range(5):
    engine.infer("Test", max_tokens=50)

# Measure
results = []
for _ in range(100):
    result = engine.infer("Explain AI", max_tokens=50)
    results.append(result.tokens_per_sec)

print(f"Mean: {sum(results)/len(results):.1f} tok/s")

Tips for Best Performance

1. GPU Layer Tuning

# Start with auto-configuration
engine.load_model("model.gguf", auto_configure=True)

# Then manually tune
engine.load_model("model.gguf", gpu_layers=20)  # Adjust based on VRAM

2. Context Size

# Smaller context = less VRAM
engine.load_model("model.gguf", ctx_size=512)   # 1GB VRAM
engine.load_model("model.gguf", ctx_size=2048)  # 4GB+ VRAM

3. Batch Size

# Larger batch = better throughput (if VRAM allows)
engine.load_model("model.gguf", batch_size=512)

4. Quantization Selection

Lower quantization = smaller model = faster loading, but slightly lower quality:

  • Q4_K_M: Best balance (recommended)
  • Q5_K_M: Higher quality, larger size
  • Q6_K: Highest quality, biggest size
  • Q3_K_M: Smallest, fastest, lower quality

Cloud Platform Quick Start

Google Colab

!pip install llcuda

import llcuda
engine = llcuda.InferenceEngine()
engine.load_model("gemma-3-1b-Q4_K_M", gpu_layers=26)
result = engine.infer("What is AI?")
print(result.text)

Kaggle

!pip install llcuda

import llcuda
engine = llcuda.InferenceEngine()
engine.load_model(
    "unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
    gpu_layers=26
)
result = engine.infer("Explain machine learning")
print(result.text)

View complete cloud platform guide →


Comparison: v1.0.x vs v1.1.0

Metric v1.0.x v1.1.0
Supported GPUs Compute 5.0 only Compute 5.0-8.9 (8 architectures)
Cloud Platforms ❌ None ✅ Colab, Kaggle
PyPI Package Size 50 MB 51 KB (hybrid architecture)
First Install Quick +3-5 min (downloads binaries/model)
Subsequent Use ~1 sec ~1 sec (cached)
Performance (940M) ~15 tok/s ~15 tok/s (same)
T4 Support ❌ Failed ✅ Works (~15 tok/s)
P100 Support ❌ Failed ✅ Works (~18 tok/s)

Hybrid Bootstrap Architecture: v1.1.0 uses a lightweight PyPI package (51 KB) that automatically downloads optimized binaries (~700 MB) and models (~770 MB) on first import based on your GPU. This solves PyPI's 100 MB file size limit while providing universal GPU support.

Backward compatible: v1.1.0 works exactly the same on GeForce 940M as v1.0.x.