Skip to content

llcuda v1.1.0 - PyTorch-Style CUDA LLM Inference

Zero-configuration CUDA-accelerated LLM inference for Python. Works on all modern NVIDIA GPUs, Google Colab, and Kaggle.

PyPI version Python 3.11+ CUDA 12 Colab Kaggle License: MIT GitHub stars

Perfect for: Google Colab • Kaggle • Local GPUs (940M to RTX 4090) • Zero-configuration • PyTorch-style API


🎉 What's New in v1.1.0

🚀 Major Update: Universal GPU Support + Cloud Platform Compatibility

Before (v1.0.x):

# On Kaggle/Colab T4
!pip install llcuda
engine = llcuda.InferenceEngine()
engine.load_model("gemma-3-1b-Q4_K_M")
# ❌ Error: no kernel image is available for execution on the device

Now (v1.1.0):

# On Kaggle/Colab T4
!pip install llcuda
engine = llcuda.InferenceEngine()
engine.load_model("gemma-3-1b-Q4_K_M")
# ✅ Works! Auto-detects T4, loads model, runs inference at ~15 tok/s

New Features

  • Multi-GPU Architecture Support - Works on all NVIDIA GPUs (compute 5.0-8.9)
  • Google Colab - Full support for T4, P100, V100, A100 GPUs
  • Kaggle - Works on Tesla T4 notebooks
  • GPU Auto-Detection - Automatic platform and GPU compatibility checking
  • Better Error Messages - Clear guidance when issues occur
  • No Breaking Changes - Fully backward compatible with v1.0.x

🎯 Supported GPUs

llcuda v1.1.0 supports all modern NVIDIA GPUs with compute capability 5.0+:

Architecture Compute Cap GPUs Cloud Platforms
Maxwell 5.0-5.3 GTX 900 series, GeForce 940M Local
Pascal 6.0-6.2 GTX 10xx, Tesla P100 ✅ Colab
Volta 7.0 Tesla V100 ✅ Colab Pro
Turing 7.5 Tesla T4, RTX 20xx, GTX 16xx ✅ Colab, ✅ Kaggle
Ampere 8.0-8.6 A100, RTX 30xx ✅ Colab Pro
Ada Lovelace 8.9 RTX 40xx Local

Cloud Platform Support: - ✅ Google Colab (Free & Pro) - ✅ Kaggle Notebooks - ✅ JupyterLab (Local)


🚀 Quick Start

Installation

pip install llcuda

That's all you need! The package includes: - llama-server executable (CUDA 12.8, multi-arch) - All required shared libraries (114 MB CUDA library with multi-GPU support) - Auto-configuration on import - Works immediately on Colab/Kaggle

Local Usage

import llcuda

# Create inference engine
engine = llcuda.InferenceEngine()

# Load model (auto-downloads with confirmation)
engine.load_model("gemma-3-1b-Q4_K_M")

# Run inference
result = engine.infer("Explain quantum computing in simple terms.")
print(result.text)
print(f"Speed: {result.tokens_per_sec:.1f} tok/s")

Google Colab

# Install llcuda
!pip install llcuda

import llcuda

# Check GPU compatibility
compat = llcuda.check_gpu_compatibility()
print(f"Platform: {compat['platform']}")  # 'colab'
print(f"GPU: {compat['gpu_name']}")       # 'Tesla T4' or 'Tesla P100'
print(f"Compatible: {compat['compatible']}")  # True

# Create engine and load model
engine = llcuda.InferenceEngine()
engine.load_model("gemma-3-1b-Q4_K_M", gpu_layers=26)

# Run inference
result = engine.infer("What is artificial intelligence?", max_tokens=100)
print(result.text)
print(f"Speed: {result.tokens_per_sec:.1f} tok/s")

Kaggle

# Install llcuda
!pip install llcuda

import llcuda

# Load model (auto-downloads from HuggingFace)
engine = llcuda.InferenceEngine()
engine.load_model(
    "unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf",
    gpu_layers=26,
    ctx_size=2048
)

# Run inference
result = engine.infer("Explain machine learning", max_tokens=100)
print(result.text)

Complete Cloud Guide: See the cloud platforms guide for detailed examples, troubleshooting, and best practices.


🔍 Check GPU Compatibility

import llcuda

# Check your GPU
compat = llcuda.check_gpu_compatibility()
print(f"Platform: {compat['platform']}")      # local/colab/kaggle
print(f"GPU: {compat['gpu_name']}")
print(f"Compute Capability: {compat['compute_capability']}")
print(f"Compatible: {compat['compatible']}")
print(f"Reason: {compat['reason']}")

Example Output (Kaggle):

Platform: kaggle
GPU: Tesla T4
Compute Capability: 7.5
Compatible: True
Reason: GPU Tesla T4 (compute capability 7.5) is compatible.


📊 Performance Benchmarks

Tesla T4 (Google Colab / Kaggle) - 15GB VRAM

Model Quantization GPU Layers Speed VRAM Usage
Gemma 3 1B Q4_K_M 26 (all) ~15 tok/s ~1.2 GB
Gemma 3 3B Q4_K_M 28 (all) ~10 tok/s ~3.5 GB
Llama 3.1 7B Q4_K_M 20 ~5 tok/s ~8 GB
Llama 3.1 7B Q4_K_M 32 (all) ~8 tok/s ~12 GB

Tesla P100 (Google Colab) - 16GB VRAM

Model Quantization GPU Layers Speed VRAM Usage
Gemma 3 1B Q4_K_M 26 (all) ~18 tok/s ~1.2 GB
Llama 3.1 7B Q4_K_M 32 (all) ~10 tok/s ~12 GB

GeForce 940M (Local) - 1GB VRAM

Model Quantization GPU Layers Speed VRAM Usage
Gemma 3 1B Q4_K_M 20 ~15 tok/s ~1.0 GB
Llama 3.2 1B Q4_K_M 18 ~12 tok/s ~0.9 GB

All benchmarks with default settings. Your mileage may vary.


💡 Key Features

1. Zero Configuration

# Just import and use - no setup required
import llcuda
engine = llcuda.InferenceEngine()
engine.load_model("gemma-3-1b-Q4_K_M")

2. Smart Model Loading

# Three ways to load models:

# 1. Registry name (easiest)
engine.load_model("gemma-3-1b-Q4_K_M")  # Auto-downloads

# 2. HuggingFace syntax
engine.load_model("unsloth/gemma-3-1b-it-GGUF:gemma-3-1b-it-Q4_K_M.gguf")

# 3. Local path
engine.load_model("/path/to/model.gguf")

3. Hardware Auto-Configuration

# Automatically detects GPU VRAM and optimizes settings
engine.load_model("model.gguf", auto_configure=True)
# Sets optimal gpu_layers, ctx_size, batch_size, ubatch_size

4. Platform Detection

# Automatically detects where you're running
compat = llcuda.check_gpu_compatibility()
# Returns: 'local', 'colab', or 'kaggle'

5. Performance Metrics

result = engine.infer("What is AI?")
print(f"Tokens: {result.tokens_generated}")
print(f"Latency: {result.latency_ms:.0f}ms")
print(f"Speed: {result.tokens_per_sec:.1f} tok/s")

# Get detailed metrics
metrics = engine.get_metrics()
print(f"P50 latency: {metrics['latency']['p50_ms']:.0f}ms")
print(f"P95 latency: {metrics['latency']['p95_ms']:.0f}ms")

📖 Documentation


🛠️ Advanced Usage

Context Manager (Auto-Cleanup)

with llcuda.InferenceEngine() as engine:
    engine.load_model("model.gguf", auto_start=True)
    result = engine.infer("Hello!")
    print(result.text)
# Server automatically stopped

Batch Inference

prompts = [
    "What is AI?",
    "Explain machine learning",
    "What are neural networks?"
]

results = engine.batch_infer(prompts, max_tokens=100)
for prompt, result in zip(prompts, results):
    print(f"Q: {prompt}")
    print(f"A: {result.text}\n")

Custom Server Settings

engine.load_model(
    "model.gguf",
    gpu_layers=20,        # Manual GPU layer count
    ctx_size=2048,        # Context window
    batch_size=512,       # Logical batch size
    ubatch_size=128,      # Physical batch size
    n_parallel=1          # Parallel sequences
)

Skip GPU Check (Advanced)

# Skip automatic GPU compatibility check
# Use only if you know what you're doing
engine.load_model("model.gguf", skip_gpu_check=True)

🔧 Troubleshooting

Common Issues

Issue: "No kernel image available for execution on the device" Solution: Upgrade to llcuda 1.1.0+

pip install --upgrade llcuda

Issue: Out of memory on GPU Solutions:

# 1. Reduce GPU layers
engine.load_model("model.gguf", gpu_layers=10)

# 2. Reduce context size
engine.load_model("model.gguf", ctx_size=1024)

# 3. Use smaller model
engine.load_model("gemma-3-1b-Q4_K_M")  # Instead of 7B

Issue: Slow inference (<5 tok/s) Solution: Check GPU is being used

compat = llcuda.check_gpu_compatibility()
assert compat['compatible'], f"GPU issue: {compat['reason']}"
assert compat['compute_capability'] >= 5.0

See the cloud platforms guide for more troubleshooting.


🤝 Contributing

Contributions welcome! Found a bug? Open an issue: https://github.com/waqasm86/llcuda/issues


📄 License

MIT License - Free for commercial and personal use.

See LICENSE for details.


🙏 Acknowledgments

  • llama.cpp team for the excellent CUDA backend
  • GGML team for the tensor library
  • HuggingFace for model hosting
  • Google Colab and Kaggle for free GPU access
  • All contributors and users

  • PyPI: https://pypi.org/project/llcuda/
  • GitHub: https://github.com/waqasm86/llcuda
  • Documentation: https://waqasm86.github.io/
  • Bug Tracker: https://github.com/waqasm86/llcuda/issues

⭐ Star History

If llcuda helps you, please star the repo! ⭐

Star History Chart


Happy Inferencing! 🚀

Built with ❤️ for the LLM community

Generated with Claude Code