About Me
Product-minded engineer building on-device AI tools for hardware people actually own.
Philosophy
I build tools that actually work on real hardware, not just theoretical benchmarks on cutting-edge GPUs. My approach is rooted in empirical engineering: every claim is backed by measurements on actual hardware, and installation should be as simple as pip install.
Core Beliefs:
- Build for Real Hardware: GeForce 940M with 1GB VRAM is my baseline, not an afterthought
- Zero-Configuration Design: If it requires manual compilation or complex setup, I haven't finished the job
- Empirical Testing: Benchmarks on simulators don't count. Real hardware or it didn't happen
- Production Quality: Publish to PyPI, not just GitHub repos. Users deserve properly packaged software
- Documentation as a Feature: If users can't figure out how to use it, it doesn't exist
Current Work
llcuda - LLM Inference for Legacy GPUs
PyPI Package | Documentation | GitHub | v1.0.1 Release
PyTorch-style Python package that brings LLM inference to old NVIDIA GPUs with zero configuration. Published to PyPI, tested extensively on GeForce 940M (1GB VRAM).
Key Achievements (v1.0.1): - Published production-ready package to PyPI with bundled CUDA 12.8 binaries - Achieved ~15 tokens/second on GeForce 940M - Zero-configuration installation - no manual path setup required - Smart model loading with HuggingFace registry (11 curated models) - Hardware auto-configuration based on VRAM detection - Performance metrics tracking (P50/P95/P99 latency) - JupyterLab-first design for data science workflows - Comprehensive documentation and examples
Technical Stack: - Python packaging and PyPI distribution (47 MB wheel with binaries) - CUDA programming for legacy GPUs (compute capability 5.0) - Bundled llama.cpp binaries and shared libraries - Empirical performance testing and optimization - Auto-configuration and hardware detection
Technical Expertise
Programming Languages
- Python: PyPI packaging, library design, API development
- CUDA/C++: GPU programming, performance optimization
- Shell/Bash: Build automation, deployment scripts
Tools & Technologies
- CUDA Development: Programming for legacy GPUs (Maxwell architecture)
- Build Systems: CMake, static linking, cross-compilation
- Package Management: PyPI publishing, semantic versioning
- Version Control: Git, GitHub workflows, CI/CD
- Documentation: MkDocs, technical writing, developer experience
Specializations
- GPU Computing: Optimizing for low-VRAM, legacy NVIDIA GPUs
- Python Packaging: Creating production-ready PyPI packages
- Performance Testing: Empirical benchmarking on real hardware
- Developer Experience: Zero-configuration tools, comprehensive docs
- LLM Systems: Integration with llama.cpp, model quantization, inference optimization
Approach to Engineering
1. Product-Minded Development
I don't just write code; I build products that solve real problems for real users.
Example: llcuda isn't just a Python wrapper around llama.cpp. It's a complete solution that handles model downloading, GPU detection, error recovery, and provides a Jupyter-friendly API.
2. Empirical Testing
All performance claims are backed by measurements on actual hardware.
Example: Every benchmark in the llcuda documentation was run on a GeForce 940M. No simulations, no theoretical calculations.
3. Documentation First
Documentation isn't an afterthought—it's a core feature.
Example: llcuda has a complete quick start guide, installation guide, performance guide, and production-ready examples. Users can get running in under 5 minutes.
4. Zero-Configuration Design
Installation complexity is a bug, not a feature.
Example: llcuda installs with pip install llcuda. No CUDA toolkit, no compilation, no configuration files. It just works.
5. Production Quality
If it's not on PyPI with proper versioning, it's not production-ready.
Example: llcuda is published to PyPI with semantic versioning, not just a GitHub repo with a README.
Background
Education
Computer Science Background with focus on: - Algorithms and Data Structures - Systems Programming - GPU Computing and Parallel Processing - Machine Learning and AI
Professional Experience
Product-Minded Software Engineer specializing in: - Python package development and PyPI publishing - CUDA programming and GPU optimization - Building tools for machine learning workflows - Technical documentation and developer experience
Key Projects: - llcuda v1.0.1: PyPI package for LLM inference on legacy GPUs with bundled CUDA binaries - CUDA Systems Research: Empirical testing on Maxwell-era GPUs
Why Legacy GPUs?
The Problem
The AI/ML community often assumes everyone has: - Modern RTX GPUs (3000/4000 series) - 8GB+ VRAM - Willingness to spend $500-$2000 on hardware
Reality: Millions of people have perfectly capable older GPUs collecting dust because the tools don't support them.
The Opportunity
Legacy GPUs like the GeForce 940M can run modern LLMs with proper optimization: - 1GB VRAM is enough for 2B parameter models - Maxwell architecture (2014) still has hundreds of CUDA cores - Quantization (Q4_K_M) makes models fit in limited memory - Performance is acceptable for interactive use (~15 tok/s)
The Mission
Make AI tools accessible on hardware people already own. No expensive upgrades needed.
Projects
Active Project
llcuda v1.0.1 - PyTorch-style LLM inference for legacy NVIDIA GPUs - Status: Published to PyPI, actively maintained - Version: 1.0.1 (December 2025) - Focus: Zero-configuration installation, smart model loading, hardware auto-config - Features: Bundled CUDA 12.8 binaries, HuggingFace registry, performance metrics - Links: PyPI | Docs | GitHub | v1.0.1 Release
Future Directions
Planned Work: - Windows Support: Pre-built binaries for Windows + CUDA - Model Optimization: Custom quantization for legacy GPUs - Advanced Features: Speculative decoding, FlashAttention integration - Broader Hardware Support: AMD GPUs (ROCm), Intel GPUs (oneAPI)
Values
Accessibility
AI tools should be accessible to everyone, not just those with expensive hardware.
Empiricism
Claims should be backed by real measurements, not marketing promises.
Transparency
Open source code, public documentation, honest benchmarks.
Quality
Production-ready tools, not just research prototypes.
User-Centric
Design for the user's experience, not the developer's convenience.
Technical Writing
I believe in documentation as a core feature of software. Good documentation: - Gets users productive in minutes, not hours - Provides realistic benchmarks and expectations - Includes production-ready code examples - Anticipates and answers common questions - Is maintained alongside the code
Example: The llcuda documentation includes: - 5-minute quick start guide - Comprehensive installation guide with troubleshooting - Real benchmarks on actual hardware - Production-ready code examples - Clear explanations of design decisions
Open Source
All my projects are open source:
llcuda v1.0.1 - License: MIT - Repository: github.com/waqasm86/llcuda - PyPI: pypi.org/project/llcuda - Contributions: Bug reports, feature requests, model testing, and PRs welcome
Contact
I'm always interested in: - Feedback on llcuda and related projects - Collaboration on making AI more accessible - Discussions about GPU optimization and LLM systems - Opportunities to build production-quality AI tools
Email: waqasm86@gmail.com GitHub: github.com/waqasm86 PyPI: pypi.org/project/llcuda
Resume
For a detailed resume including professional experience, education, and technical skills:
Testimonials
From the Community
"Finally, a tool that actually works on my old laptop GPU! llcuda made LLM development accessible without buying new hardware." — Data Science Student
"The documentation is excellent. Got up and running in under 5 minutes, exactly as promised." — Python Developer
"Impressive performance on legacy hardware. The empirical benchmarks gave me realistic expectations." — ML Engineer
Inspiration
My work is inspired by engineers who build practical, accessible tools:
- Dan McCreary: Excellent technical documentation and knowledge graphs
- Georgi Gerganov: Creator of llama.cpp, making LLMs accessible
- Yann LeCun: Advocate for open, accessible AI research
- Linus Torvalds: Focus on practical engineering over hype
Fun Facts
- Favorite GPU: GeForce 940M (1GB VRAM) - my primary testing platform
- Favorite Language: Python for APIs, C++ for performance
- Favorite Tool: JupyterLab for interactive development
- Favorite Metric: Tokens per second (measured on real hardware)
- Favorite Documentation Style: MkDocs Material (this site!)
What's Next?
I'm continually working to make AI more accessible on legacy hardware:
- Expand llcuda: Windows support, more models, better performance
- Explore New Architectures: AMD GPUs (ROCm), Intel GPUs (oneAPI)
- Build Community: Help others run LLMs on their existing hardware
- Document Everything: Share knowledge through comprehensive guides
Follow my work: - GitHub: github.com/waqasm86 - PyPI: pypi.org/project/llcuda - This Site: Regular updates on projects and learnings
Let's Build Together
Interested in collaborating on making AI tools more accessible? Have an old GPU and want to contribute to testing? Want to improve the documentation?
I'd love to hear from you.