Building the future of
AI Infrastructure

Open-source tools for GPU-accelerated computing, intelligent retrieval systems, and production-grade ML operations.

Projects

Production-ready tools designed for real-world AI workloads

Retrieval

Pensive

Hierarchical Context Retrieval

A sophisticated two-tier context management system with L1 hot cache, L2 vector retrieval via FAISS, and L3 persistent archive. Handles 1M+ token contexts with sub-350ms latency.

  • Hybrid BM25 + dense vector fusion
  • Async dependency chain orchestration
  • KV summarization & deduplication
  • Multi-hop logic chaining
FAISS sentence-transformers BM25
Training

mud-puppy

ROCm-First LLM Fine-tuning

A lightweight fine-tuning framework optimized for AMD GPUs. Supports LoRA, QLoRA, DPO, GRPO, and GPTQ quantization without bitsandbytes dependency.

  • Full, LoRA, and QLoRA fine-tuning
  • DPO/IPO/KTO/ORPO preference tuning
  • Custom ROCm kernels (qgemm, fbgemm)
  • Memory-efficient streaming & offloading
ROCm TRL HuggingFace GPTQ

Built For

AMD ROCm

First-class support for AMD GPUs with custom kernels

Production ML

Battle-tested infrastructure for real workloads

Open Source

Transparent, auditable, community-driven

Memory Efficient

Optimized for maximum utilization on consumer hardware

About Tuklus Labs

We build tools that make advanced AI accessible. Our focus is on GPU-vendor-agnostic infrastructure that works on real hardware—not just datacenter clusters with unlimited CUDA cores.

Every project is designed with ROCm-first principles, ensuring AMD GPU users aren't second-class citizens in the AI ecosystem. We believe in efficient, production-ready code over flashy demos.

3 Core Projects
ROCm First Design
100% Open Source