v0.2 · Apache 2.0 · 360 tests passing

Deploy frontier LLMs
on your customer's hardware.

Orchestration, security, and lifecycle layer wrapped around best-of-breed inference engines. Auth, audit, signing, multi-tenant routing, canary upgrades, fine-tune bridge — one CLI, one OpenAI-compatible API, every hardware tier from RTX 5090 to 8× H100.

~/customer-site
$ ai5 probe hardware: 8× H100, 1.5 TB RAM, NVLink recommended tier: frontier — vLLM, TP=8, EP, FP8
$ ai5 serve deepseek-v3 --tier frontier --port 8000 launched deepseek-v3-a1b2 (pid 12847) at 127.0.0.1:8000
$ ai5 gateway add-key prod --rate-limit-rpm 600 --tenant customer-x key: ai5_X8gK…q4 (save this — shown once)
$ ai5 gateway start --port 8080 --from-deployments ai5 gateway listening on 0.0.0.0:8080 deepseek-v3 http://127.0.0.1:8000 keys: … cors: * portal: /portal/
7
inference engines
9
hardware tiers
21
CLI commands
360
tests passing
why ai5labs/deploy

The operational layer that turns a GPU box into a real LLM service.

Inference engines like vLLM and SGLang serve tokens. ai5labs/deploy is everything else a real deployment needs: auth, audit, signing, supervision, routing, regression gates, hot-swap LoRA, fine-tune pipelines, and a portable bundle format for air-gap customers.

Hardware-aware tiers

Probe detects NVIDIA / AMD / Apple Silicon / Tenstorrent / CPU. Resolver picks the right engine, quantization, and parallelism. Same CLI on a Mac Studio and an 8× H100.

Multi-tenant gateway

Bearer-token auth (sha256, timing-safe), per-key rate limit (in-memory or Redis), audit log with hash chain, multi-deployment routing, Prometheus metrics, OpenTelemetry traces.

Signed bundles for air-gap

Pull, sign with ed25519 (file or HSM), export as a portable tarball, verify with pinned trust store on import. Strict mode refuses tampered, unsigned, or untrusted bundles.

Eval-gated canary deploys

Capture a baseline against your prod deployment. Spin up the candidate, wait for health, run your eval suite, compare. Promote on pass, rollback on regression — built in, no extra tooling.

Hot-swap LoRA adapters

Load and unload LoRAs without restarting the engine on vLLM and SGLang. Per-tenant adapters, served from a base model, swapped via the OpenAI model field.

Fine-tune to production

Recipe-driven bridge to axolotl, unsloth, TRL. Validates the schema, generates the training script, registers the resulting adapter in your local cache, ready to hot-load into prod.

how it works

A thin layer over best-of-breed engines.

ai5labs/deploy isn't an inference engine — it delegates to vLLM, SGLang, TensorRT-LLM, llama.cpp, MLX, ktransformers, or tt-metal. What it owns is the productization layer: authentication, audit trail, signing, supervision, routing, lifecycle.

Client
Existing OpenAI SDK, curl, or your application. HTTPS, OpenAI-compatible.
ai5labs/deploy — Gateway
Auth · rate limit · audit (hash-chained) · route by model · metrics · traces · /livez · /readyz · /portal/
Inference engine (yours)
vLLM · SGLang · TensorRT-LLM · llama.cpp · MLX · ktransformers · tt-metal
Control plane (out-of-band CLI)
ai5 drives daemon supervision, signed cache, fine-tune bridge, eval-gated canary, sneakernet bundle export/import. State lives in the filesystem under ~/.config/ai5-deploy/ and ~/.cache/ai5-deploy/ — all 0700.
engines & tiers

One CLI across every hardware shape your customers will hand you.

Same registry, same gateway, same OpenAI HTTP — different engine + quantization picked automatically based on what ai5 probe finds.

Tier Hardware Engine Typical model
frontier 8× H100 / H200 vLLMSGLangTensorRT-LLM DeepSeek-V3 671B (FP8), Qwen3-MoE
pro 2–4× H100 / 4–8× L40S vLLMSGLang Llama-3.3-70B AWQ, 200B MoE
amd_pro 1–8× MI300X / MI250X vLLM-rocm 70B AWQ; FP8 via env flag
tenstorrent Wormhole / Blackhole tt-metal Llama-3 scaffold included
workstation_pro 1–2× RTX 6000 Ada vLLMllama.cpp 70B 4-bit, MoE w/ ktransformers
moe_offload 1× GPU + 256 GB+ RAM ktransformers DeepSeek-V3 GGUF expert-offload
workstation_5090 RTX 5090 / 4090 llama.cppvLLM 32B 4-bit, 70B 3-bit
apple_silicon Mac Studio / Pro M-series MLXllama.cpp 70B mlx-q4, MoE via unified memory
cpu Server CPU only llama.cpp 7B 4-bit GGUF fallback
get started

From an empty box to a running, audited, signed LLM service in four commands.

  1. Probe the hardware

    Detects GPUs/VRAM/RAM and picks the recommended tier.

    ai5 probe
  2. Serve the model

    Daemonized, OpenAI-compatible HTTP. State in ~/.local/share/ai5-deploy/run/.

    ai5 serve qwen3-32b --tier pro --port 8001
  3. Create a key

    Bearer token, sha256 storage, optional tenant tag for audit isolation.

    ai5 gateway add-key prod --rate-limit-rpm 600
  4. Front it with the gateway

    Auth, audit, rate limit, metrics, traces, multi-deployment routing, web portal.

    ai5 gateway start --port 8080
production-host
# install $ pip install -e ".[otel]" $ ai5 --version ai5 0.2.0
# upgrade with a regression-gated canary $ ai5 eval baseline smoke --target prod-deploy ✓ saved baseline $ ai5 canary run prod-deploy qwen3-32b \ --suite smoke --max-drop 2.0 phase 1/5: capture baseline ✓ phase 2/5: start candidate ✓ (port 8002, pid 13991) phase 3/5: wait for health ✓ (4.2s) phase 4/5: eval ✓ (score 0.94 vs baseline 0.93) phase 5/5: promote ✓ (swap_name) canary complete — prod-deploy now serves qwen3-32b
# export for air-gap delivery $ ai5 models sign --all --signer-id ai5labs $ ai5 models export ./bundle.tar \ --model qwen3-32b:pro --sign ✓ bundle.tar (43 GB) signed by ai5labs · 2,847 files · ed25519 sha256
security & compliance

Built for the security review you'll actually have to pass.

Threat model documented, defenses tested. Audit log is tamper-evident. Bundles are signed. The framework was built assuming it'd ship to a hospital, a bank, or a defense contractor — and act like it.

Timing-safe key compare

secrets.compare_digest over sha256 hashes; never plaintext on disk.

Tamper-evident audit log

SHA-256 hash chain over every entry. ai5 gateway audit verify walks the chain.

Signed bundles + trust store

ed25519, file or PKCS#11 HSM (YubiHSM2, SoftHSM2). --strict refuses unsigned.

Path-traversal & symlink guards

Bundle import refuses ../, absolute paths, symlinks, device files.

OIDC + bearer auth

Portal works with Okta, Auth0, Keycloak, Google Workspace. Group allow-list. Bearer fallback.

Per-tenant audit isolation

Every audit row tagged. Filter at export time without leaking other tenants' data.

mTLS to upstream

Client cert + custom CA bundle on the gateway → engine connection. Configurable per-route.

DoS hardening

Body-size cap, Prometheus cardinality bucketing, in-memory or Redis rate limiter.

Read the full threat model in docs/security.md · disclosure to security@ai5labs.com.

use cases

Who this is for.

ai5labs/deploy is the layer ai5labs uses to deploy LLMs for clients who can't (or won't) send data to API providers, and don't want to glue together vLLM + nginx + Vault + Grafana + axolotl themselves.

regulated

Healthcare · Legal · Finance

data can't leave the building

  • Frontier-quality LLMs running inside the customer's VPC or on-prem.
  • Cryptographically signed model artifacts, auditable end to end.
  • Every request logged with tenant tag, request-id, and tamper-evident chain.
internal

Internal LLM platforms

platform / ML eng teams at companies

  • Per-team API keys, per-key rate limits, per-tenant audit.
  • Multi-deployment routing — different models on one gateway.
  • Read-only ops portal for SREs without giving SSH.

Ship a signed, audited LLM service on customer hardware.

Open architecture, seven engines, Apache 2.0. Get started in four commands.