v0.2 · Apache 2.0 · 360 tests passing

Deploy frontier LLMs
on your customer's hardware.

Orchestration, security, and lifecycle layer wrapped around best-of-breed inference engines. Auth, audit, signing, multi-tenant routing, canary upgrades, fine-tune bridge — one CLI, one OpenAI-compatible API, every hardware tier from RTX 5090 to 8× H100.

Get started View source ↗

~/customer-site

$ ai5 probe hardware: 8× H100, 1.5 TB RAM, NVLink recommended tier: frontier — vLLM, TP=8, EP, FP8

$ ai5 serve deepseek-v3 --tier frontier --port 8000 ✓ launched deepseek-v3-a1b2 (pid 12847) at 127.0.0.1:8000

$ ai5 gateway add-key prod --rate-limit-rpm 600 --tenant customer-x ✓ key: ai5_X8gK…q4 (save this — shown once)

$ ai5 gateway start --port 8080 --from-deployments ai5 gateway listening on 0.0.0.0:8080 deepseek-v3 → http://127.0.0.1:8000 keys: … cors: * portal: /portal/

inference engines

hardware tiers

CLI commands

360

tests passing

why ai5labs/deploy

The operational layer that turns a GPU box into a real LLM service.

Inference engines like vLLM and SGLang serve tokens. ai5labs/deploy is everything else a real deployment needs: auth, audit, signing, supervision, routing, regression gates, hot-swap LoRA, fine-tune pipelines, and a portable bundle format for air-gap customers.

Hardware-aware tiers

Probe detects NVIDIA / AMD / Apple Silicon / Tenstorrent / CPU. Resolver picks the right engine, quantization, and parallelism. Same CLI on a Mac Studio and an 8× H100.

Multi-tenant gateway

Bearer-token auth (sha256, timing-safe), per-key rate limit (in-memory or Redis), audit log with hash chain, multi-deployment routing, Prometheus metrics, OpenTelemetry traces.

Signed bundles for air-gap

Pull, sign with ed25519 (file or HSM), export as a portable tarball, verify with pinned trust store on import. Strict mode refuses tampered, unsigned, or untrusted bundles.

Eval-gated canary deploys

Capture a baseline against your prod deployment. Spin up the candidate, wait for health, run your eval suite, compare. Promote on pass, rollback on regression — built in, no extra tooling.

Hot-swap LoRA adapters

Load and unload LoRAs without restarting the engine on vLLM and SGLang. Per-tenant adapters, served from a base model, swapped via the OpenAI model field.

Fine-tune to production

Recipe-driven bridge to axolotl, unsloth, TRL. Validates the schema, generates the training script, registers the resulting adapter in your local cache, ready to hot-load into prod.

how it works

A thin layer over best-of-breed engines.

ai5labs/deploy isn't an inference engine — it delegates to vLLM, SGLang, TensorRT-LLM, llama.cpp, MLX, ktransformers, or tt-metal. What it owns is the productization layer: authentication, audit trail, signing, supervision, routing, lifecycle.

Client

Existing OpenAI SDK, curl, or your application. HTTPS, OpenAI-compatible.

↓

ai5labs/deploy — Gateway

Auth · rate limit · audit (hash-chained) · route by model · metrics · traces · /livez · /readyz · /portal/

↓

Inference engine (yours)

vLLM · SGLang · TensorRT-LLM · llama.cpp · MLX · ktransformers · tt-metal

Control plane (out-of-band CLI)

ai5 drives daemon supervision, signed cache, fine-tune bridge, eval-gated canary, sneakernet bundle export/import. State lives in the filesystem under ~/.config/ai5-deploy/ and ~/.cache/ai5-deploy/ — all 0700.

engines & tiers

One CLI across every hardware shape your customers will hand you.

Same registry, same gateway, same OpenAI HTTP — different engine + quantization picked automatically based on what ai5 probe finds.

Tier	Hardware	Engine	Typical model
frontier	8× H100 / H200	vLLMSGLangTensorRT-LLM	DeepSeek-V3 671B (FP8), Qwen3-MoE
pro	2–4× H100 / 4–8× L40S	vLLMSGLang	Llama-3.3-70B AWQ, 200B MoE
amd_pro	1–8× MI300X / MI250X	vLLM-rocm	70B AWQ; FP8 via env flag
tenstorrent	Wormhole / Blackhole	tt-metal	Llama-3 scaffold included
workstation_pro	1–2× RTX 6000 Ada	vLLMllama.cpp	70B 4-bit, MoE w/ ktransformers
moe_offload	1× GPU + 256 GB+ RAM	ktransformers	DeepSeek-V3 GGUF expert-offload
workstation_5090	RTX 5090 / 4090	llama.cppvLLM	32B 4-bit, 70B 3-bit
apple_silicon	Mac Studio / Pro M-series	MLXllama.cpp	70B mlx-q4, MoE via unified memory
cpu	Server CPU only	llama.cpp	7B 4-bit GGUF fallback

get started

From an empty box to a running, audited, signed LLM service in four commands.

Probe the hardware

Detects GPUs/VRAM/RAM and picks the recommended tier.
ai5 probe
Serve the model

Daemonized, OpenAI-compatible HTTP. State in ~/.local/share/ai5-deploy/run/.
ai5 serve qwen3-32b --tier pro --port 8001
Create a key

Bearer token, sha256 storage, optional tenant tag for audit isolation.
ai5 gateway add-key prod --rate-limit-rpm 600
Front it with the gateway

Auth, audit, rate limit, metrics, traces, multi-deployment routing, web portal.
ai5 gateway start --port 8080

production-host

# install$ pip install -e ".[otel]"$ ai5 --versionai5 0.2.0
# upgrade with a regression-gated canary$ ai5 eval baseline smoke --target prod-deploy✓ saved baseline$ ai5 canary run prod-deploy qwen3-32b \    --suite smoke --max-drop 2.0phase 1/5: capture baseline ✓phase 2/5: start candidate    ✓ (port 8002, pid 13991)phase 3/5: wait for health    ✓ (4.2s)phase 4/5: eval               ✓ (score 0.94 vs baseline 0.93)phase 5/5: promote            ✓ (swap_name)canary complete — prod-deploy now serves qwen3-32b
# export for air-gap delivery$ ai5 models sign --all --signer-id ai5labs$ ai5 models export ./bundle.tar \    --model qwen3-32b:pro --sign✓ bundle.tar (43 GB)  signed by ai5labs · 2,847 files · ed25519 sha256

security & compliance

Built for the security review you'll actually have to pass.

Threat model documented, defenses tested. Audit log is tamper-evident. Bundles are signed. The framework was built assuming it'd ship to a hospital, a bank, or a defense contractor — and act like it.

✓

Timing-safe key compare

secrets.compare_digest over sha256 hashes; never plaintext on disk.

✓

Tamper-evident audit log

SHA-256 hash chain over every entry. ai5 gateway audit verify walks the chain.

✓

Signed bundles + trust store

ed25519, file or PKCS#11 HSM (YubiHSM2, SoftHSM2). --strict refuses unsigned.

✓

Path-traversal & symlink guards

Bundle import refuses ../, absolute paths, symlinks, device files.

✓

OIDC + bearer auth

Portal works with Okta, Auth0, Keycloak, Google Workspace. Group allow-list. Bearer fallback.

✓

Per-tenant audit isolation

Every audit row tagged. Filter at export time without leaking other tenants' data.

✓

mTLS to upstream

Client cert + custom CA bundle on the gateway → engine connection. Configurable per-route.

✓

DoS hardening

Body-size cap, Prometheus cardinality bucketing, in-memory or Redis rate limiter.

Read the full threat model in docs/security.md · disclosure to security@ai5labs.com.

use cases

Who this is for.

ai5labs/deploy is the layer ai5labs uses to deploy LLMs for clients who can't (or won't) send data to API providers, and don't want to glue together vLLM + nginx + Vault + Grafana + axolotl themselves.

regulated

Healthcare · Legal · Finance

data can't leave the building

Frontier-quality LLMs running inside the customer's VPC or on-prem.
Cryptographically signed model artifacts, auditable end to end.
Every request logged with tenant tag, request-id, and tamper-evident chain.

flagship

Air-gapped deployments

defense · government · isolated networks

Build the signed bundle on a connected host, sneakernet it across.
--strict import rejects unsigned or tampered drops.
Zero outbound network from the customer box at runtime.

internal

Internal LLM platforms

platform / ML eng teams at companies

Per-team API keys, per-key rate limits, per-tenant audit.
Multi-deployment routing — different models on one gateway.
Read-only ops portal for SREs without giving SSH.

Deploy frontier LLMson your customer's hardware.