Orchestration, security, and lifecycle layer wrapped around best-of-breed inference engines. Auth, audit, signing, multi-tenant routing, canary upgrades, fine-tune bridge — one CLI, one OpenAI-compatible API, every hardware tier from RTX 5090 to 8× H100.
Inference engines like vLLM and SGLang serve tokens. ai5labs/deploy is everything else
a real deployment needs: auth, audit, signing, supervision, routing, regression gates,
hot-swap LoRA, fine-tune pipelines, and a portable bundle format for air-gap customers.
Probe detects NVIDIA / AMD / Apple Silicon / Tenstorrent / CPU. Resolver picks the right engine, quantization, and parallelism. Same CLI on a Mac Studio and an 8× H100.
Bearer-token auth (sha256, timing-safe), per-key rate limit (in-memory or Redis), audit log with hash chain, multi-deployment routing, Prometheus metrics, OpenTelemetry traces.
Pull, sign with ed25519 (file or HSM), export as a portable tarball, verify with pinned trust store on import. Strict mode refuses tampered, unsigned, or untrusted bundles.
Capture a baseline against your prod deployment. Spin up the candidate, wait for health, run your eval suite, compare. Promote on pass, rollback on regression — built in, no extra tooling.
Load and unload LoRAs without restarting the engine on vLLM and SGLang. Per-tenant adapters, served from a base model, swapped via the OpenAI model field.
Recipe-driven bridge to axolotl, unsloth, TRL. Validates the schema, generates the training script, registers the resulting adapter in your local cache, ready to hot-load into prod.
ai5labs/deploy isn't an inference engine — it delegates to vLLM, SGLang, TensorRT-LLM,
llama.cpp, MLX, ktransformers, or tt-metal. What it owns is the productization layer:
authentication, audit trail, signing, supervision, routing, lifecycle.
/livez · /readyz · /portal/
ai5 drives daemon supervision, signed cache, fine-tune bridge, eval-gated
canary, sneakernet bundle export/import. State lives in the filesystem under
~/.config/ai5-deploy/ and ~/.cache/ai5-deploy/ — all 0700.
Same registry, same gateway, same OpenAI HTTP — different engine + quantization picked
automatically based on what ai5 probe finds.
| Tier | Hardware | Engine | Typical model |
|---|---|---|---|
| frontier | 8× H100 / H200 | vLLMSGLangTensorRT-LLM | DeepSeek-V3 671B (FP8), Qwen3-MoE |
| pro | 2–4× H100 / 4–8× L40S | vLLMSGLang | Llama-3.3-70B AWQ, 200B MoE |
| amd_pro | 1–8× MI300X / MI250X | vLLM-rocm | 70B AWQ; FP8 via env flag |
| tenstorrent | Wormhole / Blackhole | tt-metal | Llama-3 scaffold included |
| workstation_pro | 1–2× RTX 6000 Ada | vLLMllama.cpp | 70B 4-bit, MoE w/ ktransformers |
| moe_offload | 1× GPU + 256 GB+ RAM | ktransformers | DeepSeek-V3 GGUF expert-offload |
| workstation_5090 | RTX 5090 / 4090 | llama.cppvLLM | 32B 4-bit, 70B 3-bit |
| apple_silicon | Mac Studio / Pro M-series | MLXllama.cpp | 70B mlx-q4, MoE via unified memory |
| cpu | Server CPU only | llama.cpp | 7B 4-bit GGUF fallback |
Detects GPUs/VRAM/RAM and picks the recommended tier.
ai5 probe
Daemonized, OpenAI-compatible HTTP. State in ~/.local/share/ai5-deploy/run/.
ai5 serve qwen3-32b --tier pro --port 8001
Bearer token, sha256 storage, optional tenant tag for audit isolation.
ai5 gateway add-key prod --rate-limit-rpm 600
Auth, audit, rate limit, metrics, traces, multi-deployment routing, web portal.
ai5 gateway start --port 8080
Threat model documented, defenses tested. Audit log is tamper-evident. Bundles are signed. The framework was built assuming it'd ship to a hospital, a bank, or a defense contractor — and act like it.
secrets.compare_digest over sha256 hashes; never plaintext on disk.
SHA-256 hash chain over every entry. ai5 gateway audit verify walks the chain.
ed25519, file or PKCS#11 HSM (YubiHSM2, SoftHSM2). --strict refuses unsigned.
Bundle import refuses ../, absolute paths, symlinks, device files.
Portal works with Okta, Auth0, Keycloak, Google Workspace. Group allow-list. Bearer fallback.
Every audit row tagged. Filter at export time without leaking other tenants' data.
Client cert + custom CA bundle on the gateway → engine connection. Configurable per-route.
Body-size cap, Prometheus cardinality bucketing, in-memory or Redis rate limiter.
Read the full threat model in docs/security.md · disclosure to
security@ai5labs.com.
ai5labs/deploy is the layer ai5labs uses to deploy LLMs for clients who can't (or
won't) send data to API providers, and don't want to glue together vLLM + nginx + Vault +
Grafana + axolotl themselves.
data can't leave the building
defense · government · isolated networks
--strict import rejects unsigned or tampered drops.platform / ML eng teams at companies
Open architecture, seven engines, Apache 2.0. Get started in four commands.