LLM Inference
LLM Inference
Summon Llama, Mistral, Qwen, or any Hugging Face model. One function call, one OpenAI-compatible endpoint. Your model, served.
Why Basilica for LLM Inference?
- OpenAI-compatible API: Drop-in replacement for OpenAI API clients
- Optimized performance: PagedAttention, continuous batching, tensor parallelism
- Simple deployment: One function call deploys a full inference server
- Cost-effective: Pay only for GPU time you use
Supported Frameworks
| Framework | Best For | Performance |
|---|---|---|
| vLLM | Production workloads | High throughput, low latency |
| SGLang | Complex pipelines | Structured generation, multi-turn |
Quick Start
Deploy a model with a single function call:
from basilica import BasilicaClient
client = BasilicaClient()
# Deploy vLLM with Qwen 2.5
deployment = client.deploy_vllm(
name="qwen-api",
model="Qwen/Qwen2.5-7B-Instruct",
gpu_count=1,
)
print(f"OpenAI-compatible API: {deployment.url}/v1")Use with any OpenAI client:
from openai import OpenAI
client = OpenAI(
base_url=f"{deployment.url}/v1",
api_key="not-needed", # Basilica handles auth
)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)Choosing a Framework
Use vLLM When
- You need maximum throughput for chat completions
- You want battle-tested production stability
- Your workload is request/response based
- You need tensor parallelism for large models
Use SGLang When
- You have complex multi-turn conversations
- You need structured generation (JSON schemas, constrained outputs)
- You want RadixAttention for KV cache reuse
- Your pipeline involves branching logic
Dedicated Methods
The SDK provides dedicated methods for LLM inference:
| Method | Description |
|---|---|
client.deploy_vllm() | Deploy vLLM OpenAI-compatible server |
client.deploy_sglang() | Deploy SGLang server |
These methods handle:
- Container image selection
- Environment configuration
- Health check configuration
- Optimized resource defaults
Model Compatibility
Both vLLM and SGLang support most Hugging Face models:
Chat Models
meta-llama/Llama-3.1-8B-Instructmeta-llama/Llama-3.1-70B-Instructmistralai/Mistral-7B-Instruct-v0.3mistralai/Mixtral-8x7B-Instruct-v0.1Qwen/Qwen2.5-7B-InstructQwen/Qwen2.5-72B-Instructdeepseek-ai/DeepSeek-V2-Chat
Base Models
meta-llama/Llama-3.1-8Bmistralai/Mistral-7B-v0.3Qwen/Qwen2.5-7B
Code Models
codellama/CodeLlama-34b-Instruct-hfdeepseek-ai/deepseek-coder-6.7b-instructQwen/Qwen2.5-Coder-7B-Instruct
Some models require accepting license agreements on Hugging Face. Set HF_TOKEN in your environment or pass it via the env parameter.
API Endpoints
Both vLLM and SGLang expose OpenAI-compatible endpoints:
| Endpoint | Description |
|---|---|
POST /v1/chat/completions | Chat completions (recommended) |
POST /v1/completions | Text completions |
GET /v1/models | List available models |
GET /health | Health check |
Chat Completions
curl -X POST "${DEPLOYMENT_URL}/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2+2?"}
],
"temperature": 0.7,
"max_tokens": 100
}'Streaming
from openai import OpenAI
client = OpenAI(base_url=f"{deployment.url}/v1", api_key="not-needed")
stream = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")Next Steps
- Deploy with vLLM: Production-grade inference
- Deploy with SGLang: Structured generation
- Deploy custom models: Fine-tuned and private models