LLM Inference
vLLM
The gold standard for LLM inference. PagedAttention, continuous batching, blazing throughput. Deploy with deploy_vllm().
Quick Start
Deploy a vLLM inference server with one function call:
from basilica import BasilicaClient
client = BasilicaClient()
deployment = client.deploy_vllm(
name="qwen-api",
model="Qwen/Qwen2.5-7B-Instruct",
gpu_count=1,
)
print(f"API endpoint: {deployment.url}/v1")API Reference
deployment = client.deploy_vllm(
name: str, # Deployment name
model: str, # Hugging Face model ID
gpu_count: int = 1, # Number of GPUs
tensor_parallel_size: int = None, # Tensor parallelism (default: gpu_count)
max_model_len: int = None, # Maximum sequence length
quantization: str = None, # Quantization method
gpu_models: List[str] = None, # GPU model filter
min_gpu_memory_gb: int = None, # Minimum GPU VRAM
memory: str = "16Gi", # CPU memory allocation
env: Dict[str, str] = None, # Additional environment variables
timeout: int = 600, # Deployment timeout
) -> DeploymentParameters
| Parameter | Type | Default | Description |
|---|---|---|---|
name | str | required | Deployment name (DNS-safe) |
model | str | required | Hugging Face model ID (e.g., "Qwen/Qwen2.5-7B-Instruct") |
gpu_count | int | 1 | Number of GPUs to allocate |
tensor_parallel_size | int | gpu_count | Number of GPUs for tensor parallelism |
max_model_len | int | model default | Maximum context length |
quantization | str | None | Quantization: "awq", "gptq", "squeezellm" |
gpu_models | List[str] | None | Acceptable GPU models |
min_gpu_memory_gb | int | None | Minimum GPU VRAM in GB |
memory | str | "16Gi" | CPU memory allocation |
env | Dict[str, str] | None | Additional environment variables |
timeout | int | 600 | Seconds to wait for ready |
Basic Examples
Deploy Qwen 2.5
from basilica import BasilicaClient
client = BasilicaClient()
deployment = client.deploy_vllm(
name="qwen-7b",
model="Qwen/Qwen2.5-7B-Instruct",
)Deploy Llama 3.1
deployment = client.deploy_vllm(
name="llama-8b",
model="meta-llama/Llama-3.1-8B-Instruct",
env={"HF_TOKEN": "hf_..."}, # Llama requires license acceptance
)Deploy Mistral
deployment = client.deploy_vllm(
name="mistral-7b",
model="mistralai/Mistral-7B-Instruct-v0.3",
)Using the API
vLLM exposes an OpenAI-compatible API. Use any OpenAI client:
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url=f"{deployment.url}/v1",
api_key="not-needed",
)
# Chat completion
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."},
],
temperature=0.7,
max_tokens=500,
)
print(response.choices[0].message.content)Streaming Responses
stream = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Write a short poem about AI"}],
stream=True,
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)cURL
curl -X POST "${DEPLOYMENT_URL}/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'JavaScript/TypeScript
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: `${deploymentUrl}/v1`,
apiKey: 'not-needed',
});
const response = await client.chat.completions.create({
model: 'Qwen/Qwen2.5-7B-Instruct',
messages: [{ role: 'user', content: 'Hello!' }],
});
console.log(response.choices[0].message.content);Advanced Configuration
Tensor Parallelism
For large models that don't fit on a single GPU:
# 70B model across 4 GPUs
deployment = client.deploy_vllm(
name="llama-70b",
model="meta-llama/Llama-3.1-70B-Instruct",
gpu_count=4,
tensor_parallel_size=4,
memory="128Gi",
timeout=900, # Large models take longer to load
)Custom Context Length
# Shorter context for faster inference
deployment = client.deploy_vllm(
name="fast-qwen",
model="Qwen/Qwen2.5-7B-Instruct",
max_model_len=4096, # Default is 32k
)
# Extended context (requires more VRAM)
deployment = client.deploy_vllm(
name="long-context",
model="Qwen/Qwen2.5-7B-Instruct",
max_model_len=32768,
min_gpu_memory_gb=24,
)Quantized Models
Deploy AWQ or GPTQ quantized models for lower memory usage:
deployment = client.deploy_vllm(
name="qwen-awq",
model="Qwen/Qwen2.5-7B-Instruct-AWQ",
quantization="awq",
)See Custom Models for GPTQ examples and custom quantization.
Performance Tuning
deployment = client.deploy_vllm(
name="high-throughput",
model="Qwen/Qwen2.5-7B-Instruct",
env={
# Increase batch size for throughput
"VLLM_MAX_NUM_BATCHED_TOKENS": "8192",
"VLLM_MAX_NUM_SEQS": "256",
# Enable eager mode (can help with some models)
"VLLM_ENFORCE_EAGER": "true",
# Swap space for larger batches
"VLLM_SWAP_SPACE": "4", # GB
},
)Specific GPU Selection
deployment = client.deploy_vllm(
name="a100-inference",
model="meta-llama/Llama-3.1-70B-Instruct",
gpu_count=2,
gpu_models=["A100"],
min_gpu_memory_gb=80,
)Gated Models
For models requiring Hugging Face license acceptance:
deployment = client.deploy_vllm(
name="llama-api",
model="meta-llama/Llama-3.1-8B-Instruct",
env={"HF_TOKEN": "hf_..."},
)See Custom Models for detailed setup instructions.
Monitoring
Check Status
status = deployment.status()
print(f"State: {status.state}")
print(f"Phase: {status.phase}")
print(f"Ready: {status.is_ready}")View Logs
# Recent logs
print(deployment.logs(tail=50))
# All logs
print(deployment.logs())Health Check
import httpx
response = httpx.get(f"{deployment.url}/health")
print(response.json()) # {"status": "healthy"}API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions | POST | Chat completions |
/v1/completions | POST | Text completions |
/v1/models | GET | List models |
/v1/embeddings | POST | Generate embeddings |
/health | GET | Health check |
/version | GET | vLLM version |
Supported Parameters
Chat completions support these parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
model | str | required | Model identifier |
messages | list | required | Conversation messages |
temperature | float | 1.0 | Sampling temperature |
top_p | float | 1.0 | Nucleus sampling |
top_k | int | -1 | Top-k sampling |
max_tokens | int | 16 | Maximum tokens to generate |
stream | bool | false | Enable streaming |
stop | list | None | Stop sequences |
presence_penalty | float | 0.0 | Presence penalty |
frequency_penalty | float | 0.0 | Frequency penalty |
n | int | 1 | Number of completions |
Troubleshooting
Model Won't Load
Check logs for errors:
print(deployment.logs())Common causes:
- Out of memory: Use smaller
max_model_lenor more GPUs - Missing token: Gated models need
HF_TOKEN - Wrong quantization: Match
quantizationto model type
Slow Startup
Large models take time to download and load:
deployment = client.deploy_vllm(
name="large-model",
model="meta-llama/Llama-3.1-70B-Instruct",
timeout=1200, # 20 minutes
)High Latency
- Reduce
max_model_lenfor shorter context - Enable tensor parallelism with more GPUs
- Use quantized models for faster inference
Next Steps
- Deploy with SGLang for structured generation
- Deploy custom models for fine-tuned models
- GPU configuration for advanced GPU settings