Basilica
LLM Inference

vLLM

The gold standard for LLM inference. PagedAttention, continuous batching, blazing throughput. Deploy with deploy_vllm().

Quick Start

Deploy a vLLM inference server with one function call:

from basilica import BasilicaClient

client = BasilicaClient()

deployment = client.deploy_vllm(
    name="qwen-api",
    model="Qwen/Qwen2.5-7B-Instruct",
    gpu_count=1,
)

print(f"API endpoint: {deployment.url}/v1")

API Reference

deployment = client.deploy_vllm(
    name: str,                              # Deployment name
    model: str,                             # Hugging Face model ID
    gpu_count: int = 1,                     # Number of GPUs
    tensor_parallel_size: int = None,       # Tensor parallelism (default: gpu_count)
    max_model_len: int = None,              # Maximum sequence length
    quantization: str = None,               # Quantization method
    gpu_models: List[str] = None,           # GPU model filter
    min_gpu_memory_gb: int = None,          # Minimum GPU VRAM
    memory: str = "16Gi",                   # CPU memory allocation
    env: Dict[str, str] = None,             # Additional environment variables
    timeout: int = 600,                     # Deployment timeout
) -> Deployment

Parameters

ParameterTypeDefaultDescription
namestrrequiredDeployment name (DNS-safe)
modelstrrequiredHugging Face model ID (e.g., "Qwen/Qwen2.5-7B-Instruct")
gpu_countint1Number of GPUs to allocate
tensor_parallel_sizeintgpu_countNumber of GPUs for tensor parallelism
max_model_lenintmodel defaultMaximum context length
quantizationstrNoneQuantization: "awq", "gptq", "squeezellm"
gpu_modelsList[str]NoneAcceptable GPU models
min_gpu_memory_gbintNoneMinimum GPU VRAM in GB
memorystr"16Gi"CPU memory allocation
envDict[str, str]NoneAdditional environment variables
timeoutint600Seconds to wait for ready

Basic Examples

Deploy Qwen 2.5

from basilica import BasilicaClient

client = BasilicaClient()

deployment = client.deploy_vllm(
    name="qwen-7b",
    model="Qwen/Qwen2.5-7B-Instruct",
)

Deploy Llama 3.1

deployment = client.deploy_vllm(
    name="llama-8b",
    model="meta-llama/Llama-3.1-8B-Instruct",
    env={"HF_TOKEN": "hf_..."},  # Llama requires license acceptance
)

Deploy Mistral

deployment = client.deploy_vllm(
    name="mistral-7b",
    model="mistralai/Mistral-7B-Instruct-v0.3",
)

Using the API

vLLM exposes an OpenAI-compatible API. Use any OpenAI client:

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url=f"{deployment.url}/v1",
    api_key="not-needed",
)

# Chat completion
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."},
    ],
    temperature=0.7,
    max_tokens=500,
)

print(response.choices[0].message.content)

Streaming Responses

stream = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Write a short poem about AI"}],
    stream=True,
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

cURL

curl -X POST "${DEPLOYMENT_URL}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

JavaScript/TypeScript

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: `${deploymentUrl}/v1`,
  apiKey: 'not-needed',
});

const response = await client.chat.completions.create({
  model: 'Qwen/Qwen2.5-7B-Instruct',
  messages: [{ role: 'user', content: 'Hello!' }],
});

console.log(response.choices[0].message.content);

Advanced Configuration

Tensor Parallelism

For large models that don't fit on a single GPU:

# 70B model across 4 GPUs
deployment = client.deploy_vllm(
    name="llama-70b",
    model="meta-llama/Llama-3.1-70B-Instruct",
    gpu_count=4,
    tensor_parallel_size=4,
    memory="128Gi",
    timeout=900,  # Large models take longer to load
)

Custom Context Length

# Shorter context for faster inference
deployment = client.deploy_vllm(
    name="fast-qwen",
    model="Qwen/Qwen2.5-7B-Instruct",
    max_model_len=4096,  # Default is 32k
)

# Extended context (requires more VRAM)
deployment = client.deploy_vllm(
    name="long-context",
    model="Qwen/Qwen2.5-7B-Instruct",
    max_model_len=32768,
    min_gpu_memory_gb=24,
)

Quantized Models

Deploy AWQ or GPTQ quantized models for lower memory usage:

deployment = client.deploy_vllm(
    name="qwen-awq",
    model="Qwen/Qwen2.5-7B-Instruct-AWQ",
    quantization="awq",
)

See Custom Models for GPTQ examples and custom quantization.

Performance Tuning

deployment = client.deploy_vllm(
    name="high-throughput",
    model="Qwen/Qwen2.5-7B-Instruct",
    env={
        # Increase batch size for throughput
        "VLLM_MAX_NUM_BATCHED_TOKENS": "8192",
        "VLLM_MAX_NUM_SEQS": "256",

        # Enable eager mode (can help with some models)
        "VLLM_ENFORCE_EAGER": "true",

        # Swap space for larger batches
        "VLLM_SWAP_SPACE": "4",  # GB
    },
)

Specific GPU Selection

deployment = client.deploy_vllm(
    name="a100-inference",
    model="meta-llama/Llama-3.1-70B-Instruct",
    gpu_count=2,
    gpu_models=["A100"],
    min_gpu_memory_gb=80,
)

Gated Models

For models requiring Hugging Face license acceptance:

deployment = client.deploy_vllm(
    name="llama-api",
    model="meta-llama/Llama-3.1-8B-Instruct",
    env={"HF_TOKEN": "hf_..."},
)

See Custom Models for detailed setup instructions.

Monitoring

Check Status

status = deployment.status()

print(f"State: {status.state}")
print(f"Phase: {status.phase}")
print(f"Ready: {status.is_ready}")

View Logs

# Recent logs
print(deployment.logs(tail=50))

# All logs
print(deployment.logs())

Health Check

import httpx

response = httpx.get(f"{deployment.url}/health")
print(response.json())  # {"status": "healthy"}

API Endpoints

EndpointMethodDescription
/v1/chat/completionsPOSTChat completions
/v1/completionsPOSTText completions
/v1/modelsGETList models
/v1/embeddingsPOSTGenerate embeddings
/healthGETHealth check
/versionGETvLLM version

Supported Parameters

Chat completions support these parameters:

ParameterTypeDefaultDescription
modelstrrequiredModel identifier
messageslistrequiredConversation messages
temperaturefloat1.0Sampling temperature
top_pfloat1.0Nucleus sampling
top_kint-1Top-k sampling
max_tokensint16Maximum tokens to generate
streamboolfalseEnable streaming
stoplistNoneStop sequences
presence_penaltyfloat0.0Presence penalty
frequency_penaltyfloat0.0Frequency penalty
nint1Number of completions

Troubleshooting

Model Won't Load

Check logs for errors:

print(deployment.logs())

Common causes:

  • Out of memory: Use smaller max_model_len or more GPUs
  • Missing token: Gated models need HF_TOKEN
  • Wrong quantization: Match quantization to model type

Slow Startup

Large models take time to download and load:

deployment = client.deploy_vllm(
    name="large-model",
    model="meta-llama/Llama-3.1-70B-Instruct",
    timeout=1200,  # 20 minutes
)

High Latency

  • Reduce max_model_len for shorter context
  • Enable tensor parallelism with more GPUs
  • Use quantized models for faster inference

Next Steps

On this page