vLLM

The gold standard for LLM inference. PagedAttention, continuous batching, blazing throughput. Deploy with deploy_vllm().

Quick Start

Deploy a vLLM inference server with one function call:

from basilica import BasilicaClient

client = BasilicaClient()

deployment = client.deploy_vllm(
    name="qwen-api",
    model="Qwen/Qwen2.5-7B-Instruct",
    gpu_count=1,
)

print(f"API endpoint: {deployment.url}/v1")

API Reference

deployment = client.deploy_vllm(
    name: str,                              # Deployment name
    model: str,                             # Hugging Face model ID
    gpu_count: int = 1,                     # Number of GPUs
    tensor_parallel_size: int = None,       # Tensor parallelism (default: gpu_count)
    max_model_len: int = None,              # Maximum sequence length
    quantization: str = None,               # Quantization method
    gpu_models: List[str] = None,           # GPU model filter
    min_gpu_memory_gb: int = None,          # Minimum GPU VRAM
    memory: str = "16Gi",                   # CPU memory allocation
    env: Dict[str, str] = None,             # Additional environment variables
    timeout: int = 600,                     # Deployment timeout
) -> Deployment

Parameters

Parameter	Type	Default	Description
`name`	`str`	required	Deployment name (DNS-safe)
`model`	`str`	required	Hugging Face model ID (e.g., `"Qwen/Qwen2.5-7B-Instruct"`)
`gpu_count`	`int`	`1`	Number of GPUs to allocate
`tensor_parallel_size`	`int`	`gpu_count`	Number of GPUs for tensor parallelism
`max_model_len`	`int`	model default	Maximum context length
`quantization`	`str`	`None`	Quantization: `"awq"`, `"gptq"`, `"squeezellm"`
`gpu_models`	`List[str]`	`None`	Acceptable GPU models
`min_gpu_memory_gb`	`int`	`None`	Minimum GPU VRAM in GB
`memory`	`str`	`"16Gi"`	CPU memory allocation
`env`	`Dict[str, str]`	`None`	Additional environment variables
`timeout`	`int`	`600`	Seconds to wait for ready

Basic Examples

Deploy Qwen 2.5

from basilica import BasilicaClient

client = BasilicaClient()

deployment = client.deploy_vllm(
    name="qwen-7b",
    model="Qwen/Qwen2.5-7B-Instruct",
)

Deploy Llama 3.1

deployment = client.deploy_vllm(
    name="llama-8b",
    model="meta-llama/Llama-3.1-8B-Instruct",
    env={"HF_TOKEN": "hf_..."},  # Llama requires license acceptance
)

Deploy Mistral

deployment = client.deploy_vllm(
    name="mistral-7b",
    model="mistralai/Mistral-7B-Instruct-v0.3",
)

Using the API

vLLM exposes an OpenAI-compatible API. Use any OpenAI client:

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url=f"{deployment.url}/v1",
    api_key="not-needed",
)

# Chat completion
response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in simple terms."},
    ],
    temperature=0.7,
    max_tokens=500,
)

print(response.choices[0].message.content)

Streaming Responses

stream = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Write a short poem about AI"}],
    stream=True,
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

cURL

curl -X POST "${DEPLOYMENT_URL}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

JavaScript/TypeScript

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: `${deploymentUrl}/v1`,
  apiKey: 'not-needed',
});

const response = await client.chat.completions.create({
  model: 'Qwen/Qwen2.5-7B-Instruct',
  messages: [{ role: 'user', content: 'Hello!' }],
});

console.log(response.choices[0].message.content);

Advanced Configuration

Tensor Parallelism

For large models that don't fit on a single GPU:

# 70B model across 4 GPUs
deployment = client.deploy_vllm(
    name="llama-70b",
    model="meta-llama/Llama-3.1-70B-Instruct",
    gpu_count=4,
    tensor_parallel_size=4,
    memory="128Gi",
    timeout=900,  # Large models take longer to load
)

Custom Context Length

# Shorter context for faster inference
deployment = client.deploy_vllm(
    name="fast-qwen",
    model="Qwen/Qwen2.5-7B-Instruct",
    max_model_len=4096,  # Default is 32k
)

# Extended context (requires more VRAM)
deployment = client.deploy_vllm(
    name="long-context",
    model="Qwen/Qwen2.5-7B-Instruct",
    max_model_len=32768,
    min_gpu_memory_gb=24,
)

Quantized Models

Deploy AWQ or GPTQ quantized models for lower memory usage:

deployment = client.deploy_vllm(
    name="qwen-awq",
    model="Qwen/Qwen2.5-7B-Instruct-AWQ",
    quantization="awq",
)

See Custom Models for GPTQ examples and custom quantization.

Performance Tuning

deployment = client.deploy_vllm(
    name="high-throughput",
    model="Qwen/Qwen2.5-7B-Instruct",
    env={
        # Increase batch size for throughput
        "VLLM_MAX_NUM_BATCHED_TOKENS": "8192",
        "VLLM_MAX_NUM_SEQS": "256",

        # Enable eager mode (can help with some models)
        "VLLM_ENFORCE_EAGER": "true",

        # Swap space for larger batches
        "VLLM_SWAP_SPACE": "4",  # GB
    },
)

Specific GPU Selection

deployment = client.deploy_vllm(
    name="a100-inference",
    model="meta-llama/Llama-3.1-70B-Instruct",
    gpu_count=2,
    gpu_models=["A100"],
    min_gpu_memory_gb=80,
)

Gated Models

For models requiring Hugging Face license acceptance:

deployment = client.deploy_vllm(
    name="llama-api",
    model="meta-llama/Llama-3.1-8B-Instruct",
    env={"HF_TOKEN": "hf_..."},
)

See Custom Models for detailed setup instructions.

Monitoring

Check Status

status = deployment.status()

print(f"State: {status.state}")
print(f"Phase: {status.phase}")
print(f"Ready: {status.is_ready}")

View Logs

# Recent logs
print(deployment.logs(tail=50))

# All logs
print(deployment.logs())

Health Check

import httpx

response = httpx.get(f"{deployment.url}/health")
print(response.json())  # {"status": "healthy"}

API Endpoints

Endpoint	Method	Description
`/v1/chat/completions`	POST	Chat completions
`/v1/completions`	POST	Text completions
`/v1/models`	GET	List models
`/v1/embeddings`	POST	Generate embeddings
`/health`	GET	Health check
`/version`	GET	vLLM version

Supported Parameters

Chat completions support these parameters:

Parameter	Type	Default	Description
`model`	`str`	required	Model identifier
`messages`	`list`	required	Conversation messages
`temperature`	`float`	`1.0`	Sampling temperature
`top_p`	`float`	`1.0`	Nucleus sampling
`top_k`	`int`	`-1`	Top-k sampling
`max_tokens`	`int`	`16`	Maximum tokens to generate
`stream`	`bool`	`false`	Enable streaming
`stop`	`list`	`None`	Stop sequences
`presence_penalty`	`float`	`0.0`	Presence penalty
`frequency_penalty`	`float`	`0.0`	Frequency penalty
`n`	`int`	`1`	Number of completions

Troubleshooting

Model Won't Load

Check logs for errors:

print(deployment.logs())

Common causes:

Out of memory: Use smaller max_model_len or more GPUs
Missing token: Gated models need HF_TOKEN
Wrong quantization: Match quantization to model type

Slow Startup

Large models take time to download and load:

deployment = client.deploy_vllm(
    name="large-model",
    model="meta-llama/Llama-3.1-70B-Instruct",
    timeout=1200,  # 20 minutes
)

High Latency

Reduce max_model_len for shorter context
Enable tensor parallelism with more GPUs
Use quantized models for faster inference

Next Steps

Deploy with SGLang for structured generation
Deploy custom models for fine-tuned models
GPU configuration for advanced GPU settings

On this page