LLM Inference

Summon Llama, Mistral, Qwen, or any Hugging Face model. One function call, one OpenAI-compatible endpoint. Your model, served.

Why Basilica for LLM Inference?

OpenAI-compatible API: Drop-in replacement for OpenAI API clients
Optimized performance: PagedAttention, continuous batching, tensor parallelism
Simple deployment: One function call deploys a full inference server
Cost-effective: Pay only for GPU time you use

Supported Frameworks

Framework	Best For	Performance
vLLM	Production workloads	High throughput, low latency
SGLang	Complex pipelines	Structured generation, multi-turn

Quick Start

Deploy a model with a single function call:

from basilica import BasilicaClient

client = BasilicaClient()

# Deploy vLLM with Qwen 2.5
deployment = client.deploy_vllm(
    name="qwen-api",
    model="Qwen/Qwen2.5-7B-Instruct",
    gpu_count=1,
)

print(f"OpenAI-compatible API: {deployment.url}/v1")

Use with any OpenAI client:

from openai import OpenAI

client = OpenAI(
    base_url=f"{deployment.url}/v1",
    api_key="not-needed",  # Basilica handles auth
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
)

print(response.choices[0].message.content)

Choosing a Framework

Use vLLM When

You need maximum throughput for chat completions
You want battle-tested production stability
Your workload is request/response based
You need tensor parallelism for large models

Use SGLang When

You have complex multi-turn conversations
You need structured generation (JSON schemas, constrained outputs)
You want RadixAttention for KV cache reuse
Your pipeline involves branching logic

Dedicated Methods

The SDK provides dedicated methods for LLM inference:

Method	Description
`client.deploy_vllm()`	Deploy vLLM OpenAI-compatible server
`client.deploy_sglang()`	Deploy SGLang server

These methods handle:

Container image selection
Environment configuration
Health check configuration
Optimized resource defaults

Model Compatibility

Both vLLM and SGLang support most Hugging Face models:

Chat Models

meta-llama/Llama-3.1-8B-Instruct
meta-llama/Llama-3.1-70B-Instruct
mistralai/Mistral-7B-Instruct-v0.3
mistralai/Mixtral-8x7B-Instruct-v0.1
Qwen/Qwen2.5-7B-Instruct
Qwen/Qwen2.5-72B-Instruct
deepseek-ai/DeepSeek-V2-Chat

Base Models

meta-llama/Llama-3.1-8B
mistralai/Mistral-7B-v0.3
Qwen/Qwen2.5-7B

Code Models

codellama/CodeLlama-34b-Instruct-hf
deepseek-ai/deepseek-coder-6.7b-instruct
Qwen/Qwen2.5-Coder-7B-Instruct

Some models require accepting license agreements on Hugging Face. Set HF_TOKEN in your environment or pass it via the env parameter.

API Endpoints

Both vLLM and SGLang expose OpenAI-compatible endpoints:

Endpoint	Description
`POST /v1/chat/completions`	Chat completions (recommended)
`POST /v1/completions`	Text completions
`GET /v1/models`	List available models
`GET /health`	Health check

Chat Completions

curl -X POST "${DEPLOYMENT_URL}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is 2+2?"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Streaming

from openai import OpenAI

client = OpenAI(base_url=f"{deployment.url}/v1", api_key="not-needed")

stream = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Next Steps

Deploy with vLLM: Production-grade inference
Deploy with SGLang: Structured generation
Deploy custom models: Fine-tuned and private models

On this page