Basilica
LLM Inference

LLM Inference

Summon Llama, Mistral, Qwen, or any Hugging Face model. One function call, one OpenAI-compatible endpoint. Your model, served.

Why Basilica for LLM Inference?

  • OpenAI-compatible API: Drop-in replacement for OpenAI API clients
  • Optimized performance: PagedAttention, continuous batching, tensor parallelism
  • Simple deployment: One function call deploys a full inference server
  • Cost-effective: Pay only for GPU time you use

Supported Frameworks

FrameworkBest ForPerformance
vLLMProduction workloadsHigh throughput, low latency
SGLangComplex pipelinesStructured generation, multi-turn

Quick Start

Deploy a model with a single function call:

from basilica import BasilicaClient

client = BasilicaClient()

# Deploy vLLM with Qwen 2.5
deployment = client.deploy_vllm(
    name="qwen-api",
    model="Qwen/Qwen2.5-7B-Instruct",
    gpu_count=1,
)

print(f"OpenAI-compatible API: {deployment.url}/v1")

Use with any OpenAI client:

from openai import OpenAI

client = OpenAI(
    base_url=f"{deployment.url}/v1",
    api_key="not-needed",  # Basilica handles auth
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
)

print(response.choices[0].message.content)

Choosing a Framework

Use vLLM When

  • You need maximum throughput for chat completions
  • You want battle-tested production stability
  • Your workload is request/response based
  • You need tensor parallelism for large models

Use SGLang When

  • You have complex multi-turn conversations
  • You need structured generation (JSON schemas, constrained outputs)
  • You want RadixAttention for KV cache reuse
  • Your pipeline involves branching logic

Dedicated Methods

The SDK provides dedicated methods for LLM inference:

MethodDescription
client.deploy_vllm()Deploy vLLM OpenAI-compatible server
client.deploy_sglang()Deploy SGLang server

These methods handle:

  • Container image selection
  • Environment configuration
  • Health check configuration
  • Optimized resource defaults

Model Compatibility

Both vLLM and SGLang support most Hugging Face models:

Chat Models

  • meta-llama/Llama-3.1-8B-Instruct
  • meta-llama/Llama-3.1-70B-Instruct
  • mistralai/Mistral-7B-Instruct-v0.3
  • mistralai/Mixtral-8x7B-Instruct-v0.1
  • Qwen/Qwen2.5-7B-Instruct
  • Qwen/Qwen2.5-72B-Instruct
  • deepseek-ai/DeepSeek-V2-Chat

Base Models

  • meta-llama/Llama-3.1-8B
  • mistralai/Mistral-7B-v0.3
  • Qwen/Qwen2.5-7B

Code Models

  • codellama/CodeLlama-34b-Instruct-hf
  • deepseek-ai/deepseek-coder-6.7b-instruct
  • Qwen/Qwen2.5-Coder-7B-Instruct

Some models require accepting license agreements on Hugging Face. Set HF_TOKEN in your environment or pass it via the env parameter.

API Endpoints

Both vLLM and SGLang expose OpenAI-compatible endpoints:

EndpointDescription
POST /v1/chat/completionsChat completions (recommended)
POST /v1/completionsText completions
GET /v1/modelsList available models
GET /healthHealth check

Chat Completions

curl -X POST "${DEPLOYMENT_URL}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is 2+2?"}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Streaming

from openai import OpenAI

client = OpenAI(base_url=f"{deployment.url}/v1", api_key="not-needed")

stream = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Next Steps

On this page