Basilica
LLM Inference

SGLang

For when you need structured outputs and complex pipelines. RadixAttention, constrained generation, multi-turn efficiency. Deploy with deploy_sglang().

Quick Start

Deploy an SGLang inference server with one function call:

from basilica import BasilicaClient

client = BasilicaClient()

deployment = client.deploy_sglang(
    name="qwen-sglang",
    model="Qwen/Qwen2.5-7B-Instruct",
    gpu_count=1,
)

print(f"API endpoint: {deployment.url}/v1")

API Reference

deployment = client.deploy_sglang(
    name: str,                              # Deployment name
    model: str,                             # Hugging Face model ID
    gpu_count: int = 1,                     # Number of GPUs
    tensor_parallel_size: int = None,       # Tensor parallelism (default: gpu_count)
    context_length: int = None,             # Maximum context length
    quantization: str = None,               # Quantization method
    gpu_models: List[str] = None,           # GPU model filter
    min_gpu_memory_gb: int = None,          # Minimum GPU VRAM
    memory: str = "16Gi",                   # CPU memory allocation
    env: Dict[str, str] = None,             # Additional environment variables
    timeout: int = 600,                     # Deployment timeout
) -> Deployment

Parameters

ParameterTypeDefaultDescription
namestrrequiredDeployment name (DNS-safe)
modelstrrequiredHugging Face model ID
gpu_countint1Number of GPUs to allocate
tensor_parallel_sizeintgpu_countNumber of GPUs for tensor parallelism
context_lengthintmodel defaultMaximum context length
quantizationstrNoneQuantization method
gpu_modelsList[str]NoneAcceptable GPU models
min_gpu_memory_gbintNoneMinimum GPU VRAM in GB
memorystr"16Gi"CPU memory allocation
envDict[str, str]NoneAdditional environment variables
timeoutint600Seconds to wait for ready

Why SGLang?

RadixAttention

SGLang's RadixAttention efficiently reuses KV cache across requests with shared prefixes. This is beneficial for:

  • System prompts: Shared system message across conversations
  • Few-shot learning: Reuse examples across queries
  • Document QA: Multiple questions about the same document

Structured Generation

SGLang excels at generating structured outputs:

  • JSON with schema validation
  • Regex-constrained outputs
  • Choice selection from options

Basic Examples

Deploy Qwen 2.5

from basilica import BasilicaClient

client = BasilicaClient()

deployment = client.deploy_sglang(
    name="qwen-sglang",
    model="Qwen/Qwen2.5-7B-Instruct",
)

Deploy Llama 3.1

deployment = client.deploy_sglang(
    name="llama-sglang",
    model="meta-llama/Llama-3.1-8B-Instruct",
    env={"HF_TOKEN": "hf_..."},
)

Deploy Mistral

deployment = client.deploy_sglang(
    name="mistral-sglang",
    model="mistralai/Mistral-7B-Instruct-v0.3",
)

Using the API

SGLang provides an OpenAI-compatible API:

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url=f"{deployment.url}/v1",
    api_key="not-needed",
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain machine learning in simple terms."},
    ],
    temperature=0.7,
    max_tokens=500,
)

print(response.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Write a haiku about programming"}],
    stream=True,
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

cURL

curl -X POST "${DEPLOYMENT_URL}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Structured Generation

SGLang's key feature is structured generation. Use the native API for advanced features:

JSON Output

import httpx

response = httpx.post(
    f"{deployment.url}/generate",
    json={
        "text": "Generate a user profile:",
        "sampling_params": {
            "temperature": 0.7,
            "max_new_tokens": 200,
        },
        "return_logprob": False,
        "regex": r'\{"name": "[^"]+", "age": \d+, "email": "[^"]+"\}',
    },
)

print(response.json()["text"])
# {"name": "John Smith", "age": 28, "email": "john@example.com"}

Choice Selection

response = httpx.post(
    f"{deployment.url}/generate",
    json={
        "text": "Is Python a compiled or interpreted language?",
        "sampling_params": {"temperature": 0},
        "choices": ["compiled", "interpreted"],
    },
)

print(response.json()["text"])  # "interpreted"

JSON Schema

schema = {
    "type": "object",
    "properties": {
        "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
        "confidence": {"type": "number", "minimum": 0, "maximum": 1},
        "keywords": {"type": "array", "items": {"type": "string"}},
    },
    "required": ["sentiment", "confidence", "keywords"],
}

response = httpx.post(
    f"{deployment.url}/generate",
    json={
        "text": "Analyze the sentiment: 'I love this product! It works great.'",
        "sampling_params": {"temperature": 0.1, "max_new_tokens": 100},
        "json_schema": schema,
    },
)

import json
result = json.loads(response.json()["text"])
print(result)
# {"sentiment": "positive", "confidence": 0.95, "keywords": ["love", "great", "works"]}

Advanced Configuration

Tensor Parallelism

For large models:

deployment = client.deploy_sglang(
    name="llama-70b-sglang",
    model="meta-llama/Llama-3.1-70B-Instruct",
    gpu_count=4,
    tensor_parallel_size=4,
    memory="128Gi",
    timeout=900,
)

Custom Context Length

deployment = client.deploy_sglang(
    name="long-context",
    model="Qwen/Qwen2.5-7B-Instruct",
    context_length=32768,
    min_gpu_memory_gb=24,
)

Performance Tuning

deployment = client.deploy_sglang(
    name="optimized-sglang",
    model="Qwen/Qwen2.5-7B-Instruct",
    env={
        # Memory management
        "SGL_MEM_FRACTION_STATIC": "0.85",

        # Chunked prefill for better throughput
        "SGL_CHUNKED_PREFILL_SIZE": "8192",

        # Attention backend
        "SGL_ATTENTION_BACKEND": "flashinfer",
    },
)

SGLang Native API

Beyond OpenAI compatibility, SGLang provides a native API with additional features:

Generate Endpoint

import httpx

response = httpx.post(
    f"{deployment.url}/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 50,
        },
    },
)

print(response.json()["text"])

Batch Generation

response = httpx.post(
    f"{deployment.url}/generate",
    json={
        "text": [
            "Translate to French: Hello",
            "Translate to French: Goodbye",
            "Translate to French: Thank you",
        ],
        "sampling_params": {"temperature": 0, "max_new_tokens": 50},
    },
)

for result in response.json():
    print(result["text"])

Get Model Info

response = httpx.get(f"{deployment.url}/get_model_info")
print(response.json())

Gated Models

For models requiring Hugging Face authentication:

deployment = client.deploy_sglang(
    name="llama-sglang",
    model="meta-llama/Llama-3.1-8B-Instruct",
    env={"HF_TOKEN": "hf_..."},
)

See Custom Models for detailed setup.

Monitoring

Check Status

status = deployment.status()

print(f"State: {status.state}")
print(f"Ready: {status.is_ready}")

View Logs

print(deployment.logs(tail=50))

Health Check

import httpx

response = httpx.get(f"{deployment.url}/health")
print(response.json())

API Endpoints

EndpointMethodDescription
/v1/chat/completionsPOSTOpenAI-compatible chat
/v1/completionsPOSTOpenAI-compatible completion
/v1/modelsGETList models
/generatePOSTNative SGLang generation
/get_model_infoGETModel information
/healthGETHealth check
/health_generateGETHealth check with generation test

SGLang vs vLLM

FeatureSGLangvLLM
OpenAI APIYesYes
ThroughputHighVery High
RadixAttentionYesNo
Structured GenerationNativeLimited
JSON SchemaYesNo
Regex ConstraintsYesNo
Choice SelectionYesNo
Multi-turn EfficiencyBetterGood

Choose SGLang when:

  • You need structured outputs (JSON, regex)
  • Your workload has shared prefixes (system prompts, documents)
  • You're building complex multi-turn conversations

Choose vLLM when:

  • Raw throughput is the priority
  • You have simple request/response patterns
  • You need maximum battle-tested stability

Troubleshooting

Model Won't Load

print(deployment.logs())

Common issues:

  • Out of memory: Reduce context length or add GPUs
  • Missing token: Set HF_TOKEN for gated models

Generation Errors

For structured generation issues:

  • Ensure schema is valid JSON Schema
  • Check regex patterns are correct
  • Verify model supports the output format

Slow Performance

  • Enable tensor parallelism for large models
  • Use SGL_CHUNKED_PREFILL_SIZE for better batching
  • Reduce context length if not needed

Next Steps

On this page