SGLang

For when you need structured outputs and complex pipelines. RadixAttention, constrained generation, multi-turn efficiency. Deploy with deploy_sglang().

Quick Start

Deploy an SGLang inference server with one function call:

from basilica import BasilicaClient

client = BasilicaClient()

deployment = client.deploy_sglang(
    name="qwen-sglang",
    model="Qwen/Qwen2.5-7B-Instruct",
    gpu_count=1,
)

print(f"API endpoint: {deployment.url}/v1")

API Reference

deployment = client.deploy_sglang(
    name: str,                              # Deployment name
    model: str,                             # Hugging Face model ID
    gpu_count: int = 1,                     # Number of GPUs
    tensor_parallel_size: int = None,       # Tensor parallelism (default: gpu_count)
    context_length: int = None,             # Maximum context length
    quantization: str = None,               # Quantization method
    gpu_models: List[str] = None,           # GPU model filter
    min_gpu_memory_gb: int = None,          # Minimum GPU VRAM
    memory: str = "16Gi",                   # CPU memory allocation
    env: Dict[str, str] = None,             # Additional environment variables
    timeout: int = 600,                     # Deployment timeout
) -> Deployment

Parameters

Parameter	Type	Default	Description
`name`	`str`	required	Deployment name (DNS-safe)
`model`	`str`	required	Hugging Face model ID
`gpu_count`	`int`	`1`	Number of GPUs to allocate
`tensor_parallel_size`	`int`	`gpu_count`	Number of GPUs for tensor parallelism
`context_length`	`int`	model default	Maximum context length
`quantization`	`str`	`None`	Quantization method
`gpu_models`	`List[str]`	`None`	Acceptable GPU models
`min_gpu_memory_gb`	`int`	`None`	Minimum GPU VRAM in GB
`memory`	`str`	`"16Gi"`	CPU memory allocation
`env`	`Dict[str, str]`	`None`	Additional environment variables
`timeout`	`int`	`600`	Seconds to wait for ready

Why SGLang?

RadixAttention

SGLang's RadixAttention efficiently reuses KV cache across requests with shared prefixes. This is beneficial for:

System prompts: Shared system message across conversations
Few-shot learning: Reuse examples across queries
Document QA: Multiple questions about the same document

Structured Generation

SGLang excels at generating structured outputs:

JSON with schema validation
Regex-constrained outputs
Choice selection from options

Basic Examples

Deploy Qwen 2.5

from basilica import BasilicaClient

client = BasilicaClient()

deployment = client.deploy_sglang(
    name="qwen-sglang",
    model="Qwen/Qwen2.5-7B-Instruct",
)

Deploy Llama 3.1

deployment = client.deploy_sglang(
    name="llama-sglang",
    model="meta-llama/Llama-3.1-8B-Instruct",
    env={"HF_TOKEN": "hf_..."},
)

Deploy Mistral

deployment = client.deploy_sglang(
    name="mistral-sglang",
    model="mistralai/Mistral-7B-Instruct-v0.3",
)

Using the API

SGLang provides an OpenAI-compatible API:

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url=f"{deployment.url}/v1",
    api_key="not-needed",
)

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain machine learning in simple terms."},
    ],
    temperature=0.7,
    max_tokens=500,
)

print(response.choices[0].message.content)

Streaming

stream = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Write a haiku about programming"}],
    stream=True,
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

cURL

curl -X POST "${DEPLOYMENT_URL}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Structured Generation

SGLang's key feature is structured generation. Use the native API for advanced features:

JSON Output

import httpx

response = httpx.post(
    f"{deployment.url}/generate",
    json={
        "text": "Generate a user profile:",
        "sampling_params": {
            "temperature": 0.7,
            "max_new_tokens": 200,
        },
        "return_logprob": False,
        "regex": r'\{"name": "[^"]+", "age": \d+, "email": "[^"]+"\}',
    },
)

print(response.json()["text"])
# {"name": "John Smith", "age": 28, "email": "john@example.com"}

Choice Selection

response = httpx.post(
    f"{deployment.url}/generate",
    json={
        "text": "Is Python a compiled or interpreted language?",
        "sampling_params": {"temperature": 0},
        "choices": ["compiled", "interpreted"],
    },
)

print(response.json()["text"])  # "interpreted"

JSON Schema

schema = {
    "type": "object",
    "properties": {
        "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
        "confidence": {"type": "number", "minimum": 0, "maximum": 1},
        "keywords": {"type": "array", "items": {"type": "string"}},
    },
    "required": ["sentiment", "confidence", "keywords"],
}

response = httpx.post(
    f"{deployment.url}/generate",
    json={
        "text": "Analyze the sentiment: 'I love this product! It works great.'",
        "sampling_params": {"temperature": 0.1, "max_new_tokens": 100},
        "json_schema": schema,
    },
)

import json
result = json.loads(response.json()["text"])
print(result)
# {"sentiment": "positive", "confidence": 0.95, "keywords": ["love", "great", "works"]}

Advanced Configuration

Tensor Parallelism

For large models:

deployment = client.deploy_sglang(
    name="llama-70b-sglang",
    model="meta-llama/Llama-3.1-70B-Instruct",
    gpu_count=4,
    tensor_parallel_size=4,
    memory="128Gi",
    timeout=900,
)

Custom Context Length

deployment = client.deploy_sglang(
    name="long-context",
    model="Qwen/Qwen2.5-7B-Instruct",
    context_length=32768,
    min_gpu_memory_gb=24,
)

Performance Tuning

deployment = client.deploy_sglang(
    name="optimized-sglang",
    model="Qwen/Qwen2.5-7B-Instruct",
    env={
        # Memory management
        "SGL_MEM_FRACTION_STATIC": "0.85",

        # Chunked prefill for better throughput
        "SGL_CHUNKED_PREFILL_SIZE": "8192",

        # Attention backend
        "SGL_ATTENTION_BACKEND": "flashinfer",
    },
)

SGLang Native API

Beyond OpenAI compatibility, SGLang provides a native API with additional features:

Generate Endpoint

import httpx

response = httpx.post(
    f"{deployment.url}/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 50,
        },
    },
)

print(response.json()["text"])

Batch Generation

response = httpx.post(
    f"{deployment.url}/generate",
    json={
        "text": [
            "Translate to French: Hello",
            "Translate to French: Goodbye",
            "Translate to French: Thank you",
        ],
        "sampling_params": {"temperature": 0, "max_new_tokens": 50},
    },
)

for result in response.json():
    print(result["text"])

Get Model Info

response = httpx.get(f"{deployment.url}/get_model_info")
print(response.json())

Gated Models

For models requiring Hugging Face authentication:

deployment = client.deploy_sglang(
    name="llama-sglang",
    model="meta-llama/Llama-3.1-8B-Instruct",
    env={"HF_TOKEN": "hf_..."},
)

See Custom Models for detailed setup.

Monitoring

Check Status

status = deployment.status()

print(f"State: {status.state}")
print(f"Ready: {status.is_ready}")

View Logs

print(deployment.logs(tail=50))

Health Check

import httpx

response = httpx.get(f"{deployment.url}/health")
print(response.json())

API Endpoints

Endpoint	Method	Description
`/v1/chat/completions`	POST	OpenAI-compatible chat
`/v1/completions`	POST	OpenAI-compatible completion
`/v1/models`	GET	List models
`/generate`	POST	Native SGLang generation
`/get_model_info`	GET	Model information
`/health`	GET	Health check
`/health_generate`	GET	Health check with generation test

SGLang vs vLLM

Feature	SGLang	vLLM
OpenAI API	Yes	Yes
Throughput	High	Very High
RadixAttention	Yes	No
Structured Generation	Native	Limited
JSON Schema	Yes	No
Regex Constraints	Yes	No
Choice Selection	Yes	No
Multi-turn Efficiency	Better	Good

Choose SGLang when:

You need structured outputs (JSON, regex)
Your workload has shared prefixes (system prompts, documents)
You're building complex multi-turn conversations

Choose vLLM when:

Raw throughput is the priority
You have simple request/response patterns
You need maximum battle-tested stability

Troubleshooting

Model Won't Load

print(deployment.logs())

Common issues:

Out of memory: Reduce context length or add GPUs
Missing token: Set HF_TOKEN for gated models

Generation Errors

For structured generation issues:

Ensure schema is valid JSON Schema
Check regex patterns are correct
Verify model supports the output format

Slow Performance

Enable tensor parallelism for large models
Use SGL_CHUNKED_PREFILL_SIZE for better batching
Reduce context length if not needed

Next Steps

Deploy with vLLM for maximum throughput
Deploy custom models for fine-tuned models
GPU configuration for advanced GPU settings

On this page