SGLang
For when you need structured outputs and complex pipelines. RadixAttention, constrained generation, multi-turn efficiency. Deploy with deploy_sglang().
Quick Start
Deploy an SGLang inference server with one function call:
from basilica import BasilicaClient
client = BasilicaClient()
deployment = client.deploy_sglang(
name="qwen-sglang",
model="Qwen/Qwen2.5-7B-Instruct",
gpu_count=1,
)
print(f"API endpoint: {deployment.url}/v1")API Reference
deployment = client.deploy_sglang(
name: str, # Deployment name
model: str, # Hugging Face model ID
gpu_count: int = 1, # Number of GPUs
tensor_parallel_size: int = None, # Tensor parallelism (default: gpu_count)
context_length: int = None, # Maximum context length
quantization: str = None, # Quantization method
gpu_models: List[str] = None, # GPU model filter
min_gpu_memory_gb: int = None, # Minimum GPU VRAM
memory: str = "16Gi", # CPU memory allocation
env: Dict[str, str] = None, # Additional environment variables
timeout: int = 600, # Deployment timeout
) -> DeploymentParameters
| Parameter | Type | Default | Description |
|---|---|---|---|
name | str | required | Deployment name (DNS-safe) |
model | str | required | Hugging Face model ID |
gpu_count | int | 1 | Number of GPUs to allocate |
tensor_parallel_size | int | gpu_count | Number of GPUs for tensor parallelism |
context_length | int | model default | Maximum context length |
quantization | str | None | Quantization method |
gpu_models | List[str] | None | Acceptable GPU models |
min_gpu_memory_gb | int | None | Minimum GPU VRAM in GB |
memory | str | "16Gi" | CPU memory allocation |
env | Dict[str, str] | None | Additional environment variables |
timeout | int | 600 | Seconds to wait for ready |
Why SGLang?
RadixAttention
SGLang's RadixAttention efficiently reuses KV cache across requests with shared prefixes. This is beneficial for:
- System prompts: Shared system message across conversations
- Few-shot learning: Reuse examples across queries
- Document QA: Multiple questions about the same document
Structured Generation
SGLang excels at generating structured outputs:
- JSON with schema validation
- Regex-constrained outputs
- Choice selection from options
Basic Examples
Deploy Qwen 2.5
from basilica import BasilicaClient
client = BasilicaClient()
deployment = client.deploy_sglang(
name="qwen-sglang",
model="Qwen/Qwen2.5-7B-Instruct",
)Deploy Llama 3.1
deployment = client.deploy_sglang(
name="llama-sglang",
model="meta-llama/Llama-3.1-8B-Instruct",
env={"HF_TOKEN": "hf_..."},
)Deploy Mistral
deployment = client.deploy_sglang(
name="mistral-sglang",
model="mistralai/Mistral-7B-Instruct-v0.3",
)Using the API
SGLang provides an OpenAI-compatible API:
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url=f"{deployment.url}/v1",
api_key="not-needed",
)
response = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain machine learning in simple terms."},
],
temperature=0.7,
max_tokens=500,
)
print(response.choices[0].message.content)Streaming
stream = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Write a haiku about programming"}],
stream=True,
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)cURL
curl -X POST "${DEPLOYMENT_URL}/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'Structured Generation
SGLang's key feature is structured generation. Use the native API for advanced features:
JSON Output
import httpx
response = httpx.post(
f"{deployment.url}/generate",
json={
"text": "Generate a user profile:",
"sampling_params": {
"temperature": 0.7,
"max_new_tokens": 200,
},
"return_logprob": False,
"regex": r'\{"name": "[^"]+", "age": \d+, "email": "[^"]+"\}',
},
)
print(response.json()["text"])
# {"name": "John Smith", "age": 28, "email": "john@example.com"}Choice Selection
response = httpx.post(
f"{deployment.url}/generate",
json={
"text": "Is Python a compiled or interpreted language?",
"sampling_params": {"temperature": 0},
"choices": ["compiled", "interpreted"],
},
)
print(response.json()["text"]) # "interpreted"JSON Schema
schema = {
"type": "object",
"properties": {
"sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
"confidence": {"type": "number", "minimum": 0, "maximum": 1},
"keywords": {"type": "array", "items": {"type": "string"}},
},
"required": ["sentiment", "confidence", "keywords"],
}
response = httpx.post(
f"{deployment.url}/generate",
json={
"text": "Analyze the sentiment: 'I love this product! It works great.'",
"sampling_params": {"temperature": 0.1, "max_new_tokens": 100},
"json_schema": schema,
},
)
import json
result = json.loads(response.json()["text"])
print(result)
# {"sentiment": "positive", "confidence": 0.95, "keywords": ["love", "great", "works"]}Advanced Configuration
Tensor Parallelism
For large models:
deployment = client.deploy_sglang(
name="llama-70b-sglang",
model="meta-llama/Llama-3.1-70B-Instruct",
gpu_count=4,
tensor_parallel_size=4,
memory="128Gi",
timeout=900,
)Custom Context Length
deployment = client.deploy_sglang(
name="long-context",
model="Qwen/Qwen2.5-7B-Instruct",
context_length=32768,
min_gpu_memory_gb=24,
)Performance Tuning
deployment = client.deploy_sglang(
name="optimized-sglang",
model="Qwen/Qwen2.5-7B-Instruct",
env={
# Memory management
"SGL_MEM_FRACTION_STATIC": "0.85",
# Chunked prefill for better throughput
"SGL_CHUNKED_PREFILL_SIZE": "8192",
# Attention backend
"SGL_ATTENTION_BACKEND": "flashinfer",
},
)SGLang Native API
Beyond OpenAI compatibility, SGLang provides a native API with additional features:
Generate Endpoint
import httpx
response = httpx.post(
f"{deployment.url}/generate",
json={
"text": "The capital of France is",
"sampling_params": {
"temperature": 0,
"max_new_tokens": 50,
},
},
)
print(response.json()["text"])Batch Generation
response = httpx.post(
f"{deployment.url}/generate",
json={
"text": [
"Translate to French: Hello",
"Translate to French: Goodbye",
"Translate to French: Thank you",
],
"sampling_params": {"temperature": 0, "max_new_tokens": 50},
},
)
for result in response.json():
print(result["text"])Get Model Info
response = httpx.get(f"{deployment.url}/get_model_info")
print(response.json())Gated Models
For models requiring Hugging Face authentication:
deployment = client.deploy_sglang(
name="llama-sglang",
model="meta-llama/Llama-3.1-8B-Instruct",
env={"HF_TOKEN": "hf_..."},
)See Custom Models for detailed setup.
Monitoring
Check Status
status = deployment.status()
print(f"State: {status.state}")
print(f"Ready: {status.is_ready}")View Logs
print(deployment.logs(tail=50))Health Check
import httpx
response = httpx.get(f"{deployment.url}/health")
print(response.json())API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/v1/chat/completions | POST | OpenAI-compatible chat |
/v1/completions | POST | OpenAI-compatible completion |
/v1/models | GET | List models |
/generate | POST | Native SGLang generation |
/get_model_info | GET | Model information |
/health | GET | Health check |
/health_generate | GET | Health check with generation test |
SGLang vs vLLM
| Feature | SGLang | vLLM |
|---|---|---|
| OpenAI API | Yes | Yes |
| Throughput | High | Very High |
| RadixAttention | Yes | No |
| Structured Generation | Native | Limited |
| JSON Schema | Yes | No |
| Regex Constraints | Yes | No |
| Choice Selection | Yes | No |
| Multi-turn Efficiency | Better | Good |
Choose SGLang when:
- You need structured outputs (JSON, regex)
- Your workload has shared prefixes (system prompts, documents)
- You're building complex multi-turn conversations
Choose vLLM when:
- Raw throughput is the priority
- You have simple request/response patterns
- You need maximum battle-tested stability
Troubleshooting
Model Won't Load
print(deployment.logs())Common issues:
- Out of memory: Reduce context length or add GPUs
- Missing token: Set
HF_TOKENfor gated models
Generation Errors
For structured generation issues:
- Ensure schema is valid JSON Schema
- Check regex patterns are correct
- Verify model supports the output format
Slow Performance
- Enable tensor parallelism for large models
- Use
SGL_CHUNKED_PREFILL_SIZEfor better batching - Reduce context length if not needed
Next Steps
- Deploy with vLLM for maximum throughput
- Deploy custom models for fine-tuned models
- GPU configuration for advanced GPU settings