Custom Models
Your fine-tuned creations deserve a home. Deploy from Hugging Face, local files, or private registries.
Hugging Face Models
Public Models
Deploy any public Hugging Face model:
from basilica import BasilicaClient
client = BasilicaClient()
# vLLM deployment
deployment = client.deploy_vllm(
name="my-model",
model="your-username/your-model",
gpu_count=1,
)
# SGLang deployment
deployment = client.deploy_sglang(
name="my-model",
model="your-username/your-model",
gpu_count=1,
)Private Models
For private Hugging Face models, provide your access token:
deployment = client.deploy_vllm(
name="private-model",
model="your-org/private-model",
env={"HF_TOKEN": "hf_..."},
)Keep your Hugging Face token secure. Consider using environment variables or secrets management.
Gated Models
Some models require accepting a license agreement on Hugging Face:
- Visit the model page (e.g.,
https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) - Accept the license agreement
- Deploy with your token:
deployment = client.deploy_vllm(
name="llama-api",
model="meta-llama/Llama-3.1-8B-Instruct",
env={"HF_TOKEN": "hf_..."},
)Fine-Tuned Models
LoRA Adapters
Deploy a base model with LoRA adapter:
# Upload your adapter to Hugging Face first
# huggingface-cli upload your-org/my-adapter ./adapter
deployment = client.deploy_vllm(
name="lora-model",
model="meta-llama/Llama-3.1-8B-Instruct",
env={
"HF_TOKEN": "hf_...",
"VLLM_LORA_MODULES": "my-adapter=your-org/my-adapter",
},
)Use the adapter in requests:
from openai import OpenAI
client = OpenAI(base_url=f"{deployment.url}/v1", api_key="not-needed")
response = client.chat.completions.create(
model="my-adapter", # Use adapter name
messages=[{"role": "user", "content": "Hello!"}],
)Multiple LoRA Adapters
Serve multiple adapters from one deployment:
deployment = client.deploy_vllm(
name="multi-lora",
model="meta-llama/Llama-3.1-8B-Instruct",
env={
"HF_TOKEN": "hf_...",
"VLLM_LORA_MODULES": "adapter1=org/adapter1,adapter2=org/adapter2",
"VLLM_MAX_LORAS": "4",
},
)Merged Models
For fine-tuned models merged with base weights:
# Upload merged model to Hugging Face
# huggingface-cli upload your-org/merged-model ./merged_weights
deployment = client.deploy_vllm(
name="merged-model",
model="your-org/merged-model",
env={"HF_TOKEN": "hf_..."},
)Quantized Models
AWQ Quantization
Deploy AWQ-quantized models for lower memory usage:
deployment = client.deploy_vllm(
name="qwen-awq",
model="Qwen/Qwen2.5-7B-Instruct-AWQ",
quantization="awq",
gpu_count=1,
)GPTQ Quantization
deployment = client.deploy_vllm(
name="mistral-gptq",
model="TheBloke/Mistral-7B-Instruct-v0.2-GPTQ",
quantization="gptq",
gpu_count=1,
)Custom Quantized Models
Upload your quantized model to Hugging Face:
# Quantize locally
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
model.quantize(tokenizer, quant_config={"w_bit": 4, "q_group_size": 128})
model.save_quantized("./quantized-model")
# Upload to Hugging Face
# huggingface-cli upload your-org/llama-awq ./quantized-modelDeploy:
deployment = client.deploy_vllm(
name="custom-awq",
model="your-org/llama-awq",
quantization="awq",
)Custom Docker Images
For complete control, build a custom image:
Dockerfile
FROM vllm/vllm-openai:latest
# Pre-download model weights
ENV HF_TOKEN=hf_...
RUN python -c "from huggingface_hub import snapshot_download; snapshot_download('your-org/your-model')"
# Custom dependencies
RUN pip install custom-package
# Custom entrypoint
COPY start.sh /start.sh
ENTRYPOINT ["/start.sh"]Deploy Custom Image
deployment = client.deploy(
name="custom-vllm",
image="ghcr.io/your-org/custom-vllm:latest",
port=8000,
gpu_count=1,
memory="16Gi",
env={
"MODEL": "your-org/your-model",
},
)Model Configuration
Chat Templates
For models without a default chat template:
deployment = client.deploy_vllm(
name="custom-chat",
model="your-org/base-model",
env={
"VLLM_CHAT_TEMPLATE": "/templates/chat.jinja",
},
)Create a custom template in your image or mount it via storage.
Special Tokens
Configure special tokens for custom models:
deployment = client.deploy_vllm(
name="custom-tokens",
model="your-org/your-model",
env={
# Custom stop tokens
"VLLM_STOP_TOKEN_IDS": "2,32000",
},
)Embedding Models
Deploy embedding models for semantic search:
deployment = client.deploy_vllm(
name="embeddings",
model="BAAI/bge-large-en-v1.5",
env={
"VLLM_TASK": "embed",
},
)Use embeddings:
from openai import OpenAI
client = OpenAI(base_url=f"{deployment.url}/v1", api_key="not-needed")
response = client.embeddings.create(
model="BAAI/bge-large-en-v1.5",
input=["Hello world", "Goodbye world"],
)
print(response.data[0].embedding[:5]) # First 5 dimensionsVision-Language Models
Deploy multimodal models:
deployment = client.deploy_vllm(
name="vision-llm",
model="llava-hf/llava-1.5-7b-hf",
gpu_count=1,
memory="16Gi",
)Use with images:
import base64
with open("image.jpg", "rb") as f:
image_data = base64.b64encode(f.read()).decode()
response = client.chat.completions.create(
model="llava-hf/llava-1.5-7b-hf",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}},
],
}
],
)Testing Your Model
Before production, verify your model works:
from basilica import BasilicaClient
from openai import OpenAI
basilica = BasilicaClient()
# Deploy
deployment = basilica.deploy_vllm(
name="test-model",
model="your-org/your-model",
env={"HF_TOKEN": "hf_..."},
)
deployment.wait_until_ready(timeout=600)
# Test basic completion
client = OpenAI(base_url=f"{deployment.url}/v1", api_key="not-needed")
response = client.chat.completions.create(
model="your-org/your-model",
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=50,
)
print(f"Response: {response.choices[0].message.content}")
print(f"Tokens: {response.usage.total_tokens}")
# Test streaming
stream = client.chat.completions.create(
model="your-org/your-model",
messages=[{"role": "user", "content": "Count to 5"}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
# Cleanup
deployment.delete()Troubleshooting
Model Not Loading
Check logs for errors:
print(deployment.logs())Common issues:
- Authentication: Verify
HF_TOKENis correct - Memory: Model may need more GPU VRAM
- Architecture: Ensure model architecture is supported
Wrong Outputs
- Verify chat template matches your model's training format
- Check tokenizer configuration
- Ensure model wasn't corrupted during upload
Slow Inference
- Enable tensor parallelism for large models
- Use quantization to reduce memory bandwidth
- Check if context length is too large
Out of Memory
- Reduce
max_model_len - Use quantization
- Add more GPUs with tensor parallelism
Next Steps
- Deploy with vLLM for production inference
- Deploy with SGLang for structured generation
- GPU configuration for advanced GPU settings