Custom Models

Your fine-tuned creations deserve a home. Deploy from Hugging Face, local files, or private registries.

Hugging Face Models

Public Models

Deploy any public Hugging Face model:

from basilica import BasilicaClient

client = BasilicaClient()

# vLLM deployment
deployment = client.deploy_vllm(
    name="my-model",
    model="your-username/your-model",
    gpu_count=1,
)

# SGLang deployment
deployment = client.deploy_sglang(
    name="my-model",
    model="your-username/your-model",
    gpu_count=1,
)

Private Models

For private Hugging Face models, provide your access token:

deployment = client.deploy_vllm(
    name="private-model",
    model="your-org/private-model",
    env={"HF_TOKEN": "hf_..."},
)

Keep your Hugging Face token secure. Consider using environment variables or secrets management.

Gated Models

Some models require accepting a license agreement on Hugging Face:

Visit the model page (e.g., https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
Accept the license agreement
Deploy with your token:

deployment = client.deploy_vllm(
    name="llama-api",
    model="meta-llama/Llama-3.1-8B-Instruct",
    env={"HF_TOKEN": "hf_..."},
)

Fine-Tuned Models

LoRA Adapters

Deploy a base model with LoRA adapter:

# Upload your adapter to Hugging Face first
# huggingface-cli upload your-org/my-adapter ./adapter

deployment = client.deploy_vllm(
    name="lora-model",
    model="meta-llama/Llama-3.1-8B-Instruct",
    env={
        "HF_TOKEN": "hf_...",
        "VLLM_LORA_MODULES": "my-adapter=your-org/my-adapter",
    },
)

Use the adapter in requests:

from openai import OpenAI

client = OpenAI(base_url=f"{deployment.url}/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="my-adapter",  # Use adapter name
    messages=[{"role": "user", "content": "Hello!"}],
)

Multiple LoRA Adapters

Serve multiple adapters from one deployment:

deployment = client.deploy_vllm(
    name="multi-lora",
    model="meta-llama/Llama-3.1-8B-Instruct",
    env={
        "HF_TOKEN": "hf_...",
        "VLLM_LORA_MODULES": "adapter1=org/adapter1,adapter2=org/adapter2",
        "VLLM_MAX_LORAS": "4",
    },
)

Merged Models

For fine-tuned models merged with base weights:

# Upload merged model to Hugging Face
# huggingface-cli upload your-org/merged-model ./merged_weights

deployment = client.deploy_vllm(
    name="merged-model",
    model="your-org/merged-model",
    env={"HF_TOKEN": "hf_..."},
)

Quantized Models

AWQ Quantization

Deploy AWQ-quantized models for lower memory usage:

deployment = client.deploy_vllm(
    name="qwen-awq",
    model="Qwen/Qwen2.5-7B-Instruct-AWQ",
    quantization="awq",
    gpu_count=1,
)

GPTQ Quantization

deployment = client.deploy_vllm(
    name="mistral-gptq",
    model="TheBloke/Mistral-7B-Instruct-v0.2-GPTQ",
    quantization="gptq",
    gpu_count=1,
)

Custom Quantized Models

Upload your quantized model to Hugging Face:

# Quantize locally
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

model.quantize(tokenizer, quant_config={"w_bit": 4, "q_group_size": 128})
model.save_quantized("./quantized-model")

# Upload to Hugging Face
# huggingface-cli upload your-org/llama-awq ./quantized-model

Deploy:

deployment = client.deploy_vllm(
    name="custom-awq",
    model="your-org/llama-awq",
    quantization="awq",
)

Custom Docker Images

For complete control, build a custom image:

Dockerfile

FROM vllm/vllm-openai:latest

# Pre-download model weights
ENV HF_TOKEN=hf_...
RUN python -c "from huggingface_hub import snapshot_download; snapshot_download('your-org/your-model')"

# Custom dependencies
RUN pip install custom-package

# Custom entrypoint
COPY start.sh /start.sh
ENTRYPOINT ["/start.sh"]

Deploy Custom Image

deployment = client.deploy(
    name="custom-vllm",
    image="ghcr.io/your-org/custom-vllm:latest",
    port=8000,
    gpu_count=1,
    memory="16Gi",
    env={
        "MODEL": "your-org/your-model",
    },
)

Model Configuration

Chat Templates

For models without a default chat template:

deployment = client.deploy_vllm(
    name="custom-chat",
    model="your-org/base-model",
    env={
        "VLLM_CHAT_TEMPLATE": "/templates/chat.jinja",
    },
)

Create a custom template in your image or mount it via storage.

Special Tokens

Configure special tokens for custom models:

deployment = client.deploy_vllm(
    name="custom-tokens",
    model="your-org/your-model",
    env={
        # Custom stop tokens
        "VLLM_STOP_TOKEN_IDS": "2,32000",
    },
)

Embedding Models

Deploy embedding models for semantic search:

deployment = client.deploy_vllm(
    name="embeddings",
    model="BAAI/bge-large-en-v1.5",
    env={
        "VLLM_TASK": "embed",
    },
)

Use embeddings:

from openai import OpenAI

client = OpenAI(base_url=f"{deployment.url}/v1", api_key="not-needed")

response = client.embeddings.create(
    model="BAAI/bge-large-en-v1.5",
    input=["Hello world", "Goodbye world"],
)

print(response.data[0].embedding[:5])  # First 5 dimensions

Vision-Language Models

Deploy multimodal models:

deployment = client.deploy_vllm(
    name="vision-llm",
    model="llava-hf/llava-1.5-7b-hf",
    gpu_count=1,
    memory="16Gi",
)

Use with images:

import base64

with open("image.jpg", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="llava-hf/llava-1.5-7b-hf",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}},
            ],
        }
    ],
)

Testing Your Model

Before production, verify your model works:

from basilica import BasilicaClient
from openai import OpenAI

basilica = BasilicaClient()

# Deploy
deployment = basilica.deploy_vllm(
    name="test-model",
    model="your-org/your-model",
    env={"HF_TOKEN": "hf_..."},
)

deployment.wait_until_ready(timeout=600)

# Test basic completion
client = OpenAI(base_url=f"{deployment.url}/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="your-org/your-model",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=50,
)

print(f"Response: {response.choices[0].message.content}")
print(f"Tokens: {response.usage.total_tokens}")

# Test streaming
stream = client.chat.completions.create(
    model="your-org/your-model",
    messages=[{"role": "user", "content": "Count to 5"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

# Cleanup
deployment.delete()

Troubleshooting

Model Not Loading

Check logs for errors:

print(deployment.logs())

Common issues:

Authentication: Verify HF_TOKEN is correct
Memory: Model may need more GPU VRAM
Architecture: Ensure model architecture is supported

Wrong Outputs

Verify chat template matches your model's training format
Check tokenizer configuration
Ensure model wasn't corrupted during upload

Slow Inference

Enable tensor parallelism for large models
Use quantization to reduce memory bandwidth
Check if context length is too large

Out of Memory

Reduce max_model_len
Use quantization
Add more GPUs with tensor parallelism

Next Steps

Deploy with vLLM for production inference
Deploy with SGLang for structured generation
GPU configuration for advanced GPU settings

On this page