Basilica
LLM Inference

Custom Models

Your fine-tuned creations deserve a home. Deploy from Hugging Face, local files, or private registries.

Hugging Face Models

Public Models

Deploy any public Hugging Face model:

from basilica import BasilicaClient

client = BasilicaClient()

# vLLM deployment
deployment = client.deploy_vllm(
    name="my-model",
    model="your-username/your-model",
    gpu_count=1,
)

# SGLang deployment
deployment = client.deploy_sglang(
    name="my-model",
    model="your-username/your-model",
    gpu_count=1,
)

Private Models

For private Hugging Face models, provide your access token:

deployment = client.deploy_vllm(
    name="private-model",
    model="your-org/private-model",
    env={"HF_TOKEN": "hf_..."},
)

Keep your Hugging Face token secure. Consider using environment variables or secrets management.

Gated Models

Some models require accepting a license agreement on Hugging Face:

  1. Visit the model page (e.g., https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)
  2. Accept the license agreement
  3. Deploy with your token:
deployment = client.deploy_vllm(
    name="llama-api",
    model="meta-llama/Llama-3.1-8B-Instruct",
    env={"HF_TOKEN": "hf_..."},
)

Fine-Tuned Models

LoRA Adapters

Deploy a base model with LoRA adapter:

# Upload your adapter to Hugging Face first
# huggingface-cli upload your-org/my-adapter ./adapter

deployment = client.deploy_vllm(
    name="lora-model",
    model="meta-llama/Llama-3.1-8B-Instruct",
    env={
        "HF_TOKEN": "hf_...",
        "VLLM_LORA_MODULES": "my-adapter=your-org/my-adapter",
    },
)

Use the adapter in requests:

from openai import OpenAI

client = OpenAI(base_url=f"{deployment.url}/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="my-adapter",  # Use adapter name
    messages=[{"role": "user", "content": "Hello!"}],
)

Multiple LoRA Adapters

Serve multiple adapters from one deployment:

deployment = client.deploy_vllm(
    name="multi-lora",
    model="meta-llama/Llama-3.1-8B-Instruct",
    env={
        "HF_TOKEN": "hf_...",
        "VLLM_LORA_MODULES": "adapter1=org/adapter1,adapter2=org/adapter2",
        "VLLM_MAX_LORAS": "4",
    },
)

Merged Models

For fine-tuned models merged with base weights:

# Upload merged model to Hugging Face
# huggingface-cli upload your-org/merged-model ./merged_weights

deployment = client.deploy_vllm(
    name="merged-model",
    model="your-org/merged-model",
    env={"HF_TOKEN": "hf_..."},
)

Quantized Models

AWQ Quantization

Deploy AWQ-quantized models for lower memory usage:

deployment = client.deploy_vllm(
    name="qwen-awq",
    model="Qwen/Qwen2.5-7B-Instruct-AWQ",
    quantization="awq",
    gpu_count=1,
)

GPTQ Quantization

deployment = client.deploy_vllm(
    name="mistral-gptq",
    model="TheBloke/Mistral-7B-Instruct-v0.2-GPTQ",
    quantization="gptq",
    gpu_count=1,
)

Custom Quantized Models

Upload your quantized model to Hugging Face:

# Quantize locally
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

model.quantize(tokenizer, quant_config={"w_bit": 4, "q_group_size": 128})
model.save_quantized("./quantized-model")

# Upload to Hugging Face
# huggingface-cli upload your-org/llama-awq ./quantized-model

Deploy:

deployment = client.deploy_vllm(
    name="custom-awq",
    model="your-org/llama-awq",
    quantization="awq",
)

Custom Docker Images

For complete control, build a custom image:

Dockerfile

FROM vllm/vllm-openai:latest

# Pre-download model weights
ENV HF_TOKEN=hf_...
RUN python -c "from huggingface_hub import snapshot_download; snapshot_download('your-org/your-model')"

# Custom dependencies
RUN pip install custom-package

# Custom entrypoint
COPY start.sh /start.sh
ENTRYPOINT ["/start.sh"]

Deploy Custom Image

deployment = client.deploy(
    name="custom-vllm",
    image="ghcr.io/your-org/custom-vllm:latest",
    port=8000,
    gpu_count=1,
    memory="16Gi",
    env={
        "MODEL": "your-org/your-model",
    },
)

Model Configuration

Chat Templates

For models without a default chat template:

deployment = client.deploy_vllm(
    name="custom-chat",
    model="your-org/base-model",
    env={
        "VLLM_CHAT_TEMPLATE": "/templates/chat.jinja",
    },
)

Create a custom template in your image or mount it via storage.

Special Tokens

Configure special tokens for custom models:

deployment = client.deploy_vllm(
    name="custom-tokens",
    model="your-org/your-model",
    env={
        # Custom stop tokens
        "VLLM_STOP_TOKEN_IDS": "2,32000",
    },
)

Embedding Models

Deploy embedding models for semantic search:

deployment = client.deploy_vllm(
    name="embeddings",
    model="BAAI/bge-large-en-v1.5",
    env={
        "VLLM_TASK": "embed",
    },
)

Use embeddings:

from openai import OpenAI

client = OpenAI(base_url=f"{deployment.url}/v1", api_key="not-needed")

response = client.embeddings.create(
    model="BAAI/bge-large-en-v1.5",
    input=["Hello world", "Goodbye world"],
)

print(response.data[0].embedding[:5])  # First 5 dimensions

Vision-Language Models

Deploy multimodal models:

deployment = client.deploy_vllm(
    name="vision-llm",
    model="llava-hf/llava-1.5-7b-hf",
    gpu_count=1,
    memory="16Gi",
)

Use with images:

import base64

with open("image.jpg", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

response = client.chat.completions.create(
    model="llava-hf/llava-1.5-7b-hf",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}},
            ],
        }
    ],
)

Testing Your Model

Before production, verify your model works:

from basilica import BasilicaClient
from openai import OpenAI

basilica = BasilicaClient()

# Deploy
deployment = basilica.deploy_vllm(
    name="test-model",
    model="your-org/your-model",
    env={"HF_TOKEN": "hf_..."},
)

deployment.wait_until_ready(timeout=600)

# Test basic completion
client = OpenAI(base_url=f"{deployment.url}/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="your-org/your-model",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=50,
)

print(f"Response: {response.choices[0].message.content}")
print(f"Tokens: {response.usage.total_tokens}")

# Test streaming
stream = client.chat.completions.create(
    model="your-org/your-model",
    messages=[{"role": "user", "content": "Count to 5"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

# Cleanup
deployment.delete()

Troubleshooting

Model Not Loading

Check logs for errors:

print(deployment.logs())

Common issues:

  • Authentication: Verify HF_TOKEN is correct
  • Memory: Model may need more GPU VRAM
  • Architecture: Ensure model architecture is supported

Wrong Outputs

  • Verify chat template matches your model's training format
  • Check tokenizer configuration
  • Ensure model wasn't corrupted during upload

Slow Inference

  • Enable tensor parallelism for large models
  • Use quantization to reduce memory bandwidth
  • Check if context length is too large

Out of Memory

  • Reduce max_model_len
  • Use quantization
  • Add more GPUs with tensor parallelism

Next Steps

On this page