Basilica
Python SDK

GPU Configuration

Access NVIDIA's finest silicon. A100s, H100s, B200s. No waitlists, no quotas.

Quick Start

Request a GPU with a few parameters:

from basilica import BasilicaClient

client = BasilicaClient()

deployment = client.deploy(
    name="gpu-app",
    source="app.py",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    gpu_count=1,
    memory="8Gi",
)

GPU Parameters

ParameterTypeDescription
gpu_countintNumber of GPUs (1-8)
gpu_modelsList[str]Acceptable GPU models
min_cuda_versionstrMinimum CUDA version
min_gpu_memory_gbintMinimum GPU VRAM in GB

Available GPUs

ModelVRAMCUDA Compute
NVIDIA RTX A400016GB8.6

GPU availability varies by region and demand. The scheduler will find a suitable GPU matching your requirements.

Basic GPU Deployment

Single GPU

deployment = client.deploy(
    name="gpu-test",
    source="""
import torch

print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device count: {torch.cuda.device_count()}")
print(f"Device name: {torch.cuda.get_device_name(0)}")

# Keep running
import time
while True:
    time.sleep(60)
""",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    gpu_count=1,
    memory="8Gi",
)

Specific GPU Model

deployment = client.deploy(
    name="specific-gpu",
    source="app.py",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    gpu_count=1,
    gpu_models=["NVIDIA-RTX-A4000"],  # Only this model
    memory="8Gi",
)

Multiple Acceptable Models

deployment = client.deploy(
    name="flexible-gpu",
    source="app.py",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    gpu_count=1,
    gpu_models=["H100", "A100", "NVIDIA-RTX-A4000"],  # Any of these
    memory="8Gi",
)

CUDA Version Requirements

deployment = client.deploy(
    name="cuda12-app",
    source="app.py",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    gpu_count=1,
    min_cuda_version="12.0",  # Requires CUDA 12.0+
    memory="8Gi",
)

VRAM Requirements

deployment = client.deploy(
    name="large-model",
    source="app.py",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    gpu_count=1,
    min_gpu_memory_gb=16,  # Requires 16GB+ VRAM
    memory="16Gi",
)

Multi-GPU Deployments

Multiple GPUs

deployment = client.deploy(
    name="multi-gpu",
    source="app.py",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    gpu_count=4,
    memory="32Gi",
)

Tensor Parallelism

For large models that don't fit on a single GPU:

deployment = client.deploy(
    name="llm-tp4",
    source="""
from vllm import LLM

# Model sharded across 4 GPUs
llm = LLM(
    model="meta-llama/Llama-2-70b",
    tensor_parallel_size=4,
)

# Serve...
""",
    image="vllm/vllm-openai:latest",
    gpu_count=4,
    memory="128Gi",
)

Decorator Pattern

Use the decorator for GPU deployments:

import basilica

@basilica.deployment(
    name="pytorch-app",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    port=8000,
    gpu_count=1,
    gpu_models=["NVIDIA-RTX-A4000"],
    memory="8Gi",
)
def serve():
    import torch
    from http.server import HTTPServer, BaseHTTPRequestHandler
    import json

    class Handler(BaseHTTPRequestHandler):
        def do_GET(self):
            info = {
                "cuda_available": torch.cuda.is_available(),
                "device_count": torch.cuda.device_count(),
                "device_name": torch.cuda.get_device_name(0) if torch.cuda.is_available() else None,
                "cuda_version": torch.version.cuda,
            }

            self.send_response(200)
            self.send_header('Content-Type', 'application/json')
            self.end_headers()
            self.wfile.write(json.dumps(info).encode())

    HTTPServer(('', 8000), Handler).serve_forever()

deployment = serve()

Common GPU Images

Use CaseImage
PyTorchpytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime
TensorFlowtensorflow/tensorflow:2.14.0-gpu
vLLMvllm/vllm-openai:latest
SGLanglmsysorg/sglang:latest
NVIDIA Basenvidia/cuda:12.1-runtime-ubuntu22.04

GPU with Persistent Storage

Cache model weights to avoid re-downloading:

import basilica

model_cache = basilica.Volume.from_name("model-cache", create_if_missing=True)

@basilica.deployment(
    name="cached-model",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    gpu_count=1,
    memory="16Gi",
    volumes={"/root/.cache/huggingface": model_cache}
)
def serve():
    from transformers import AutoModelForCausalLM

    # First run: downloads to cache
    # Subsequent runs: loads from cache
    model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-2-7b",
        device_map="auto",
    )

    # Serve model...

PyTorch Example

Full PyTorch inference server:

import basilica

@basilica.deployment(
    name="pytorch-inference",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    port=8000,
    gpu_count=1,
    memory="8Gi",
    pip_packages=["fastapi", "uvicorn"],
)
def serve():
    import torch
    from fastapi import FastAPI
    import uvicorn

    app = FastAPI()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    @app.get("/")
    def info():
        return {
            "device": str(device),
            "cuda": torch.cuda.is_available(),
            "device_name": torch.cuda.get_device_name(0) if torch.cuda.is_available() else None,
        }

    @app.post("/predict")
    def predict(data: dict):
        # Your inference logic here
        tensor = torch.tensor(data["input"]).to(device)
        result = tensor * 2  # Example operation
        return {"output": result.cpu().tolist()}

    uvicorn.run(app, host="0.0.0.0", port=8000)

deployment = serve()
print(f"API: {deployment.url}")

Resource Recommendations

Model SizeGPU CountGPU VRAMMemory
Small (< 3B params)18GB8Gi
Medium (3-7B params)116GB16Gi
Large (7-13B params)1-224GB+32Gi
XL (30-70B params)2-440GB+64Gi+

GPU scheduling may take longer than CPU-only deployments. Set an appropriate timeout value for large deployments.

Debugging GPU Issues

Check CUDA Availability

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"cuDNN version: {torch.backends.cudnn.version()}")

Check GPU Memory

import torch
if torch.cuda.is_available():
    print(f"Total memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    print(f"Allocated: {torch.cuda.memory_allocated(0) / 1e9:.1f} GB")
    print(f"Cached: {torch.cuda.memory_reserved(0) / 1e9:.1f} GB")

Common Issues

IssueSolution
CUDA not availableEnsure using GPU image and gpu_count >= 1
Out of memoryIncrease min_gpu_memory_gb or use multiple GPUs
Slow startupUse persistent storage to cache models
Driver mismatchCheck CUDA version compatibility with image

Next Steps

On this page