GPU Configuration

Access NVIDIA's finest silicon. A100s, H100s, B200s. No waitlists, no quotas.

Quick Start

Request a GPU with a few parameters:

from basilica import BasilicaClient

client = BasilicaClient()

deployment = client.deploy(
    name="gpu-app",
    source="app.py",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    gpu_count=1,
    memory="8Gi",
)

GPU Parameters

Parameter	Type	Description
`gpu_count`	`int`	Number of GPUs (1-8)
`gpu_models`	`List[str]`	Acceptable GPU models
`min_cuda_version`	`str`	Minimum CUDA version
`min_gpu_memory_gb`	`int`	Minimum GPU VRAM in GB

Available GPUs

Model	VRAM	CUDA Compute
NVIDIA RTX A4000	16GB	8.6

GPU availability varies by region and demand. The scheduler will find a suitable GPU matching your requirements.

Basic GPU Deployment

Single GPU

deployment = client.deploy(
    name="gpu-test",
    source="""
import torch

print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device count: {torch.cuda.device_count()}")
print(f"Device name: {torch.cuda.get_device_name(0)}")

# Keep running
import time
while True:
    time.sleep(60)
""",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    gpu_count=1,
    memory="8Gi",
)

Specific GPU Model

deployment = client.deploy(
    name="specific-gpu",
    source="app.py",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    gpu_count=1,
    gpu_models=["NVIDIA-RTX-A4000"],  # Only this model
    memory="8Gi",
)

Multiple Acceptable Models

deployment = client.deploy(
    name="flexible-gpu",
    source="app.py",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    gpu_count=1,
    gpu_models=["H100", "A100", "NVIDIA-RTX-A4000"],  # Any of these
    memory="8Gi",
)

CUDA Version Requirements

deployment = client.deploy(
    name="cuda12-app",
    source="app.py",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    gpu_count=1,
    min_cuda_version="12.0",  # Requires CUDA 12.0+
    memory="8Gi",
)

VRAM Requirements

deployment = client.deploy(
    name="large-model",
    source="app.py",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    gpu_count=1,
    min_gpu_memory_gb=16,  # Requires 16GB+ VRAM
    memory="16Gi",
)

Multi-GPU Deployments

Multiple GPUs

deployment = client.deploy(
    name="multi-gpu",
    source="app.py",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    gpu_count=4,
    memory="32Gi",
)

Tensor Parallelism

For large models that don't fit on a single GPU:

deployment = client.deploy(
    name="llm-tp4",
    source="""
from vllm import LLM

# Model sharded across 4 GPUs
llm = LLM(
    model="meta-llama/Llama-2-70b",
    tensor_parallel_size=4,
)

# Serve...
""",
    image="vllm/vllm-openai:latest",
    gpu_count=4,
    memory="128Gi",
)

Decorator Pattern

Use the decorator for GPU deployments:

import basilica

@basilica.deployment(
    name="pytorch-app",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    port=8000,
    gpu_count=1,
    gpu_models=["NVIDIA-RTX-A4000"],
    memory="8Gi",
)
def serve():
    import torch
    from http.server import HTTPServer, BaseHTTPRequestHandler
    import json

    class Handler(BaseHTTPRequestHandler):
        def do_GET(self):
            info = {
                "cuda_available": torch.cuda.is_available(),
                "device_count": torch.cuda.device_count(),
                "device_name": torch.cuda.get_device_name(0) if torch.cuda.is_available() else None,
                "cuda_version": torch.version.cuda,
            }

            self.send_response(200)
            self.send_header('Content-Type', 'application/json')
            self.end_headers()
            self.wfile.write(json.dumps(info).encode())

    HTTPServer(('', 8000), Handler).serve_forever()

deployment = serve()

Common GPU Images

Use Case	Image
PyTorch	`pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime`
TensorFlow	`tensorflow/tensorflow:2.14.0-gpu`
vLLM	`vllm/vllm-openai:latest`
SGLang	`lmsysorg/sglang:latest`
NVIDIA Base	`nvidia/cuda:12.1-runtime-ubuntu22.04`

GPU with Persistent Storage

Cache model weights to avoid re-downloading:

import basilica

model_cache = basilica.Volume.from_name("model-cache", create_if_missing=True)

@basilica.deployment(
    name="cached-model",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    gpu_count=1,
    memory="16Gi",
    volumes={"/root/.cache/huggingface": model_cache}
)
def serve():
    from transformers import AutoModelForCausalLM

    # First run: downloads to cache
    # Subsequent runs: loads from cache
    model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-2-7b",
        device_map="auto",
    )

    # Serve model...

PyTorch Example

Full PyTorch inference server:

import basilica

@basilica.deployment(
    name="pytorch-inference",
    image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    port=8000,
    gpu_count=1,
    memory="8Gi",
    pip_packages=["fastapi", "uvicorn"],
)
def serve():
    import torch
    from fastapi import FastAPI
    import uvicorn

    app = FastAPI()
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    @app.get("/")
    def info():
        return {
            "device": str(device),
            "cuda": torch.cuda.is_available(),
            "device_name": torch.cuda.get_device_name(0) if torch.cuda.is_available() else None,
        }

    @app.post("/predict")
    def predict(data: dict):
        # Your inference logic here
        tensor = torch.tensor(data["input"]).to(device)
        result = tensor * 2  # Example operation
        return {"output": result.cpu().tolist()}

    uvicorn.run(app, host="0.0.0.0", port=8000)

deployment = serve()
print(f"API: {deployment.url}")

Resource Recommendations

Model Size	GPU Count	GPU VRAM	Memory
Small (< 3B params)	1	8GB	8Gi
Medium (3-7B params)	1	16GB	16Gi
Large (7-13B params)	1-2	24GB+	32Gi
XL (30-70B params)	2-4	40GB+	64Gi+

GPU scheduling may take longer than CPU-only deployments. Set an appropriate timeout value for large deployments.

Debugging GPU Issues

Check CUDA Availability

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"cuDNN version: {torch.backends.cudnn.version()}")

Check GPU Memory

import torch
if torch.cuda.is_available():
    print(f"Total memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    print(f"Allocated: {torch.cuda.memory_allocated(0) / 1e9:.1f} GB")
    print(f"Cached: {torch.cuda.memory_reserved(0) / 1e9:.1f} GB")

Common Issues

Issue	Solution
CUDA not available	Ensure using GPU image and `gpu_count >= 1`
Out of memory	Increase `min_gpu_memory_gb` or use multiple GPUs
Slow startup	Use persistent storage to cache models
Driver mismatch	Check CUDA version compatibility with image