Python SDK
GPU Configuration
Access NVIDIA's finest silicon. A100s, H100s, B200s. No waitlists, no quotas.
Quick Start
Request a GPU with a few parameters:
from basilica import BasilicaClient
client = BasilicaClient()
deployment = client.deploy(
name="gpu-app",
source="app.py",
image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
gpu_count=1,
memory="8Gi",
)GPU Parameters
| Parameter | Type | Description |
|---|---|---|
gpu_count | int | Number of GPUs (1-8) |
gpu_models | List[str] | Acceptable GPU models |
min_cuda_version | str | Minimum CUDA version |
min_gpu_memory_gb | int | Minimum GPU VRAM in GB |
Available GPUs
| Model | VRAM | CUDA Compute |
|---|---|---|
| NVIDIA RTX A4000 | 16GB | 8.6 |
GPU availability varies by region and demand. The scheduler will find a suitable GPU matching your requirements.
Basic GPU Deployment
Single GPU
deployment = client.deploy(
name="gpu-test",
source="""
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device count: {torch.cuda.device_count()}")
print(f"Device name: {torch.cuda.get_device_name(0)}")
# Keep running
import time
while True:
time.sleep(60)
""",
image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
gpu_count=1,
memory="8Gi",
)Specific GPU Model
deployment = client.deploy(
name="specific-gpu",
source="app.py",
image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
gpu_count=1,
gpu_models=["NVIDIA-RTX-A4000"], # Only this model
memory="8Gi",
)Multiple Acceptable Models
deployment = client.deploy(
name="flexible-gpu",
source="app.py",
image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
gpu_count=1,
gpu_models=["H100", "A100", "NVIDIA-RTX-A4000"], # Any of these
memory="8Gi",
)CUDA Version Requirements
deployment = client.deploy(
name="cuda12-app",
source="app.py",
image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
gpu_count=1,
min_cuda_version="12.0", # Requires CUDA 12.0+
memory="8Gi",
)VRAM Requirements
deployment = client.deploy(
name="large-model",
source="app.py",
image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
gpu_count=1,
min_gpu_memory_gb=16, # Requires 16GB+ VRAM
memory="16Gi",
)Multi-GPU Deployments
Multiple GPUs
deployment = client.deploy(
name="multi-gpu",
source="app.py",
image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
gpu_count=4,
memory="32Gi",
)Tensor Parallelism
For large models that don't fit on a single GPU:
deployment = client.deploy(
name="llm-tp4",
source="""
from vllm import LLM
# Model sharded across 4 GPUs
llm = LLM(
model="meta-llama/Llama-2-70b",
tensor_parallel_size=4,
)
# Serve...
""",
image="vllm/vllm-openai:latest",
gpu_count=4,
memory="128Gi",
)Decorator Pattern
Use the decorator for GPU deployments:
import basilica
@basilica.deployment(
name="pytorch-app",
image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
port=8000,
gpu_count=1,
gpu_models=["NVIDIA-RTX-A4000"],
memory="8Gi",
)
def serve():
import torch
from http.server import HTTPServer, BaseHTTPRequestHandler
import json
class Handler(BaseHTTPRequestHandler):
def do_GET(self):
info = {
"cuda_available": torch.cuda.is_available(),
"device_count": torch.cuda.device_count(),
"device_name": torch.cuda.get_device_name(0) if torch.cuda.is_available() else None,
"cuda_version": torch.version.cuda,
}
self.send_response(200)
self.send_header('Content-Type', 'application/json')
self.end_headers()
self.wfile.write(json.dumps(info).encode())
HTTPServer(('', 8000), Handler).serve_forever()
deployment = serve()Common GPU Images
| Use Case | Image |
|---|---|
| PyTorch | pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime |
| TensorFlow | tensorflow/tensorflow:2.14.0-gpu |
| vLLM | vllm/vllm-openai:latest |
| SGLang | lmsysorg/sglang:latest |
| NVIDIA Base | nvidia/cuda:12.1-runtime-ubuntu22.04 |
GPU with Persistent Storage
Cache model weights to avoid re-downloading:
import basilica
model_cache = basilica.Volume.from_name("model-cache", create_if_missing=True)
@basilica.deployment(
name="cached-model",
image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
gpu_count=1,
memory="16Gi",
volumes={"/root/.cache/huggingface": model_cache}
)
def serve():
from transformers import AutoModelForCausalLM
# First run: downloads to cache
# Subsequent runs: loads from cache
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b",
device_map="auto",
)
# Serve model...PyTorch Example
Full PyTorch inference server:
import basilica
@basilica.deployment(
name="pytorch-inference",
image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
port=8000,
gpu_count=1,
memory="8Gi",
pip_packages=["fastapi", "uvicorn"],
)
def serve():
import torch
from fastapi import FastAPI
import uvicorn
app = FastAPI()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
@app.get("/")
def info():
return {
"device": str(device),
"cuda": torch.cuda.is_available(),
"device_name": torch.cuda.get_device_name(0) if torch.cuda.is_available() else None,
}
@app.post("/predict")
def predict(data: dict):
# Your inference logic here
tensor = torch.tensor(data["input"]).to(device)
result = tensor * 2 # Example operation
return {"output": result.cpu().tolist()}
uvicorn.run(app, host="0.0.0.0", port=8000)
deployment = serve()
print(f"API: {deployment.url}")Resource Recommendations
| Model Size | GPU Count | GPU VRAM | Memory |
|---|---|---|---|
| Small (< 3B params) | 1 | 8GB | 8Gi |
| Medium (3-7B params) | 1 | 16GB | 16Gi |
| Large (7-13B params) | 1-2 | 24GB+ | 32Gi |
| XL (30-70B params) | 2-4 | 40GB+ | 64Gi+ |
GPU scheduling may take longer than CPU-only deployments. Set an appropriate timeout value for large deployments.
Debugging GPU Issues
Check CUDA Availability
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
print(f"cuDNN version: {torch.backends.cudnn.version()}")Check GPU Memory
import torch
if torch.cuda.is_available():
print(f"Total memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
print(f"Allocated: {torch.cuda.memory_allocated(0) / 1e9:.1f} GB")
print(f"Cached: {torch.cuda.memory_reserved(0) / 1e9:.1f} GB")Common Issues
| Issue | Solution |
|---|---|
| CUDA not available | Ensure using GPU image and gpu_count >= 1 |
| Out of memory | Increase min_gpu_memory_gb or use multiple GPUs |
| Slow startup | Use persistent storage to cache models |
| Driver mismatch | Check CUDA version compatibility with image |