Skip to content

Section 2 — Configure Your Host

Before running training jobs, make sure each host is correctly configured.

Automatic Environment Inspection

When a host connects, SkyPortal inspects:

  • GPU model and driver versions
  • Installed Python environments
  • Active processes
  • Network settings and ports :contentReference[oaicite:4]{index=4}
  1. Verify GPU Drivers
    Ensure drivers are compatible with your frameworks (e.g., CUDA for PyTorch/TensorFlow).

  2. Choose Python Environment
    Select or create a virtual environment for your ML stack.

  3. Dependency Resolution
    Use SkyPortal’s integrated tooling to install packages and detect conflicts.

  4. Test Run
    Launch a small sample script to verify compute readiness and observability.

Integrated Observability

Once configured, you’ll immediately see:

  • CPU and GPU health
  • Memory usage
  • Active workflows and metrics
  • Alerts and event logs

This ensures your host is production-ready before launching real workloads. :contentReference[oaicite:5]{index=5}