Section 2 — Configure Your Host
Before running training jobs, make sure each host is correctly configured.
Automatic Environment Inspection
When a host connects, SkyPortal inspects:
- GPU model and driver versions
- Installed Python environments
- Active processes
- Network settings and ports :contentReference[oaicite:4]{index=4}
Recommended Configuration Steps
-
Verify GPU Drivers
Ensure drivers are compatible with your frameworks (e.g., CUDA for PyTorch/TensorFlow). -
Choose Python Environment
Select or create a virtual environment for your ML stack. -
Dependency Resolution
Use SkyPortal’s integrated tooling to install packages and detect conflicts. -
Test Run
Launch a small sample script to verify compute readiness and observability.
Integrated Observability
Once configured, you’ll immediately see:
- CPU and GPU health
- Memory usage
- Active workflows and metrics
- Alerts and event logs
This ensures your host is production-ready before launching real workloads. :contentReference[oaicite:5]{index=5}