Section 4 — Monitor in Real Time
Monitoring isn’t an afterthought — it’s built into every part of the SkyPortal workflow.
Observability Dashboards
As soon as a job runs, you get:
- Training metrics: loss, accuracy, MAE, MSE
- System metrics: GPU utilization, CPU load, memory, I/O
- Budget insights: cost per run, warnings on overspend :contentReference[oaicite:9]{index=9}
Log Streams
All log output — stdout, stderr, system events — streams in real time.
Alerts & Thresholds
- Automatic alerts when metrics cross thresholds
- Optionally auto-stop jobs that exceed budget or error tolerance :contentReference[oaicite:10]{index=10}
Advanced Views
- Multi-host overviews across clouds
- Historical comparisons between runs
- Experiment tracking tied to parameters, datasets, and outcomes
This consolidated observability eliminates the need for external dashboards or tools. :contentReference[oaicite:11]{index=11}