Skip to content

Section 4 — Monitor in Real Time

Monitoring isn’t an afterthought — it’s built into every part of the SkyPortal workflow.

Observability Dashboards

As soon as a job runs, you get:

  • Training metrics: loss, accuracy, MAE, MSE
  • System metrics: GPU utilization, CPU load, memory, I/O
  • Budget insights: cost per run, warnings on overspend :contentReference[oaicite:9]{index=9}

Log Streams

All log output — stdout, stderr, system events — streams in real time.

Alerts & Thresholds

  • Automatic alerts when metrics cross thresholds
  • Optionally auto-stop jobs that exceed budget or error tolerance :contentReference[oaicite:10]{index=10}

Advanced Views

  • Multi-host overviews across clouds
  • Historical comparisons between runs
  • Experiment tracking tied to parameters, datasets, and outcomes

This consolidated observability eliminates the need for external dashboards or tools. :contentReference[oaicite:11]{index=11}