SkyPortal Documentation
The AI-native platform that simplifies how teams build, train, and manage machine learning models.
The AI Agent Platform
The AI agent that sets up and runs your ML jobs across any cloud
Any AI use case accelerated 10X
What is SkyPortal?
SkyPortal unifies your entire ML workflow in one platform. Instead of juggling multiple tools like Weights & Biases, GitHub, cloud consoles, and observability platforms, SkyPortal brings everything together.
-
Multi-Cloud Management
Connect to every host from any cloud provider. Track server health, job runtime, and experiment progress with convenient tagging and a single terminal interface.
-
Job Orchestration
Easily launch and manage training jobs on cloud or on-prem GPUs. Automatic scheduling and resource optimization.
-
Real-Time Observability
Monitor accuracy, loss, epochs, GPU usage, and budget spend in real-time with comprehensive dashboards.
-
Team Collaboration
Centralized logging, experiment tracking, and reproducibility for your entire team.
-
AI Agents
Smart copilots that debug, optimize, and automate workflows for ML engineers and their teams.
-
ML-Specific Code Editor
Code lint locally or on any host. Diff files with repo, host, and local environments seamlessly.
Turn One ML Engineer into Ten
ML teams today juggle separate environments, repos, and results, making collaboration slow and error-prone. SkyPortal gives you an immediate 10x improvement on productivity and scale.
Key Features
Experiment Tracking
Automatically log hyperparameters, code versions, datasets, and metrics for complete reproducibility.
Usage Monitoring
Track GPU hours, storage consumption, and job status across all your infrastructure.
Budget Control
Early stopping and overage alerts prevent runaway costs and optimize resource usage.
Multi-Cloud Training
Run training jobs across AWS, GCP, Azure, and independent GPU providers seamlessly.
Advanced Metrics
MAE, MSE, accuracy, throughput, and infrastructure usage - all in one dashboard.
Enterprise Ready
Private cloud deployment, role-based access control, and compliance features.
Who Uses SkyPortal?
-
ML Engineers
Manage multiple cloud hosts in one place with unified observability and control.
-
Data Scientists
Seamless training and easy experiment comparison without infrastructure headaches.
-
Product Managers
Visibility into training progress, costs, and performance metrics in real-time.
-
Engineering Leaders
Cost transparency, reproducibility, and compliance for AI initiatives at scale.
Integration Ecosystem
SkyPortal works with your existing tools:
- ML Frameworks: PyTorch, TensorFlow, Hugging Face
- Cloud Providers: AWS, GCP, Azure, RunPod, Vast.ai
- Data Storage: S3, MinIO
- Version Control: GitHub
- Experiment Tracking: Weights & Biases
- Infrastructure: Kubernetes
Get Started
-
Get Started with SkyPortal
Visit our main website to sign up and get started
-
User Guide
Comprehensive guides on using all SkyPortal features
-
API Reference
Complete API documentation for programmatic access
-
Release Notes
Latest features, improvements, and bug fixes
Ready to accelerate your ML workflow?