MLflow in Production on Cloud Run
Tracking, Registry, and Plugin-Based Extensions
These notes describe a production-grade ML platform setup where MLflow tracking and model registry are deployed as a Cloud Run service, and extended via custom plugins to support enterprise workflows.
graph LR
subgraph Container [Cloud Run Container]
MLflow[MLflow Core]
Plugin[Custom Plugins]
MLflow <-->|Extends| Plugin
end
User[User / Pipelines] --> Container
Container -->|Enables| Enterprise[Enterprise Workflows]
style Plugin fill:#f9f,stroke:#333,stroke-width:2px
The emphasis is on system architecture, extensibility, and operational tradeoffs rather than tutorial-style deployment.
Problem Statement
Out-of-the-box MLflow works well for: - local experimentation - small teams - notebook-driven workflows
It begins to break down when requirements include: - centralized tracking across multiple teams - custom authentication, metadata, and validation - controlled model promotion and governance - cloud-native scalability and reliability
This architecture addresses those gaps while keeping MLflow upgradeable.
High-Level Architecture
graph TD
Client[Clients<br>training jobs, notebooks, CI] --> CR[Cloud Run<br>MLflow Tracking + Registry API]
subgraph Service[Cloud Run Service]
CR
Plugin[Plugin layer<br>auth, validation, metadata]
CR -.- Plugin
end
CR --> DB[(Backend Store<br>Postgres / Cloud SQL)]
CR --> Storage[(Artifact Store<br>GCS)]
Key idea: MLflow operates as a stateless control plane, not a monolithic ML system.
Why Cloud Run
Cloud Run was chosen because it provides: - Stateless HTTP execution - Automatic horizontal scaling (including scale-to-zero) - Native IAM integration - Simple container-based deployment - Cost efficiency for bursty and spiky traffic
This fits MLflow well because: - MLflow APIs are request/response oriented - All state lives in external systems - Training workloads are fully decoupled
Core Components
1. MLflow Tracking Server
- Exposes REST APIs for:
- experiment management
- run tracking
- parameter and metric logging
- Runs as a containerized, stateless service
- Makes no assumptions about local filesystem persistence
2. Backend Store
Responsible for: - experiment metadata - runs - parameters and metrics - model registry state
Typical implementations: - Cloud SQL (Postgres) - Managed MySQL
Key requirements: - strong transactional guarantees - schema stability across upgrades - automated backups and recovery
3. Artifact Store
Responsible for: - model artifacts - checkpoints - feature snapshots - evaluation outputs
Typical choice: - Google Cloud Storage (GCS)
Design notes: - artifacts are referenced by URI - artifacts are never served directly by the MLflow service - access is controlled via IAM or signed URLs
Plugin-Based Extension Model
MLflow plugins are used to extend core behavior without forking MLflow.
Why Plugins
- Avoid long-lived forks of upstream MLflow
- Preserve a clean upgrade path
- Isolate organization-specific logic
- Treat governance as a first-class concern
Common Plugin Responsibilities
Authentication & Authorization
- Validate caller identity
- Enforce experiment- and registry-level access
- Integrate with cloud IAM or internal identity systems
Metadata Enrichment
- Attach required contextual metadata:
- git SHA
- training job ID
- dataset or feature version
- Enforce mandatory tagging policies
Registry Controls
- Validate model registration and stage transitions
- Enforce promotion rules (e.g., staging → production)
- Block unsafe or non-compliant transitions
Audit & Observability
- Emit structured audit logs
- Track registry actions and transitions
- Support compliance and forensic analysis
Model Registry Workflow
Typical promotion flow:
graph TD
Job[Training Job] --> Log[Log model to MLflow]
Log --> Reg[Register versioned model]
Reg --> Hook[Plugin validation hooks]
Hook --> Stage[Stage transition<br>staging / production]
Stage --> Deploy[Downstream deployment automation]
Key principle:
The model registry is a control surface, not just a metadata store.
Integration with Training Pipelines
MLflow clients are invoked from: - batch training jobs - CI pipelines - scheduled workflows
Key design decisions: - clients authenticate using short-lived credentials - training jobs never access databases directly - all interactions flow through the MLflow API
This enforces: - consistency - auditability - centralized policy enforcement
Operational Considerations
Scalability
- Cloud Run autoscaling absorbs bursty metric logging traffic
- Database connection pooling is critical
- Artifact uploads occur out-of-band from request paths
Reliability
- Stateless service enables fast restarts and redeployments
- Backend store is the primary SPOF
- Health checks and alerts focus on:
- API latency
- error rates
- database connectivity
Security
- No public access to artifacts
- IAM-scoped service accounts for all components
- Plugins act as policy enforcement boundaries
Tradeoffs & Limitations
Pros
- Cloud-native and cost-efficient
- Highly extensible without forking MLflow
- Clear separation between control plane and data plane
- Evolves cleanly as governance needs grow
Cons
- Plugin APIs are lightly documented
- Debugging plugin behavior requires MLflow internals familiarity
- Registry workflows still require process discipline and tooling
Interview Takeaways
Key points to emphasize in interviews: - MLflow functions as a metadata and control plane, not a training system - Plugins enable enterprise governance without upstream divergence - Cloud Run is well-suited for stateless ML control services - Most real-world ML complexity lies in data, serving, and governance
TL;DR
- MLflow runs as a stateless service on Cloud Run
- Metadata lives in SQL; artifacts live in object storage
- Plugins enforce auth, validation, and policy
- The registry acts as a production gate
- Designed for scale, auditability, and fast iteration
These notes reflect a real production ML platform, not a tutorial deployment.