Skip to content

MLflow in Production on Cloud Run

Tracking, Registry, and Plugin-Based Extensions

These notes describe a production-grade ML platform setup where MLflow tracking and model registry are deployed as a Cloud Run service, and extended via custom plugins to support enterprise workflows.

graph LR
    subgraph Container [Cloud Run Container]
        MLflow[MLflow Core]
        Plugin[Custom Plugins]
        MLflow <-->|Extends| Plugin
    end

    User[User / Pipelines] --> Container
    Container -->|Enables| Enterprise[Enterprise Workflows]
    style Plugin fill:#f9f,stroke:#333,stroke-width:2px

The emphasis is on system architecture, extensibility, and operational tradeoffs rather than tutorial-style deployment.


Problem Statement

Out-of-the-box MLflow works well for: - local experimentation - small teams - notebook-driven workflows

It begins to break down when requirements include: - centralized tracking across multiple teams - custom authentication, metadata, and validation - controlled model promotion and governance - cloud-native scalability and reliability

This architecture addresses those gaps while keeping MLflow upgradeable.


High-Level Architecture

graph TD
    Client[Clients<br>training jobs, notebooks, CI] --> CR[Cloud Run<br>MLflow Tracking + Registry API]

    subgraph Service[Cloud Run Service]
        CR
        Plugin[Plugin layer<br>auth, validation, metadata]
        CR -.- Plugin
    end

    CR --> DB[(Backend Store<br>Postgres / Cloud SQL)]
    CR --> Storage[(Artifact Store<br>GCS)]

Key idea: MLflow operates as a stateless control plane, not a monolithic ML system.


Why Cloud Run

Cloud Run was chosen because it provides: - Stateless HTTP execution - Automatic horizontal scaling (including scale-to-zero) - Native IAM integration - Simple container-based deployment - Cost efficiency for bursty and spiky traffic

This fits MLflow well because: - MLflow APIs are request/response oriented - All state lives in external systems - Training workloads are fully decoupled


Core Components

1. MLflow Tracking Server

  • Exposes REST APIs for:
  • experiment management
  • run tracking
  • parameter and metric logging
  • Runs as a containerized, stateless service
  • Makes no assumptions about local filesystem persistence

2. Backend Store

Responsible for: - experiment metadata - runs - parameters and metrics - model registry state

Typical implementations: - Cloud SQL (Postgres) - Managed MySQL

Key requirements: - strong transactional guarantees - schema stability across upgrades - automated backups and recovery


3. Artifact Store

Responsible for: - model artifacts - checkpoints - feature snapshots - evaluation outputs

Typical choice: - Google Cloud Storage (GCS)

Design notes: - artifacts are referenced by URI - artifacts are never served directly by the MLflow service - access is controlled via IAM or signed URLs


Plugin-Based Extension Model

MLflow plugins are used to extend core behavior without forking MLflow.

Why Plugins

  • Avoid long-lived forks of upstream MLflow
  • Preserve a clean upgrade path
  • Isolate organization-specific logic
  • Treat governance as a first-class concern

Common Plugin Responsibilities

Authentication & Authorization

  • Validate caller identity
  • Enforce experiment- and registry-level access
  • Integrate with cloud IAM or internal identity systems

Metadata Enrichment

  • Attach required contextual metadata:
  • git SHA
  • training job ID
  • dataset or feature version
  • Enforce mandatory tagging policies

Registry Controls

  • Validate model registration and stage transitions
  • Enforce promotion rules (e.g., staging → production)
  • Block unsafe or non-compliant transitions

Audit & Observability

  • Emit structured audit logs
  • Track registry actions and transitions
  • Support compliance and forensic analysis

Model Registry Workflow

Typical promotion flow:

graph TD
    Job[Training Job] --> Log[Log model to MLflow]
    Log --> Reg[Register versioned model]
    Reg --> Hook[Plugin validation hooks]
    Hook --> Stage[Stage transition<br>staging / production]
    Stage --> Deploy[Downstream deployment automation]

Key principle:

The model registry is a control surface, not just a metadata store.


Integration with Training Pipelines

MLflow clients are invoked from: - batch training jobs - CI pipelines - scheduled workflows

Key design decisions: - clients authenticate using short-lived credentials - training jobs never access databases directly - all interactions flow through the MLflow API

This enforces: - consistency - auditability - centralized policy enforcement


Operational Considerations

Scalability

  • Cloud Run autoscaling absorbs bursty metric logging traffic
  • Database connection pooling is critical
  • Artifact uploads occur out-of-band from request paths

Reliability

  • Stateless service enables fast restarts and redeployments
  • Backend store is the primary SPOF
  • Health checks and alerts focus on:
  • API latency
  • error rates
  • database connectivity

Security

  • No public access to artifacts
  • IAM-scoped service accounts for all components
  • Plugins act as policy enforcement boundaries

Tradeoffs & Limitations

Pros

  • Cloud-native and cost-efficient
  • Highly extensible without forking MLflow
  • Clear separation between control plane and data plane
  • Evolves cleanly as governance needs grow

Cons

  • Plugin APIs are lightly documented
  • Debugging plugin behavior requires MLflow internals familiarity
  • Registry workflows still require process discipline and tooling

Interview Takeaways

Key points to emphasize in interviews: - MLflow functions as a metadata and control plane, not a training system - Plugins enable enterprise governance without upstream divergence - Cloud Run is well-suited for stateless ML control services - Most real-world ML complexity lies in data, serving, and governance


TL;DR

  • MLflow runs as a stateless service on Cloud Run
  • Metadata lives in SQL; artifacts live in object storage
  • Plugins enforce auth, validation, and policy
  • The registry acts as a production gate
  • Designed for scale, auditability, and fast iteration

These notes reflect a real production ML platform, not a tutorial deployment.