Building the MLOps Platform

Building a platform from scratch is ultimately about standardization and developer experience.

The Problem

How do small companies go from scrappy startup to unicorn-level operations?

We all know the pattern: early-stage teams, especially in ML and DS, prioritize speed over robustness.
Ship features → hit deadlines → experiment fast. Best practices come later.

But what happens when the company starts raising bigger rounds, scaling operations, hiring dozens of new engineers, and introducing more requirements?

Suddenly:

Every mid-level engineer can deploy a FastAPI model.
But not every engineer can deploy a FastAPI model that is scalable, observable, reproducible, and ready for fast iteration.

That’s where I stepped in.

I needed to define a new, unified way to deploy all models across three teams in the company.

Understanding the Business Need

“Let’s build a huge MLOps platform!”
Easy to say. But reality is different.

Before writing a single line of code or drafting epics, we had to ask the right questions:

What does the business actually need right now?
How much time do we have to build core components internally?
Should we assemble an open-source stack ourselves?
Or adopt a Platform-as-a-Service like Weights & Biases, H2O, or Databricks?
Is a hybrid approach more feasible?

After several weeks of conversations with stakeholders, directors, and the CTO, we narrowed down the options:

Fully internal platform built on open-source
Open-source + PaaS mix
Full PaaS platform

After evaluating cost, maintainability, and time-to-value…

We chose a hybrid infrastructure.

Some components would be managed by a PaaS, and others would remain internal.

The First Platform Proposal

We drafted our first platform architecture using open-source components.
It looked like this:

MLOps with OS components

For each component, we created user stories and defined clear acceptance criteria.

We built:

Common containers
Monitoring and model drift tools
Documentation for every stage of training and deployment

We also defined the entire end-to-end process:

branching strategy (trunk-based vs gitflow)
CI/CD steps required to validate new model trainings
ingestion of logs and metrics into the datalake for long-term analysis

This allowed the company to scale model deployments predictably and consistently.