Skip to main content

Command Palette

Search for a command to run...

Day 3 — Data Engineering Basics for MLOps (Plain and Practical)

Published
7 min read
Day 3 — Data Engineering Basics for MLOps (Plain and Practical)
P

👋 Hello! I'm passionate about DevOps and have over 1+ years of experience in the field. I'm proficient in a variety of cutting-edge technologies and always motivated to expand my knowledge and skills. Let's connect and grow together!

SKILLS:

🔹 Languages & Runtimes: Python, Shell Scripting, HCL, YAML 🔹 Cloud Technologies: AWS, Microsoft Azure, GCP 🔹 Infrastructure Tools: Docker, Terraform, AWS CloudFormation 🔹 Other Tools: Linux, Git and GitHub Actions, Jenkins, Jira, GitLab (beginner), Docker, AWS DevOps 🔹 Web Development: HTML, CSS, Bootstrap, Python, SQL

Job & Responsibilities:

🚀 Improved development efficiency by implementing CI/CD pipelines, resulting in a 30% reduction in deployment time on the test server. 🔒 Strengthened deployment and testing reliability by utilizing Docker containers and optimizing Dockerfile, reducing development issues on the test server by 20%. ⚙️ Automated S3 bucket log creation with Shell scripting, eliminating 100% of manual search and saving 2 hours per week. 📅 Scheduled EC2 instance start/stop using Lambda functions and Event Bridge, leading to a 25% decrease in infrastructure costs. 🔧 Utilized AWS, Linux, Python, Docker, Shell scripting, Terraform, Jenkins Pipelines, and automation to streamline workflows and improve overall system performance.

I'm very detail-oriented and possess strong written and verbal communication skills. As a high performer with a possibility mindset, I strive to solve problems using efficient approaches.

Let's Connect & Grow:

If you find my profile suitable for the role you are searching for, please feel free to reach out to me at sumanprasad9766@gmail.com.


So far, you’ve learned:

Day 1 → What MLOps is and why it matters

Day 2 → The ML lifecycle from idea to production

Today, we go deeper into the foundation of every ML system: Data. Because in real-world ML systems:

  • Good data beats a good model.

  • Bad data destroys even the best model.

Think of data as fuel for a vehicle. If the fuel is dirty, the engine gets damaged.

Data Engineering is the process of collecting, storing, cleaning, transforming, and delivering data so ML systems can use it.

And MLOps heavily depends on strong Data Engineering, just like DevOps depends on strong infrastructure.


1. What Is Data Engineering? (Simple Definition)

Data Engineering is:

The process of building systems that move data from where it is generated to where it is needed — safely, reliably, and at scale.

Meaning:

Raw Data → Processing → Usable Data for ML

Just like a water treatment plant:

  • collects raw water

  • filters it

  • distributes clean water to homes

Data Engineering does the same for data.


2. Where Does Data Come From? (Real-World Examples)

Food Delivery App

  • app events

  • GPS signals

  • delivery times

  • restaurant ratings

Banking

  • transactions

  • balances

  • repayment history

  • Loans & EMIs details

Hospitals

  • patient records

  • lab results

  • prescriptions

E-commerce

  • product clicks

  • search terms

  • orders

Data may come from:

  • databases

  • APIs

  • logs

  • sensors

  • external data sources

MLOps ensures this data flows smoothly end-to-end.


3. Types of Data (Very Simple Explanation)

Structured Data

Like rows in Excel.

Examples:

  • bank statements

  • customer info

  • transaction tables

Stored in databases like:

  • MySQL

  • PostgreSQL

  • Snowflake

  • BigQuery


Unstructured Data

Not in row-column form.

Examples:

  • emails

  • chat messages

  • medical reports

  • images

  • audio


Semi-Structured Data

Partly organized.

Examples:

  • JSON

  • XML

  • log files


Understanding data type matters because:

  • different tools handle different data

  • ML pipelines need the right formats


4. Batch vs Real-Time Data (Important Concept)

This is a key difference in ML systems.

Batch Data

Processed at intervals:

  • every hour

  • every day

  • every week

Example: A bank updates customer credit score once per month.


Real-Time Data

Processed instantly.

Example: Fraud detection when you swipe your card.


MLOps systems must support both reliably.


5. What Do Data Engineers Actually Build?

They build data pipelines. A data pipeline is like a factory assembly line for data.

Flow looks like this:

Source System
↓
Ingestion
↓
Processing
↓
Storage
↓
Serving to ML

Example — Food Delivery ETA Model

Orders come in ↓ Data pipeline collects order time, distance, traffic ↓ Data is cleaned ↓ Stored in warehouse ↓ Used for model training & prediction

If pipeline breaks:

  • model starves

  • predictions fail

  • business suffers

That’s why MLOps + Data Engineering must work together.


6. Data Quality — Why It Matters So Much

Bad data leads to bad predictions. This is called Garbage In, Garbage Out (GIGO).

Examples:

Bank Wrong salary → wrong loan approval

Hospital Missing symptoms → wrong diagnosis model

E-commerce Duplicate users → wrong recommendations

Food delivery Wrong distance values → wrong ETA

So we must constantly check:

  • missing values

  • incorrect values

  • outdated data

  • bias

  • anomalies

MLOps automates these checks in pipelines. Just like DevOps automates software testing.


7. Intro to SQL — The Language of Data

SQL helps you:

  • retrieve data

  • filter data

  • join tables

  • aggregate results

For example:

Bank wants: “How many loans were approved last month?”

SQL answers that. Even ML engineers and MLOps engineers must know basic SQL, because it’s used everywhere.


8. Data Storage — Where Does All Data Live?

Data Lakes - Data lives

Store raw data. Examples: S3, GCS, Azure Blob

Think of this as a warehouse full of mixed boxes.


Data Warehouses

Store cleaned, structured data ready for analytics.

Examples:

  • Snowflake

  • BigQuery

  • Redshift

Think of this as a well-organized supermarket.


MLOps interacts with both.


9. Data Lineage & Versioning (Very Important for MLOps)

Companies need to answer:

  • Where did this data come from?

  • Which version was used to train this model?

  • Who changed it?

  • When?

  • Why?

This is called data lineage & governance.

It matters especially in:

  • banking

  • healthcare

  • insurance

  • finance

Because mistakes can cause legal risk. MLOps ensures tracking is automatic and auditable.


10. How Data Engineering and MLOps Work Together

Here’s the connection in simple form:

StepData Engineering RoleMLOps Role
Collect DataBuild pipelinesValidate & monitor
Store DataManage warehousesTrack versions
Prepare DataTransform dataEnsure reproducibility
Train ModelProvide clean dataTrack experiments
Deploy ModelSupply live dataDeploy & monitor models

So:

Data Engineering feeds the model. MLOps manages the model lifecycle. Both are essential.


11. A Simple Real-World Story

Imagine an insurance company predicting accident risk.

Data Engineers:

  • collect claim history

  • remove errors

  • build pipelines

  • store data safely

ML Engineers:

  • build model

MLOps Engineers:

  • deploy

  • monitor

  • retrain

  • track versions

If any part fails → the system fails. That’s why modern ML needs strong collaboration.


12. Data Scientist vs ML Engineer vs MLOps Engineer (Simple Comparison)

AspectData ScientistML EngineerMLOps Engineer
Primary FocusDiscover insights & build modelsTurn models into productsRun ML systems reliably
Main ResponsibilityCreate the ML modelProductiize the modelAutomate & operate ML lifecycle
Think of Them AsResearcher + AnalystSoftware Engineer for MLDevOps for ML
Where They WorkNotebook / research environmentApplication / backend servicesCloud, infra, pipelines
InputRaw dataTrained ML modelModels, data, infra
OutputTrained & evaluated modelScalable inference serviceEnd-to-end ML platform

What They Actually Do (With Examples)

RoleExample (Food Delivery App)
Data ScientistBuilds a model to predict delivery time using past orders & traffic data
ML EngineerWraps the model in an API so the app can call it in real time
MLOps EngineerSets up pipelines to retrain the model weekly & monitor accuracy drift

Core Responsibilities Breakdown

AreaData ScientistML EngineerMLOps Engineer
Business understanding⚠️⚠️
Data exploration & cleaning⚠️ (automation)
Model selection & tuning
Model optimization
API development
Training pipelines
CI/CD for ML
Monitoring & drift detection
Cloud & Kubernetes⚠️

✅ = Primary responsibility

⚠️ = Limited involvement

❌ = Not responsible


13. Quick Recap of Day 3

Today you learned:

  • what Data Engineering is

  • why it is critical for MLOps

  • types of data

  • batch vs real-time pipelines

  • data quality importance

  • SQL basics

  • data lakes vs warehouses

  • lineage and governance

  • how Data Engineering and MLOps work together


MLOps

Part 18 of 20

Practical MLOps series breaking down how ML systems work in production — from data pipelines to deployment, monitoring, and retraining. No buzzwords, just real-world MLOps concepts explained simply for engineers and data teams.

Up next

Day 2 — ML Lifecycle (From Idea to Live)

Today we go deeper into how a machine learning system actually works end-to-end, not just the model. Think of this like running an e-commerce business. Just like an order goes through multiple stages: Browsing → Add to cart → Payment → Packing → Ship...

More from this blog

D

DeployToCloud

405 posts

👋 Welcome to my Hashnode blog! I'm a DevOps Engineer with 2+ years of experience. Join ~5k followers and explore 320+ blogs on Python, AWS, Docker, Jenkins, Linux, and more. Let's connect & grow 🚀