Day 3 — Data Engineering Basics for MLOps (Plain and Practical)

👋 Hello! I'm passionate about DevOps and have over 1+ years of experience in the field. I'm proficient in a variety of cutting-edge technologies and always motivated to expand my knowledge and skills. Let's connect and grow together!
SKILLS:
🔹 Languages & Runtimes: Python, Shell Scripting, HCL, YAML 🔹 Cloud Technologies: AWS, Microsoft Azure, GCP 🔹 Infrastructure Tools: Docker, Terraform, AWS CloudFormation 🔹 Other Tools: Linux, Git and GitHub Actions, Jenkins, Jira, GitLab (beginner), Docker, AWS DevOps 🔹 Web Development: HTML, CSS, Bootstrap, Python, SQL
Job & Responsibilities:
🚀 Improved development efficiency by implementing CI/CD pipelines, resulting in a 30% reduction in deployment time on the test server. 🔒 Strengthened deployment and testing reliability by utilizing Docker containers and optimizing Dockerfile, reducing development issues on the test server by 20%. ⚙️ Automated S3 bucket log creation with Shell scripting, eliminating 100% of manual search and saving 2 hours per week. 📅 Scheduled EC2 instance start/stop using Lambda functions and Event Bridge, leading to a 25% decrease in infrastructure costs. 🔧 Utilized AWS, Linux, Python, Docker, Shell scripting, Terraform, Jenkins Pipelines, and automation to streamline workflows and improve overall system performance.
I'm very detail-oriented and possess strong written and verbal communication skills. As a high performer with a possibility mindset, I strive to solve problems using efficient approaches.
Let's Connect & Grow:
If you find my profile suitable for the role you are searching for, please feel free to reach out to me at sumanprasad9766@gmail.com.
So far, you’ve learned:
Day 1 → What MLOps is and why it matters
Day 2 → The ML lifecycle from idea to production

Today, we go deeper into the foundation of every ML system: Data. Because in real-world ML systems:
Good data beats a good model.
Bad data destroys even the best model.
Think of data as fuel for a vehicle. If the fuel is dirty, the engine gets damaged.
Data Engineering is the process of collecting, storing, cleaning, transforming, and delivering data so ML systems can use it.
And MLOps heavily depends on strong Data Engineering, just like DevOps depends on strong infrastructure.
1. What Is Data Engineering? (Simple Definition)

Data Engineering is:
The process of building systems that move data from where it is generated to where it is needed — safely, reliably, and at scale.
Meaning:
Raw Data → Processing → Usable Data for ML
Just like a water treatment plant:
collects raw water
filters it
distributes clean water to homes
Data Engineering does the same for data.
2. Where Does Data Come From? (Real-World Examples)
Food Delivery App
app events
GPS signals
delivery times
restaurant ratings
Banking
transactions
balances
repayment history
Loans & EMIs details
Hospitals
patient records
lab results
prescriptions
E-commerce
product clicks
search terms
orders
Data may come from:
databases
APIs
logs
sensors
external data sources
MLOps ensures this data flows smoothly end-to-end.
3. Types of Data (Very Simple Explanation)
Structured Data
Like rows in Excel.
Examples:
bank statements
customer info
transaction tables
Stored in databases like:
MySQL
PostgreSQL
Snowflake
BigQuery
Unstructured Data
Not in row-column form.
Examples:
emails
chat messages
medical reports
images
audio
Semi-Structured Data
Partly organized.
Examples:
JSON
XML
log files
Understanding data type matters because:
different tools handle different data
ML pipelines need the right formats
4. Batch vs Real-Time Data (Important Concept)
This is a key difference in ML systems.
Batch Data
Processed at intervals:
every hour
every day
every week
Example: A bank updates customer credit score once per month.
Real-Time Data
Processed instantly.
Example: Fraud detection when you swipe your card.
MLOps systems must support both reliably.
5. What Do Data Engineers Actually Build?
They build data pipelines. A data pipeline is like a factory assembly line for data.
Flow looks like this:
Source System
↓
Ingestion
↓
Processing
↓
Storage
↓
Serving to ML
Example — Food Delivery ETA Model
Orders come in ↓ Data pipeline collects order time, distance, traffic ↓ Data is cleaned ↓ Stored in warehouse ↓ Used for model training & prediction
If pipeline breaks:
model starves
predictions fail
business suffers
That’s why MLOps + Data Engineering must work together.
6. Data Quality — Why It Matters So Much
Bad data leads to bad predictions. This is called Garbage In, Garbage Out (GIGO).
Examples:
Bank Wrong salary → wrong loan approval
Hospital Missing symptoms → wrong diagnosis model
E-commerce Duplicate users → wrong recommendations
Food delivery Wrong distance values → wrong ETA
So we must constantly check:
missing values
incorrect values
outdated data
bias
anomalies
MLOps automates these checks in pipelines. Just like DevOps automates software testing.
7. Intro to SQL — The Language of Data
SQL helps you:
retrieve data
filter data
join tables
aggregate results
For example:
Bank wants: “How many loans were approved last month?”
SQL answers that. Even ML engineers and MLOps engineers must know basic SQL, because it’s used everywhere.
8. Data Storage — Where Does All Data Live?
Data Lakes - Data lives
Store raw data. Examples: S3, GCS, Azure Blob
Think of this as a warehouse full of mixed boxes.
Data Warehouses
Store cleaned, structured data ready for analytics.
Examples:
Snowflake
BigQuery
Redshift
Think of this as a well-organized supermarket.
MLOps interacts with both.
9. Data Lineage & Versioning (Very Important for MLOps)
Companies need to answer:
Where did this data come from?
Which version was used to train this model?
Who changed it?
When?
Why?
This is called data lineage & governance.
It matters especially in:
banking
healthcare
insurance
finance
Because mistakes can cause legal risk. MLOps ensures tracking is automatic and auditable.
10. How Data Engineering and MLOps Work Together
Here’s the connection in simple form:
| Step | Data Engineering Role | MLOps Role |
| Collect Data | Build pipelines | Validate & monitor |
| Store Data | Manage warehouses | Track versions |
| Prepare Data | Transform data | Ensure reproducibility |
| Train Model | Provide clean data | Track experiments |
| Deploy Model | Supply live data | Deploy & monitor models |
So:
Data Engineering feeds the model. MLOps manages the model lifecycle. Both are essential.
11. A Simple Real-World Story
Imagine an insurance company predicting accident risk.
Data Engineers:
collect claim history
remove errors
build pipelines
store data safely
ML Engineers:
- build model
MLOps Engineers:
deploy
monitor
retrain
track versions
If any part fails → the system fails. That’s why modern ML needs strong collaboration.
12. Data Scientist vs ML Engineer vs MLOps Engineer (Simple Comparison)
| Aspect | Data Scientist | ML Engineer | MLOps Engineer |
| Primary Focus | Discover insights & build models | Turn models into products | Run ML systems reliably |
| Main Responsibility | Create the ML model | Productiize the model | Automate & operate ML lifecycle |
| Think of Them As | Researcher + Analyst | Software Engineer for ML | DevOps for ML |
| Where They Work | Notebook / research environment | Application / backend services | Cloud, infra, pipelines |
| Input | Raw data | Trained ML model | Models, data, infra |
| Output | Trained & evaluated model | Scalable inference service | End-to-end ML platform |
What They Actually Do (With Examples)
| Role | Example (Food Delivery App) |
| Data Scientist | Builds a model to predict delivery time using past orders & traffic data |
| ML Engineer | Wraps the model in an API so the app can call it in real time |
| MLOps Engineer | Sets up pipelines to retrain the model weekly & monitor accuracy drift |
Core Responsibilities Breakdown
| Area | Data Scientist | ML Engineer | MLOps Engineer |
| Business understanding | ✅ | ⚠️ | ⚠️ |
| Data exploration & cleaning | ✅ | ❌ | ⚠️ (automation) |
| Model selection & tuning | ✅ | ❌ | ❌ |
| Model optimization | ❌ | ✅ | ❌ |
| API development | ❌ | ✅ | ❌ |
| Training pipelines | ❌ | ❌ | ✅ |
| CI/CD for ML | ❌ | ❌ | ✅ |
| Monitoring & drift detection | ❌ | ❌ | ✅ |
| Cloud & Kubernetes | ❌ | ⚠️ | ✅ |
✅ = Primary responsibility
⚠️ = Limited involvement
❌ = Not responsible
13. Quick Recap of Day 3
Today you learned:
what Data Engineering is
why it is critical for MLOps
types of data
batch vs real-time pipelines
data quality importance
SQL basics
data lakes vs warehouses
lineage and governance
how Data Engineering and MLOps work together




