Big Data Project
Apr 2026
Distributed data processing and machine learning pipeline using Apache Spark, Docker, and Python.
This project demonstrates a distributed data processing and machine learning pipeline built with Apache Spark, Docker, and Python. It covers data ingestion, unification, preprocessing, feature engineering, and scalable execution across a containerized Spark cluster.
The workflow complements my graduate work in Big Data and Cloud by emphasizing scalable analytics and reproducible data pipelines.