Big Data Project

Apr 2026

Distributed data processing and machine learning pipeline using Apache Spark, Docker, and Python.

This project demonstrates a distributed data processing and machine learning pipeline built with Apache Spark, Docker, and Python. It covers data ingestion, unification, preprocessing, feature engineering, and scalable execution across a containerized Spark cluster.

The workflow complements my graduate work in Big Data and Cloud by emphasizing scalable analytics and reproducible data pipelines.