Case Study: Predicting Flight Delays

Case study showing how data can be acquired, aggregated, and enriched in order to predict flight delays. Details can be found on the Apache Spark for Data Analytics Course Overview page.

Demo (EDA): Show how data can be summarized, visualized, and analyzed using Spark dataframes and Python visualization tools (Seaborn)
Demo (ML/Classification): Show how to generate a model from the raw flights data (flights data from 2018)
Exericse (ETL): Transform raw data (available as a CSV) into a semi-structured format for consumption using Spark SQL and the Spark Dataframes API providing a dataset covering for 2012 and 2013.
Exercise (ETL): Enrich (join) the JSON data with weather data (avaialble in Avro format)
Exercise (ML/Classification): Generate a second machine learning classification model utilizing the weather data and compare to the performance of the first model
Exercise (ETL): Summarize raw data (consumed as JSON) by airport as well as hour/day (two separate datasets) to include total number of flights, number of flights on time, and number of flights delayed. Combine with weather data.
Exercise (ML/Regression): Generate a machine learning model predicting the number of delayed flights per day (and per hour), combining all features.
Demo (Streaming): Show how a model can be deployed as part of a streaming application. Use log-replay to demonstrate the function and behavior of a streaming application.
Exercise (Streaming/ETL): Build a custom Kafka consumer/producer to enrich flight data with weather conditions by consuming a microservice.
Exercise (Streaming/ML): Deploy the aggregated model (regression) for predicting the number of flights for each day.
Demo (GraphX): Demonstrate how to perform graph analysis and incorporating distance based considerations into the understanding of what factors contribute to flight delays.