You are given nine years of individual EPA data in csv format. The data files are not very large (each file is approx. 1 MB). Each yearly file contains thousands of vehicles along with their vital information and pollution testing records. Each file contains 42 columns, the details of which are given in the Data Dictionary document. Please note that the original data had more columns, and some of them were removed for the consistency purposes. The deleted columns also exist in the data dictionary and you are asked to ignore them while referring to the dictionary.
There are three sections to this case study: Merging and cleaning (20 points), Data Analysis (60 points), Visualization (20 points) totaling 100 points.
Important Note: Make sure to keep the three time periods separate for the following analysis, i.e. perform separate analysis for each of the time period separately.
Merging and Cleaning (20 Points)
The first objective is to combine those files and stack them as three large files, one for each time period. Run basic EDA and descriptive statistics on some columns and clean any obvious outliers from each time period. Make sure that no more than 1% of the data are removed from within each time period in this process. Clearly write the details of outlier detection and descriptive analysis.
Analysis (60 Points)
This section is further broken down into two parts:
Part A: (30 points)
There are several numeric columns listed in the datasets.Use the tools of dimension reduction learnt during the course and condense the number of columns to smaller dimension for each time period separately.
Use the reduced dimensions to perform “grouping” of similar vehicles. Keep the number of groups between 5 and 8 for each time period. Clearly define groups based on their characteristics by running descriptive analytics on each group. Now compare the groups for the three time periods and point out any vehicles that jumped from one group to the other over time. Also explain what that jump means in your own words.