Seattle Car Accident Project (IBM Applied Data Science)

José Manuel García Portillo

Published in

The Startup

11 min readSep 27, 2020

Introduction

An estimated 1.2 million people are killed in road crashes each year, and as many as 50 million are injured.

Road traffic injuries are predictable and preventable, but the study of data is of utmost importance. Understanding the factors around which these tragedies occurs, will give us the insight needed in order to prevent them.

Data

Using the Database of the SDOT Traffic Management Division of Seattle, the main objectives of this project are: to build a Classification model that can predict the probability of one of these type of accidents (property damage or injury) and to get a better understanding of the key factors involved in road crashes.

The Database consist of 194,673 rows and 38 columns/features, one of which will be the dependent variable (“SEVERITYCODE”) and thus, the one to predict. Its values are either 1 (property damage) or 2 (injury).

Methodology

Missing values

For starters, it is important to check the missing values in the Database. That will be the basis for the next steps.

In a Database of roughly 195,000 rows, having columns with as many as almost 110,000 missing values represents a massive lack of data. For this reason, it is a good idea to drop the 6 first features with the highest null (NaN) values, as shown in the image above.

Together with these, there are also some features that don’t bring much value to the deployment of the model, such as the reportN0, and so they will be also dropped. To get the insight needed on what features wouldn’t be meaningful to build the model, I used the official Attribute Information provided together with the Database.

It is now time to take a look into the Categorical Features. Interestingly enough, there is a Date mixed in, which Python has not recognised as a date datatype. Thus, the Feature Engineering begins:

Some of the Categorical Features of the Database

Feature engineering

I am specially interested in the time zone when the accidents occurred. Since it is only a part of the whole date, Regular Expressions will be useful to match the wanted pattern and to later store it in a brand new column in our database.

Pattern Matching with Regular Expressions

Now, onto the main issue. Even though there are missing values, for Python to recognise the value as a proper date datatype, parsing is necessary. The goal in this whole procedure is to extract the hour (as an int) and to create time frames (parts of the day), which is done in the following steps:

But, what about the NaN values? Since this database has multiple columns with interdependent information, it is possible to use another column to deduce the rest of values.

In this case, the feature “LIGHTCOND” will be the one used as a reference column to fill the missing values of the “Frame” feature we just created. Thus, the goal will be to deduce what time frame of the day did the accident happen in by making use of the information on the light conditions when each accident occurred.

Since it is difficult to infer anything if both columns have missing values in the same row, in such situation the row will be dropped.

That aside, in order to be as exact as possible, “GroupBY” will be the tool used in this part of the kernel. This way, it is possible to know which values are more connected to one each other depending on the frequency they appear on the same row.

For example, we can see from the below image that when the reference column “LIGHTCOND” has a value of “Dark — No Street Lights”, it is relatively safe to conclude that the missing value of the column “Frame” should be “Night”.

Understanding the relationship between columns using GroupBy

Assigning new values making use of interconnected columns

Just like the values of the column “Frame” were deduced thanks to the values of the reference column “LIGHTCOND”, it will be also done the other way around. This way, we can fill both columns’ missing values.

I believe this process has really enriched the Database, which had many columns with missing data. Indeed, meaningful connections between columns could be created thanks to these series of steps. But, will it create a high correlation between them? Let’s explore that later.

The same procedure is followed for each column that has some missing data, by creating pair of columns and trying to fill the gaps left by the missing values (see Road Condition & Weather / Junction type & Place of accident / Collision description & Collision code).

And just like that, at last, the time to work with the coordinates X and Y has come.

What I want achieve by working with these two features is to divide each accident into geographical clusters and add that data as a new feature into the database.

Folium is a good tool to visualise the accidents grouped by location in an interactive way. The map can be zoomed in or out to check bigger clusters of accidents or even the exact place where each accident occurred:

Setting up the map visualisation using Folium

Clusters of accidents in Seattle from our Database

To do the real clustering in the database, K-Means will be used. To reduce the workload, it is advisable to create another dataset, slicing the main database and taking the columns needed.

The dataset created in this case is called “X” and has the coordinates X and Y and the column “OBJECTID”, which will be useful to connect this dataset to the main database once the clustering is finished.

New “X” dataset is created from our database

Before fitting K-Means into the “X” dataset, it is necessary to check the optimal number of clusters by using the Elbow Method.

Following this step, I ended up choosing 4 clusters and fitting the model into our dataset. Once the clustering is done and each pair of coordinates is assigned to a cluster (0,1,2 or 3), it is time to add the column “Clusters” to the main database.

Finally, the columns used for the clustering are dropped since they won’t be needed any further.

Merging the “X” dataset and our main database using “OBJECTID” feature

Categorical Variable Encoding

In the database both numerical and categorical variable coexist. But, in order to carry on with the following steps, it is required to encode those categorical values into numerical data. Label Encoding will be helpful for this endeavour.

Correlations

The features in the database are correlated both between each other and with the dependent variable (“SEVERITYCODE”). The problem comes when they are highly correlated.

Features with high correlation are more linearly dependent and hence have almost the same effect on the dependent variable. So, when two features have high correlation, we can drop one of the two features.

First, the correlation between the independent features and the dependent variable is checked. The features highly correlated with the dependent variable are dropped(in this case, “SEVERITYDESC” and “SEVERITYCODE.1”).

Correlation chart with our dependent variable as the reference

After that, we check how the independent variables are correlated between each other using the Correlation Matrix:

Correlation Matrix for our independent features

The features “ADDRTYPE” and “JUNCTIONTYPE” are highly correlated (0,92) so it is advisable to drop one of them. In this case “JUNCTIONTYPE” is the one chosen since it had more missing values to begin with.

Feature Selection

Another important step on building a meaningful model is to select only the most meaningful features to feed our model. Remember: Garbage in, Garbage out. That’s the reason why this step is so important and why I decided to select the top 10 most relevant features in the database.

What’s more, in an attempt to compare different selection models, two different Feature Selection methods will be applied for this project: Chi-Squared and Feature Importance.

Top 10 Features by importance (Chi-Squared Method)

Top 10 Features by importance (Feature Importance Method)

From the images above, it is noticeable that the features chosen by each method are different.

For this reason, we will build two different modelling processes using the results of each one of the Feature Selection methods. At the end, we will compare the results and see which one had a better accuracy in its predictions.

Undersampling

It is not wise to jump onto the modelling just like that. As I explained in the beginning, the dependent variable (“SEVERITYCODE”) has two values (1 = “property damage” / 2 = “injury”). The issue here is that it doesn’t exist a balance between the number of rows that have one value and those that have the other.

In fact, the database has around three times more samples/rows where the dependent variable is equal to 1 than when it is equal to 2.

This bias in the training dataset can influence many machine learning algorithms, leading some to ignore the minority class entirely.

Thus, it is needed to adjust the sampling so that both values of the dependent variable appear the same number of times. This will be done during the training of the model.

Modelling

It is finally time to start the first modelling. The features selected by the Chi-Squared method are the ones that will be used in this first modelling.

For starters, the database is split into Training and Test Set in a ratio 8:2.

Splitting into Training and Test Set (Chi-Squared version)

After this, Feature Scaling is applied to even the values of the independent features. It is really important to always do it AFTER splitting the database to avoid data leakage into the Machine Learning model.

Applying Feature Scaling to X_train and X_test

Now it is time to use a few interesting Classification Models and compare their individual performance. Also, in order to have a solid grasp on their accuracy, I will be using Cross-Validation and rank each one of them on a customised chart. All of this will be carried out in the Training Set.

Classification models chosen for this project

Important to note is that I will be using Stratified K-Fold Cross Validation instead of the usual K-Fold Cross Validation since we are dealing with an unbalanced database, as mentioned before.

The idea behind this is to train the model on a balanced Training Set, so that it learns from the same number of samples where the value of the dependent variable is 1 and 2.

Once trained on the balanced Training Set, we will do the predictions on the unbalanced Test Set, which is the real one and check how accurate the trained model is. The reason why we cannot test it on a balanced Test Set is because the results wouldn’t be adjusted to the reality of the database.

Stratified Cross-Validation and chart of each ML model

The results can be further improved. To that end, the next step will be to optimise their hyper-parameters.

Once the hyper-parameters of each model are adjusted, it is time to train the model again (using Stratified K-Fold Cross-Validation) and check their performance on the Training Set one more time.

Example of Hyper-parameter Tuning with the Random Forest model

Stratified K-Fold Cross Validation with the optimised models

Accuracy with Tuned Parameters for each model

It is time to wrap up the first modelling. Once the Training is finished and the model has learned from the training data, all that is left is to check how does each model perform (in terms of accuracy) in the unbalanced Test Set (which, as said before, represents the reality of the database). This will be the final hurdle for the models.

Predictions on the Test Set (Chi-Squared version)

The results seem to be fairly consistent, with a slightly better accuracy on the Test Set than on the Training Set, which shows the good performance of the trained models; specially the Random Forest method, with a 73,93% accuracy on the Test Set.

Now it is time to move into the second modelling, but since the modelling based on the features selected from the Feature Importance method follows the same prior steps, I will skip it and move into the final showdown: the comparison between the Accuracy results of both modelling versions on the Test Set.

At last, the results of each modelling are concatenated and we finally get closure for our modelling competition:

Comparison between the two versions of our modelling based on each Feature Selection

Results

The results from the comparison between both modelling versions show that the Random Forest Classification Model has given the best results, with a 73,93% of accuracy of its predictions in the Test Set.

Now, with this trained model and given a dataset with the features selected by the Chi-Squared method, it would be possible to predict with that much accuracy how likely an accident will end up in property damage or injury.

But, let’s not forget that another objective of this project was to get a better understanding on the importance of the key factors involved.

The model with the best results (Random Forest) together with the features selected by the Chi-squared method will be the chosen combination to achieve this.

Importance of each feature for the Random Forest model

Visualisation of the importance of each feature for the model

Conclusion

With this project, we have built a model that can predict with a 73,93 % of accuracy how likely an accident will end only with property damage (type1) or with an injury (type2), given a set of features. This could help to allocate resources and aid the decision making of the emergency services in the scenario where a traffic crash would occur.

Specially interesting is the information regarding the features that are critical when it comes to predicting which type of accident will occur. A better design on the traffic infrastructure by reducing the number of parked cars, better accommodating bicycles and pedestrians or even improving the signposting of crosswalks could significantly ease the severity of road accidents.

Finally, it is quite surprising that factors like the weather or light conditions have not played a big of a role in predicting the outcome of the accidents. Maybe, these features would have come more in handy if what we wanted to predict was whether there was an accident or not; instead of assuming that the accident happened already.

For more information on the Kernel of this project please visit: https://github.com/josem-gp