Source materials

Here is a list with links to the jupyter notebook and original datasets used to generate the findings on this page:

Raw data EDA, data cleansing, and feature engineering

To view the preliminary EDA findings and summaries of the feature engineering activities that were performed on the component datasets eventually merged into the model data examined on this page, please follow the links listed below. On each of those linked pages, you will also find links to the original source datasets used for each raw data EDA:

  1. Crime incident raw data EDA
  2. Property assessment raw data EDA
  3. Streetlight location raw data EDA
  4. NOAA weather raw data EDA
  5. Neighborhood demographics raw data EDA
  6. Liquor licensing raw data EDA
  7. Educational institutions raw data EDA
  8. Property violations raw data EDA

Introduction

This page outlines the comprehensive exploratory data analysis (EDA) conducted on our training feature set engineered for our predictive model. As is outlined above, a number of different data sources were investigated, cleansed, and enhanced in order to achieve the finished feature set investigated on this page.

EDA Contents

The EDA outlined below on this page is divided into the following sections:

  1. Variable of interest and number of overall records examined
  2. Predictors used in this analysis
  3. Geographical distributions of crime records
  4. Geographical change in crime records over time
  5. Pair-wise relationship strength between predictors and crime-type classes
  6. Evaluating distance-based geographical predictors by class
  7. Dimensionality reduction with principal component analysis (PCA)
  8. Correlation of predictors and multi-collinearity risk

Our variable of interest and number of overall records examined

Because our analysis seeks to predict, given any location in the City of Boston, the “type of crime” most likely to occur at that location, the response variable against which our analysis is built is crime-type.

This crime-type variable was derived from Crime Incident Records data maintained by the City of Boston, and available online here. After our initial investigation of this data, a summary of which can be found here, we eventually subsetted the data to include all records containing corresponding geo-locational data recorded during the Jan. 2016 through Aug. 2019 period. In additional, we consolidated a subset of the 66 original crime type classes in the raw dataset down to a set of 9 remaining crime classes:

crime-type 	crime-type-name 	  crime-count
0 	        other                       6,321
1 	        burglary                    5,664
2 	        drugs-substances           13,082
3 	        fraud                       9,587
4 	        harassment-disturbance     20,767
5 	        robbery                     3,423
6 	        theft                      34,555
7 	        vandalism-property         13,710
8 	        violence-aggression        21,243

Once our data was subsetted, classes were filtered and consolidated, our final engineered features were merged with the crime incident data, and any records with missing data for any engineered features were removed, we were left with a dataset containing 160,440 observed crime records.

Then, after performing our train set split with an 80/20 split, stratified by crime-type, our resulting training set (the data analyzed in this EDA) contained 128,352 training records.

Secondary subset of crime-type classes

Return to page contents

Please note that we have also performed a secondary subsetting of crime-type classes in our subsequent prediction models. This was done as a comparative analysis to attempt to overcome the imbalanced classes existing in our primary set of crime-type classes and to examine potential changes in predictive accuracy when we subset for a smaller grouping of particularly meaningful classes.

The seconday subsetting of classes can be summarized as such:

crime-type 	crime-type-name 	   crime-count
0 	        drugs-substances            13,082
1 	        theft                       34,555
2 	        violence-aggression         24,666

Just note that in this particular secondary grouping of classes, violence-aggression has been combined with robbery due to the physical nature of the crime and the increased risk/threat of violence that such a crime entails.

Predictors used in this analysis

Return to page contents

For a full listing of the predictors engineered for our final dataset, please see the “Models” page of this analysis.

Geographical distributions of crime records

Return to page contents

The plot below illustrates the distribution of crime records by each record’s associated geo-location coordinates in the City of Boston. By color-coding each plotted location point according to its associated crime-type class, we can see that crime types demostrate heavy spatial mixing, with each of the different crime-type classes having occured in close proximity or overlapping one another over time from the year 2016 through 2019.

crime-locs

crime-locs-by-type

If we further subset our training records by Boston neighborhood, we can begin to examine bit more closely, whether this spatial mixing of classes might hold for all areas of the city, as well as how they might vary by particular neighborhood.

First, we can see that the greatest proportion of all total records occured in the Dorchester neighborhood, followed next by Roxbury, and then others.

crime-by-hood

Examining this neighborhood-by-neighborhood breakdown a little more closely, we can see which particular neighborhoods experienced larger proportions of one crime-type versus another crime-type proportional each individual neighborhood’s own crime incidents.

For example, in the plot below we can see that Back Bay experienced the highest proportion of theft among their own crime incidents when compared to other neighborhoods. The West End leads with the highest proportion of violence-aggression type crimes, whereas Beacon Hill has the lowest incidence of those crimes. And, Mattapan appears to lead with the highest in-neighborhood proportion of harrassment-disturbance incidents, while there are relatively very few in the Leather District.

crime-by-hood-type

By looking at spatially mapped crime incidents in individual neighborhoods, we can also see a bit more clearly the overlapping nature of crime records by class. Starting with Dorchester, which we know to have the highest overall number of records, we can see just how densely mixed the crime incidents are.

crime-dorchester

And, looking at Jamaica Plain and East Boston, as they are plotted below, we can see visually how crimes are more heavily clustered in the more heavily populated areas of each neighborhood.

crime-jamaica-plain

crime-east-boston

Even in Neighborhoods with a more sparse spatial distribution of crime records, we can still see evidence of heavy spatial mixing among crime-type classes.

crime-north-end

crime-w-roxbury

PRELIMINARY CONCLUSION:

These observations suggest to us that purely locational predictors such as Latitude and Longitude alone will be insufficient for generating accurate predictions for each crime-type class.

Geographical change in crime records over time

Return to page contents

As was discovered during the initial EDA on the raw crime incidents records analysis, to meaningfully plot choropleth distributions of crime records, we must do so with a sufficiently granular set of geographic areas. As is shown below, at the neighborhood level, the disproportionately high volume of crime records in Dorchester overwhelm the plotted distribution. However, this same level of aggregation presents a very different look when we consider the 4-year annual neighborhood-by-neighborhood change in proportion of crime records from 2016 to 2019 as is shown in the second plot below. There we can see that Dorchester is actually decreasing in overall proportion, while the South End, Roxbury, Downtown, South Boston, and Mattapan are all increasing.

crime-hood-choro

crime-hood-choro-4yr

Then, by plotting with a more granular shape area such as census tract, we can see much more specifically, with a set of smaller sub-regions, where the highest concentration of crime incidents have occured, as well as very specifically, where they are increasing the most.

crime-tract-choro

crime-tract-choro-4yr

Therefore, for the next set of plots we will use census tract areas for plotting aggregated summary statistics examining 4-year change in locational crime record proportions by crime-type class (for just a subset of our classes to preserve space), we will use census tracts as our level of aggregation.

Our first example shown below are burglary crime records. Here we can see that, while there are particular “hotspots” for burglaries represented in our dataset Downtown and in Allston, when we view the same records in respect to their 4-year change in propoprtions, a different set of locational relationships emerge, with the same Allston tract proportion dropping relatively heavily over time.

crime-tract-burg

crime-tract-burg-4yr

Next, looking at drugs-substances crime records, we can see a different distribution of hotspots overall, with particulaly high 4-year growth in the tract straddling the South Boston, South End, and Roxbury borders.

crime-tract-drug

crime-tract-drug-4yr

Then, when we look at harassment-disturbance records, another distribution emerges, more heavily distributed in tracts corresponding to Dorchester, Roxbury, Mattapan, and Hyde Park. In this case, the 4-year growth is most heavily focused in the Hyde Park tract bordering Mattapan, with Dorchester dropping most rapidly.

crime-tract-harass

crime-tract-harass-4yr

To view similar plots for all crime-type classes, please review the original notebook in which these plots were produced, which can be found online here.

PRELIMINARY CONCLUSIONS:

These locational change by crime-type findings would suggest to us that, by including time-based predictors in our models such as year-based property-related metrics such as those described in detail on our property assessment data EDA page, should increase the overall predictive accuracy of our models.

Pair-wise relationship strength between predictors and crime-type classes

Return to page contents

Looking beyond just location based and time-based predictors related to location, we will now examine the pair-wise relationships between our individual predictors versus each of our crime-type classes. To accomplish this, we will measure the strength of these relationships by calculating t-statistics for each of these pairings. Effectively, this will look in isolation at each predictor and class combination to determine the strength of that particular predictor’s strength in predicting the that class outcome in a “one-vs-rest” fashion.

Outlined below are the top 6 predictors, as measured using t-statistic for each crime-type class:

The predictors with the largest t-statistics related
to each crime-type class are:

class 0: other
                         tstat
owner-occupied-ratio     22.27
lat                      18.27
poverty-rate             17.13
median-age               17.01
enrolled-college-perc    16.39
streetlights-night        9.76


class 1: burglary
                                tstat
commercial-mix-ratio            15.76
night                           14.72
lon                             13.69
less-than-high-school-perc      10.11
college-near                     9.62
streetlights-night               9.40


class 2: drugs-substances
                              tstat
night                         34.35
streetlights-night            27.68
industrial-mix-ratio          26.97
Sunday                        25.63
less-than-high-school-perc    23.50
enrolled-college-perc         20.39


class 3: fraud
                                     tstat
less-than-high-school-perc           16.96
Sunday                               16.77
bachelor-degree-or-more-perc         14.74
residential-median-value-3yr-cagr    14.62
median-income                        11.65
enrolled-college-perc                11.11


class 4: harassment-disturbance
                                tstat
residential-gini-coef           64.20
bachelor-degree-or-more-perc    64.11
enrolled-college-perc           63.76
commercial-mix-ratio            63.10
residential-median-value        57.11
college-near                    57.01


class 5: robbery
                                     tstat
less-than-high-school-perc           11.04
streetlights-night                   10.14
lon                                   8.77
night                                 8.13
bachelor-degree-or-more-perc          7.53
residential-median-value-3yr-cagr     7.50


class 6: theft
                                tstat
bachelor-degree-or-more-perc    68.57
residential-gini-coef           61.65
enrolled-college-perc           60.34
residential-median-value        59.61
less-than-high-school-perc      54.19
college-near                    50.52


class 7: vandalism-property
                                tstat
commercial-mix-ratio            26.47
college-near                    22.72
residential-median-value        21.42
residential-gini-coef           20.61
enrolled-college-perc           19.23
owner-occupied-ratio            12.65


class 8: violence-aggression
                                tstat
less-than-high-school-perc      20.03
bachelor-degree-or-more-perc    17.88
median-income                   15.81
streetlights-night              15.32
residential-gini-coef           15.11
poverty-rate                    14.03

Now, for the sake of consistency with the previous section of this EDA, we will use plots to examine these top 5 predictors for the sames subset of predictors burglary, drugs-substances, and harassment-disturbance.

tstat-burg

tstat-drug

tstat-harass

PRELIMINARY CONCLUSIONS:

In above summary tables and plots, we can see a fairly broad mixture of top 5 predictors across all of our crime-type classes. This leads us to believe that we may have a suitable variety of predictors for the classes we hope to predict. However, while some crime-type classes such as theft and harassment-disturbance have top predictors with t-statistics exceeding 60, some classes such as violence-aggression, robbery, and burglary have much smaller top t-statistic values less than 25. This suggests those classes may be prove to be harder to predict accurately using this set of predictors. However, what’s not accounted for in these t-statistic measures is the potential interaction effect of trying to generate predictions for each specific crime-type class with multiple predictors at once.

Evaluating distance-based geographical predictors by class

Return to page contents

In the previous section of this EDA, we saw a couple of our distance-based predictors in the top 5 predictors for several of the crime-type classes. This leads us to wonder how these predictors might vary overall among all of our class types. The specific predictors we would consider distance-based are streetlights-night, which measures the number of streetlights within 100 meters of each crime record for crimes occuring at night, college-near, which indicates whether or not a crime incident occured wihtin 500 meters of a college or university, and highschool-near, which indicates whether a crime incident occured within 500 meters of a public or non-public highschool.

All measurements used in these predictors were calculated using the Haversine formula to measure the distance between sets of latitude and longitude coordinates. A detailed description of the Haversine implementation built for our analysis can be found on this site page.

Note that, for the purpose of better understanding the underlying nuances of distance to the nearest college or university and highschool, we will plot the actual distances per crime record rather than the proximity indicator for each.

For each of the below predictors, we do see some variations in distributions among the different crime-type classes. Most notable are the difference in distances to the nearest college or university for some of the classes. The difference between college distance distributions for harrassment-disturbance versus theft records provides a good example of how these relationships can vary, and likely contribute to college-near showing up several times in our top 5 t-statistics tables above.

dist-streetlight

dist-college

dist-highschool

Dimensionality reduction with principal component analysis (PCA)

Return to page contents

To gain a better sense for how well we might be able to separate classes given the predictors we currently have on hand, we now use the unsupervised learning method PCA to reduce the number of dimensions in our predictor set. Doing so will help us to understand explained variance in our predictors regardless of our target response classes, and will help us to visualize our crime-type classes along the two PCA reduced dimensions with the greatest levels of explained variance. Thus, giving us a sense for how easily separable our crime-type classes might be given these predictors.

First we can examine the growth curve in cumulative explained variance plotted against the number of principal components derived from our standardized predictor data. Here we can see a distinct elbow in explained cumulative explained variance growth when we reach 14 components and approximately 85% explained variance, indicating diminishing returns for all components beyond that point.

pca

Next, we isolate our top 2 principal components and plot them against one another. In this plot we have labeled the crime-type class of each observation. Immediately we can see that PCA does very little to separate our response classes using these top 2 components. The observations do however appear to have achieved some sort of separation across some other dimension not measured here. Regardless, this indicates to us that our prediction models will also likely have difficulty separating out the crime-type classes we have outlined here given our predictor variables.

pca-top-2

Correlation of predictors and multi-collinearity risk

Return to page contents

As a final step in our model data EDA, we will investigate correlation and multi-collinearity of the predictors we have engineered for our model.

To acheive this, we first generate a pair-wise correlation matrix for each unique pair of predictors, sorted in descending order as is shown below for all resulting correlation values greater than 0.50. Here we can see 27 paired predictors with moderate to high correlation (with 14 unique predictors contained in the list). Of these, would expect night and streetlights-night to be strongly correlation becuase streetlights-night is an interaction term partly derived from the values in night. in fact, streetlights-night will only contain non-zero values for crime records that occured at night. Additionally, the other several top correlated pairings also make sense. We should expect both poverty-rate and median-income measured at the neighborhood level to have a strong negative relationship. Likewise, we should expect the same negative relationship between bachelor-degree-or-more-perc and less-than-high-school-perc.

The most strongly correlated predictors (corr > 0.50) in our predictor
set and their corresponding correlation values are:
                                                                corr coef
poverty-rate                       median-income                    0.832
bachelor-degree-or-more-perc       less-than-high-school-perc       0.830
night                              streetlights-night               0.814
enrolled-college-perc              bachelor-degree-or-more-perc     0.776
owner-occupied-ratio               enrolled-college-perc            0.736
median-age                         owner-occupied-ratio             0.727
residential-gini-coef              residential-median-value         0.719
median-income                      less-than-high-school-perc       0.718
bachelor-degree-or-more-perc       median-income                    0.714
bachelor-degree-or-more-perc       residential-gini-coef            0.672
college-near                       enrolled-college-perc            0.670
poverty-rate                       median-age                       0.669
enrolled-college-perc              residential-gini-coef            0.663
bachelor-degree-or-more-perc       residential-median-value         0.629
median-age                         enrolled-college-perc            0.624
lat                                enrolled-college-perc            0.621
college-near                       residential-gini-coef            0.612
owner-occupied-ratio               lat                              0.610
residential-median-value           commercial-mix-ratio             0.610
residential-gini-coef              commercial-mix-ratio             0.605
college-near                       bachelor-degree-or-more-perc     0.583
lat                                bachelor-degree-or-more-perc     0.562
residential-median-value           college-near                     0.556
lat                                residential-gini-coef            0.555
residential-median-value           enrolled-college-perc            0.554
enrolled-college-perc              commercial-mix-ratio             0.550
median-income                      residential-median-value         0.523

To further examine these relationships we can begin by plotting scatter-matrices of the 14 unique predictors contained in the correlation table above.

scatter

PRELIMINARY FINDINGS:

Given the strength of correlations among a handful of our predictors, with additional time and future iterations of our analysis, this is high on our list of priorities to address. To address this, for the most strongly correlated predictor pairings we will consider methods such as centering to alleviate potential correlation issues and the removal of at least one of the predictors from the pairings to see what effect each change has on our models and the interpretability of our results.

Return to page contents