The notebook used to develop this baseline model can be found here.

Summary

As an initial baseline model, against which we could compare the results of our subsequent models, we attempted to predict crime types using only latitude and longitude (lat and lon) as our predictors. To accomplish this, we ran a multinomial logistic model with no regularization. And we ran it both with and then without “balanced” class weightings for comparison. Given the heavy spatial mixing of crime-type classes as shown in our initial training data EDA we had very low expectations for the quality of predictions that might be generated using only geo-spatial coordinates as our predictors.

Model parameters

For comparison the parameters of the two scikit-learn LogisticRegression model objects fitted for our baseline model are:

MODEL 1: Without balanced class weights

LogisticRegression(C=100000, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1,
                   l1_ratio=None, max_iter=1000,
                   multi_class='multinomial', n_jobs=None,
                   penalty='l2', random_state=20, solver='lbfgs',
                   tol=0.0001, verbose=0, warm_start=False)

MODEL 2: With balanced class weights

LogisticRegression(C=100000, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1,
                   l1_ratio=None, max_iter=1000,
                   multi_class='multinomial', n_jobs=None,
                   penalty='l2', random_state=20, solver='lbfgs',
                   tol=0.0001, verbose=0, warm_start=False)

Model accuracy and AUC

Ultimately, these models resulted in the following training accuracies and average AUCs. As we can see the model generated without using balanced class weights acheived only a TEST accuracy of just 0.2733, which is not surprising given the expected difficulty of predicting crime types given our previous exploration of this data. However, what is interesting is the far lower TEST accuracy for the Model 2, generated using balanced class weights. Surprisingly, the AUC of Model 2 suffers very little as a result of balancing class weights, which leads us to further inspect the prediction results generated by each model in the next section below.

MODEL 1: Without balanced class weights

This model resulted in the following accuracy:

Training	0.2706
Test		0.2733

The model AUC is:

	        weighted	unweighted
Training	0.5519		0.5396
Test		0.5558		0.5370

MODEL 2: With balanced class weights

This model resulted in the following accuracy:

Training	0.1295
Test		0.1332

The model AUC is:

		    weighted	unweighted
Training	0.5487		0.5382
Test		0.5529		0.5395

Model predictions

By viewing the number of predictions generated by each version of the model, we can see that Model 2 generated a more balanced set of predictions, whereas Model 1 only predicted occurances of crime-type classes 4 (harassment-disturbance) and 6 (theft), two of the most heavily over-represented classes in our training set. Even with some many false negative predicted for all other classes in Model 1, because these two classes together represent 43% of our total true TEST observations, this would explain why were able to achieve the TEST accuracy score we did with that model. Likewise, because Model 2 predictions are distributed among all potential response classes, the heavy over-representation of crime-type classes 0 (other) and 4 in our results very negatively impacts overall accuracy on the test set.

MODEL 1: Without balanced class weights

The number of classes predicted by class are:

TRAINING
        count  proportion
class                    
4       15446    0.120341
6      112906    0.879659

TEST
       count  proportion
class                   
4       3939    0.122756
6      28149    0.877244

MODEL 2: With balanced class weights

The number of classes predicted by class are:

TRAINING
       count  proportion
class                   
    53841    0.419479
     6662    0.051904
     1595    0.012427
     2222    0.017312
    52740    0.410901
     3164    0.024651
     3877    0.030206
     1837    0.014312
     2414    0.018808

TEST
       count  proportion
class                   
    13476    0.419970
     1624    0.050611
      423    0.013182
      527    0.016424
    13313    0.414890
      763    0.023778
      964    0.030042
      409    0.012746
      589    0.018356

To further understand this effect, we can look at the confusion matrices and resulting classification metrics generated by the TEST results of each model as are shown below. Here again we can see the result of Model 2 predicting a larger number of response classes for the TEST set. For example, the true positive rate (TPR) shown for crime-type 6 in Model 1 is 0.905, but in Model 2 TPR for this same class drops to just 0.026. Because crime-type 6 is our most heavily represented class in our actual TEST data, this metric too supports the discrepancy in Model 1 versus Model 2 accuracy results that we have seen.

MODEL 1: Without balanced class weights

The resulting confusion matrix:

TEST
Actual        0     1     2     3     4    5     6     7     8  Total
Predicted                                                            
0             0     0     0     0     0    0     0     0     0      0
1             0     0     0     0     0    0     0     0     0      0
2             0     0     0     0     0    0     0     0     0      0
3             0     0     0     0     0    0     0     0     0      0
4           127   154   329   357   954  105   824   458   631   3939
5             0     0     0     0     0    0     0     0     0      0
6          1453  1262  2942  2040  4238  751  7815  2969  4679  28149
7             0     0     0     0     0    0     0     0     0      0
8             0     0     0     0     0    0     0     0     0      0
Total      1580  1416  3271  2397  5192  856  8639  3427  5310  32088

The classification metrics derived from the confusion matrix are:

TEST
         TP     FP    FN     TN    TPR    FNR    FPR    TNR
class                                                                  
0         0      0  1580  30508  0.000  1.000  0.000  1.000
1         0      0  1416  30672  0.000  1.000  0.000  1.000
2         0      0  3271  28817  0.000  1.000  0.000  1.000
3         0      0  2397  29691  0.000  1.000  0.000  1.000
4       954   2985  4238  23911  0.184  0.816  0.111  0.889
5         0      0   856  31232  0.000  1.000  0.000  1.000
6      7815  20334   824   3115  0.905  0.095  0.867  0.133
7         0      0  3427  28661  0.000  1.000  0.000  1.000
8         0      0  5310  26778  0.000  1.000  0.000  1.000

MODEL 2: With balanced class weights

The resulting confusion matrix:

TEST
Actual        0     1     2     3     4    5     6     7     8  Total
Predicted                                                            
0           697   479  1532  1088  1207  394  4715  1245  2119  13476
1           102    85    97    98   353   50   383   173   283   1624
2            21    21    27    30    85    6   120    46    67    423
3            29    24    24    28    99   16   125    73   109    527
4           525   675  1246   974  3041  320  2690  1572  2270  13313
5            41    46   122    51   101   25   155    79   143    763
6           114    32   112    62   133   25   222   100   164    964
7            16    30    30    25    79    8    88    64    69    409
8            35    24    81    41    94   12   141    75    86    589
Total      1580  1416  3271  2397  5192  856  8639  3427  5310  32088

The classification metrics derived from the confusion matrix are:

TEST
         TP     FP    FN     TN    TPR    FNR    FPR    TNR
class                                                                  
0       697  12779   883  17729  0.441  0.559  0.419  0.581
1        85   1539  1331  29133  0.060  0.940  0.050  0.950
2        27    396  3244  28421  0.008  0.992  0.014  0.986
3        28    499  2369  29192  0.012  0.988  0.017  0.983
4      3041  10272  2151  16624  0.586  0.414  0.382  0.618
5        25    738   831  30494  0.029  0.971  0.024  0.976
6       222    742  8417  22707  0.026  0.974  0.032  0.968
7        64    345  3363  28316  0.019  0.981  0.012  0.988
8        86    503  5224  26275  0.016  0.984  0.019  0.981

Receiver operator characteristic (ROC) curves by model and class

As a final set of items we will investigate to better understand these quality of our results and the differences that we are seeing between Model 1 and Model 2, we will now plot the ROC for each model and class, and report on the individual AUC values by class and model.

As we can see in the first two plots below, ROC curve plotted by class for each model shows very similar results among Model 1 and Model 2 (plotted in that order below), explaining the very similar average overall AUC for both models (note that this metric is also reported on the plots). However, what is evident in both of these models are the unfavorable overall ROC curve shapes for all classes, even the best performing class, crime-type class 4. What we would hope to see here are ROC curves rising very steeply toward a true positive rate (TPR) of 1.0 at very low values for the corresponding false positive rates (FPR), idealing forming a very high, far left, sharp elbow in each curve. Such a curve would indicate that, by modifying our classification probability thresholds for each class (or a particular class of interest, such as violence-aggression) there would be a very low trade-off in FPR to achieve higher TPRs for individual class predictions. A good illustrative example of this this relationship can be seen on the related Wikipedia article, most notably in the illustrative image located here.

However, what we see in our plots below are wavering ROC curves with very low rates of increase across each plot. Notably, some class ROC curves are even at or below the 0.5 AUC threshold, as is indicated by curved below the diagonal dashed black line plotted as a point of interest in each figure. This particular relationship to the AUC=0.5 reference line can be more easily seen in the third plot below, which breaks out ROC curves for each individual crime-type class for Model 1 as a visual example. Ideally, in future models, would like to see each of these curves pulled upward, toward the left, and away from that 0.5 AUC dashed line as far as possible, which would indicate more favorable opportunities for increasing TPR for particular crime-type classes of interest.

roc

roc-bal

roc-class

Conclusion and next steps

As expected from our initial EDA and data visualization efforts, due to the heavy spatial mixing of crime-type classes, we have achieved very low accuracy scores and far less than favorable AUC values and ROC curves with the two versions of our baseline model above. What’s also interesting is the sharp decrease in accuracy we achieve when balancing the class weights of our fitted model as discussed above. These are both challenges we will seek to address in our subsequent models.

To some degree, we expect that logistic regression will fail to adequately separate prediction classes due to the linear nature of the model, and that we will eventually find better results by using non-parametric methods such as k-Nearest Neighbors classifiers or Decision Tree-based classifiers, which can provide more flexible decision boundaries. However, after viewing these results, we were still very interested to see what additional predictive accuracy we could achieve and how different our AUC and ROC results might be once we add in a more complete predictor set. To view the results of our subsequent Logistic Regression classifier models in which we generated predictions using the entire set of predictors, please see the Logistic Regression Classifiers summary page. As you can see on that page, we were able to achieve some very interesting results, even with just logistic regression as our classifier.