Race Ethnicity | n | p |
---|---|---|
White | 172548 | 0.62 |
Black | 38826 | 0.14 |
Hispanic | 40564 | 0.15 |
Other | 24661 | 0.09 |
Total | 276599 | 1.00 |
Selecting an Evaluation Metric for Race and Ethnicity Imputation in FARS
In machine learning, the evaluation metric is a crucial component of the model development process. The evaluation metric is used to assess the performance of a model and determine how well it is performing on a given task. The choice of evaluation metric can have a significant impact on the model results. Here, I provide an overview of how I selected the evaluation metric for the imputation model that predicts the race and ethnicity of people in the Fatality Analysis Reporting System (FARS).
I provide a refresher on model performance evaluation, focusing on why the chosen evaluation metric matters, the limitations of accuracy in imbalanced datasets, and the strengths and weaknesses of the Multiclass Receiver Operating Characteristic Area Under the Curve (\(MROC_{AUC}\)). I select the \(MROC_{AUC}\) for the race and ethnicity imputation model because it aligns with the disparity estimation objective of the study.
Evaluation Metric Considerations
Model Objective
Since the intended use of the imputation model is to enable downstream disparity estimation and policy evaluation across racial and ethnic groups, model performance needs to heed systematic missclassification errors. A preferred performance metric for model evaluation is a measure of statistical discriminatory power that penalizes models that overfit to the majority class at the expense of the minority classes.1 For example, it is undesirable to have a model that is very good at predicting, say White individuals, but systematically miscategorizes Black individuals as Hispanic because it will bias comparison between these groups in downstream analyses.
Class Imbalance
An important factor to consider in selecting the evaluation metric is the class balance of the training data. The FARS data is quite unbalanced: the majority class, White, accounts for 62 percent of observations, while the minority classes, Black, Hispanic, and Other, account for 14, 15, and 9 percent, respectively.
In cases with class imbalance, choosing accuracy (the proportion of correctly classified instances) as the evaluation metric will tend to favor model performance on the majority class because:
- There is likely more information about the majority class.
- The model can achieve high accuracy by predicting the majority class for all observations.
Another property of accuracy is that it is class agnostic, meaning it does not distinguish between different classes. In other words, it does not matter which races or ethnicity are misclassified or how they are misclassified. This is clearly problematic when the goal is to compare differences between the classified groups because systematic misclassification errors will bias the results.
To safeguard against such imputation bias, the evaluation metric needs to give equal weight to all racial and ethnic classes and the model’s ability to statistically discriminate between them. One way to measure the discriminatory ability of a model is the area under the receiver operating characteristic curve (\(ROC_{AUC}\)). The ROC curve is a graphical representation of the trade-off between the true positive rate and the false positive rate for different threshold values. The AUC-ROC summarizes the ROC curve into a single value, with a higher AUC indicating better discriminatory power.
To illustrate, here is the \(ROC\) for a binary White versus all other classes case of the FARS model:
The \(ROC\) curve traces the true positive rate against false positive rate at all decision thresholds. In the figure above, we see that at a decision threshold of .27 (anyone with a predicted probability of being White greater than .27 is classified as White):
- The true positive rate, \(P(Model predicts White | Person is White)\), is .82.
- The false positive rate, \(P(Model predicts White | Person is not White)\), is .19.
The area under this curve is the \(ROC_{AUC}\). A perfect model would predict all White individuals as White with 100% certainty and everyone else as not White with 100% certainty. At all thresholds then, the true positive rate would be 1 and the false positive rate would be 0. If that is the case, then the \(ROC\) will be pushed all the way to the upper left corner and the \(ROC_{AUC}\) will be 1. Thus the more area under the curve, the better. Another interpretation for the \(ROC_{AUC}\) is that it is the probability that, for a randomly chosen White person, the model will have assigned them a higher probability of being White than the other classes.
The FARS race and ethnicity imputation model outputs predicted probabilities for 4 racial and ethnic categories: White, Black, Hispanic, and Other. The \(MROC_{AUC}\) is a pairwise generalization of the \(ROC_{AUC}\) (Hand and Till 2001):
\[MROC_{AUC} = \frac{1}{C \times (C - 1)} \sum_{i=0}^{C-1} \sum_{j=0, j \neq i}^{C-1} A(i,j)\]
The \(MROC_{AUC}\) is essentially an average of averages. For each race or ethnicity class, its \(AUC\) is calculated against each other class, then averaged. The process of averaging these pairwise comparisons is repeated across all of the race and ethnic classes. The \(MROC_{AUC}\) is an average of these pairwise averages. Similar the the \(AUC\), the \(MROC_{AUC}\) ranges from .5 to 1, with .5 indicating no discriminatory power and 1 indicating perfect discriminatory power.
References
Footnotes
Here the terms majority and minority are statistical terms that refer to class prevalence within the data set.↩︎