What's the BEST performance on your data? 🤖
When tackling a dataset, it's important to know the best possible performance. There's no point studying a solved problem!
If a group of annotators always labeled examples the same way, then models could score perfectly. In practice, only a handful of annotators decide each label, introducing some randomness in the process. So, even the perfect model can't get the perfect score.
We can formalize “the perfect model” as an oracle that knows how people annotate each example on average. This demo reports an estimate of that oracle's score on your data, which we call the BEST (Bayesian Estimated Score Terminus) performance.
Submit your dataset and discover the BEST performance! You can also check out the code, read the paper, or learn some background on how we compute the estimate.
To compute the BEST performance, you'll need a classification dataset with a fixed set of classes and multiple annotators per question. We'll estimate the score from the label counts. To try it out, you can select a preset example below.
name | metric | score |
---|
This demo provides a ReST API at /api/score
. To use
it, POST
a JSON object with the following keys:
"accuracy"
"balanced accuracy"
"f1 (macro)"
"cross entropy (soft labels)"
Using cURL, you could hit the API as follows:
$ curl \
> --request POST \
> --header "Content-Type: application/json" \
> --data '{"metrics": ["accuracy", "f1 (macro)"], "labelCounts": [[1, 3], [4, 0]]}' \
> $DOMAIN/api/score
[
{
"metric": "accuracy",
"score": 0.8878
},
{
"metric": "f1 (macro)",
"score": 0.8485666666666668
}
]
Or using HTTPie:
$ http \
> post \
> $DOMAIN/api/score \
> labelCounts:='[[3, 2], [0, 5]]' \
> metrics:='["accuracy", "f1 (macro)"]'
HTTP/1.0 200 OK
Content-Length: 118
Content-Type: application/json
Date: Wed, 04 Dec 2019 23:15:50 GMT
Server: Werkzeug/0.15.4 Python/3.7.0
[
{
"metric": "accuracy",
"score": 0.7626
},
{
"metric": "f1 (macro)",
"score": 0.6836
}
]
To formalize “best possible” performance, we define an oracle which knows exactly how people annotate each example on average. For example \(i\), let \(p_{ij}\) be the probability that a random annotator assigns it to class \(j\), \(Y_{ij}\) be the count of annotations placing it in class \(j\), and \(N_i\) the total number of annotations it has. Using slice syntax to denote vectors, we have: \[ Y_{i:} \sim \operatorname{Multinomial}(p_{i:}, N_i) \] So given a loss, \(\ell\), the oracle predicts as well as possible knowing the class probabilities for each example (\(p\)) but not the annotations that were actually collected (\(Y\)). For ease of computation, we use the most probable label or the label distribution (whichever makes more sense) — even though these might not be optimal for all metrics.
Since the annotations (\(Y\)) are multinomial, we model the class probabilities (\(p\)) with the conjugate, Dirichlet distribution: \[ p_{i:} \sim \operatorname{Dirichlet}(\alpha) \] Following the empirical Bayesian approach, we fit the prior (\(\alpha\)) via maximum likelihood to obtain \(\hat{\alpha}\). Our estimate, the BEST performance, is then the oracle's expected loss over the posterior: \[ s \coloneqq \mathbb{E}_{p \mid Y, \hat{\alpha}}[\ell(p, Y)] \]
For the mathematical details, check out the paper or the code for computing the BEST performance.
If you build off this work or use the BEST performance, please cite the paper as follows:
@article{Lourie2020Scruples,
author = {Nicholas Lourie and Ronan Le Bras and Yejin Choi},
title = {Scruples: A Corpus of Community Ethical Judgments on 32,000 Real-Life Anecdotes},
journal = {arXiv e-prints},
year = {2020},
archivePrefix = {arXiv},
eprint = {2008.09094},
}