Allen Institute for AI

Scoracle

What's the BEST performance on your data? 🤖

About

When tackling a dataset, it's important to know the best possible performance. There's no point studying a solved problem!

If a group of annotators always labeled examples the same way, then models could score perfectly. In practice, only a handful of annotators decide each label, introducing some randomness in the process. So, even the perfect model can't get the perfect score.

We can formalize “the perfect model” as an oracle that knows how people annotate each example on average. This demo reports an estimate of that oracle's score on your data, which we call the BEST (Bayesian Estimated Score Terminus) performance.

Submit your dataset and discover the BEST performance! You can also check out the code, read the paper, or learn some background on how we compute the estimate.

Demo

To compute the BEST performance, you'll need a classification dataset with a fixed set of classes and multiple annotators per question. We'll estimate the score from the label counts. To try it out, you can select a preset example below.

Upload Data

Provide the data, a matrix of integers with the \(ij^{th}\) entry providing the number of annotators that marked example \(i\) in class \(j\).

View Results

BEST performance estimates
name metric score

API

Documentation

This demo provides a ReST API at /api/score. To use it, POST a JSON object with the following keys:

metrics
A list of strings representing metrics to compute. Supported metrics include:
  • "accuracy"
  • "balanced accuracy"
  • "f1 (macro)"
  • "cross entropy (soft labels)"
labelCounts
An array of arrays of integers, where the \(ij^{th}\) entry is the number of annotators that labeled example \(i\) as in class \(j\).

Examples

Using cURL, you could hit the API as follows:


$ curl \
>   --request POST \
>   --header "Content-Type: application/json" \
>   --data '{"metrics": ["accuracy", "f1 (macro)"], "labelCounts": [[1, 3], [4, 0]]}' \
>   $DOMAIN/api/score
[
  {
    "metric": "accuracy",
    "score": 0.8878
  },
  {
    "metric": "f1 (macro)",
    "score": 0.8485666666666668
  }
]
          

Or using HTTPie:


$ http \
>   post \
>   $DOMAIN/api/score \
>   labelCounts:='[[3, 2], [0, 5]]' \
>   metrics:='["accuracy", "f1 (macro)"]'
HTTP/1.0 200 OK
Content-Length: 118
Content-Type: application/json
Date: Wed, 04 Dec 2019 23:15:50 GMT
Server: Werkzeug/0.15.4 Python/3.7.0

[
    {
        "metric": "accuracy",
        "score": 0.7626
    },
    {
        "metric": "f1 (macro)",
        "score": 0.6836
    }
]
          

Background

This section provides some quick background on how we define and estimate the BEST performance for a dataset. For an implementation, see the code. For more information, checkout the paper.

Defining the Oracle

To formalize “best possible” performance, we define an oracle which knows exactly how people annotate each example on average. For example \(i\), let \(p_{ij}\) be the probability that a random annotator assigns it to class \(j\), \(Y_{ij}\) be the count of annotations placing it in class \(j\), and \(N_i\) the total number of annotations it has. Using slice syntax to denote vectors, we have: \[ Y_{i:} \sim \operatorname{Multinomial}(p_{i:}, N_i) \] So given a loss, \(\ell\), the oracle predicts as well as possible knowing the class probabilities for each example (\(p\)) but not the annotations that were actually collected (\(Y\)). For ease of computation, we use the most probable label or the label distribution (whichever makes more sense) — even though these might not be optimal for all metrics.

The BEST Performance: Estimating the Oracle's Score

Since the annotations (\(Y\)) are multinomial, we model the class probabilities (\(p\)) with the conjugate, Dirichlet distribution: \[ p_{i:} \sim \operatorname{Dirichlet}(\alpha) \] Following the empirical Bayesian approach, we fit the prior (\(\alpha\)) via maximum likelihood to obtain \(\hat{\alpha}\). Our estimate, the BEST performance, is then the oracle's expected loss over the posterior: \[ s \coloneqq \mathbb{E}_{p \mid Y, \hat{\alpha}}[\ell(p, Y)] \]

For the mathematical details, check out the paper or the code for computing the BEST performance.

Cite

If you build off this work or use the BEST performance, please cite the paper as follows:


@article{Lourie2020Scruples,
    author = {Nicholas Lourie and Ronan Le Bras and Yejin Choi},
    title = {Scruples: A Corpus of Community Ethical Judgments on 32,000 Real-Life Anecdotes},
    journal = {arXiv e-prints},
    year = {2020},
    archivePrefix = {arXiv},
    eprint = {2008.09094},
}