Agreement calculation

This section of the documentation will cover how RedBrick AI calculates inter-annotator agreement between two users.

For two sets of labels, annotation instances are matched up by category. For the same category, instances are matched up by selecting pairs that maximize the overall agreement score. For two instances of the same category, RedBrick AI uses the following similarity functions

Bounding box, Polygon, and Pixel Segmentation

RedBrick AI uses IOU for these annotation types. For two annotations A and B IOU is defined by:

IOU=ABABIOU = \frac{A\cup B}{A\cap B}


For landmarks/keypoints, RedBrick AI uses a normalized Root Mean Squared Error (RMSE) to compute similarity, where similarity is Similarity=1RMSESimilarity = 1 - RMSE.

MSE=1nin(PiP^i)2RMSE=MSEMSE = \frac{1}{n}\sum_{i}^{n}(P_{i} - \hat P_{i})^2 \\ RMSE = \sqrt{MSE}

Where nn​ is the number of components of the point (2 for 2D, 3 for 3D), and Pi,Pi^P_i, \hat{P_i} ​ are normalized components (by width, height, depth of the image) of the two points.

Length Measurements

Comparisons of length measurements are done by comparing the two sets of points (using the technique covered above) that define the length line.

Angle Measurements

For angle measurements, the vectors between each arm of the angle measurement are compared. The two angles comparing both sets of measurement arms are computed. The similarity score is then defined by:

Similarity=1θ1+θ22πSimilarity = 1 - \frac{\theta_1 + \theta_2}{2\pi}

​Where θ1,θ2\theta_1, \theta_2​ are the angles between the two sets of measurement arms.


For classification labels, the agreement is binary. If the chosen category and attributes match, the consensus score will be 100%, otherwise, it will be 0%.

Generating a single score

To generate a single score between two sets of labels, a series of averages are computed.

  1. Scores of matching annotations instances of the same category are averaged, to generate a single score per category.

  2. Scores are then averaged per category.

  3. Scores are then averaged per label type to generate a single score per label type.

  4. For videos, scores are calculated per frame and averaged to generate a single score per sequence.

  5. For multi-series studies, scores are averaged by volume to generate a single score per study. ​

Last updated