BRAVO - a Workflow for Improving Rating Reliability in Behavioural Research

Authors

  • Damien Neadle Birmingham City University, Department of Psychology, Birmingham, UK; University of Birmingham, School of Psychology, Birmingham, UK
  • Alba Motes-Rodrigo University of Lausanne, Department of Ecology and Evolution, Lausanne, Switzerland
  • Sarah R. Beck University of Birmingham, School of Psychology, Birmingham, UK
  • Claudio Tennie University of Tübingen, WG Early Prehistory and Quaternary Ecology, Tübingen, Germany; Words, Bones, Genes, and Tools: DFG Center for Advanced Studies, Tübingen, Germany

Keywords:

reliability, replication crisis, classifier validity, simulation, behavioural sciences

Abstract

Reliability assessments are a quality control protocol commonly employed in fields of research that deal with video-recorded behavioural data. During these assessments, the same sample of videos is coded (at least) twice by the same researcher (intrarater reliability), or - more often - by two different researchers independently (interrater reliability). Next, levels of agreement are quantified to assess how reliable the behavioural classification is. In this manuscript, we concentrate on interrater reliability, though our points hold generally true for both cases. Despite the importance of interrater reliability assessments to ensure research quality, to the best of our knowledge there is no guideline to date specifying how they should be conducted to avoid potentially detrimental effects of ‘coders’ degrees of freedom’ (CDF) and ‘questionable coder practices’ (QCP). For instance, there is no consensus regarding how large the sample of behaviours evaluated should be, the sample composition, the inclusion of negative controls or what statistical measures should be used to compare the raters’ classifications. To begin to fill this methodological gap, we provide a list of best practices to conduct reliability tests, which we term the BRAVO (Balanced Reliability Assessment of Video Observations) workflow. We complement these recommendations with a series of simulations highlighting the properties of BRAVO and its use-cases. BRAVO represents the first step in creating a methodological gold-standard that researchers can use to perform valid reliability assessments. Given the widespread use of behavioural data across fields, we hope that the BRAVO workflow will be implemented by researchers from a variety of disciplines such as psychology, ethology, behavioural economics, and anthropology to increase quality control and scientific transparency.

Downloads

Additional Files

Published

2025-04-16

Issue

Section

Articles