Repeatability and Reproducibility of Comparison Decisions by Firearms Examiners

A study assesses the consistency of firearms examiners, finding high reliability for definitive Identification and Elimination conclusions but significant variability in the use of the subjective "Inconclusive" category.

Simplyforensic
8 Min Read
Photo by Ünsal Demirbaş on Pexels.com

Repeatability and reproducibility are often cited as the hallmarks of good science. In forensic science, establishing these characteristics for pattern comparison disciplines—where the examiner is the de facto “instrument”—is crucial for courtroom admissibility. Forensic firearms and toolmark analysis, which involves matching microscopic striations on bullets and cartridge cases, is one such discipline. A comprehensive new study, published in the Journal of Forensic Sciences, provides key data on the reliability of firearms examiners, finding that while definitive conclusions are highly consistent, the indeterminate range remains subjective.

The Examiner as the ‘Instrument’ of Comparison

In pattern comparison, the reliability of the technique is directly linked to the consistency of the human examiner. This study aims to quantify that consistency in forensic firearms examiners by measuring both:

  • Repeatability: The ability of one examiner to make the same decision when re-examining the same material.
  • Reproducibility: The ability of two different examiners to come to the same conclusion when independently evaluating the same material.

The data gathered uses the AFTE Range of Conclusions (Identification, Inconclusive A, B, or C, Elimination, or Unsuitable), which offers a granular view of examiner decisions.

The Research: Quantifying Consistency Across Examiners

The study was based on thousands of comparisons of both bullets and cartridge cases, with test sets blindly resubmitted to examiners.

Methodology: Repeatability vs. Reproducibility

The researchers used comparison sets from three different types of firearms. The same examiner re-examined samples for repeatability (over 5,700 comparisons by 105 examiners), and different examiners compared the same material for reproducibility (over 5,700 comparisons by over 190 examiner pairs). The analysis focused purely on the agreement of the paired conclusions, not their overall accuracy.

Key Findings: High Reliability, Low Consistency in the Gray Area

The data demonstrated a clear trend: definitive conclusions are highly reliable, but the subjective range is not:

  • Repeatability (Same Examiner): Averaged over bullets and cartridge cases, repeatability was high: 78.3% for known matches (Identification) and 64.5% for known nonmatches (Elimination). Disagreements were predominantly between a definitive decision and an Inconclusive category.
  • Reproducibility (Different Examiners): Consistency between different examiners was lower, averaging 67.3% for known matches and 36.5% for known nonmatches.
  • Reliability of Definitive Calls: The reliability of Identification and Elimination conclusions was high; instances of contradictory definitive decisions (ID to Elimination or vice versa) were rare (around 0.11% to 2.68% depending on the comparison).
  • The Inconclusive Problem: The vast majority of disagreements were contained within the Inconclusive categories. When the three sub-levels of Inconclusive were pooled into a single category, agreement increased substantially, particularly for nonmatching sets. This highlights that the subjectivity is not in the final definitive call, but in choosing the level of Inconclusive to report.

The Need for Standardizing the ‘Inconclusive’ Call

This research is vital because it scientifically validates concerns about the subjective gray area of forensic and toolmark analysis. The data confirm that examiners are reliable in their core function, but the consistency of their reporting needs refinement.

The Statistical Value of Definitive Conclusions

The study’s finding that definitive conclusions are rarely reversed and rarely confused for their opposite is a crucial piece of data that supports the efficacy of firearms and toolmark analysis in court. However, the high variability in the Inconclusive category presents a challenge to transparency and trustworthiness. It suggests that while the AFTE Range of Conclusions provides a framework, the middle ground relies too heavily on individual judgment, which is detrimental to the scientific rigor of the discipline.

Lessons from STR DNA Analysis

The issues found in toolmark analysis—inconsistent grading and subjectivity in non-definitive calls—mirror historical challenges in other pattern-comparison disciplines. As a Senior DNA analyst experienced in STR DNA analysis, I recognize the parallel. Our field addressed similar issues in interpreting complex DNA mixtures by moving toward standardized, statistical software that removed much of the subjectivity. The high disagreement rate within the Inconclusive categories strongly suggests that forensic firearms and toolmark analysis would benefit from implementing objective, statistically driven decision models to guide or replace the subjective gradations of Inconclusive opinions.

My Perspective: Upholding Confidence in Pattern Evidence

This research is an essential step in upholding the confidence placed in pattern evidence. It confirms that the underlying principle of individualization is sound, but that the process by which examiners report uncertainty needs to be standardized. By embracing objective data (such as the pooling models demonstrated here), the field can reduce subjective variability and ensure that the evidence presented is not only accurate but also consistent across all laboratories and examiners.


Conclusion

This study provides compelling evidence that the repeatability and reproducibility of forensic firearms examiners are high for definitive conclusions, demonstrating their essential reliability. However, the research also reveals a significant source of variability within the subjective Inconclusive categories of the AFTE Range of Conclusions. These findings underscore the necessity for the forensic and toolmark analysis community to refine its decision-making framework, moving toward standardized, statistically guided protocols to ensure consistency and transparency in all reported conclusions.

Original Research Paper

Monson, K. L., Smith, E. D., & Peters, E. M. (2023). Repeatability and reproducibility of comparison decisions by firearms examiners. Journal of Forensic Sciences, 68(5), 1721-1740. https://doi.org/10.1111/1556-4029.15318

Term Definitions

  • Firearms and Toolmark Analysis: The forensic discipline that examines and compares microscopic markings on bullets, cartridge cases, and other objects to link them to a specific firearm or tool.
  • Repeatability: The degree of consistency shown by a single examiner when comparing the same evidence multiple times.
  • Reproducibility: The degree of consistency shown when different examiners evaluate the same evidence and reach the same conclusion.
  • Reliability (Test-Retest Reliability): The overall consistency of a measurement method, encompassing both repeatability and reproducibility.
  • AFTE Range of Conclusions (Association of Firearm & Tool Mark Examiners): The standardized set of possible conclusions used by examiners, including Identification, Elimination, and Inconclusive (often with sub-levels A, B, and C).
  • Inconclusive Call: A conclusion where the examiner cannot definitively identify or eliminate a source, often deemed the “gray area” of subjective judgment.
Share This Article
Follow:
Forensic Analyst by Profession. With Simplyforensic.com striving to provide a one-stop-all-in-one platform with accessible, reliable, and media-rich content related to forensic science. Education background in B.Sc.Biotechnology and Master of Science in forensic science.
Leave a Comment