Common problems with the usage of F-measure and accuracy metrics in medical research

IRIS - Institutional Research Information System
IRIS è il sistema di gestione integrata dei dati della ricerca (persone, progetti, pubblicazioni, attività) adottato dall'Università degli Studi dell’Insubria.

IRInSubria - Institutional Repository Insubria
IRInSubria raccoglie, conserva, documenta e dissemina le informazioni sulla produzione scientifica dell'Università degli Studi dell’Insubria anche ai fini della valutazione della ricerca.

Problem Binary classifiers are widely used in medical research, especially for diagnoses. They are usually evaluated via performance metrics computed based on confusion matrices. Accuracy and F-measure are among the most frequently used performance metrics, but they make implicit assumptions and do not take into account important characteristics of classifiers. As a consequence, evaluations based on Accuracy or F-measure may turn out to be incorrect, unreliable, and inadequate for the specific application context. The usage of Accuracy and F-measure is particularly critical in the medical domain, where selecting a sub-optimal classifier may lead to incorrect diagnoses, with potentially serious or even fatal consequences. Aim We investigated whether the improper or naive usage of Accuracy and F-measure can lead to partial or incorrect evaluations. If this is the case, we need a procedure to reinterpret the conclusions reported in research articles, whenever possible. Method After discussing a few important properties of Accuracy and F-measure, we examine a set of representative research articles, to assess their conclusions, and illustrate a procedure to reinterpret those conclusions. Results It appears that the examined research articles yield conclusions that are largely affected by the used performance metrics, which in some cases lead to very misleading conclusions. The application of the proposed procedure allows the retrieval of confusion matrices and the derivation of reliable indications of classifiers’ performances. Conclusion F-measure and Accuracy should be used with care, being aware of their characteristics and limits. We recommend that future evaluations of binary classifiers be provided with the complete confusion matrices, so that users can formulate evaluations based on specific contexts and priorities.

Common problems with the usage of F-measure and accuracy metrics in medical research

Lavazza, Luigi;Morasca, Sandro

2023-01-01

Abstract

Problem Binary classifiers are widely used in medical research, especially for diagnoses. They are usually evaluated via performance metrics computed based on confusion matrices. Accuracy and F-measure are among the most frequently used performance metrics, but they make implicit assumptions and do not take into account important characteristics of classifiers. As a consequence, evaluations based on Accuracy or F-measure may turn out to be incorrect, unreliable, and inadequate for the specific application context. The usage of Accuracy and F-measure is particularly critical in the medical domain, where selecting a sub-optimal classifier may lead to incorrect diagnoses, with potentially serious or even fatal consequences. Aim We investigated whether the improper or naive usage of Accuracy and F-measure can lead to partial or incorrect evaluations. If this is the case, we need a procedure to reinterpret the conclusions reported in research articles, whenever possible. Method After discussing a few important properties of Accuracy and F-measure, we examine a set of representative research articles, to assess their conclusions, and illustrate a procedure to reinterpret those conclusions. Results It appears that the examined research articles yield conclusions that are largely affected by the used performance metrics, which in some cases lead to very misleading conclusions. The application of the proposed procedure allows the retrieval of confusion matrices and the derivation of reliable indications of classifiers’ performances. Conclusion F-measure and Accuracy should be used with care, being aware of their characteristics and limits. We recommend that future evaluations of binary classifiers be provided with the complete confusion matrices, so that users can formulate evaluations based on specific contexts and priorities.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2023
			
	Anno di pubblicazione online
	
				2023
			
	Rivista
	
				IEEE ACCESS
			
	Url
	
				https://doi.org/10.1109/access.2023.3278996
			
	DOI
	
				https://dx.doi.org/10.1109/ACCESS.2023.3278996
			
	Codice Web of Science
	
				WOS:001005995300001
			
	Codice Scopus
	
				2-s2.0-85161000395
			
	Parole chiave
	
				Accuracy, binary classifiers, F-measure, F-score, performance metrics.
			
	Tutti gli autori
	
						Lavazza, Luigi; Morasca, Sandro
					
	Appare nelle tipologie:
	
				Articolo su Rivista

File in questo prodotto:

File	Dimensione	Formato
IEEE_Access_2023.pdf accesso aperto Descrizione: Articolo principale Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 928.17 kB Formato Adobe PDF Visualizza/Apri	928.17 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11383/2155431

Citazioni

ND

16

11

social impact