Comparing φ and the F-measure as Performance Metrics for Software-related Classifications

IRIS - Institutional Research Information System
IRIS è il sistema di gestione integrata dei dati della ricerca (persone, progetti, pubblicazioni, attività) adottato dall'Università degli Studi dell’Insubria.

IRInSubria - Institutional Repository Insubria
IRInSubria raccoglie, conserva, documenta e dissemina le informazioni sulla produzione scientifica dell'Università degli Studi dell’Insubria anche ai fini della valutazione della ricerca.

Context The F-measure has been widely used as a performance metric when selecting binary classifiers for prediction, but it has also been widely criticized, especially given the availability of alternatives such as φ (also known as Matthews Correlation Coefficient). Objectives Our goals are to (1) investigate possible issues related to the F-measure in depth and show how φ can address them, and (2) explore the relationships between the F-measure and φ. Method Based on the definitions of φ and the F-measure, we derive a few mathematical properties of these two performance metrics and of the relationships between them. To demonstrate the practical effects of these mathematical properties, we illustrate the outcomes of an empirical study involving 70 Empirical Software Engineering datasets and 837 classifiers. Results We show that φ can be defined as a function of Precision and Recall, which are the only two performance metrics used to define the F-measure, and the rate of actually positive software modules in a dataset. Also, φ can be expressed as a function of the F-measure and the rates of actual and estimated positive software modules. We derive the minimum and maximum value of φ for any given value of the F-measure, and the conditions under which both the F-measure and φ rank two classifiers in the same order. Conclusions Our results show that φ is a sensible and useful metric for assessing the performance of binary classifiers. We also recommend that the F-measure should not be used by itself to assess the performance of a classifier, but that the rate of positives should always be specified as well, at least to assess if and to what extent a classifier performs better than random classification. The mathematical relationships described here can also be used to reinterpret the conclusions of previously published papers that relied mainly on the F-measure as a performance metric.

Comparing φ and the F-measure as Performance Metrics for Software-related Classifications

L. Lavazza;S. Morasca

2022-01-01

Abstract

Context The F-measure has been widely used as a performance metric when selecting binary classifiers for prediction, but it has also been widely criticized, especially given the availability of alternatives such as φ (also known as Matthews Correlation Coefficient). Objectives Our goals are to (1) investigate possible issues related to the F-measure in depth and show how φ can address them, and (2) explore the relationships between the F-measure and φ. Method Based on the definitions of φ and the F-measure, we derive a few mathematical properties of these two performance metrics and of the relationships between them. To demonstrate the practical effects of these mathematical properties, we illustrate the outcomes of an empirical study involving 70 Empirical Software Engineering datasets and 837 classifiers. Results We show that φ can be defined as a function of Precision and Recall, which are the only two performance metrics used to define the F-measure, and the rate of actually positive software modules in a dataset. Also, φ can be expressed as a function of the F-measure and the rates of actual and estimated positive software modules. We derive the minimum and maximum value of φ for any given value of the F-measure, and the conditions under which both the F-measure and φ rank two classifiers in the same order. Conclusions Our results show that φ is a sensible and useful metric for assessing the performance of binary classifiers. We also recommend that the F-measure should not be used by itself to assess the performance of a classifier, but that the rate of positives should always be specified as well, at least to assess if and to what extent a classifier performs better than random classification. The mathematical relationships described here can also be used to reinterpret the conclusions of previously published papers that relied mainly on the F-measure as a performance metric.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2022
			
	Anno di pubblicazione online
	
				2022
			
	Rivista
	
				EMPIRICAL SOFTWARE ENGINEERING
			
	Url
	
				https://link.springer.com/article/10.1007/s10664-022-10199-2
			
	DOI
	
				https://dx.doi.org/10.1007/s10664-022-10199-2
			
	Codice Web of Science
	
				WOS:000862564000002
			
	Codice Scopus
	
				2-s2.0-85139262470
			
	Parole chiave
	
				Binary classification · Software defect prediction · Performance evaluation · Performance metrics · Matthews Correlation Coefficient · F-measure · F-score
			
	Tutti gli autori
	
						Lavazza, L.; Morasca, S.
					
	Appare nelle tipologie:
	
				Articolo su Rivista

File in questo prodotto:

File	Dimensione	Formato
EMSE_2022.pdf accesso aperto Descrizione: articolo Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 2.96 MB Formato Adobe PDF Visualizza/Apri	2.96 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11383/2140856

Citazioni

ND

16

13

social impact