On the reliability of the area under the ROC curve in empirical software engineering

IRIS - Institutional Research Information System
IRIS è il sistema di gestione integrata dei dati della ricerca (persone, progetti, pubblicazioni, attività) adottato dall'Università degli Studi dell’Insubria.

IRInSubria - Institutional Repository Insubria
IRInSubria raccoglie, conserva, documenta e dissemina le informazioni sulla produzione scientifica dell'Università degli Studi dell’Insubria anche ai fini della valutazione della ricerca.

Binary classifiers are commonly used in software engineering research to estimate several software qualities, e.g., defectiveness or vulnerability. Thus, it is important to adequately evaluate how well binary classifiers perform, before they are used in practice. The Area Under the Curve (AUC) of Receiver Operating Characteristic curves has often been used to this end. However, AUC has been the target of some criticisms, so it is necessary to evaluate under what conditions and to what extent AUC can be a reliable performance metric. We analyze AUC in relation to ϕ (also known as Matthews Correlation Coefficient), often considered a more reliable performance metric, by building the lines in the ROC space with constant value of ϕ, for several values of ϕ, and computing the corresponding values of AUC. By their very definitions, AUC and ϕ depend on the prevalence ρ of a dataset, which is the proportion of its positive instances (e.g., the defective software modules). Hence, so does the relationship between AUC and ϕ. It turns out that AUC and ϕ are very well correlated, and therefore provide concordant indications, for balanced datasets (those with ρ ≃ 0.5). Instead, AUC tends to become quite large, and hence provide over-optimistic indications, for very imbalanced datasets (those with ρ ≃ 0 or ρ ≃ 1). We use examples from the software engineering literature to illustrate the analytical relationship linking AUC, ϕ, and ρ. We show that, for some values of ρ, the evaluation of performance based exclusively on AUC can be deceiving. In conclusion, this paper provides some guidelines for an informed usage and interpretation of AUC.

On the reliability of the area under the ROC curve in empirical software engineering

Lavazza, Luigi;Morasca, Sandro;Rotoloni, Gabriele

2023-01-01

Abstract

Binary classifiers are commonly used in software engineering research to estimate several software qualities, e.g., defectiveness or vulnerability. Thus, it is important to adequately evaluate how well binary classifiers perform, before they are used in practice. The Area Under the Curve (AUC) of Receiver Operating Characteristic curves has often been used to this end. However, AUC has been the target of some criticisms, so it is necessary to evaluate under what conditions and to what extent AUC can be a reliable performance metric. We analyze AUC in relation to ϕ (also known as Matthews Correlation Coefficient), often considered a more reliable performance metric, by building the lines in the ROC space with constant value of ϕ, for several values of ϕ, and computing the corresponding values of AUC. By their very definitions, AUC and ϕ depend on the prevalence ρ of a dataset, which is the proportion of its positive instances (e.g., the defective software modules). Hence, so does the relationship between AUC and ϕ. It turns out that AUC and ϕ are very well correlated, and therefore provide concordant indications, for balanced datasets (those with ρ ≃ 0.5). Instead, AUC tends to become quite large, and hence provide over-optimistic indications, for very imbalanced datasets (those with ρ ≃ 0 or ρ ≃ 1). We use examples from the software engineering literature to illustrate the analytical relationship linking AUC, ϕ, and ρ. We show that, for some values of ρ, the evaluation of performance based exclusively on AUC can be deceiving. In conclusion, this paper provides some guidelines for an informed usage and interpretation of AUC.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2023
			
	Titolo del volume
	
				EASE '23: Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering
			
	ISBN
	
				9798400700446
			
	Titolo del congresso
	
				EASE '23: the 27th International Conference on Evaluation and Assessment in Software Engineering
			
	Luogo del Congresso
	
				Oulu, Finland
			
	Data del Congresso
	
				June 14-16, 2023 |
			
	Appare nelle tipologie:
	
				Relazione (in Volume)

File in questo prodotto:

File	Dimensione	Formato
paper_PRE.pdf non disponibili Descrizione: articolo principale Tipologia: Documento in Pre-print Licenza: Copyright dell'editore Dimensione 626.49 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	626.49 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11383/2155551

Citazioni

ND

10

4

social impact