Dealing with Uncertainty in Binary Logistic Regression Fault-proneness Models

IRIS - Institutional Research Information System
IRIS è il sistema di gestione integrata dei dati della ricerca (persone, progetti, pubblicazioni, attività) adottato dall'Università degli Studi dell’Insubria.

IRInSubria - Institutional Repository Insubria
IRInSubria raccoglie, conserva, documenta e dissemina le informazioni sulla produzione scientifica dell'Università degli Studi dell’Insubria anche ai fini della valutazione della ricerca.

Background Binary Logistic Regression is widely used in Empirical Software Engineering to build estimation models, e.g., fault-proneness models, which estimate the probability that a given module is faulty, based on some measures of the module. Fault-proneness models are then used to build faultiness model, i.e., models that estimate whether a given module is faulty or non-faulty.Objective Because of the very nature of Binary Logistic Regression, there is always a range of values of the independent variable in which estimates are very close to the estimates that would be obtained via random estimation. These estimates are hardly accurate, and should be regarded as not reliable. For Binary Logistic Regression models used to build faultiness models-i.e., binary classifiers-a range where estimates are inaccurate can be regarded as an "uncertainty" area, where the model is uncertain about the faultiness of the given modules. We define and empirically validate a simple method to identify the uncertainty region.Method We compute the standard deviation of a probability estimate provided by a Binary Logistic Regression model. If the random estimate is within the range centered on the estimate and spanning a standard deviation, we regard the estimate as "too close" to the random estimate, hence unreliable. On the contrary, estimates that are far enough from the random estimate are considered reliable. This method was tested on 54 datasets from the PROMISE (now SEACRAFT) repository.Results Our results show that the variance of estimates can be effectively used to detect the uncertainty region. Estimates in the uncertainty area are rarely statistically significant, and always much less accurate than estimates obtained out of the uncertainty area.Conclusions Practitioners and researchers can use our results to assess the reliability of estimates obtained via Binary Logistic Regression models, and reject (or challenge) those estimates that fall in the uncertainty region.

Dealing with Uncertainty in Binary Logistic Regression Fault-proneness Models

Lavazza, Luigi;Morasca, Sandro

2019-01-01

Abstract

Background Binary Logistic Regression is widely used in Empirical Software Engineering to build estimation models, e.g., fault-proneness models, which estimate the probability that a given module is faulty, based on some measures of the module. Fault-proneness models are then used to build faultiness model, i.e., models that estimate whether a given module is faulty or non-faulty.Objective Because of the very nature of Binary Logistic Regression, there is always a range of values of the independent variable in which estimates are very close to the estimates that would be obtained via random estimation. These estimates are hardly accurate, and should be regarded as not reliable. For Binary Logistic Regression models used to build faultiness models-i.e., binary classifiers-a range where estimates are inaccurate can be regarded as an "uncertainty" area, where the model is uncertain about the faultiness of the given modules. We define and empirically validate a simple method to identify the uncertainty region.Method We compute the standard deviation of a probability estimate provided by a Binary Logistic Regression model. If the random estimate is within the range centered on the estimate and spanning a standard deviation, we regard the estimate as "too close" to the random estimate, hence unreliable. On the contrary, estimates that are far enough from the random estimate are considered reliable. This method was tested on 54 datasets from the PROMISE (now SEACRAFT) repository.Results Our results show that the variance of estimates can be effectively used to detect the uncertainty region. Estimates in the uncertainty area are rarely statistically significant, and always much less accurate than estimates obtained out of the uncertainty area.Conclusions Practitioners and researchers can use our results to assess the reliability of estimates obtained via Binary Logistic Regression models, and reject (or challenge) those estimates that fall in the uncertainty region.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2019
			
	Titolo del volume
	
				PROCEEDINGS of EASE 2019 - Evaluation and Assessment in Software Engineering
			
	ISBN
	
				9781450371452
			
	Titolo del congresso
	
				23rd Evaluation and Assessment in Software Engineering Conference, EASE 2019
			
	Luogo del Congresso
	
				IT University Copenhagen, dnk
			
	Data del Congresso
	
				2019
			
	Appare nelle tipologie:
	
				Relazione (in Volume)

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11383/2079008

Attenzione

L'Ateneo sottopone a validazione solo i file PDF allegati

Citazioni

ND

1

0

social impact