Background Binary Logistic Regression is widely used in Empirical Software Engineering to build estimation models, e.g., fault-proneness models, which estimate the probability that a given module is faulty, based on some measures of the module. Fault-proneness models are then used to build faultiness model, i.e., models that estimate whether a given module is faulty or non-faulty.Objective Because of the very nature of Binary Logistic Regression, there is always a range of values of the independent variable in which estimates are very close to the estimates that would be obtained via random estimation. These estimates are hardly accurate, and should be regarded as not reliable. For Binary Logistic Regression models used to build faultiness models-i.e., binary classifiers-a range where estimates are inaccurate can be regarded as an "uncertainty" area, where the model is uncertain about the faultiness of the given modules. We define and empirically validate a simple method to identify the uncertainty region.Method We compute the standard deviation of a probability estimate provided by a Binary Logistic Regression model. If the random estimate is within the range centered on the estimate and spanning a standard deviation, we regard the estimate as "too close" to the random estimate, hence unreliable. On the contrary, estimates that are far enough from the random estimate are considered reliable. This method was tested on 54 datasets from the PROMISE (now SEACRAFT) repository.Results Our results show that the variance of estimates can be effectively used to detect the uncertainty region. Estimates in the uncertainty area are rarely statistically significant, and always much less accurate than estimates obtained out of the uncertainty area.Conclusions Practitioners and researchers can use our results to assess the reliability of estimates obtained via Binary Logistic Regression models, and reject (or challenge) those estimates that fall in the uncertainty region.

Dealing with Uncertainty in Binary Logistic Regression Fault-proneness Models

Lavazza, Luigi;Morasca, Sandro
2019-01-01

Abstract

Background Binary Logistic Regression is widely used in Empirical Software Engineering to build estimation models, e.g., fault-proneness models, which estimate the probability that a given module is faulty, based on some measures of the module. Fault-proneness models are then used to build faultiness model, i.e., models that estimate whether a given module is faulty or non-faulty.Objective Because of the very nature of Binary Logistic Regression, there is always a range of values of the independent variable in which estimates are very close to the estimates that would be obtained via random estimation. These estimates are hardly accurate, and should be regarded as not reliable. For Binary Logistic Regression models used to build faultiness models-i.e., binary classifiers-a range where estimates are inaccurate can be regarded as an "uncertainty" area, where the model is uncertain about the faultiness of the given modules. We define and empirically validate a simple method to identify the uncertainty region.Method We compute the standard deviation of a probability estimate provided by a Binary Logistic Regression model. If the random estimate is within the range centered on the estimate and spanning a standard deviation, we regard the estimate as "too close" to the random estimate, hence unreliable. On the contrary, estimates that are far enough from the random estimate are considered reliable. This method was tested on 54 datasets from the PROMISE (now SEACRAFT) repository.Results Our results show that the variance of estimates can be effectively used to detect the uncertainty region. Estimates in the uncertainty area are rarely statistically significant, and always much less accurate than estimates obtained out of the uncertainty area.Conclusions Practitioners and researchers can use our results to assess the reliability of estimates obtained via Binary Logistic Regression models, and reject (or challenge) those estimates that fall in the uncertainty region.
2019
PROCEEDINGS of EASE 2019 - Evaluation and Assessment in Software Engineering
9781450371452
23rd Evaluation and Assessment in Software Engineering Conference, EASE 2019
IT University Copenhagen, dnk
2019
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11383/2079008
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 0
social impact