Background Binary Logistic Regression is widely used in Empirical Software Engineering to build estimation models, e.g., fault-proneness models, which estimate the probability that a given module is faulty, based on some measures of the module. Fault-proneness models are then used to build faultiness model, i.e., models that estimate whether a given module is faulty or non-faulty.Objective Because of the very nature of Binary Logistic Regression, there is always a range of values of the independent variable in which estimates are very close to the estimates that would be obtained via random estimation. These estimates are hardly accurate, and should be regarded as not reliable. For Binary Logistic Regression models used to build faultiness models-i.e., binary classifiers-a range where estimates are inaccurate can be regarded as an "uncertainty" area, where the model is uncertain about the faultiness of the given modules. We define and empirically validate a simple method to identify the uncertainty region.Method We compute the standard deviation of a probability estimate provided by a Binary Logistic Regression model. If the random estimate is within the range centered on the estimate and spanning a standard deviation, we regard the estimate as "too close" to the random estimate, hence unreliable. On the contrary, estimates that are far enough from the random estimate are considered reliable. This method was tested on 54 datasets from the PROMISE (now SEACRAFT) repository.Results Our results show that the variance of estimates can be effectively used to detect the uncertainty region. Estimates in the uncertainty area are rarely statistically significant, and always much less accurate than estimates obtained out of the uncertainty area.Conclusions Practitioners and researchers can use our results to assess the reliability of estimates obtained via Binary Logistic Regression models, and reject (or challenge) those estimates that fall in the uncertainty region.
Dealing with Uncertainty in Binary Logistic Regression Fault-proneness Models
Lavazza, Luigi;Morasca, Sandro
2019-01-01
Abstract
Background Binary Logistic Regression is widely used in Empirical Software Engineering to build estimation models, e.g., fault-proneness models, which estimate the probability that a given module is faulty, based on some measures of the module. Fault-proneness models are then used to build faultiness model, i.e., models that estimate whether a given module is faulty or non-faulty.Objective Because of the very nature of Binary Logistic Regression, there is always a range of values of the independent variable in which estimates are very close to the estimates that would be obtained via random estimation. These estimates are hardly accurate, and should be regarded as not reliable. For Binary Logistic Regression models used to build faultiness models-i.e., binary classifiers-a range where estimates are inaccurate can be regarded as an "uncertainty" area, where the model is uncertain about the faultiness of the given modules. We define and empirically validate a simple method to identify the uncertainty region.Method We compute the standard deviation of a probability estimate provided by a Binary Logistic Regression model. If the random estimate is within the range centered on the estimate and spanning a standard deviation, we regard the estimate as "too close" to the random estimate, hence unreliable. On the contrary, estimates that are far enough from the random estimate are considered reliable. This method was tested on 54 datasets from the PROMISE (now SEACRAFT) repository.Results Our results show that the variance of estimates can be effectively used to detect the uncertainty region. Estimates in the uncertainty area are rarely statistically significant, and always much less accurate than estimates obtained out of the uncertainty area.Conclusions Practitioners and researchers can use our results to assess the reliability of estimates obtained via Binary Logistic Regression models, and reject (or challenge) those estimates that fall in the uncertainty region.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.