Background Setting thresholds is important for the practical use of internal software measures, so software modules can be classified as having either acceptable or unacceptable quality, and software practitioners can take appropriate quality improvement actions. Quite a few methods have been proposed for setting thresholds and several of them are based on the distribution of an internal measure’s values (and, possibly, other internal measures), without any explicit relationship with any external software quality of interest. Objective In this paper, we empirically investigate the consequences of defining thresholds on internal measures without taking into account the external measures that quantify qualities of practical interest. We focus on fault-proneness as the specific quality of practical interest. Method We analyzed datasets from the PROMISE repository. First, we computed the thresholds of code measures according to three distribution-based methods. Then, we derived statistically significant models of fault-proneness that use internal measures as independent variables. We then evaluated the indications provided by the distribution-based thresholds when used along with the fault-proneness models. Results Some methods for defining distribution-based thresholds requires that code measures be normally distributed. However, we found that this is hardly ever the case with the PROMISE datasets, making that entire class of methods inapplicable. We adapted these methods for non-normal distributions and obtained thresholds that appear reasonable, but are characterized by a large variation in the fault-proneness risk level they entail. Given a dataset, the thresholds for different internal measures—when used as independent variables of statistically significant models— provide fairly different values of fault-proneness. This is quite dangerous for practitioners, since they get thresholds that are presented as equally important, but practically can correspond to very different levels of user-perceivable quality. For other distribution-based methods, we found that the proposed thresholds are practically useless, as many modules with values of internal measures deemed acceptable according to the thresholds actually have high fault-proneness. Also, the accuracy of all of these methods appears to be lower than the accuracy obtained by simply estimating modules at random. Conclusions Our results indicate that distribution-based thresholds appear to be unreliable in providing sensible indi cations about the quality of software modules. Practitioners should instead use different kinds of threshold-setting methods, such as the ones that take into account data about the presence of faults in software modules, in addition to the values of internal software measures.
An Empirical Evaluation of Distribution-based Thresholds for Internal Software Measures
LAVAZZA, LUIGI ANTONIO;MORASCA, SANDRO
2016-01-01
Abstract
Background Setting thresholds is important for the practical use of internal software measures, so software modules can be classified as having either acceptable or unacceptable quality, and software practitioners can take appropriate quality improvement actions. Quite a few methods have been proposed for setting thresholds and several of them are based on the distribution of an internal measure’s values (and, possibly, other internal measures), without any explicit relationship with any external software quality of interest. Objective In this paper, we empirically investigate the consequences of defining thresholds on internal measures without taking into account the external measures that quantify qualities of practical interest. We focus on fault-proneness as the specific quality of practical interest. Method We analyzed datasets from the PROMISE repository. First, we computed the thresholds of code measures according to three distribution-based methods. Then, we derived statistically significant models of fault-proneness that use internal measures as independent variables. We then evaluated the indications provided by the distribution-based thresholds when used along with the fault-proneness models. Results Some methods for defining distribution-based thresholds requires that code measures be normally distributed. However, we found that this is hardly ever the case with the PROMISE datasets, making that entire class of methods inapplicable. We adapted these methods for non-normal distributions and obtained thresholds that appear reasonable, but are characterized by a large variation in the fault-proneness risk level they entail. Given a dataset, the thresholds for different internal measures—when used as independent variables of statistically significant models— provide fairly different values of fault-proneness. This is quite dangerous for practitioners, since they get thresholds that are presented as equally important, but practically can correspond to very different levels of user-perceivable quality. For other distribution-based methods, we found that the proposed thresholds are practically useless, as many modules with values of internal measures deemed acceptable according to the thresholds actually have high fault-proneness. Also, the accuracy of all of these methods appears to be lower than the accuracy obtained by simply estimating modules at random. Conclusions Our results indicate that distribution-based thresholds appear to be unreliable in providing sensible indi cations about the quality of software modules. Practitioners should instead use different kinds of threshold-setting methods, such as the ones that take into account data about the presence of faults in software modules, in addition to the values of internal software measures.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.