Background. Outliers and corrupted data points may unduly bias software development effort estimation models. However, given the usually limited size of software engineering data sets, removing too many data points may seriously reduce the power of the statistical tests used and the likelihood of statistically significant result. Also, statistical techniques are typically based on assumptions that are either believed to be true a priori or, at best, checked via statistical tests, without ever achieving 100% certainty on their truthfulness. Estimation models based on less strict assumptions have broader applicability and lower risks of drawing unwarranted conclusions. Aim. We investigate the usefulness of Robust Regression when building effort estimation models, by varying the degree of robustness and, thus, the number of data points that are excluded from the data analysis as outliers. Method. We have used Least Quantile of Squares (LQS) Robust Regression, a generalization of the Least Median of Squares (LMS). LMS builds a regression line by minimizing the median squared residual. LQS minimizes the order statistic of square residuals corresponding to any specified quantile, and not just the median, which is the order statistic corresponding to the 50% quantile. We have extended a statistical significance test for univariate LQS regression models. We have also built a weighted model, obtained from statistically significant LQS models, where each LQS model contributes proportionally to the quantile used. Results. We have applied LQS Linear Regression to estimate development effort on four projects from the PROMISE data set and obtained valid and significant univariate models. Conclusions. LQS may provide a valid alternative to LMS and Ordinary Least Square regressions to build estimation models when (1) balancing the need for excluding outliers and keeping enough data points to build statistically significant models and (2) using less strict assumptions underlying the regression technique.

Software Effort Estimation with a Generalized Robust Linear Regression Technique

LAVAZZA, LUIGI ANTONIO;MORASCA, SANDRO
2012-01-01

Abstract

Background. Outliers and corrupted data points may unduly bias software development effort estimation models. However, given the usually limited size of software engineering data sets, removing too many data points may seriously reduce the power of the statistical tests used and the likelihood of statistically significant result. Also, statistical techniques are typically based on assumptions that are either believed to be true a priori or, at best, checked via statistical tests, without ever achieving 100% certainty on their truthfulness. Estimation models based on less strict assumptions have broader applicability and lower risks of drawing unwarranted conclusions. Aim. We investigate the usefulness of Robust Regression when building effort estimation models, by varying the degree of robustness and, thus, the number of data points that are excluded from the data analysis as outliers. Method. We have used Least Quantile of Squares (LQS) Robust Regression, a generalization of the Least Median of Squares (LMS). LMS builds a regression line by minimizing the median squared residual. LQS minimizes the order statistic of square residuals corresponding to any specified quantile, and not just the median, which is the order statistic corresponding to the 50% quantile. We have extended a statistical significance test for univariate LQS regression models. We have also built a weighted model, obtained from statistically significant LQS models, where each LQS model contributes proportionally to the quantile used. Results. We have applied LQS Linear Regression to estimate development effort on four projects from the PROMISE data set and obtained valid and significant univariate models. Conclusions. LQS may provide a valid alternative to LMS and Ordinary Least Square regressions to build estimation models when (1) balancing the need for excluding outliers and keeping enough data points to build statistically significant models and (2) using less strict assumptions underlying the regression technique.
2012
Maria Teresa Baldassarre, Marcela Genero, Emilia Mendes, Mario Piattini
16th International Conference on Evaluation & Assessment in Software Engineering, EASE 2012, Ciudad Real, Spain, May 14-15, 2012. Proceedings
9781849195416
16th International Conference on Evaluation and Assessment in Software Engineering (EASE 2012)
Ciudad Real – Spain
14-15 May, 2012
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11383/1759050
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 6
  • ???jsp.display-item.citation.isi??? ND
social impact