Software Effort Estimation with a Generalized Robust Linear Regression Technique

Lavazza, Luigi Antonio; Morasca, Sandro

doi:10.1049/ic.2012.0027

Background. Outliers and corrupted data points may unduly bias software development effort estimation models. However, given the usually limited size of software engineering data sets, removing too many data points may seriously reduce the power of the statistical tests used and the likelihood of statistically significant result. Also, statistical techniques are typically based on assumptions that are either believed to be true a priori or, at best, checked via statistical tests, without ever achieving 100% certainty on their truthfulness. Estimation models based on less strict assumptions have broader applicability and lower risks of drawing unwarranted conclusions. Aim. We investigate the usefulness of Robust Regression when building effort estimation models, by varying the degree of robustness and, thus, the number of data points that are excluded from the data analysis as outliers. Method. We have used Least Quantile of Squares (LQS) Robust Regression, a generalization of the Least Median of Squares (LMS). LMS builds a regression line by minimizing the median squared residual. LQS minimizes the order statistic of square residuals corresponding to any specified quantile, and not just the median, which is the order statistic corresponding to the 50% quantile. We have extended a statistical significance test for univariate LQS regression models. We have also built a weighted model, obtained from statistically significant LQS models, where each LQS model contributes proportionally to the quantile used. Results. We have applied LQS Linear Regression to estimate development effort on four projects from the PROMISE data set and obtained valid and significant univariate models. Conclusions. LQS may provide a valid alternative to LMS and Ordinary Least Square regressions to build estimation models when (1) balancing the need for excluding outliers and keeping enough data points to build statistically significant models and (2) using less strict assumptions underlying the regression technique.