Data analysis in software engineering: an approach to incremental data-driven effort estimation.

Lenarduzzi, Valentina

Cost and effort estimation in software projects have been investigated for several years. Nonetheless, compared to other engineering fields, there is still a large number of projects that fail into different phases due to prediction errors. On average, large IT projects run 45 percent over budget and seven percent over time, while delivering 56 percent less value than predicted. Several effort estimation models have been defined in the past, mainly based on user experience or on data collected in previous projects, but no studies support an incremental effort estimation and tracking. Iterative development techniques, and in particular Agile techniques, partially support the incremental effort estimation, but due to the complexity of the estimation, the total effort always tend to be higher than expected. Therefore, this work focuses on defining an adequate incremental and data driven estimation model so as to support developers and project managers to keep track of the remaining effort incrementally. The result of this work is a set of estimation models for effort estimation, based on a set of context factors, such as the domain of application developed, size of the project team and other characteristics. Moreover, in this work we do not aim at defining a model with generic parameters to be applied in similar context, but we define a mathematical approach so as to customize the model for each development team. The first step of this work focused on analysis of the existing estimation models and collection of evidence on the accuracy of each model. We then defined our approach based on Ordinary Least Squares regression analysis (OLS)so as to investigate the existence of a correlation between the actual effort and other characteristics. While building the OLS models we analyzed the data set and removed the outliers to prevent them from unduly influencing the OLS regression lines obtained. In order to validate the result we apply a 10-fold cross-validation assessing the accuracy of the results in terms of R2, MRE and MdMRE. The model has been applied to two different case studies. First, we analyzed a large number of projects developed by means of the waterfall process. Then, we analyzed an Agile process, so as to understand if the developed model is also applicable to agile methodologies. In the first case study we want to understand if we can define an effort estimation model to predict the effort of the next development phase based on the effort already spent. For this reason, we investigated if it is possible to use: • the effort of one phase for estimating the effort of the next development phase • the effort of one phase for estimating the remaining project effort • the effort spent up to a development phase to estimate its effort • the effort spent up to a development phase to estimate the remaining project effort Then, we investigated if the prediction accuracy can be improved considering other common context factors such as project domain, development language, development platform, development process, programming language and number of Function Points. We analyzed projects collected in the ISBSG dataset and, considering the different context factors available, we run a total of 4500 analysis, to understand which are the more suitable factors to be applied in a specific context. The results of this first case study show a set of statistically significant correlations between: (1) the effort spent in one phase and the effort spent in the following one; (2) the effort spent in a phase and the remaining effort; (3) the cumulative effort up to the current phase and the remaining effort. However, the results also show that these estimation models come with different degrees of goodness of fit. Finally, including further information, such as the functional size, does not significantly improve estimation quality. In the second case study, a project developed with an agile methodology (SCRUM) has been analyzed. In this case, we want to understand if is possible to use our estimation approach, so as to help developers to increase the accuracy of the expert based estimation. SCRUM, effort estimations are carried out at the beginning of each sprint, usually based on story points. The usage of functional size measures, specifically selected for the type of application and development conditions, is expected to allow for more accurate effort estimates. The goal of the work presented here is to verify this hypothesis, based on experimental data. The association of story measures to actual effort and the accuracy of the resulting effort model is evaluated. The study shows that developers’ estimation is more accurate than those based on functional measurement. In conclusion, our study shows that, easy to collect functional measures do not help developers in improving the accuracy of the effort estimation in Moonlight SCRUM. These models derived in our work can be used by project managers and developers that need to estimate or control the project effort in a development process. These models can also be used by the developers to track their performances and understand the reasons of effort estimation errors. Finally the model help project managers to react as soon as possible and reduce project failures due to estimation errors. The detailed results are reported in the next sections as follows: • Chapter 1 reports the introduction to this work • Chapter 2 reports the related literature review on effort estimation techniques • Chapter 3 reports the proposed effort estimation approach • Chapter 4 describe the application of our approach to Waterfall process • Chapter 5 describe the application of our approach to SCRUM • Chapter 6 reports the conclusion and the future works

Data analysis in software engineering: an approach to incremental data-driven effort estimation / Lenarduzzi, Valentina. - (2015).