ai-se / process-product Goto Github PK

View Code? Open in Web Editor NEW

0.0 9.0 1.0 34.76 MB

Process Vs Product Metrics

License: Apache License 2.0

Jupyter Notebook 98.31% Python 1.58% Shell 0.01% MATLAB 0.11%

process-product's Introduction

TCKCCA

Transfer Kernel Canonical Correlation Analysis

process-product's People

Contributors

Watchers

process-product's Issues

Reviewer 3

However, the study features a number of unexpected design choices that somehow over-simplify the study. In particular, relating to some extent to the authors' earlier work on local vs. global models, the methodology in this paper is extremely "global", basically comparing 700 projects head-to-head without distinction in terms of age, community size, activity, domain, #releases, etc. The high-level stats in Table 4, especially for process metrics, but also the data statistics (including the imbalanced nature of the defect data), suggest that a 700-project study somehow is comparing apples to oranges.
I believe that clustering the projects, then performing analyses within the clusters, would still yield an "in-the-large" analysis, but a more meaningful one at that. In particular, such a cluster-based analysis could validate the extent to which the in-the-large results are really due to the scale of the data vs. the context of specific projects (e.g., domain, size, age, time between releases, etc.). If context is the decisive factor, this would mean that the in-the-small results were actually correct in the context of the studied projects. Along the same lines, the current in-the-large results could be dominated by specific types of projects, leading in turn to incorrect generalizations.
Apart from clustering, the paper would improve substantially by manually checking some of the many outlier data points, since there could be very interesting stories or observations in those that might deepen the discussion of the RQs' results. Right now, the paper does not go very deep into the interpretation of the findings, often stopping at formulating a hypothesis for future work. For example, the time-aware models could lead to highly different results for projects with few releases compared to those with weekly releases.
Furthermore, the paper is rather vague regarding the actual value of in-the-large analysis. First of all, the initial pages of the paper do not define "in-the-large" well. In fact, the concept could easily be misread as meaning "pooling the data of all projects together, then build one model" instead of "building models for many projects, then pool their results". Second, the paper does not really discuss how companies should interpret the in-the-large findings. Do they need to care about the in-the-large results (since validated on larger sample), or only about the in-the-small results (since they will be building models for only few projects)? Do future papers from now on need to stick to in-the-large studies, or they should just make sure the use the in-the-large findings about metric importance and learning algorithms in their in-the-small study? In other words, what are really the implications of this work?
The literature review in Section 2.2 also raises a number of questions. It seems like the in-the-small papers (the larger population, not just the 2 reproduced papers) do not agree themselves on issues like the most important complexity metrics. As such, how do the results of this paper that disagree with the 2 reproduced papers relate to those other papers? Are there in-the-small studies that actually obtained similar findings as this in-the-large study (e.g., recommending random forest models)?
Regarding the choice of papers, what criteria were used by the authors to identify the "important or highly influential" papers that were added on top of the papers with sufficient citations? Similarly, what was the last paper added to the surveyed set? It would also be relevant to add the number of papers comparing or using process and product metrics that eventually were excluded, since this seems to be a much larger set than the paper suggests.
Another design choice that was unexpected was the lack of hyper-parameter tuning in the models, especially since the authors' prior work has stressed the importance of such tuning. The paper states that "the performanceof the default parameters was so promising that we left such optimization forfuture work", yet earlier studies have found unstable defect prediction results across hyper-parameter configurations, which could impact the results of the study.
Furthermore, in RQ8, the in-the-small models use a logistic regression learner instead of random forest. While this is based on earlier findings that such models work better in-the-small, the fact that this paper found random forest to be better based on in-the-large findings, could warrant the use of random forest for in-the-small as well. I guess this goes back to my earlier comment on how the in-the-large results should be used by practitioners.
Some other expressions like "A defect in software is a failure or error") or "commits which have bugs in them" (bug-introducing or -fixing?) should be rephrased.
Finally, it is not clear whether statistical tests are used on the boxplots in the earlier RQs and, if so, if Bonferroni correction is used, since there are typically 3 groups to compare to each other?
Other Questions

Page 2:
Content: " unwise to trust metric importance results fromanalytics in-the-small studies since those change, dramatically when movingto analytics in-the-larg"
Comment: OK, but why? The current motivation is not that strong.

Page 3:
Content: "Now we can access data on hundreds to thousands of projects. Howdoes this change software analytics?"
Comment: In what respect? Does this refer to pooling data of multiple projects together, or to building models for more projects? The motivation is rather vague at this point.

Page 3:
Content: " For example, for 722,471 commitsstudied in this paper, data collected required 500 days of CPU (using fivemachines, 16 cores, 7days)."
Comment: OK, but actual companies building a model for themselves would not need to do this on thousands of projects, hence the effort would be lower for them?

Page 6:
Content: " in both released based and JIT based setting. A"
Comment: These study settings should be mentioned more explicitly before the RQs.

Page 6:
Content: "process metrics have significantly lower correlation than product metrics inboth released based and JIT based setting"
Comment: Is this a good or a bad thing?

Table 2: Do metrics like "age" only apply to JIT models? What are "neighbors"? What is "recent"?

Table 3: It might be better to order the papers by year to prove the point about recent studies including relatively few projects in their data set.

Page 11:
Content: "The papers in the intersectionare [60, 48, 24, 6] explore both process and product metrics."
Comment: Where is the 5th one?

Page 12:
Content: " more than 8 issues."
Comment: Why 8?

Page 12:
Content: " eight contributors."
Comment: Why 8?

Page 12:
Content: " modified version of Commit Guru [65] "
Comment: What modifications were made?

Table 4: Are these metrics calculated on the last version of each repo, for java files only, or across all commits of all repos? The latter could bias the results to older, larger projects?

Page 13:
Content: " using a keyword based search."
Comment: What keywords were used?

Page 14:
Content: " uses SZZ algorithm"
Comment: Which SZZ implementation? Is the bug report date-heuristic used?

Page 14:
Content: " use the release number, release date informationsupplied from the API to group commits into releases and thus dividingeach project the into multiple releases for each of the metrics."
Comment: Did all projects have releases?

Page 14:
Content: " or was changed in a defective child commit."
Comment: Why?

Page 16:
Content: "But by reporting on results fromboth methods, it is more likely that other researchers will be able to comparetheir results against ours. "
Comment: Nice.

Page 20:
Content: "see any significant benefit when accessing the performance in regards to thePopt20, which is another effort aware evaluation criteria used by Kamei et al.and this study."
Comment: Somehow, even product metrics seem to perform equally well on this metric.

Page 21:
Content: "With the exception of AUC"
Comment: Popt20?

Page 23:
Content: " evident from the results, thatfile level prediction shows statistically significant improvement "
Comment: Supported by statistical test results?

Page 24:
Content: " then check each of the 3 subsequent releases"
Comment: In terms of what?

Page 24:
Content: " see in both process based and product based models thePopt20 does significantly better in the third release"
Comment: Perhaps many projects have only few releases?

Page 24:
Content: " This basically means if either process or product metrics can capturesuch differences, then the metric values for a file between release R and R+1should not be highly correlated."
Comment: Since process metrics capture the development process, would a low correlation imply changes in the process?

Page 24:
Content: " Spearman correlation values for every file between two consecutivereleases for all the projects explored as a violin plot for each type of metrics."
Comment: Basically, for each file there is one spearman correlation across all its metrics, then those correlations are aggregated across all files, all commits and all projects into one violin plot?

Page 27:
Content: "indicate the models are proba-bility learning to predict the same set of files defective and finding the samedefect percentage in the test set as training set and it is not able to prop-erly differentiate between defective and non-defective files. "
Comment: What is the difference with RQ5?

Page 27:
Content: "Spearman rank correlation between the learned and predicted probability formodels built using process and product metrics."
Comment: Is this analysis per file, then aggregated across all files and all projects?

Page 27:
Content: " part 3 only contains files which are defective intraining and not in test set,"
Comment: The other way around?

Page 28:
Content: " using both process and product metrics "
Comment: "both" or "either"?

Page 30:
Content: " sorted by the absolute value of their β-coefficients within thelearned regression equation."
Comment: Are coefficients comparable across the different metrics in a logistic regression model? Why not use odds ratios?

Page 31:
Content: " have relied on issues marked as a 'bug' or 'enhancement' to count bugsor enhancements"
Comment: Which metrics leverage information about enhancements?

Page 32:
Content: " took precaution to remove any pull merge requests from thecommits to remove any extra contributions added to the hero programmer."
Comment: Any details about this?

Page 32:
Content: " process metrics generate better predictors than process metrics"
Comment: Something seems wrong in this sentence.

Reviewer comments

Reviewer 1:

My major concern is about the experimental design and the definition of the defect models in this paper. (Update according to JIT and Released based setting)
Specifically, it's not clear to me if this paper focused on release-based defect prediction or JIT defect prediction. Please clearly state at the beginning of the paper. (Update according to JIT and Released based setting)
readers need strong justification and convincing statements that the proposed approach is sound and correct. Currently, the use of product metrics in this paper does not sound right to me. The prediction granularity of JIT models is at the commit level, and it's almost impossible to map the product metrics to each commit. That's why the original Kamei paper does not have product metrics. (modify section 3.1 to be more clear about data set creation)
Step 3 in 3.1 is very hard to understand, thus it needs an overview diagram to convince reviewers and support future replication. my quick understanding is that the authors aim to generate product metrics for each file in each commit. So my question is that how can the authors aggregate these product metrics of multiple files to the commit level. (modify section 3.1 to be more clear about data set creation)
the experimental setup of this paper is very different from the Rahman paper. (1) The process metrics in Table 2 are proposed by Kamei, but are not the same set that was used in Rahman et al. (2) Rahman et al focused on release-based DP, not JIT DP. (Update according to JIT and Released based setting)
if the paper focused on JIT models, why not formulate the paper to replicate Kamei et al, instead of Rahman et al.
The authors stated that "Only one paper argues that product metrics are best. Arisholm et al. [6] found one project where product metrics perform better". I don't think it is correct to draw conclusions from "one project" of the Arisholm paper. (Yes we agree, that's why this study)
In 3.1 Collaboration, the authors used #pull requests as a proxy of collaboration. I don't think this is correct. Do the authors have any reference to support this? (add citations)
Please report the ratios of bug-introducing commits as well in Table 4. (add a table including dataset details)
Please include a reference to support the definition of "thick-slice" validation. (We don't have any citations. This is a new validation approach. Add details why it is useful kin JIT)
The selection criteria look fine in general. But I do not understand how many projects in the starting set, how many projects were removed for each criterion in order to arrive at the final set of 770 projects. Please clearly explain step by step. (Add details about initial set of projects)

Reviewer 2:

IMHO, the novelty of this work is marginal. I did not learn much from the paper as the findings are very similar to the prior works, i.e., the process metrics are the best and the experimental method is pretty standard. The only interesting part of this work is comparing the findings derived from a small vs large dataset, but still is not so well executed
Despite the fair novelty, the topic addressed in this work is very significant as defect prediction and other predictive models are widely used in SE research and the size of the dataset could potentially impact the findings and results. The problem is well-motivated, where the authors show that many papers use a small number of datasets. This should highlight the generalizability threat of empirical studies.
The authors describe that they aim to revisit the work of Rahman and Devanbu. Although the authors emphasize that they did not want to do an exact replication, this experimental design is too far different from the work of Rahman and Devanbu. More specifically, Rahman and Devanbu investigate this problem in the context of release-level defect prediction (i.e., predicting which file will be defective after the software is released). On the other hand, this work is more like a change-level defect prediction (i.e., which change will induce fix). Hence, the context is different. Consequently, the studied metrics and the ways that these metrics calculated are quite different. Hence, IMHO, the authors should clearly situate the work in the context of the change-level defect predict, which is unlike the work of Rahman and Devanbu.
the paper does not provide an overview of the collected dataset (e.g., LOC, #Files, #PR, #Dev, Defective Ratio) (Added LOC, #Files, #Dev, Defective Ratio)
Since this work is not generally comparable with the work of Rahman and Devanbu (as per my comment 1), it would be better to conduct an experiment RQ3 for all the RQs. More specifically, the authors should run the experiment on a small dataset and check whether the process metrics are still the best in both small and large dataset in this context of a change-level defect prediction.
While I did like the study approach of RQ3, I’m still concerned about its comparison. More specifically, the authors compare the metric importance between the Random forest model that is built in-the-large, and the Logistic Regression model that is built in-the-small. Why different classification techniques?
Q3 investigates the metric importance, the highly correlated metrics should be removed in order to achieve a stable conclusion
It would be nice if the authors also conduct a statistical analysis.
I’m very confused by the results of RQ4. In the introduction, the authors describe that product metrics are harder to compute than the process metrics. However, in the RQ4 results, the authors present in the other way around. So, I’m not sure which one is the correct results. Also, the results briefly are reported. It is better to describe the average computation for each project for process metrics and product metrics.

Reviewer 3:

Authors said in part “Defective Commits step (2)” that you used Commit_Guru to identify commits. However I think the the classification of buggy commits made by Commit_Guru is based on a simple list of keywords. With this method they might miss a number of commit messages containing complex sentences without the Commit_Guru keywords. Automatic language processing and bug triage method based on classifiers could give more relevant results.
Authors said that thick slice release-based is used to generate the train/test data with 60% of the first history as training set and the next 40% for the test. But this raises a question: what is the impact on the predictor quality to choose the 60% of the first history with a project containing only few hundred commits and/or with only one or two release(s)?
Moreover why the choice 60/40 and not 80/20 compared another configuration of train/test?
Evaluation criteria are relevant and explained. It is appreciable to have other criteria than classics recall, precision and ROC, all the more so Ifa and Popt20 are clearly applicable to predictors in software engineering. However, the choice could be more justified by authors.
In RQ3, authors give the metric importance for Random Forest and Logistic Regression but what about the other two predictors (e.g., Naive Bayes and SVM) used in the study? For example authors could give the ranked coefficients for the SVM and the ranked probabilities for the Naive Bayes to have a point of comparison. Is it because Random Forests use a majority vote?
A numerical comparison with predictors built on small scale of software projects might have been interesting and relevant.

Reviewer 2

In the introduction, the authors said that to calculate product metrics of 700k commits, the process required 500 CPU-days while it required 10 times less for process metrics. However, it is not clear how many metrics are considered in one case and how many in the opposite. Moreover, it is quite strange that the product metrics require 10 times more than the process metrics. Usually, many of the process metrics can be built on top of the product metrics considering the evolution of the latter in time. Are you counting also the time to extract product metrics? Do you investigate why in your case this was surprisingly the opposite? Have you adopted a cached approach to avoid recalculating product metrics every time? Perhaps this point is just an engineering task not crucial but clarifying it can help to understand the adopted approach.
My second concern regards the prediction granularity. To what granularity the prediction is performed? It is not clear by reading the methodology to what granularity level the authors are performing the defect prediction evaluation. What do you consider defective? Is it an entire package, a class, a file, a method, or a line? The granularity of the prediction impact drastically on the quality of the results. Of course, a fine-grained granularity is what developers wish, but on the other hand, it is challenging to achieve. For example see *calikli2009effect.
My third concern regards the outcome and the lesson learned from this study. I have the impression that interest of this study is limited by its nature. The authors mainly focus on machine learning classifiers such as SVM, Naive Bayes, Logistic Regression, and Random Forest excluding deep learning approaches that for a large scale study seems to be more promising. See *yang2015deep In addition, this work is merely cited in the "3.4 Evaluation Criteria" section in regards to recall more than the use of deep learning. To know more about the topic see also *chen2019deepcpdp, *hoang2019deepjit, *qiao2020deep. Since the majority of the effort for conducting this study is already available (the metrics are already extracted). It would be fantastic extending this study by comparing current results for in-the-large with a deep-learning model that leverages the availability of a huge amount of data for the training.
Another limitation may affect the sets of metrics, authors arbitrarily decide to investigate the debate about product and process metrics, but it may be interesting also to consider other kinds of metrics that may influence the performance in-the-large. For example, *radjenovic2013software, *pascarella2020performance, *li2018progress suggest and compare also other sets of metrics to understand whether different sets can achieve different results. Have you considered to extend your pool of metrics with additional non conventional metrics?
An additional point that may require clarification regards the overlapping between product and process metrics. What is the role of these sets of metrics when they have to capture different aspects of the software? To what extent process-based overlap product-based results and vice versa? In other words, what is the amount of defects that are caught only by a single set of metrics only?
Some clarifications are needed from the validation perspective. The authors use cross- and released-based validation strategies. That's good for giving an overview of the model behavior, however, while the first is promising for getting a rapid summary of the results it may misleading during the training and the testing phases. See *pascarella2020performance. It would be outstanding to read more about the countermeasures used to address such a limitation.
Authors use several filtering criteria such as the minimum number of commits, pull-requests, issues, etc. However, some of the criteria seem not properly defined and only randomly chosen. Since the goal of the study is to understand the behavior of metrics in-the-large, I would like to read more about the reasons behind these choices that can contribute to the definition of what is so-called large. For instance. the authors are selecting as good representative projects for a large scale study all those projects that have at least 20 commits, 10 defective commits, and 50 weeks of development. Such a selected project implies that half of the commits are defective and a commit is released on average every 2 or 3 weeks. Is this representative a project in a good state? See also *bird2009promises for git related thoughts.
What about forked projects? On GitHub open projects can be forked by any users. This increases the number of mined project if forked projects are not removed from the query results. How do you deal with forked projects? Do those 700 results contain forked projects? How do you recognize the original and forked project?
Release selection. How do you use GitHub API to select releases? Do you also considered tags and branches? What king of heuristic are you using for identify releases?
Finally, while I generally appreciate this work's results, it leaves me with some doubts that I wish to clarify before battle in favor of it. In conclusion, the authors claim that this work can shed light on which machine learning method is more promising when used in a realistic scenario in-the-large. However, due to the ease to use, the always increasing computational power, and the huge availability of data (that is also the strongest point of this paper), neural network seems to be the future also in defect prediction see *wang2018fast. The evaluation of a neural network model may allow authors to extend their advice on which prediction model to choose while designing defect prediction tools.

Reviewer 1

In p.1, l.27, what did not hold should be written in addition to what hold.
In p.11, l.43, the number of projects before selection should be written. The information is useful to know how 700 projects (i.e., those which could be analyzed with confidence) are large in GitHub space.
In Section 3.2, l.34, the statement "The performance of the default parameters was so promising…" is not confirmed by the following sections. Especially, SVM was not tuned and looked not promising. The usefulness of ensemble learning (i.e., RandomForest in this paper) is a key finding of this paper (p.3, l.15), and SVM must be tuned at least to defend the finding.
In p.16, l.26, the sentence "… in the order proposed by the learner, " brings some questions. First, does it mean that the learners return a probability of fault-prone?
If so, how SVM was configured? SVC of scikit-learn does not return a probability by default. An option "probability" needs to be set as True. The configuration must be written in Section 3.2.1. If not so, how the order was defined in Popt20 must be written in Section 3.4. Second, if learners returned a probability, how was it turned into a binary value?
In p.18, RQ1, it is unclear what data was used for plotting Figs.3-5. The cross-validation study repeated 5-fold cross-validation 5 times and yielded 5 x 700 results for each metrics setting. The release study had 3 releases to be tested and yielded 3 x 700 results for each metrics setting. Each boxplot of the figures was based on 700 projects. So, what summarization or aggregation was applied to the results?
In p.23, RQ5, it is unclear how the correlation between two consecutive releases was calculated. A file can be updated multiple times between releases. Thus, each of the two consecutive release data can have multiple instances attributed to the same file. There are some possible combinations among them to calculate a correlation. Therefore, how the correlation was calculated must be detailed to clarify.
In p.26, RQ6-7, it is unclear how training instances and test instances were linked. As described, there are some possible combinations if multiple instances are attributed to the same file. Also, some files only appeared in either of the training or test set. It must be detailed on how these cases were treated.
In p.29, l.23, it looks inappropriate to use logistic regression. The focus here is the difference between large scale study and small scale study. Using a different learner is noisy for the comparison. The analysis must be carried out with Random Forests, not logistic regression.
Some typos and mistakes to be fixed were also found as follows (not all):
P.3, l.16 "had" -> "hand"
P.4, l.45 "in-term" -> "in terms"
P.5, l.26 "then" -> "than"
P.9, l.25 "that" - "than"
P.17, l.48 "AUC" -> "Popt20"
P.18, l.17 "AUC" -> "Popt20"
P.18, l.48 "… where process but…" -> "… where using and…"
P.20, l.40 "AUC" -> "Popt20"
P.23, l.18 "build" -> "built"
P.23, l.28 "build" -> "built"
P.26, l.47 "defective in training and not in test set" -> "defective in test set and not in training set"
P.27, Fig. 10: Add labels "recall" and "pf" to Y-axis
P.29, l.9 "left to right" -> "bottom to top"

ai-se / process-product Goto Github PK

process-product's Introduction

TCKCCA

process-product's People

Contributors

Watchers

process-product's Issues

Reviewer 3

Reviewer comments

Reviewer 1:

Reviewer 2:

Reviewer 3:

Reviewer 2

Reviewer 1

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent