ai-se / phasedelay Goto Github PK

View Code? Open in Web Editor NEW

2.0 7.0 1.0 3.75 MB

does phase delay increase bug costs?

Makefile 0.15% TeX 99.68% Shell 0.17%

phasedelay's Introduction

phaseDelay

does phase delay increase bug costs?

phasedelay's People

Contributors

Stargazers

Watchers

phasedelay's Issues

its not time, but cost

reviewer2:

Authors mix multiple quite distinct concepts, such as effort,
cost, time, and the number of defects as if they were all
interchangeable.
Authors appear to confuse the terminology. Most traditional
definitions of phase delay involve cost. Cost may be dramatically
higher than effort when customers get involved and software problems
require hardware replacement. The time card data used by authors
appears to represent effort, not cost.

reviewer1:

The paper refers in the title and in the abstract on time to fix defects, whereas in the introduction and motivation sections the discussion focuses on cost to fix defects. It is finally in Section 4.1 that the authors explain that they measure the cost to fix through the average time to fix, which is not a straightforward assumption. I would recommend to move such definition earlier, or to change title and abstract mentioning cost and not time. In fact, the literature more generally refers to the cost of fixing defect to increment with phase delay.

I do not understand how section 5 is structured. I would make three separate subsections, i.e. I would number 5.2.1 as 5.2 and 5.2.2 as 5.3. Moreover, for consistency, I would include H1 in the title of Subsection 5.1.
In 5.2.2 the numbers reported are not consistent with Figure 10, please check. Also, the caption refers to Left but I think you mean Right.

Reviewer 1: typos and grammar issues

Please add an N column to Fig 13.

I am troubled by Fig. 12, how many defects have in total been fixed in the same phase in which they have been found? What is the proportion from the initial 47,376 defect logs that are concerned in the statistics in Fig. 13?

Authors do not define what they mean by phase delay.

Reveiwer1:

Authors do not define what they mean by phase delay. That makes it very difficult to evaluate the claims and the relationship to prior work.

Reviewer3:

I find the initial statement of the first two hypothesis to be unclear. When you say "across phases" I thought you were meaning when the defect survived across phases - but actually you did not (necessarily) mean that. You simply meant "over the corse of the software lifecycle." And "as the phase delay increases" is not clear because you haven't clearly defined "phase delay". Your descriptions of both in the conclusion are much more clear - but why do you have two different ways of stating the hypothesis? Please be clear and consistent when you are stating the research questions.

Process clarifications in Section 5.2.3

Section 5.2.3: There is a minor discrepancy between Figure 7 and the text. In the running text you list system tests before acceptance tests, but the opposite order is depicted in the figure. Make sure the order is correct, also in Figure 13. To me acceptance testing normally happens as the final V&V step before release. Did all 171 projects really do system testing last? Maybe this is prescribed in TSP?

fix fig8 comments

"The first thing to note in Figure 8, is that this data does not follow
the exponential increase feared by Beck’s Figure 1"
Figure 8 does show increase but it does not even show the cost of defect:
it shows the numbers of defects?!

Reviewer 2: typos and grammar issues

Reviewer 3: typos and grammar issues

reply to emse

DEAR AUTHORS-- THIS TEXT NOW MOVED TO PAPER

Dear authors,

Three knowledgable reviewers have scrutinized the manuscript
and found it interesting and relevant,  although in need for
certain improvements. Therefore, the authors are
 invited to submit a revised version for a new review, where
 the reviewers' comments
are adressed. Particularly, this holds for:

- the connection to activities of theory building in ESE
(reviewers 1).

[x ] TM

This new draft now maps our work into the SE literature on theory building.

- clarification about the data origin and
characteristics (reviewer 1).

[x ] TM

This new draft now maps not contains extensive notes on the origin and characteristics of our data.

- use of statistical analysis (reviewers 2 and 3)

[ x] TM

All our results are now augmented with statistical significance tests as well as effect size tests. Those tests greatly strength the overall message of this paper.

- writing style, that would benefit form being more
precise (reviewer 3)

TM >

Reviewer #1: PAPER SUMMARY:

One of the most widespread beliefs in software engineering
is that the sooner you identify and resolve  an issue the
less effort it requires (the authors call it DIE).  This is
also being taught in software engineering education; fix
something in the r equirements phase and it is a quick fix,
wait until testing and you will have a to debug source code,
modify design, update documentation, and then change the
requirement - a much more complex task. This rule-of-thumb
was expressed in the 80s, and much has changed in the
software business since, e.g., programming languages,
editors, agile development and stack overflow. To what
extent does DIE actually exist in modern software
engineering projects?

The authors investigate the DIE phenomenon quantitatively by
analyzing data from 171 completed projects. All included
projects used Team Software Process (TSP), a development
methodology developed by the Software Engineering Institute
at Carnegie Mellon University. A central component in TSP is
detailed time logging, i.e., a TSP developer reports how
much time he spends on various tasks including issue
resolution. Thanks to access to a large dataset of completed
TSP-driven projects, the authors have been able to analyze
effort spent on issues in different development phases.

The results show that DIE does not apply to all projects,
and the effect is not as strong as previous publications
suggest. The authors thus conclude that DIE cannot be
assumed to always hold.  Furthermore, they present five
explanations of the observed lack-of-effect: i) DIE is a
historic relic, ii) DIE is intermittent, iii) DIE applies
only to very large systems, iv) DIE has been mitigated by
modern development methodologies, and v) DIE has been
mitigated by modern development tools. Finally, the authors
state the relevant question: how could DIE become such a
widespread idea in the literature, despite limited empirical
evidence?


REVIEW SUMMARY: + The paper investigates whether a widely
accepted "rule" in software engineering holds by looking at
a large set of empirical data. The research topic is highly
relevant and the results are interesting.  + The discussion
on reassessing old truism fits the scope of the journal very
well.  + The language in the paper is excellent, and the
structure is (although non-standard) easy to follow.

TM Thank you for those comments.

- It is not easy for the reader to understand what parts
of the "life of an issue" the authors claim are affected by
DIE, and it is also not too easy to understand what data the
authors have available. I believe adding another figure
could address both these aspects.  ```

@llayman: We have added an example of the DIE in the Introduction, and more precisely defined how we measure this effect Section 6.

- The different projects studied are all grouped together. I
believe studying clusters of development contexts would be
highly interesting, but unfortunately the current
characterization of projects is inadequate.

@WilliamNichols: need some way to group the projects

- The authors do not relate their reassessment of an
"old truism" to the growing set of papers on theory building
in software engineering. With access to such large amounts
of empirical evidence, it should be possible to take some
steps toward improved DIE theories.

[ ] @fshull need words on this

DETAILED REVIEW: The authors should expand on what they
include in the "delayed issue effect" (DIE), in particular
in relation to standard milestones in the life of an issue
and the corresponding timespans. I would suggest adding a
figure to point out important issue milestones on a
timeline, e.g., 1) injected, 2) found/reported, 3) work on
resolution started, 4) work on resolution ended, 5) V&V
completed, 6) resolution deployed/integrated. I believe the
authors refer to an increase in time between 3) and 4) as
the DIE, but I'm not really sure. I'm particularly
interested in whether all "indirect" issue resolution steps
are covered, e.g., change impact analysis, updating
documentation, and regression testing. The effort involved
in these steps clearly depends on the development context of
the specific project. The short a-c) list on page 14:50
suggest that the issues studied are restricted to minor
fixes, i.e., no updates to architecture, changed hardware,
recertification, updated user manuals etc.

A figure with issue milestones could also help the reader
understand what data are actually available for the
analysis. Section 6.3 describes the data, but some aspects
should be further explained. Logging interruption time
sounds very useful, but I wonder how carefully the
developers actually did this -

@llayman: We not provide a more explicit definition of what constitutes a defect (our study of issues is limited to defects) and the "defect lifecycle" phases we measure in Section 6.2.2. "a defect is any change to a product, after its construction, that is necessary to make the product correct". This includes both minor issues (typography) and severe issues (architecture changes, requirements errors). Time on a defect includes a) time to investigate/analyze a defect once discovered, b) time to craft and implement a fix, and c) time to verify and close the defect.

 it must be really hard to keep up the discipline required.
 If you are interrupted as a developer (e.g., phone calls,
 urgent mail, someone asks a question) I don't think the
 first thing you do is to stop a timer on your computer.
 Moreover, developers often work on several defects in
 parallel, and might interweave bug fixing with new
 development. I don't think the "interruption time" captures
 all such multi-tasking, and it should at least be properly
 discussed in Section 7 "Threats to validity".

@llayman Yes, there is undoubtedly some accuracy in the time-tracking. We provide some evidence on the TSP data set's integrity in Section 6.3, and a more extensive discussion on the issue of time-tracking and defect logging accuracy in Section 7.1 on Conclusion Validity.

The discussion on reassessing old truisms (Section 3) is
interesting, but it should be complemented by the
perspective of theory building in software engineering - An
active research topic lately, with a dedicated workshop
series (GTSE). I suggest looking into the following
references for a start: Sjøberg et al. (2008) "Building
Theories in Software Engineering", Smolander and Päivärinta
(2013) "Theorizing about software development practices",
and Stol and Fitzgerald (2015) "Theory-oriented software
engineering". Considering new empirical evidence is
obviously critical to theory building, and discussing your
new results in the light of an theory creation would be
valuable. The authors mention that DIE appears to occur
intermittently in certain kinds of projects - maybe the
authors could elaborate on this idea and present an improved
DIE theory based on what they now know? I believe the
authors have the best available data to do so, and I would
expect the paper to go beyond simply questioning the "old
truth".

@fshull: is this material you know? can you do a
page or two casting this work in terms of t: Sjøberg et al.
(2008) "Building Theories in Software Engineering",
Smolander and Päivärinta (2013) "Theorizing about software
development practices", and Stol and Fitzgerald (2015) "T

The authors study three claims in this paper, and the
third claim is the central one: "delayed issues are not
harder to resolve". To study whether issues require longer
resolution times in later phases, the authors analyze a
large set of issue data from historical projects. Thus the
manuscript reports from an observational study rather than
an experiment with a controlled delivery of treatments.
While I believe the authors' approach is practical, I would
like to see a critical discussion on threats to validity of
observational studies (e.g., Madigan et al., A Systematic
Statistical Approach to Evaluating Evidence from
Observational Studies, Annu. Rev. Stat. Appl. 2014. 1:11-39
and Carlson and Morison, Study Design, Precision, and
Validity in Observational Studies, J Palliat Med. 2009 Jan;
12(1): 77-82). An alternative study design (although
difficult to realize in a system of industrial size) would
be to let different developers resolve the same issues for
the same software system, i.e., one group resolves an issue
during design, another during implementation, and a third
during testing - some discussion along these lines would
strengthen the validity section.

@llayman We have reviewed and cited the mentioned papers. We have incorporated many of the concerns raised in these papers into our expanded Section 7.4 on External Validity.

The first and second claims are studied with much less
rigor. "DIE is a commonly held belief" is studied using a
survey of software engineers, both practitioners and
researchers. According to Fig. 2 (actually a table), the
number of respondents is 16 and 30 for practitioners and
researchers, respectively. Sixteen respondents from industry
represent a tiny survey of a very general claim that any
software engineer could respond to. Why were not more
answers collected? The authors do not have much evidence to
defend the first claim, and there is no discussion of the
corresponding validity threats. The second claim, "DIE is
poorly documented", is studied using a literature review.
Unfortunately, the method underlying the literature review
is not presented. Although it doesn't need to be an SLR of
the most rigorous kind, the authors should report how the
papers were identified. I suspect the terminology used to
describe the DIE phenomenon is highly diverse, thus it would
strengthen the paper if the authors reported how they
reviewed the literature. According to Page 19:34 only eight
publications were identified.

@llayman need to show that these were the only empirical
results for DIE in tow

The iterative fashion of modern software development,
with agile at the extreme end, is not fully discussed (the
short discussion section could be extended). The phases of
the linear development of the 80s, (such as in Fig. 1)
probably don't exist in many of the projects in the TSP
dataset, still the authors discuss the DIE-effect from the
perspective of the 80s. Page 16:31 states "DIE has been used
to justify many changes to early lifecycle software
engineering" - does this mean the agile movement
successfully mitigated DIE? This possibility is not fully
considered in the paper.
How many of the 171 projects were
(more or less) agile? This appears to be an important
characteristic of the included projects - very important to
describe!

@llayman An astute observation. We provide more information on the types of projects included in our sample in Section 6.4. More to the point -- does agile help mitigate DIE? The answer is - we don't know. A strong argument can be made that all iterative&incremental methodolgies are meant to mitigate the DIE. Further, huge advances in software engineering technology and process have occurred since DIE was first observed, many of which have precisely targeted the DIE (static analysis, test-frist development, automated build). Perhaps our analysis provides evidence that DIE can be defeated, at a large scale across multiple varied projects, using modern technologies. We do not yet know the causes here, but certainly it appears the DIE should not be treated as a truism any more, but rather a variable that can be controlled. We discuss this in Section 8.
@WilliamNichols : need a quesstimate of how many of the
projects were "agile"

Concerning characterization of the 171 projects, the
paper needs to report much more detail. I would expect to
see some descriptive statistics. On page 17:18 the authors
say "perhaps we should split our SE principles into two sets
/---/" - of course SE practices need to be adapted to the
development context, and also two sets of principles is a
too simple split. The paper does not report much
characterization of the 171 projects. I strongly suggest the
authors to dig deeper into the data, and analyzing for which
types of projects the findings hold. What patterns are there
to discover? I suggest Petersen and Wohlin, "Context in
industrial software engineering research" (ESEM2009) for
details on how to characterize development contexts.
Moreover, I would really like to see what practitioners from
the 171 projects think of your findings - an interesting
option for a qualitative angle on your study.

[ ]@WilliamNichols : can you look at p402 and 403 of
http://www.cse.chalmers.se/~feldt/courses/sple/papers/petersen_2009_context_in_industrial_research.pdf. do you have any statistics on our projects broken out according to the terms of that paper?

Fig 1: "Operation" dwarfs everything in Boehm's diagram.
Since you do not study anything post release in this paper,
I think this figure skews the reader's mental picture. If
you remove the rightmost bar and rescale accordingly, the
plot better matches the findings you report in Fig. 10. You
still identify an interesting result though, but the
presentation turns fairer.

@timm will do

MINOR COMMENTS: Keywords: I believe the keywords could be
improved to help readers find this work.

@llayman

Several figures are copied from previous work. Are all copyrights properly
managed?

@llayman

Fig 10: Black and red is not a good pair for
grayscale printouts. Please use black and gray instead.


- [ ] @timm

Some figures are actually tables, thus their captions should
be replaced accordingly.


- [ ] @llayman

Some figures should be resized to better match the page width.


- [ ] @llayman

Page 3:14 - "The above argument" Which argument? Could be precisely specified.


- [X]

Page 3:22 - Please provide the full link to the dataset.


- [ ] @timm

Page 4:8 - "More difficult" but "harder" is used in claim3. Why this inconsistency?

- [ ]

Page 18:26 - First sentence: "Unexpected results such as this one". Ambiguous reference,
please be specific.


- [ ]

Page 18:27 - "We also survey the state of SW dev. practice /---/": given the size of the survey,
this statement feels a bit bold.

- [ ]

TYPOS ETC.
Sec 1, §1: [3] and [30] have been swapped?
Page 14:46 - Spell out IV&V the first (and only) time it appears.
Page 3:24 - missing verb (is) Page 16:43 - "Did this study failed"
Page 18:44 - Appache
Page 19:19 - Al Ref 26 - oo --> OO


- [X]
_____________________________________________

Reviewer #2: The topic of this paper is the empirical
evaluation of a largely admitted claim in software
engineering, i.e. the exponential cost of correcting errors
according to the phase in which the errors are discovered.
The claim is first confirmed by surveying and interviewing
practitioners and experts. Next, the authors use a large set
of data from projects in SEI database. The results show
that, strictly speaking, the claim does not hold (although
some effects of delayed corrections can be noticed).

Globally, the paper is well written and the authors develop
a convincing demonstration. The topic of the paper is highly
relevant for both practitioners and people from academia.
Among many, I personally believed in this claim and had
regularly taught it in my courses. Thus, reading this paper
offers a refreshing perspective on our understandings of
software engineering background and theoretical knowledge. I
also like the suggested posture that insists on the
necessary skepticism that we should have against such
prevailing claims.

Beyond this positive global impression, I have some concerns
about the study reporting and analysis: - It seems to me
that more descriptive statistics about the sample would
contribute to a better understanding of the scope of the
study.


- [ ] @timm more  stats

I think in particular about the size (lines of codes and/or
number of software components); a duration histogram would
also be relevant. Accordingly, the formula for calculating
variable "Total effort" in Fig. 7 could be explained, and
Figure 9 made more readable (bigger size).


- [X] @llayman We have added more description of the projects to Section 6.4. We have also added more precise definitions of our measures in Section 6.2. We have adjusted the size of figures throughout the paper to be more readable.

If certain attributes for the issues and errors are
available in the data (e.g. severity, priority, etc.), they
should also be brought to the readers knowledge.


- [ ] @WilliamNichols: got any stats on these

Figure 10 is central to the paper's demonstration; its
expressiveness could be enhanced: i) the reader tends to
think that the "right hand side bars" are different from the
BLACK and RED bars, ii) the formula for calculating the 50th
and 95 ratio could be provided, iii) the unit of column
"Percentile" could be mentioned.


- [ ] @timm split into two

I like the discussion section, in particular, the idea
of software architectures that could be a contributing
factor for enhancing software evolution and reducing the
cost of issues and errors fixing. Is there any available
knowledge in the data set concerning any architectural
design choices made in each project? If so, would it make
sense to seek any correlation between these architectural
choices and how expensive it was to solve issues?


- [ ] @WilliamNichols: anything on architectural style? or should
we jsut say "many and varied and hard to get a precise
picture of it all

In the same trend of ideas, I think what has
fundamentally changed since early days of SE is the
development of requirements engineering. In actual SE
practices, problem and the solution spaces are
systematically explored thanks to RE techniques; together
with architecture, this could also explain why things have
changed since the 70'. It could be, for example, that people
tend to make less severe errors in early software project
phases (thanks to RE techniques); a similar phenomenon was
observed when SE practices become more mature (see Harter et
al. 2012).

References Harter D. E., Kemerer C. F., Slaughter S. (2012).
Does Software Process Improvement Reduce the Severity of
Defects? A Longitudinal Field Study. IEEE Transactions on
Software Engineering, vol. 38, n° 4, p. 810-827.


- [ ] @timm reference harter
- [X] @llayman Absolutely - there are potentially many explanations for why the DIE was NOT observed in our dataset. We discuss some more of the potential causes in Section 8.

In the conclusion section, I feel uncomfortable with
the assertion "That data held no trace of the delayed issued
effect" (line 17). As mentioned elsewhere in the paper, the
delayed issue is not absent; it is much less significant and
systematic than what is usually claimed. Moreover, this
reduction has been demonstrated for medium sized projects;
we cannot generalize to larger projects.


- [ ] @timm

Minor remarks: - p.9, line 29 : " … reported by Shull [52], found that the cost to find certain non-critical classes of defects …" => is it the "cost to find" or the "cost to FIX"?


- [X] @llayman It should be "cost to fix". We have corrected this.

p.10, line 21: duplicate word "there" -
p.12, line19: missing word "One of THE guiding …" - p.16,
line 33: "distinguish" instead of "distinguished"? - p.18,
line 42: explain acronym MEAN - p.20, line 20: useless
enumeration A1?


- [X] 
_______________________

Reviewer #3: I like the idea of the paper and I certainly
would like to see more empirical studies that periodically
check if our beliefs about software engineering practices
still hold (or even if they have ever held). The authors
address one of the beliefs that are more entrenched in the
population of researchers and practitioners, and a very
important one too.

At the same time, I think that the paper needs to be
improved before it can be published.

The empirical analysis (Section 6) is a bit of a let down. I
would like to see some sort of more robust and rigorous
statistical analysis, based on statistical tests. Instead,
the paper only provides qualitative comparisons between the
results obtained at the 50-th and 95-percentiles. This is
not to say that the results or the discussions are
incorrect, but only that they need to be better supported.
The lack of this kind of statistical analysis should have
least be mentioned in the Threats to Validity. I understand
that you say that your "claims" (including Claim 3) should
not be considered "hypotheses" in the statistical sense and
I agree with you on Claim 1 and Claim 2. However, Claim 3
lends itself to a statistical analysis, without which your
results become a bit more anecdotal, and this is what you
rightly criticize in the previous claims that DIE actually
exists. In addition, you may want to analyze some additional
and somewhat unexpected results that can be derived from
your data (see Detail Comments below).


- [ ] do the stats

The paper seems to be written in a somewhat casual style,
which makes the paper more pleasant to read, but less clear
in some parts (see Detail Comments below).

DETAIL COMMENTS

Section 2

I'm not really sure what the authors mean by "We say that a
measure collected in phase 1, ., i, .. j is very much more
when that measure at phase j is larger than the sum of that
measure in earlier phases 1 <= i < j." You are defining a
property of a measure, but, the way this definition is
written, it seems as if you are defining the property of a
measure of "being much more," without any further
qualifications. I can guess that you mean that a measure
$m$ (collected in phases, so you can denote by $m(i)$ the
value of $m$ in phase $i$) has this property in phase $j$
(so, it's a property m has in a specific phase and not a
general property of the measure) if $\sum_{1 <= i < i} m(i)
< m(j)$. Then, it's up to you to make this a property of the
measure, for example by using an existential "policy" ($m$
has this property if there exists one phase $j$ where
$\sum_{1 <= i < i} m(i) < m(j)$) or a universal one ($m$ has
this property if for all phases $1<j$, $\sum_{1 <= i < i}
m(i) < m(j)$).


- [ ] @timm just do that

The real problem with this definition, however, is that it
doesn't seem to be used anywhere in the paper. The only
point where it might be used is in the discussion of Figure
1 (which precedes it anyway). So, you may want to remove
this property altogether.


- [ ] @timm your right. the real thing here is that here is
statistically singicicat increase

Your definition of "difficult issue" is "Issues are more
difficult when their resolution takes more time or costs
more (e.g. needs expensive debugging tools or the skills of
expensive developers)." That's not a very precise
definition, though, because it has two different
interpretations, one in terms of time and the other in terms
of effort. The reasons why you introduce this definition
become clearer only in Section 7.2, in the Threats to
Validity, and that's too late. You should move some of the
discussion of Section 7.2 here.


- [ ] @timm roger

Why would the term "delayed issue effect" be a
generalization of the rule "requirements errors are hardest
to fix"? "Delayed" seems to be quite specific as it appears
to refer to time, while "hardest to fix" may refer to other
variables, like effort or cost.


- [ ] @timm text

In Claim 3, "very much more harder" sounds like a
little bit too much ... ("more harder"?)


- [ ]

Section 4

A very minor issue: why is Figure 2 a ... figure, instead of
a table? Same for all other tables ...


- [ ]

Section 5.1

This is not a complete sentence "All the literature
described above that reports the onset of DIE prior to
delivery."


- [X]

Section 6.3

You should rephrase your definition "a defect is any change," since a defect is what existed before the change was made and not the change itself.


- [X] @llayman We now provide a more precise definition of defect in Section 6.2.

You do not explain that "QualTest" in Fig. 8 is.


- [ ]

The definition of "time per defect - The total # of defects found in a plan item during a removal phase divided by the total time spent on that plan item in that phase." is not correct as it is, as this would basically be the number of defects per unit time, instead. The roles of time and
defects should be reversed.


- [X] We now provide a more precise definition of time measurement related to defects in Section 6.2.

Section 6.4

Besides adding a more rigorous statistical analysis, you
need improve your explanations.

Figure 9. The caption says "Distribution of defects found
and fixed by phase." Is that the distribution of defects
found in some phase and fixed in that phase? The text says
"The distribution of defects found and fixed per phase in
our data is shown in Figure 9. A high percentage of defects
(44%) were found and fixed in the early phases, i.e.,
requirements, high level design, and design reviews and
inspections." which doesn't add much. If that's the
distribution of defects introduced in a phase and fixed
immediately in that phase, it is not clear why you introduce
it, though.


- [ ]

Figure 10. The explanation of the data and histograms in the
figure is, at best, confusing. The caption says "50th and
95th percentiles of issue resolution times" and the opening
sentence of Section 6.4 says "Figure 10 shows the 50th and
95th percentile of the time spent resolving issues ..."
However, these are not "times," because you write "expressed
as ratios of resolution time in the phase where they were
first injected" a couple of paragraphs below. This seems to
be the right interpretation, but right after that you write
"The BLACK and RED bars show the increases for the 50th
(median) and 95th percentile values, respectively," which
would be a third interpretation of the data in Figure 10.


- [ ]

You should provide the value of the mean, in addition to the
50-th percentile (i.e., the median).


- [ ]

You need to at least mention (or, better, discuss) some of
your results, because they appear to be "counterintuitive,"
since they show that, for example, median resolution times
can even decrease if issues are not fixed immediately, but
in later phases. For example, looking at the Reqts section,
most percentages are way below 1, which seems to indicate
that some issues are actually much easier to fix at later
stages (maybe because more or better information about the
software system is available only later). That would be a
very interesting result by itself, especially if you can
provide some statistical support for it.


- [ ] @timm really, no statisticall difference

Section 7.1

Maybe you could analyze the projects developed in a
"traditional" way and check if there is some sort of DIE
there. For example, if you look at the projects developed
with the waterfall life cycle, maybe you will find larger
DIE than for the other life cycles, which might partially
justify the claims of previous researchers about the
existence of DIE. This would also show that newer kinds of
life cycles (like Agile ones) help solve DIE. You do provide
hints about this in Section 8 ("We also note that other
development practices have changed in ways that could
mitigate the delayed issued effect") and Section 9, but
that's clearly not enough.


- [ ] @williamNichols: can we separate out the agile and the trad?

Section 8

I'm not sure what you are really trying to discuss in this
section, because you are not discussing the results of the
paper. It sounds like you are saying that maybe it's the
newer software development approaches that made DIE
disappear?


- [ ]


- [ ] Verify that the Section #'s in our responses are accurate prior to submission.

lots of small stuff

projects may have not interpreted the definitions of stages consistently

Bill, this is really the same problem as #15.

There is no attempt to validate the collected data. Based on the
provided summary of phases it appears that the projects may have not
interpreted the definitions of stages consistently (see more
detailed comments below).

fix historgram

In Figure 10 - why is the histogram for Code Review in the Code phase not 10 *'s like all the other's? Also the histograms - since they are scaled by phase - would seem to not be comparable across phases but they are visually presented in a way that they look like they are comparable. I'm not sure why you scaled them.

Scott-Knott

The statistical analysis section (§5.5) can be enhanced by better explicating the calculations. The Scott-Knott algorithm is a particular clustering algorithm, and the reader needs some clarification how it contributes to the demonstration (this can be related to sections §6.1 and §6.3 about threats to validity).
Moreover, Fig.13 is still insufficiently clear. The authors added a lengthy legend; I tend to think this would better be placed in the text. The column on the left hand (i.e. "rank") is still unclear to me. I also wonder if the size of the clusters should not be reported, i.e. how many defects are considered when 50th percentile growth is calculated? Does the calculation hold even if the clusters are minuscule in size? This aspect doesn't seem to be mentioned in §6.

trying to explain phases in a cyclic progress

Hey Bill,
is there any issue here:

The phases logged by our data are shown in \fig{waterfall}.
Although the representation suggests a waterfall model, the SEI experience is that all real implementations of any size follow a spiral approach with many team performing the work in iterative and/or incremental development cycles.

Is each such cycle its own before/planning/req/des/code/test thing? or, do folks go agile spinning between (say) req/des or (say) res/code?

please advise

is there evidence re the cost of TSP versus anything else?

one sentence i am writing needs some comment on the overhead of TSP. ideally, the overhead is negligible (after some initial training... how much training? how much cost for an added mentor like the distinguished Dr Nochols? is that mentor needed?)

comment on post-release buts

Reviewer3:

You compare Figure 8 to Figure 1 to point out that they don't match. These are graphs of different things. The y-axis of Figure 1 is cost and the y-axis of Figure 8 is number of defects. You've pointed out that the past data ignored the phase where the defect was introduced - so I don't see the point of Figure 9. (Overall I don't see the point of investigating H1, but if you are going to include it, this is a big problem with the analysis.)

Reviewer 1:

The claims are not commensurate with evidence. The projects in
the sample are homogeneous, small, and do not contain post-release
defects.
To note a few weaker sides, the study is based on mostly small to
medium projects, but the authors themselves pointed out (in section
3, Figure 4) that Phase Delay effect and the cost to fix an error
are most important for larger projects. More importantly, one would
assume requirement errors would have more effect for larger
projects, since they are supposed to contain more functionality and
also to be general, in contrast with the small or medium projects
that are specifically designed for some particular process/ intended
for a specific group of users in many cases. So it may not be
justified to claim that hypothesis 3 , "Requirements errors are the
most expensive to fix", is wrong, as mentioned in the paper, when
considering only small & medium projects and ignoring post-release
issues. Moreover, the study accounts for the phases in which some
defect was injected and the phase in which it was fixed, but no
information is provided about the absolute time scale. One would
assume a long running project is supposed to have more serious
impact in terms of cost-to-fix.
The blanket claims are not justified by the narrow domain size and practices
The problem is supposed to be most prominent in large projects
No post release information: the critical concern of phase delay
Confusing terminology

need to define before and planning phases

Despite the amount of the paper that describes PSP and TSP and lifecycles - I don't know what happens in the Before and Planning phases. I don't know what this could be that isn't what is traditionally considered the Requirements phase. Without this - I don't see how you can compare your results with other studies that would presumably consider these things as part of the requirements phase. Meanwhile, a lot of the Project Phases section seems irrelevant to understanding the data.

I don't fully understand Section 5.2.2 paragraph 4, and the relation between plan items and the time tracking logs.

I don't fully understand Section 5.2.2 paragraph 4, and the relation between plan items and the time tracking logs. "One or more defects are reported against a single plan item, e.g., a review session, an inspection meeting, a test execution". Multiple defects can be mapped to a plan item? But a plan item can also be "resolving a defect" (page 11:46). Also, the time tracking logs include time to collect data, prepare a fix, and its validation (page 12:50)… But what if a plan item includes both running a test suite and resolving the identified defect? The equation presented on page 13 suggests that the "time-to-fix a defect" could be inflated by e.g. running slow test cases. I'm sure the authors have all this covered, but the text confuses me. Could it possibly be revised? Maybe another figure could clarify the situation?

are phases accurately measured?

Biil: any comment on how we know developers found the right root cause phase of a defect

I was concerned that there appears to be no assessment done if
developers can accurately determine the phase when defect was
injected.

(Actually, I don't even get this comment... how does anyone find the origin phase acurrately? boehm? anyone? why are we being pinged on this issue?)

Ignore this: support comment

In Figure 10, there are many cells missing, for example, no defects
from planning found in system test, integration test, or acceptance
test. This suggests that some of the cells may come predominantly
from one set of projects while others from a different set.

fix references

Please check throughout your references, some include extended first name of authors, other only the initial.

Scott-Knott ranker description

The description of the Scott-Knott ranker is not terribly clear. You write "Scott-Knott seeks the division of of l treatments into subsets of size m, n of sizes ls,ms, ns and median values l.μ,m.μ, n.μ (respectively) in order to maximize ms/ls abs(m.μ − l.μ)^2 + ns/ls abs(n.μ − l.μ)^2" but it's not clear to me what "subsets of size m, n of sizes ls,ms, ns and median values l.μ,m.μ, n.μ (respectively)" means. Are these 5 subsets? How are they related?

meeting thurs june 16

adas

where does the line "phase delay" come from?

2005 or 2006?

In Section 4.2 you say “we examined 171 software projects conducted between 2005 and 2014.”, but in the rest of paper you mentioned 2006.

revise and lighten tones throughout.

In some parts I found the tone of text a little assumptive. For example “Lest the reader find this too little to justify our conclusions, we note that …”: trust the ESEC FSE readers to be able to make their own conclusions. Later on: “the phase delay myth persists. Hence, this paper. “ let readers judge whether the paper convince them or not. Please revise and lighten tones throughout.

need to explain

Bill: you got any comments on this:

In Figure 10 - I have some very serious concerns about some of the data. The most obvious example is that there are a large number of Before, Planning and Requirements errors being found during UnitTest. I find this extremely unexpected. Unit testing should be focused on verifying a low level component in the context of very specific design requirements. It is amazing that it would be so effective at discovering "Requirements" errors after those requirements have been through inspections and reviews during the requirements and design phase. This is the most amazing thing I see in the data. What is going on during unit testing?

ignore this bit, just support text:

Are people inappropriately tagging defects as "requirements" issues? This data goes completely against the traditional "V" diagram that aligns System Test with Requirements, Integration Test with Design and Unit Test with Coding

external validity items

We believe that the fact that the study was limited to small to medium sized projects conducted using TSP is a major threat to making the generalized claims in the paper. There are also aspects of the data set and the analysis that were not clearly explained, which introduced concerns about the validity of some of the presentation. We think the authors are exploring a very important idea and have a valuable data set - we encourage them to continue this line of investigation and share the results with the community, even if the claims can only be made about TSP-like processes.
- I have some concerns about your "low level" approach to data collection for the fix time. Your effort data is collected at a very low level - per developer in terms of time. The Boehm data and the Beck assertions are at the project-level and in terms of cost (usually). I don't think you can directly compare these things as they may not be linearly related and it's not clear how they would be aggregated. For example, a simple sum would not taking into account the critical path. For example two tasks that take 50 minutes each to fix delay the project by 50 minutes. But two that take 1 minute and 99 minutes delay the project by 99 minutes. I would expect that the cost is greater when the critical path is longer.
It would seem to me "interruptions" like meetings and requests for technical help might actually be work that is required to fix issues. It seems questionable to not account this time toward something.
In Figure 10 - I have some concerns about your sample sizes considering that there were 171 projects and most of the rows have less than 171 items. This would seem to indicate that different projects might have very different data sets and that some interesting results might be lost in merging them all together for the sake of getting the numbers to be big enough to make "general" conclusions. It seems like some projects could easily match the old view and this could be lost in the statistics.

Figure 11 - Does the year show when the project started or finished (released)?

Figure 11 - I like this figure, especially the final distribution column! But please clarify the year… I'm sure many projects are not completed within a calendar year. Does the year show when the project started or finished (released)?

Effect of high severity bugs?

Section 6.3 paragraph 5 - "Figure 14 shows that no such effect occurs…" How? Severity is not presented in the figure. Moreover, did your stratification cover high-severity defects?

The argument referenced is from Tim -- I confess to never understanding it myself. Tim or maybe Forrest needs to rework to justify why we don't think severity distribution matters.

setting up

port to new format
define phase delay type1 and phase delay type2
add the 2002 paper

more discussion on the evolution of SE discipline and how novel paradigms may have impacted on the truisms here challenged

Weaknesses

even though a huge basis of projects has been analysed, the sample could not be externally generalized as all projects share the TSP process

I would have also liked to see more discussion on the evolution of SE discipline and how novel paradigms may have impacted on the truisms here challenged. The fact that the studied project adopted TSP, and this could be a reason of the results observed is quickly mentioned among the threats to validity. I think this is an important feature of this study that should be taken into account while making claims along the paper.

comment on threat to valditiy

Also, one would assume that the TSP process was developed while
keeping the Phase Delay effect in mind. That might be one cause why
the effect is not predominantly seen in the projects studied, that
point is not mentioned in the paper anywhere. One independent metric
would have been to study some loosely developed projects and see if
the effect is seen. The authors themselves pointed out that "The
time in final testing is particularly low suggesting that few
defects survived into testing". So, no major defect might have
survived long enough in the system.

Figure 13 - The caption is far too long. Could parts of it be moved to a paragraph in the running text?

Fig. 11 could be complemented with the number of defects distribution.

From the first comment of Reviewer #2

Fig. 11 could be complemented with the number of defects distribution.

Consider adding this, but I would say it is optional. I have ignored their other suggestions from this comment because why remove descriptive information?

fig9 confuses units on the y-axis

@llayman : our plot is time to fix, there data is cost to fix. is this even repairable or should we dump this figure?

any information on project size?

The characteristics of the projects are given in duration and
number of people: no indication of size.

what is \bar{x} and \tidle{x}?

mean and standard deviation?

Defect type not discussed

Finally, looking at Fig.6 and defect type distribution, wouldn't be relevant to explore any possible effect of defect type on the results in Fig. 13? As defect severity is absent, may be defect type has some influence. It could be that certain categories of defect can be corrected any time during the project, it could even be that these defects, in particular, are delayed for later (because they are easy to repair) … It is indeed a mystery why certain defects have been fixed immediately (as reported in Fig. 12, how many in total?) and others were delayed for later (as analyzed in Fig. 13, how many in each cluster and in total?).

Bill, can you perhaps offer some response to this?

Figure 12 - Found and fixed?

Figure 12 - Found and fixed? Does this mean each defect appears twice in the figure? Could you separate them? The current figure is hard to penetrate.

write the reply text to reviewers.

Editor

The reviewers have carefully scrutinized the revised version, and all of the found the new version being globally improved with respect to the major concerns, and thus found the manuscript being worth publishing. At the same time, changes have introduced some inconsistencies and minor issues (such as a figure caption being too extensive) that the authors should fix them before submitting the final version.

Reply E0: That you for your comments. We have removed that excessively long figure text and fixed the typos and other small inconsistencies mentioned by the reviewers. We hope this revision meets with your approval.

Reviewer1

Reviewer #1: Thank you for carefully revising the manuscript. All my comments have been addressed - either elaborated in the paper or explained in the rejoinder. The manuscript is mature and my remaining comments are all minor. Unfortunately, it is not easy to refer to specific comments in the rejoinder, but I do my best below.

I think the manuscript has improved considerably, especially in regard to three key aspects. First, the data and the overall development context has greatly improved. This will help future readers understand the results and allow further interpretation. Second, the paper now connects to previous research on theory building in software engineering - already in the introduction section. While no new theories are put forward in this paper, I think the discussion is important, and the pointers to future work could inspire more work. Third, I appreciate the authors' efforts to stratify the data. Although there are no new findings, I believe the discussion belongs in the paper - I missed it in the previous version.

R1.1: Note that there is no Section 6.6 in the manuscript though, make sure you didn't forget anything.

Reply1.1: Our bad. Wrong section reference. One more Latex compile and that bad ref went away.

Minor comments:

R1.2Section 4: I think the paragraphing could be improved - the highly interesting point about agile is hidden. "A goal of agile methods is to reduce /---/ little empirical data exist" appears in the middle of a paragraph headed by "Shull et al. conducted a literature survey and held a series of e-workshops". Consider restructuring this section to bring forward the agile perspective.

Reply1.2: We agree that this section "buries the lead". We have moved those critical sentences to the front of the section.

R1.3: I don't fully understand Section 5.2.2 paragraph 4, and the relation between plan items and the time tracking logs. "One or more defects are reported against a single plan item, e.g., a review session, an inspection meeting, a test execution". Multiple defects can be mapped to a plan item? But a plan item can also be "resolving a defect" (page 11:46). Also, the time tracking logs include time to collect data, prepare a fix, and its validation (page 12:50)… But what if a plan item includes both running a test suite and resolving the identified defect? The equation presented on page 13 suggests that the "time-to-fix a defect" could be inflated by e.g. running slow test cases. I'm sure the authors have all this covered, but the text confuses me. Could it possibly be revised? Maybe another figure could clarify the situation?

Reply1.3: We apologize for the confusion. We have rewritten this paragraph to be more understandable. The key point is that we measure defects both directly and by summing effort in defect removal phases. The total removal phases produce a greater cost estimate by including overhead.

R1.4: Section 5.2.3: There is a minor discrepancy between Figure 7 and the text. In the running text you list system tests before acceptance tests, but the opposite order is depicted in the figure. Make sure the order is correct, also in Figure 13.

Reply1.4: Our bad. System comes before acceptance. Changed throughout. Good catch!

R1.5: Figure 10 - I like this figure, especially the final distribution column! But please clarify the year… I'm sure many projects are not completed within a calendar year. Does the year show when the project started or finished (released)?

Reply 1.5: Thank you for catching this ambiguity and for the positive comment on the distributions. We reported the start year and have modified the figure to clarify.

R1.6: Figure 12 - Found and fixed? Does this mean each defect appears twice in the figure? Could you separate them? The current figure is hard to penetrate.

Reply 1.6: Each defect is counted only once. The term "found and fixed" has a specific meaning, but this meaning has not come through clearly to more than one reviewer. To TSP practitioners, "found and fixed" means that the required change was "identified and implemented". We modified the text to clarify this that this operational term is a single activity.

R1.7 Figure 13 - The caption is far too long. Could parts of it be moved to a paragraph in the running text?

Reply1.7: Moved into text

R1.8: Section 6.1 - It is not obvious to me how to interpret "conformance with Benford distribution". Could you please elaborate somewhat?

Reply 1.8: A Benford test is used in forensic accounting to detect human manipulation of data entries. The distribution is not quite, but vaguely log-normal and occurs naturally in frequency distributions, such as log tables and data entries. Human alterations (guesses) cause deviations from the expected distribution. We have applied this test to log entries to estimate those recorded in real time vs those estimated or guessed. We have re-written the paragraph and included a citation to clarify the meaning.

R1.9: Section 6.1 paragraph 7 - "Fourth" appears twice.

Reply 1.9: Fixed

R1.10: Section 6.3 paragraph 5 - "Figure 14 shows that no such effect occurs…" How? Severity is not presented in the figure. Moreover, did your stratification cover high-severity defects?

Reply 1.10: You are correct. That is an incorrect inference. We have deleted that sentence.

R1.11:

Details:
Page 24:12 - "the cased study"
Page 7:24 - "Some studies that report less-than" Remove extra "that".
Page 12:41 - "quote elaborate" quite?
Page 14:30 - "the of the"
Page 14:45 - "complete submitted"
Page 15-16 - No thousands separators.
Page 16:48 - "only 25% of the teams…" Was this supposed to be a separate item in the bullet list?
Page 18:51 - "raises" -> "raised"

Reply 1.ll: Fixed. Thanks!

Reviewer2

Reviewer #2: This revised version (R1) of the paper contains many corrections, changes, and enhancements. The authors have considered all the reviewer's comments and have provided precise responses. My own observations (i.e. Reviewer #2) have been globally satisfied, and thanks to comments from reviewers #1 and #3, the paper has been significantly enhanced.

Thank you for that comment.

Nonetheless, there are still some issues that need some clarifications.

R2.1 My main concerns are related to data and statistics sections:

As requested, the authors have added descriptive statistics about the data. However, some of these are not very relevant here, i.e. number of projects per organization (Fig. 9), year distribution (Fig.11 last row). Moreover, Fig. 11 could be complemented with the number of defects distribution.

Reply 2.1: We have added a defect count distribution as suggested.

R2.2- The statistical analysis section (§5.5) can be enhanced by better explicating the calculations. The Scott-Knott algorithm is a particular clustering algorithm, and the reader needs some clarification how it contributes to the demonstration.

Reply 2.2: You are quite correct- our description of Scott-Knott was opaque. We have added more text at the start of 5.5.

2.3 Moreover, Fig.13 is still insufficiently clear. The authors added a lengthy legend; I tend to think this would better be placed in the text. The column on the left hand (i.e. "rank") is still unclear to me.

Reply 2.3: . You are correct: that explanation text in the figure caption was confusing. By moving it to to section 5.6, it can be expanded into a dot list (easier to read) and extra details
can be added (for example, as you suggest, we can discuss more the meaning of the "rank" column).

2.4 - I am troubled by Fig. 12, how many defects have in total been fixed in the same phase in which they have been found? What is the proportion from the initial 47,376 defect logs that are concerned in the statistics in Fig. 13?

Reply 2.4: "Find and fix" is a single sub-task, so all defects are "found and fixed" in the same phase. This is distinct from injected and found.

- Finally, looking at Fig.6 and defect type distribution, wouldn't be relevant to explore any possible effect of defect type on the results in Fig. 13? As defect severity is absent, may be defect type has some influence. It could be that certain categories of defect can be corrected any time during the project, it could even be that these defects, in particular, are delayed for later (because they are easy to repair)/

Reply 2.5: Indeed, some activities are more likely to find certain types of defects. This was explored in a pair of TSP Symposium papers using PSP data.
D. Vallespir and W. Nichols, “An Analysis of Code Defect Injection and Removal in PSP,” in Proceedings of the TSP Symposium 2012, 2012. , and
D. Vallespir, “Analysis of Design Defect Injection and Removal in PSP,” in TSP Symposium 2011, 2011, pp. 1–28.
However, exploring this with data from so many distinct projects presents challenges and analysis beyond the scope of this paper. This is an interesting question that should be pursued in future work.
Although some types of defect are clearly more challenging than the others (syntax is often simple), the discovery phase and activity is a much larger effect. Examining this relationship is a very good idea, but we judged that to be a significant future effort to elaborate on the more basic finding.

2.6 Minor remarks:

p3, line 1: "that" repeated

§5.2: the numbers about the TSP database can be regrouped in a table

Fig.11: the presentation of the table could be enhanced by reducing the size of columns N, Min, etc. and enlarging the column Distribution to make the graphics visible.

p. 17: the full name should be mention for the "A12 effect size test" (i.e. Vargha and Delaney's A12 test, cf. [1])

p.17, line 36: "it applies"

p.17, line 38: "of" repeated

§5.6: the last paragraphs about the possible explanations could be regrouped in a separate subsection

p.21, line 20: "are" repeated

Reply 2.6: Fixed

Reviewer3

Reviewer #3: The authors have adequately addressed my observations. Below is a list minor comments (with one exception, all of them are typos).

R3.1: Section 5.2.2
"quote elaborate" should be "quite elaborate"

Section 5.4
"discuss the of the projects" should probably be "discuss those of the projects"
"the logical order is described section 5.2.3 is followed" should be "the logical order described in Section 5.2.3 is followed"

In Fig. 11, the "min" Year is not reported (even though the text says it's 2006).

"teams meet at least weekly" should be "teams met at least weekly" (everything else is in the past tense)

Section 5.5

"it apples some statistical hypothesis test" should be "it applies some statistical hypothesis test"
"the division of of l treatments" should be "the division of l treatments"

Reply 3.1: These items have been fixed. Thanks!

3.2 The description of the Scott-Knott ranker is not terribly clear. You write "Scott-Knott seeks the division of of l treatments into subsets of size m, n of sizes ls,ms, ns and median values l.μ,m.μ, n.μ (respectively) in order to maximize ms/ls abs(m.μ − l.μ)^2 + ns/ls abs(n.μ − l.μ)^2" but it's not clear to me what "subsets of size m, n of sizes ls,ms, ns and median values l.μ,m.μ, n.μ (respectively)" means. Are these 5 subsets? How are they related?

Reply 3.2: That text has unclear and we have fixed it. See the second dot list in Section 5.5.

3.3 In the caption of Figure 13, you write "(these values are calculated by sorting all resolution time, then reporting the middle values of that sort) The", which should be "(these values are calculated by sorting all resolution times, then reporting the middle values of that sort). The"

Section 5.6
"examples where there exists at least N >= 30 examples" should be "examples where there exist at least N >= 30 examples"

Should "stratification's" be "stratifications"?

Section 6.1
"the data are are consistent" should be "the data is consistent" (you've always used "data" as a singular noun everywhere else in the paper)

Section 6.4
"cased study" should be "case study"

"the data are" should be "the data is"

Reply 3.3: fixed

Section 6.1 - It is not obvious to me how to interpret "conformance with Benford distribution".

Pretty sure this is a Bill line...

Section 6.1 - It is not obvious to me how to interpret "conformance with Benford distribution". Could you please elaborate somewhat?

Section 4: I think the paragraphing could be improved

Section 4: I think the paragraphing could be improved - the highly interesting point about agile is hidden. "A goal of agile methods is to reduce /---/ little empirical data exist" appears in the middle of a paragraph headed by "Shull et al. conducted a literature survey and held a series of e-workshops". Consider restructuring this section to bring forward the agile perspective.

need a proof read... when everything else is done

revise fig3

I don’t see why in Fig. 3 you report the whole list of 10 laws surveyed, while there is no discussion or introduction of the other nine laws beyond the Phase Delay one. To me either you include some more discussion of them or better you remove the list.

been mercilessly pruning details from the TSP description

bill--

worried you won't like the reduced description of TSP in this draft. now, space has to be made for the extra discussion points demanded by the reviewers. but i would not be surprised if you start adding back some of the culled details