greg4cr / sbse-sigsoft-standard Goto Github PK

View Code? Open in Web Editor NEW

23.0 23.0 6.0 436 KB

Proposed ACM SIGSOFT Standard for Optimization Studies in SE (including SBSE).

Home Page: https://arxiv.org/abs/2010.03525

sbse-sigsoft-standard's People

Contributors

Stargazers

Watchers

Forkers

xdevroey maleknaz guentherruhe drpaulralph rebecca-moussa zzctmac

sbse-sigsoft-standard's Issues

comment on relevance

From Lionel Briand: "The only thing that I would contend with is the discussion about “importance” or what I would call relevance. Research, by definition, is exploratory and about taking risks. But in an engineering discipline the problem should be well defined, with clearly justified assumptions."

From Paul: Reformulate "support diverse range of studies" as specific attributes.

"We stress that the use of optimization in SE is still a rapidly evolving field. Hence, the following criteria are approximate and many exceptions exist to these criteria. Reviewers should reward sound and novel work and, where possible, support a diverse range of studies."

Paul states: "Take this paragraph out and try to build this flexibility into the essential attributes. Most readers are only going to read the essential, desirable, and extraordinary attributes so all the critical stuff has to go in there somehow".

From Paul: Should this attribute be moved to "desirable"?

Multiple trials can either be performed as a cross-validation (multiple independent executions) or temporally (multiple applications as part of a timed sequence), depending on the problem at hand.

Pauls believes this should shift to desirable as it "may not always apply". Do we agree?

From Paul: Clarify the following

Provide a detailed explanation on how subjects or datasets were collected and chosen in order to mitigate selection bias and improve the generalization of the findings. Describe the main features of the subjects used to run and evaluate the optimization approach(es) and discuss what characterizes the different instances in terms of "hardness". In case of data-driven approaches, if random data splits are used, this should be made publicly available or at least reproducible. In case of synthetic data, clearly explain why real-world data cannot be used, and to what extend the proposed approach and the findings can be applicable to a real-world setting.

This point is (a) quite long, and (b) not immediately clear. Would suggest an editing pass.

option space

From Twitter:

Daniel Struber - "Looks great, thanks for this! I have a remark: "one should at least consider the option space" I think that this criterion needs to be more specific to allow a fair application, given that the option space (any available optimization technique) is huge."

From Paul: Suggested contents of replication package

"To enable open science, a replication package should be made available that conforms to SIGSOFT standards for artifacts."

What are some of the possible components of this replication package? We should provide a list of suggested elements.

adding a point under the "essential" category

This is a great initiative, and I would like to thank the authors and everyone contributing.

I would suggest adding a point under the "essential" category related to datasets used for evaluation. It is important that there is an appropriate justification of the dataset used for evaluation, and a description of the main features of the dataset that characterise the different problem instances in terms of "hardness". For example, if the size of the problem instances is an important feature that affects the performance of the optimisation approach, the dataset should be described in terms of this feature.

Under the "desirable category" I would suggest to include something around "efforts should be made to avoid any bias in the selection of the dataset"

Under the "invalid criticism" I would include "the paper uses only one dataset". Reviewers should provide valid criticism as to why that single dataset is not sufficient, or ask for more clarification from the authors. Papers should not be rejected because the authors use a single dataset.

too many exampls just form sbse

I have one point of critique though. All examples seem to come from the field of SBSE. However, there are lots of other optimisation techniques. In particular, there's constraint solving (SAT,SMT,CSP, and other). These all deal with optimisation problems, and not all are stochastic solutions. Nevertheless, most points I would imagine apply (define search space, fitness/evaluation criteria etc.).

I would thus suggest to remove "(e.g., metaheuristics and evolutionary algorithms)" from the introduction; add additional examples: e.g., change
"The algorithm underlying an approach (e.g., the metaheuristic) "
to "The algorithm underlying an approach (e.g., the metaheuristic, CDCL)"
and change:
"One should sample from data multiple times in a controlled manner."
to "One should sample from data multiple times in a controlled manner, where appropriate." (or smth, as it's not always necessary)

I tried to find an exemplary paper that would compare SBSE and constraint-based approaches. There have been quite a few constraint-based approaches proposed for test input generation, but I wasn't sure which one to pick. Perhaps others might have suggestions.

Thanks again for all the efforts.

Add two papers

From Guenther Ruhe:

Guenther Ruhe, Optimization in Software Engineering - A Pragmatic Approach. In Felderer, M. and Travassos, G.H. eds., Contemporary Empirical Methods in Software Engineering, Springer, 2020.

From Mark Harman:

Mark Harman, Phil McMinn, Jerffeson Teixeira Souza, and Shin Yoo. Search-Based Software Engineering: Techniques, Taxonomy, Tutorial. Empirical Software Engineering and Verification, Lecture Notes in Computer Science, vol. 7007, pp. 1–59, 2011

add paper

From Lionel Briand: "We tacked many of these issues in our IEEE TSE 2010 paper, in a SBST context: Ali et al., "A Systematic Review of the Application and Empirical Investigation of Search-Based Test-Case Generation".

(add to recommended reading, see if any other advice from it that we should apply)

more concise, using footnotes if needed

From Paul Ralph: "My only suggestion so far is to try to keep the specific attributes more concise, using footnotes if needed."

Feedback to incorporate into the draft.

From Twitter:

struggled with “The effects of stochasticity must be understood"

I also struggled with “The effects of stochasticity must be understood and accounted for at all levels (e.g., in the use of randomized algorithms, in fitness functions that measure a random variable from the environment, and in data sampling)” which seems a bit broad for a new problem where you don’t understand half of the stochasticity yet.

It may be good to clarify what we mean by "understood and accounted for" and "at all levels" and how this applies to well-studied problems vs new problems or new approaches.

siegmund says

From Norbert Siegmund:

There are some important aspects missing, especially criteria for the input space (generation), the description of the search space and why it is truly and an exponential problem, and clear discussion of threats to validity. Please refer to our paper: https://t.co/sNQK4DaaTQ?amp=1

In our paper together with SvenApel and Stefan Sobernig, we analyzed papers optimizing software configurations with variability models. Here, features (or options) of a SW system are modeled together with attributes, such as performance. Goal: Try to find the optimal config.

There are three validity issues common in most papers:

Non realisitc inputs: Attribute values have not been measured, but mostly generated by an arbitrary distribution that has no relation to value distributions in the wild. Optimizations should work with realistic data.
Exponential search space: Finding an optimal configuration in an exp number of combinations of SW features is an NP-hard problem. However, if there are only a linear number of effects, we can simply compute every configuration's value with a function having linear # of terms. Hence, the optimization task becomes trivial, because ignoring combination (or interaction) effects renders the problem to a linear one. We saw this problem simplification in all analyzed optimization papers by omiting interaction effects.
Threats to validity: We saw that early papers that define experimental setups have been reused by others. But, the threats to validity of these early setups have not been addressed and even not mentioned by those papers reusing the setup. Hence, make your limitations explicit.

size tests

A few examples of "size tests" from the Antipatterns section would be useful, especially if they can be paired with appropriate statistical tests (and perhaps links to relevant Wiki entries). For example, would you expect rank correlation or just a general effect size? https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test#Effect_sizes

Update markdown to match Empirical Standards repo

The other standards are being converted to Markdown. We should match their style: https://github.com/acmsigsoft/EmpiricalStandards/tree/development/Standards