ai-se / ml-assisted-slr Goto Github PK

Automated Systematic Literature Review

Python 61.93% TeX 38.04% Shell 0.03%

ml-assisted-slr's Introduction

Machine Learning Assisted Systematic Literature Review

The ultimate goal of our research is to reduce the effort of SLR to days of work and make it practical to be conducted frequently.

ml-assisted-slr's People

Stargazers

Watchers

Forkers

joe-nano

ml-assisted-slr's Issues

Need a newer baseline method? On which data?

Jun-02-2016

Things done

Target Problem:
- sampling bias --> imbalance problem in active learning (results delta can be more significant)
Existing literature on solving imbalance problem in active learning scenario
- one paper suggests that active learning outperforms SMOTE on imbalanced data. Learning on the border: active learning in imbalanced data classification 2007
  - SVM only cares about support vectors
  - data near decision boundary are less imbalanced
- Use human resources (filtering and searching) Why label when you can search?: alternatives to active learning for applying human resources to build classification models under extreme class imbalance 2010 (Guided Learning)
- Boosted Disagreement with QBC: Reducing class imbalance during active learning for named entity annotation.
What existing literature lack
- all methods focus on how to select training examples for next generation.
- all methods assume that we already have a initially labeled training set. (Except for Hierarchical sampling for active learning. The problem for this method is that it totally abandoned the good nature of active learning.)
- our assumptions:
  - 1. imbalance in initial training set will affect the active learning performance
  - 1. Hierarchical clustering can balance the initial training set
  - 1. For new stages, we need to consider expert knowledge. e.g. keyword search through elasticsearch first to retrieve a more balanced initial training set.
Negative Results (on multi-classification problem)

The entropy maximization methods do not make a single difference from random sampling!!!

To Do

reduce the problem to binary classification, target class is minority.
if still does not work out, consider keywork search.

baseline candidate

Wahono, Romi Satria. "A systematic literature review of software defect prediction: research trends, datasets, methods and frameworks." Journal of Software Engineering 1, no. 1 (2015): 1-16. (chosen)

Hall, Tracy, Sarah Beecham, David Bowes, David Gray, and Steve Counsell. "A Systematic Literature Review on Fault Prediction Performance in Software Engineering."

Malhotra, Ruchika. "A systematic review of machine learning techniques for software fault prediction." Applied Soft Computing 27 (2015): 504-518.

Radjenović, Danijel, Marjan Heričko, Richard Torkar, and Aleš Živkovič. "Software fault prediction metrics: A systematic literature review." Information and Software Technology 55, no. 8 (2013): 1397-1418.

Unterkalmsteiner, Michael, Tony Gorschek, AKM Moinul Islam, Chow Kian Cheng, Rahadian Bayu Permadi, and Robert Feldt. "Evaluation and measurement of software process improvement—a systematic literature review." IEEE Transactions on Software Engineering 38, no. 2 (2012): 398-424.

BEFORE

Need to decide whether to use UPDATE or REUSE.

Notice:

Hall -> Wahono, UPDATE is better,

Hall -> Abdellatif, REUSE is better,

Wahono -> Hall, REUSE is better,

However:

In cases when REUSE is better than UPDATE, UPDATE still beats REUSE in early stages
Therefore a mixed strategy (UPDATE_REUSE): use UPDATE in early stages, when accuracy goes down, switch to REUSE.

NOW:

Results:
Single repeat results suggest that UPDATE_REUSE will have the merits of both UPDATE and REUSE.

More sophisticated results are expected tomorrow. Will update this issue when results come out.

01/11/2017:

First try, results are not that good. Adjusting criteria for switching from UPDATE to REUSE.

All types of reviews

Systematic Approaches to a Successful Literature Review

Retrieval per Review = the derivative

Test to see which is the cost efficient point to stop reading: at X% (X=80,85,90,95,99...) retrieval rate?

80% or 85% retrieval rate seems most cost efficient.
I would choose 90% since we want more completeness and the sacrifice on efficiency is not much.
Any suggestions?

Want bar chart at 80%, 85%, 90%, 95%, 99% for the above?

Felizardo

Semi-automatic selection of primary studies in systematic literature reviews: is it reasonable? (2015): Score Citation Automatic Selection (SCAS), to automate part of the primary study selection activity. (58.2 % cost reduction, 12.98 % error rate.)
VTM 2013, VTM 2011. (Clustering, citation maps), less than half cost reduction (70.14 min -> 54.5 min).

UPDATE

Using Forward Snowballing to update Systematic Reviews in Software Engineering Snowballing vs. Full search:

	snowballing	full search
recall	92.86%	100%
precision	7.55%	1.42%

A visual analysis approach to update systematic reviews 2014
Formalizing a systematic review updating process first addressed the difficulty of updating an SLR.

Hall Result

Hall, Tracy, Sarah Beecham, David Bowes, David Gray, and Steve Counsell. "A Systematic Literature Review on Fault Prediction Performance in Software Engineering."

	Hall Paper	IEEExplore
Initial Size	2073	8912
Final Size	136	106

In the scenario of

fixed pool
no external knowledge from experts
labels are all correct

Baseline from Biomedical #17

patient_aggressive_undersampling

Baseline from Litigation #16

hasty_continuous_active

Different setups:

patient or hasty: hasty is start to learn as soon as we get one relevant example; patient is wait for more (5 in the graph) relevant examples and then start learning. (hasty is suggested by litigation, patient is suggested by biomedical)
continuous or not: continuous is ignore uncertainty sampling, start with certainty sampling (pick docs with highest prediction score) and goes forever. Methods without continuous use uncertainty sampling until stable (yellow dot in the first graph, big enough margin in SVM model), then stop learning. continuous is suggested by litigation.
aggressive_undersampling or not: aggressive_undersampling is a data balancing method, which throw away irrelevant training data that are close to the SVM decision plane. I also test SMOTE and in this scenario, aggressive_undersampling outperforms SMOTE. aggressive_undersampling is suggested by biomedical.

Danijel

Software fault prediction metrics: A systematic literature review

Biomedical

medline database

Byron C. Wallace

Active learning for biomedical citation screening 2010

co-feature (co-testing) for hasty problem (rule based simple model vs SVM, find disagreements)

Semi-automated screening of biomedical citations for systematic reviews 2010

patient AL (random sampling until enough, then start training and uncertainty sampling. Aggressive under-sampling for final model [found incredibly useful in simple experiment])

Who should label what? instance allocation in multiple expert active learning. 2011

Problem: given a panel of experts, a set of unlabeled examples and a budget, who should label which examples?
Solution: inexpensive experts flag difficult examples encountered during annotation for review by more experienced experts.

Active literature discovery for scoping evidence reviews: How many needles are there? 2013

Problem: how to estimate the prevalence of relative docs.
Solution: active learning as to select informative docs to be reviewed (bias introduced), then inverse-weighting to correct bias. (weighting is the prediction probability)

Deploying an Interactive Machine Learning System in an
Evidence-Based Practice Center: abstrackr 2012

TOOL

Modernizing the systematic review process to inform comparative effectiveness: tools and methods. 2013

Overview

Combining crowd and expert labels using decision theoretic active learning 2015

crowd sourcing + expert labeling + active learning

ask crowd to label an initial set
ask crowd to label an unlabeled doc OR ask expert to label a crowd labeled doc

Non - Wallace

Automatic text classification to support systematic reviews in medicine 2014

Compare some supervised text miners for classification of Medical Systematic Review data.
Classifiers: NB, KNN, SVM, Rocchio
Preprocessing: steming, stopword removal, TF, tfidf
Metrics: Macro-F score
Like our BIGDSE16 paper.

Effort for using MAR

Reviewed 350 docs, 160 of them are "relevant". Cost me 1.5 hours.

Task

Hall.csv without true labels, trying to find studies about "software defect prediction".

Operator

Just me

Result

The output csv file is here

Discussions

Retrieved more than Hall's relevant.
Among 350 coded studies, 250 of them are consistent with Hall.
Among Hall's 106 relevant, 72 of them are corrected coded as relevant in my attempt.
Human errors/ biases do exist.

CSC 510

Propose:

Whole plan

I suggest we form groups of 6 students.
each group, choose their own topics in SE.
each member in one group review only 200 studies (should be able to finish in about 1 hour)
3 parallel reviews per group: A reviews 200, then D starts on top of A, do another 200. Same thing happens to B+E, C+F.
we will have data for both a) disagreements between reviewers, b) concept drift when different reviewers work on top of others.
students will then read studies they selected and learn from those.

Reason:

Students can select their own topics.
As a student, I would rather work 2-3 hours on some topic I personally interested in than 1 hour on something I care less about.
Data generated will be extremely useful for me.

Search strategy

Search Strategy Formulation: A Framework For Learning

Aug-04-2016

Baseline Results and possible improvements

In the scenario of

fixed pool
no external knowledge from experts
labels are all correct

Baseline from Biomedical #17

patient_aggressive_undersampling

Baseline from Litigation #16

hasty_continuous_active

Conclusions drawn:

aggressive undersampling is effective
need patients if aggressive undersample

Two current winners:

hasty aggressive undersampling, if just need around 80% recall, 5% docs reviewed. (Has a stable classifier in the end, but stops learning)
patient continuous active, achieve 100% recall at the cost of reviewing 20% docs. (May be more suitable to handle concept drift as it keeps learning)

Ten repeat result

Future work

Get more data sets to do experiment on. It would be best if one from biomedical, one from litigation.

springer

http://link.springer.com/search?facet-language=%22En%22&query=%28software+OR+applicati*+OR+systems+%29+AND+%28fault*+OR+defect*+OR+quality+OR+error-prone%29+AND+%28predict*+OR+prone*+OR+probability+OR+assess*+OR+detect*+OR+estimat*+OR+classificat*%29&facet-discipline=%22Computer%22&facet-content-type=%22Article%22&facet-discipline=%22Engineering%22&date-facet-mode=between&facet-start-year=2000&previous-start-year=1963&facet-end-year=2013&previous-end-year=2016

Presumptive non-relevant examples

Newest result in e-discovery:
Scalability of Continuous Active Learning for Reliable High-Recall Text Classification mentioned one technique to tackle the problem.

Presumptive non-relevant examples.
Autonomy and reliability of continuous active learning for technology-assisted review

May be useful for REUSE.

Testing.

What

Each round, besides all the labeled examples, randomly sample from the unlabeled examples and treat them as negative training examples.

Then train the model.

Why

E-discovery

why we need this technique:
- start with 1 positive example (either a real one or a synthetic one)
- lack of negative examples at early stage
why it works:
- prevalence are extremely low -> very low chance to have a positive example treated as negative.
- and it will change each round.

SLR

why we need this technique:
- among continuous review, newly discovered negative examples are all in one corner (not representative)
- lack of representative negative examples at early stage when REUSEing
why it works:
- prevalence are extremely low -> very low chance to have a positive example treated as negative.
- and it will change each round.
- even there are positive examples treated as negative, aggressive undersampling will discard those.

Results

FASTREAD, use this tech or not:

Hall:
Wahono:
Abdellatif:

At least as good as not using it. (worst case result depends on pseudo random, not reliable)

Transfer learning result with this tech:

Hall as previous SLR,

on Wahono:
on Abdellatif:

Wahono as previous SLR,

on Hall:
on Abdellatif:

Abdellatif as previous SLR,

on Hall:
on Wahono:

Conclusions

adding this technique will not deteriorate the performance of FASTREAD.
making REUSE better
UPDATE-REUSE now is always among the best solutions (only on Abdellatif, if we all it to retrieve one or two less than target)

New Figure (Oct 6st)

25 repeats, took about 1 day on ncsu hpcc.

Hall Result

Hall, Tracy, Sarah Beecham, David Bowes, David Gray, and Steve Counsell. "A Systematic Literature Review on Fault Prediction Performance in Software Engineering."

	Hall Paper	IEEExplore
Initial Size	2073	8912
Final Size	136	106

Wahono Result

Wahono, Romi Satria. "A systematic literature review of software defect prediction: research trends, datasets, methods and frameworks." Journal of Software Engineering 1, no. 1 (2015): 1-16.

	Wahono Paper	IEEExplore
Initial Size	2117	7002
Final Size	71	62

Question: is the figure clear enough? or make it double with medians and iqrs on different figs?

Comparisons for each code:

P_U_S_A vs. P_U_S_N

P_U_S_A wins.
For last code, A is better than N (aggressive undersampling is useful)

Compare the third code
P_U_S_A vs. P_U_C_A

No clear winner. C is better than S due to its ability to continuous update the model.

Compare the second code
P_U_C_A vs. P_C_C_A

No clear winner, let's keep both.

Compare the first code
P_U_C_A vs. P_C_C_A vs.H_U_C_A vs. H_C_C_A

H is better than P.
H_U_C_A and H_C_C_A are similar.

Compare H_U_C_A and H_C_C_A in terms of stability:

H_C_C_A outperforms H_U_C_A in terms of stability.

Comparing to state-of-the-art:
H_C_C_A vs. H_C_C_N vs. P_U_S_A

H_C_C_A outperforms either of the state-of-the-art methods.

Jul-06-2016

Baseline paper chosen

Wahono, Romi Satria. "A systematic literature review of software defect prediction: research trends, datasets, methods and frameworks." Journal of Software Engineering 1, no. 1 (2015): 1-16.

Reason: transparency

Construct RQs
Define search string, to retrieve defect prediction papers.
- The search string is: "(software OR applicati* OR systems ) AND (fault* OR defect* OR quality OR error-prone) AND (predict* OR prone* OR probability OR assess* OR detect* OR estimat* OR classificat*)"
Search in 5 datasets, get 2117 results:
- ACM Digital Library (dl.acm.org)
- IEEE eXplore (ieeexplore.ieee.org)
- ScienceDirect (sciencedirect.com)
- Springer (springerlink.com)
- Scopus (scopus.com)
Review title and abstract of the 2117 papers, 213 left.
Review full text of the 213 papers, 71 left.

Experiment Design

Include the 2117 papers in elasticsearch database
Tag the 71 as relevant
Compare two ways to detect the 71 papers within 2117
- traditional linear review
- random seed + active learning (random review N papers, then start active learning, uncertainty sampling first, and certainty sampling at some point of time)

Expected result:

traditional linear review will need to review 21170.85 papers to find 710.85 relevant ones.
learning based review only need to review 21170.15 papers to find 710.85 relevant ones.

Current Progress

Include the 2117 papers in elasticsearch database

Injected citemap.csv (provided by George) into elastic search.

Problems:

576 results returned by query the search string (compared to 2117)
within the 576 results, none is in the list of 71 😭
only has abstracts, no full text included

To Do

try to inject the original 300 GB data we have into elasticsearch
If 1 does not work out. Use api to query the 5 databases for constructing the 2117 papers.

Similarity between data and target

Supported by https://github.com/ai-se/ML-assisted-SLR/blob/master/no_ES/src/runner.py

Data Similarity

LDA on 30 topics (number of topics does not matter much)
Topic weighting for the two data sets:

L1 similarity, as default of LDA:

Hall2007- vs Hall2007+: 0.95
Hall vs Wahono: 0.79

L2 similarity, make more sense:

Hall2007- vs Hall2007+: 0.99
Hall vs Wahono: 0.86

Target Similarity

LDA on 30 topics
Topic weighting for the two relevant set:

L1 similarity, as default of LDA:

Hall2007- vs Hall2007+: 0.95
Hall vs Wahono: 0.93

L2 similarity, make more sense:

Hall2007- vs Hall2007+: 1.00
Hall vs Wahono: 0.96

Conclusion:

Target of Hall and Wahono are very similar, which explains why UPDATE works.
Data similarity of Hall and Wahono are not that high, but it does not damage the UPDATE performance much.

Problem:

Target similarity is measured by comparing relevant docs, however, before review, this information is not available.

scopus

https://www.scopus.com/results/results.uri?sort=plf-f&src=s&st1=%28software+OR+applicati*+OR+systems+%29+AND+%28fault*+OR+defect*+OR+quality+OR+error-prone%29+AND+%28predict*+OR+prone*+OR+probability+OR+assess*+OR+detect*+OR+estimat*+OR+classificat*%29&nlo=&nlr=&nls=&sid=89D9176808ACE0FC86A3B9E92691D658.I0QkgbIjGqqLQ4Nw7dqZ4A%3a70&sot=b&sdt=cl&cluster=scosubjabbr%2c%22COMP%22%2ct&sl=350&s=TITLE-ABS-KEY%28%28software+OR+applicati*+OR+systems+%29+AND+%28fault*+OR+defect*+OR+quality+OR+error-prone%29+AND+%28predict*+OR+prone*+OR+probability+OR+assess*+OR+detect*+OR+estimat*+OR+classificat*%29%29+AND+DOCTYPE%28ar+OR+cp%29+AND+SUBJAREA%28MULT+OR+CENG+OR+CHEM+OR+COMP+OR+EART+OR+ENER+OR+ENGI+OR+ENVI+OR+MATE+OR+MATH+OR+PHYS%29+AND+PUBYEAR+%3E+1999+AND+PUBYEAR+%3C+2014&origin=resultslist&zone=leftSideBar&editSaveSearch=&txGid=0

Performance Metrics

F score with training size
What is used in the baseline method?

g-means (sqrt(prec*rec)), AUC and PRBEP

Business Motivation

add to proposal

ACM

http://dl.acm.org/results.cfm?query=%28software%20OR%20applicati%2A%20OR%20systems%20%29%20AND%20%28fault%2A%20OR%20defect%2A%20OR%20quality%20OR%20error-prone%29%20AND%20%28predict%2A%20OR%20prone%2A%20OR%20probability%20OR%20assess%2A%20OR%20detect%2A%20OR%20estimat%2A%20OR%20classificat%2A%29&filtered=resources%2Eft%2EresourceFormat=PDF&within=owners%2Eowner%3DHOSTED&dte=2000&bfr=2013&srt=_score

"query": { (software OR applicati* OR systems ) AND (fault* OR defect* OR quality OR error-prone) AND (predict* OR prone* OR probability OR assess* OR detect* OR estimat* OR classificat*) }

"filter": {"publicationYear":{ "gte":2000, "lte":2013 }},
{owners.owner=HOSTED}

Litigation

relativity analytics (TOOL)
tar for smart people, catalyst (BOOK)
CAL: Evaluation of machine-learning protocols for technology-assisted review in electronic discovery 2014

Autonomy and Reliability of Continuous Active Learning for Technology-Assisted Review (1 relevant seed (hasty)) 2015

Result Summary

Hall Result

Hall, Tracy, Sarah Beecham, David Bowes, David Gray, and Steve Counsell. "A Systematic Literature Review on Fault Prediction Performance in Software Engineering."

	Hall Paper	IEEExplore
Initial Size	2073	8912
Final Size	136	106

Wahono Result

Wahono, Romi Satria. "A systematic literature review of software defect prediction: research trends, datasets, methods and frameworks." Journal of Software Engineering 1, no. 1 (2015): 1-16.

	Wahono Paper	IEEExplore
Initial Size	2117	7002
Final Size	71	62

Method Code

Stage 1: Random sampling

P: patient, random sample and review until N=5 relevant studies retrieved.
H: hasty, random sample and review until 1 relevant studies retrieved.

Stage 2: Build classifier

U: uncertainty sampling, sample from nearest points to SVM decision hyperplane. Use labeled data for training until SVM is stable (margin > X=2.0).
C: certainty sampling, sample from SVM most confident relevant predictions. Use labeled data for training. (What it really means is that this method does not have a Stage 2, goes directly to Stage 3)

Stage 3: Prediction

S: simple, certainty sampling in this stage, but stop training.
C: continuous, same as C in Stage 2, certainty sampling, never stop training.

Data Balancing

A: aggressive undersampling, undersample majority training data, only keep data points furthest to SVM decision hyperplane, equal number to minority training data size. Any stage of certainty sampling will apply aggressive undersampling.
N: no data balancing, stages with certainty sampling will not apply any data balancing method.

Baselines

Baseline from Medicine #17

P_U_S_A (patient, uncertainty sampling, simple, aggressive undersampling)

Baseline from Litigation #16

H_C_C_N (hasty, certainty sampling, continuous, no data balancing)

Winner so far

H_U_C_A (hasty, uncertainty sampling, continuous, aggressive undersampling)
Hasty and continuous suggested by litigation,
Uncertainty sampling and aggressive undersampling suggested by Medicine.

Google Scholar

https://scholar.google.com/scholar?q=%28software+OR+applicati*+OR+systems+%29+AND+%28fault*+OR+defect*+OR+quality+OR+error-prone%29+AND+%28predict*+OR+prone*+OR+probability+OR+assess*+OR+detect*+OR+estimat*+OR+classificat*%29&hl=en&as_sdt=1%2C36&as_vis=1&as_ylo=2000&as_yhi=2013

IEEE

http://ieeexplore.ieee.org/search/searchresult.jsp?queryText=((software%20OR%20applicati%20OR%20systems%20)%20AND%20(fault%20OR%20defect%20OR%20quality%20OR%20error-prone)%20AND%20(predict%20OR%20prone%20OR%20probability%20OR%20assess%20OR%20detect*%20OR%20estimat%20OR%20classificat))&refinements=4291944822&refinements=4291944246&ranges=2000_2013_Year&matchBoolean=true&searchField=Search_All

Meta Data:
http://ieeexplore.ieee.org/search/searchresult.jsp?queryText=(((fault%20OR%20defect%20OR%20quality%20OR%20error-prone)%20AND%20(predict%20OR%20prone%20OR%20probability%20OR%20assess%20OR%20detect%20OR%20estimat%20OR%20classificat)%20AND%20(software)))&refinements=4291944822&refinements=4291944246&refinements=4291944823&refinements=undefined&ranges=2000_2013_Year&matchBoolean=true&rowsPerPage=100&searchField=Search_All

Does REUSE has any value?

When?

Have reviewed Hall. Retrieved 90%
Want to review Wahono. Want to retrieve 90%. Both set are targeting at defect prediction, but different RQs. (Different review protocols)
- Review protocol for Hall:
- Review protocol for Wahono:

What?

REUSE: Import only learned model from Hall, featurize just on Wahono. Use imported Hall model to replace random sampling, then start to learn its own model on Wahono.
UPDATE (Partial): Import only labeled data from Hall, combine with Wahono, and re-featurize. It is partial UPDATE since it can save memory without damaging performance as discussed in #31.

Why?

Target is no longer the same, there are differences between review protocols. Want to build model purely on new data set. Applying UPDATE may be a bad idea.

Result

Conclusion

UPDATE is actually performing better than REUSE.
Is it because of the data sets? Does there exist data sets where UPDATE performs badly but REUSE stays similarly.

Some crowdsourcing supporting SLR

EMBASE

As by the end of April, a total of 232 volunteers have signed up and screened more than 38,000 records identifying 1147 RCTs or quasi-RCTS. Early data on the accuracy of the crowd and the robustness of the algorithm is extremely positive with 99.8% sensitivity and 99.8% specificity.

Wallace in evidence-based medicine

Crowd labels are collected via Amazon Mechanical Turk.

FN costs 10 times more than FP.

A typical (Mechanical Turk) crowd worker might earn ≈ $1.5 / hour, while a trained physician might earn ≈ $150 /hour.

Partial UPDATE?

When?

Have reviewed Hall2007- (<= 2007). Retrieved 90%
Want to review Hall2007+ (> 2007). Want to retrieve 90%
Don't touch Hall2007- anymore.

What?

Whole UPDATE: Import all data from Hall2007-, combine with Hall2007+, and re-featurize. Only retrieve studies from Hall2007+.
Partial UPDATE: Import only labeled data from Hall2007-, combine with Hall2007+, and re-featurize. Only retrieve studies from Hall2007+.

Why?

No need to store all Hall2007- data, only the labeled ones. Huge save on memory.

Result

Conclusion

Partial UPDATE can save memory without damaging performance.

Possible crowdsourcing strategies for primary study selection

1 from Wallace

Combining crowd and expert labels using decision theoretic active learning 2015

Strategy:

Two choices each step:

Pick an unlabeled item and ask the crowd to label it (i.e.,
collect k crowds labels for a cost of k units).
Pick a crowd-labeled item and ask the expert to label it for
a cost of E units.

Decision theory for which action to choose.

Performance is measured by an expected loss vs cost curve. Not very convincing to me.

2

If crowd worker is way cheaper than expert. (Say 1/1000)

Strategy:

ask the crowd to label everything, multiple times (e.g. each item will be labeled by N crowd workers)
rank items by the crowd-labeled score, ask experts to review and label items in the ranked order.
when enough "relevant" retrieved by experts, start training.
re-rank items by a combination of crowd-labeled score and model prediction, ask experts to review and label items in the ranked order. retrain model, repeat 4 until finished.

In 4, can use "true" label (expert label) to adjust weight of crowd labels, e.g. this crowd worker is unreliable...

3

If crowd worker is not that cheap

Strategy:

random sample X items, ask crowd to label
rank the X items by the crowd-labeled score, ask experts to review and label items in the ranked order.
when enough "relevant" retrieved by experts, start training (using only expert labels). Otherwise go to 1.
re-rank items by model prediction, ask crowd workers to label top Y, then ask experts to label top Z crowd scored items among the Y. retrain model with expert labels only. repeat 4 until finished.

Questions

how much cheaper does a crowd worker cost comparing to an expert.
what to compare? the cost needed to retrieve 90% relevant?
how to construct data for experiments.
how to design tasks for crowd workers.

New Figure (Sep 1st)

Hall Result

Hall, Tracy, Sarah Beecham, David Bowes, David Gray, and Steve Counsell. "A Systematic Literature Review on Fault Prediction Performance in Software Engineering."

	Hall Paper	IEEExplore
Initial Size	2073	8912
Final Size	136	106

Wahono Result

Wahono, Romi Satria. "A systematic literature review of software defect prediction: research trends, datasets, methods and frameworks." Journal of Software Engineering 1, no. 1 (2015): 1-16.

	Wahono Paper	IEEExplore
Initial Size	2117	7002
Final Size	71	62

Comparisons for each code:

Start with patient active learning (P_U_S_A)
First compare the last code
P_U_S_A vs. P_U_S_N

P_U_S_A wins.
For last code, A is better than N (aggressive undersampling is useful)

Compare the third code
P_U_S_A vs. P_U_C_A

No clear winner. I would prefer C over S, since continuous learning can handle concept drift better.
But let's keep both

Compare the second code
P_U_S_A vs. P_U_C_A vs. P_C_C_A

No clear winner. I prefer C over U since no need to worry about stop rule for U (margin threshold of SVM)
But let's keep all three

Compare the first code
P_U_S_A vs. P_U_C_A vs. P_C_C_A vs. H_U_S_A vs. H_U_C_A vs. H_C_C_A

H is better than P.
But H_C_C_A is a clear loser.
Start aggressive undersampling with only one "relevant" example is a bad idea.
The final winner would be H_U_S_A and H_U_C_A.
I would prefer H_U_C_A for continuous learning to handle concept drift and updating of SLR.

Email SLR authors

How much effort does it cost for a primary study selection? (N reviewers, T time for each)

Better if details of each step can be provided:
- Exclusion by title and abstract: start from once the initial candidate study list has been collected. end with exclusion by title and abstract.
- Exclusion by full text: start from candidate study list included by title and abstract. end with exclusion by full text.
- Missing studies inclusion: start from final list included by full text. end with inclusion of studies by author, reference...
Is there any effort taken (a hidden step) between applying the search string to databases and the initial candidate study list has been collected?

Why ask: I retrieve much more candidate studies with the same search string provided in the SLR paper.

If there is any effort, what is that? How much does it cost? What is the reason behind?

One reason I guess is to reduce the size of initial candidate study list, thus reduce the review cost of primary study selection. If this is true, learning based primary study selection can remove this hidden step since it can search in much larger candidate study list and retrieve above 90% "relevant" with less effort. This may even improve the overall completeness and save the effort of this hidden step.

Tune similarity

Why?

Target similarity of three data sets: Hall, Wahono, Abdellatif.

LDA with default parameter: alpha=0.1, eta=0.1. Topic number = 30.
Repeat 10 times with different sequence order
Hall vs Wahono is stable, with iqr = 0.002, since the two are very similar.
Abdellatif vs the two are not, with iqr = 0.13, 0.13. Which means target similarity between Abdellatif and Hall can be 0.4 or 0.6 or ...

Need to stabilize the target similarity.

How?

Tune LDA paremeters (Decision = [alpha, eta]). Don't want to change topic number.

Objective = [iqrs].

Differential evolution, 10 candidates per generation, 10 generations max.

Running on NCSU HPC with single node, 10 threads.

Result

Best decisions: [alpha = 0.3636991597795636, eta = 0.9722983748261428]
Best objectives (iqrs): [0.0064311303948402232, 0.039641889335073899, 0.048358360331471784]
iqrs before tuning: [0.002, 0.129, 0.129]
medians of similarities: [0.98309488776481135, 0.45742986887869136, 0.4108420090949999]

Conclusion

Tuning LDA is essential to get a stabilized similarity score.

Snow balling

Implement

Improvements in the StArt tool to better support the systematic review process

Theory

Using Forward Snowballing to update Systematic Reviews in Software Engineering

Jul-13-2016

Numbers

Retrieved from IEEE by search string: 6963
Target: 70
Target in IEEE: 37 (23)
Target in IEEE+acm: 25
Target in other: 8

Injected into Elasticsearch: 7002
Pos in Elasticsearch: 62

Simple result:

An extreme one

To DO:

make the simple experiment more complete:
- include acm results
- inject full text (may not be useful)
- repeat N times? (how to report)
do experiments on other initial set, compare with other baseline approaches:
- several candidates are listed in #14
- shall we do a traditional SLR on SLR ourselves?
some questions want to know
- RQ1: Does primary selection cost a lot in conventional SLR? Can be answered by SLR on SLR, or interview?
- RQ3: will Learning based SLR be not thorough enough compared to conventional SLR? Can be answered by SLR on SLR.
read
- SLuRp – A Tool to Help Large Complex Systematic Literature Reviews Deliver Valid and Rigorous Results
- An empirical assessment of a systematic search process for systematic reviews

Practical study selection now

Evaluating strategies for study selection in
systematic literature studies

Generate Data (In progress)

Generate synthetic data

from LDA topics
to test performance for different target overlap score (L2, cosine distance)
find threshold for using UPDATE, using REUSE, using START

Extract data from SLR

Now we have data in the area of: dis >= UPDATE
Data lies in the area of: REUSE<= dis < UPDATE (defect analysis perhaps)
- Risk evaluation approaches in failure mode and effects analysis: A literature review
(Maybe) in the area of: dis < REUSE
To support the conclusion from synthetic data

Outline

Abstract

Systematic literature review (SLR) is the primary method for aggregating and synthesizing evidence in evidence-based software engineering. Such SLR studies need to be conducted frequently since a) researchers should update their SLR result once one or two years to consider latest publications; b) most researchers are constantly studying different research questions in the same or similar topic area. However, SLR studies cannot be conducted frequently due to its heavy cost. In our previous study, with the help of FASTREAD, we succeed to save 90% of the review cost in sacrifice of 10% recall in primary study selection of systematic literature review (SLR). In this paper, we allow researchers to import knowledge from previously completed SLR studies to boost FASTREAD. With knowledge transfering, review effort can be further reduced to 50% of the review effort of FASTREAD with extremely low variance when updating an SLR study while variance can be greatly reduced in the scenario of conducting an SLR on similar or the same topic.

Assumptions

(same as FASTREAD paper)

one single reviewer who never makes mistakes
no expert knowledge for building up initial seed training set
binary classification, studies will be labeled as "relevant" or "irrelevant" by the reviewer.

UPDATE scenario

(Except for the general assumptions)

previously completed an SLR which
- is of the same topic
- has same or similar review protocols
import the labeled examples from previous SLR to boost current primary study selection

REUSE scenario

(Except for the general assumptions)

previously completed an SLR which
- is of similar topic
import the trained model from previous SLR to boost current primary study selection

Methods

Experiments and results (in progress)

[Hall2007-] -> [Hall2007+] -> [Wahono]

FASTREAD on [Hall2007-]

FASTREAD vs UPDATE on [Hall2007+]

FASTREAD vs UPDATE vs REUSE on [Wahono]

Conclusion from current result

UPDATE can save about 50% review effort comparing to a fresh start with FASTREAD.
REUSE will not save much review effort, but can greatly reduce variance of FASTREAD.

To do

Literature review for background

SLR update
SLR on similar topics (reuse scenario)

Generate synthetic data

from LDA topics (or just term frequency)
to test performance for different target overlap score (L2, cosine distance)
find threshold for using UPDATE, using REUSE, using START

Extract data from SLR

Now we have data in the area of: similarity>= UPDATE
Data lies in the area of: REUSE<= similarity< UPDATE (defect analysis perhaps)
- Risk evaluation approaches in failure mode and effects analysis: A literature review
(Maybe) in the area of: similarity< REUSE
To support the conclusion from synthetic data

PoP

convert bib, ris to csv files.
Publish or Perish

Overall UPDATE and REUSE

Experiment:

START with Hall2007-.csv

UPDATE vs. START on HALL2007+.csv

Based on the result of UPDATE on Hall2007+.csv, UPDATE vs. REUSE vs. START on Wahono.csv.

Discussion

UPDATE is better than REUSE in the case of transfering knowledge from Hall to Wahono. Probably because the targets are very consistent. See #33

Need to find a data set which is also in SE SLR, but has different target.

Variance

Problem

From the following two figures, looks like Hall data set has much larger variance than Wahono.

Hall:

Wahono:

Reason

Hall has some bad luck when random sampling. The prevalence of "relevant" is larger than 0.01, the possibility of not getting a single "relevant" when reviewing the first 200 studies is (1-0.99)^(200)=13%. For our 10 repeat experiments, there should be 1 to 2 out of 10 repeats that stay at 0 "relevant" at 200 studies reviewed, which will not cause a big iqr. However, in the experiments shown above, we got 3, and the iqr at 200 is therefore extremely large. On the other hand, in Wahono, with a little bit luck, we always got more than 1 "relevant" studies retrieved at 200 reviewed, which leads to a low iqr.

The Hall should look more like this:

And Wahono should look more like this:

Conclusion

Probably 10 repeats is not enough.
Should we increase it to 25?

New data set: Abdellatif

Abdellatif.csv

Software Analytics to Software Practice: A Systematic Literature Review

Target: software analysis, 19/1726

Similarity

Hall.csv vs. Abdellatif.csv

data similarity: 0.69

target similarity: 0.57

UPDATE or REUSE

START on Hall.csv, then

START on Abdellatif.csv
UPDATE on Abdellatif.csv
REUSE on Abdellatif.csv

Just median

Conclusion

REUSE > START > UPDATE
UPDATE has the potential to outperform REUSE and START (How? time decay model.)

SE community

About SLR

Evidence-based software engineering

Evidence-based software engineering for practitioners

First brought to SE community

Guidelines for performing systematic literature reviews in software engineering

The famous GOLDEN guideline

A systematic review of systematic review process research in software engineering

An recent review

Overview

Outcomes of a Community Workshop to Identify and Rank Barriers to the Systematic Literature Review Process

Identifying barriers to the systematic literature review process

Same goal. Primary study selection being identified as the top three most difficult, time-consuming, tool support needed phase in SLR.

Tools to support systematic reviews in software engineering: A cross-domain survey using semi-structured interviews

Tools to Support Systematic Reviews in Software Engineering: A Feature Analysis
Best study for tools

Tools to support systematic literature reviews in software engineering: A mapping study

tools

SLuRp: a tool to help large complex systematic literature reviews deliver valid and rigorous results

SESRA: A Web-based Automated Tool to Support the Systematic Literature Review Process
best and latest tool so far

SLR-Tool: A Tool for Performing Systematic Literature Reviews.

Using GQM and TAM to evaluate StArt-a tool that supports Systematic Review

VTM

A Visual Analysis Approach to Update Systematic Reviews

A visual analysis approach to validate the selection review of primary studies in systematic reviews

An approach based on visual text mining to support categorization and classification in the systematic mapping

A visual text mining approach for systematic reviews

Jun-08-2016

Negative result

Hierarchical Clustering does NOT help to balance initial training data.

Possible Reasons:

1. In the data we use, regarding our distance measurement of feature setting, examples within same class do not cluster together. (Means data is wrong or feature is wrong or distance metric is wrong)(Don't quite believe this)
1. Target class is very rare, sampling without any pre-knowledge or say intention is too hard to find these rare class data. (Means totally blind sampling, even with Hierarchical Clustering is wrong)

Fix: (fix 2 first, if results are good, then 1 is not the problem)

1. introduce domain knowledge (ask experts to do this)
1. Experts can use search and filter to better perform this task and achieve a better balanced training data.
1. Hierarchical Clustering can still be applied, but just to assist the filtering and searching process. Also can let expert decide which axis to split on (is this word useful to distinguish target class?).
1. Search: exploration, Uncertainty Sampling: exploitation. Let expert to decide flexibly how to balance these two. (expert can decide to start a search at any time during active learning)
1. Hierarchical Clustering can also help for the visualization of result

Strategy

Basically try things listed in Fix on LN DiscoveryIQ project.
Then map useful techs into Systematic Literature Review in SE.

Systematic Literature Review

Our Task:
Similar task as to LN. How to assist reviews fast retrieve relevant papers by search and active learning.

Can have hierarchical clustering first to guide 1.
Start with a search (filtering)
Review ranked results, label top N as relevant or not. at anytime, go back to a search is possible.
When enough labeled example or enough new labeled example, start a training.
Show user re-ranked results (along with important features and examples, give user handle to change them)
Go back to 2

Checked several 2016 paper conducting Systematic Literature Review, some use CiteSeerX as part of the source.
Souza, Draylson M., Katia R. Felizardo, and Ellen F. Barbosa. "A Systematic Literature Review of Assessment Tools for Programming Assignments." In 2016 IEEE 29th International Conference on Software Engineering Education and Training (CSEET), pp. 147-156. IEEE, 2016.

No learning involved, just searching and filtering.

Marshall, Christopher, Pearl Brereton, and Barbara Kitchenham. "Tools to support systematic reviews in software engineering: a cross-domain survey using semi-structured interviews." In Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering, p. 26. ACM, 2015.

Evaluate tools used in Systematic Literature Review.

Literature on Systematic Literature Review itself (instead of conducting one)
Zhou, You, He Zhang, Xin Huang, Song Yang, Muhammad Ali Babar, and Hao Tang. "Quality assessment of systematic reviews in software engineering: a tertiary study." In Proceedings of the 19th International Conference on Evaluation and Assessment in Software Engineering, p. 14. ACM, 2015.

Focuses on how to manage tasks distributed onto several reviewer, how to setup standard rubrics, how to do quality assessment...

Details

CLEF eHealth

Task 2: Technologically Assisted Reviews in Empirical Medicine

Suggested by Manny.
Very related to MAR.
Provide medical data sets.
Possible to apply FASTREAD on.

Start rule and stop rule

Stop rule

elbow test
if highest prediction score < 0.5 (promising)
life point, each iter if no relevant detected, life point -1

Start rule (a bigger problem when reusing model)

random sampling until ONE relevant study retrieved (Currently in use, can be unstable in extreme cases when seed set cannot represent the shape of the data)
Instead of random sampling, try stratified sampling from clusters. Add start rule of: at least one labeled example in each cluster.

Research Questions

Any literature stressing the problem of initial sample of active learning?
- Still have not found any. There are literatures on the imbalance problem, but all the solutions are focusing on how to better sample from the pool when we already have a working classifier.
Does imbalanced initial training set actually affects the active learning result?
(Comparing to random sampling) Can hierarchical clustering helps to sample more from the minority class to build a more balanced initial training set?

Reproducibility

Reproducible research–what, why and how

SLR update

Motivation:

As in RQ6, the need of continuous updating SLR is identified in:

Carver, Jeffrey C., Edgar Hassler, Elis Hernandes, and Nicholas A. Kraft. "Identifying barriers to the systematic literature review process." In 2013 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pp. 203-212. IEEE, 2013.

Another possible need is to reuse SLR, also identified in the above paper. The goal is to speed up the random selection process for one SLR by applying a model from another similar SLR, e. g., apply the model from Hall to Wahono to speed up the initial seed collection.

Two goals, two solutions

GOAL1: update SLR: review protocal stays the same, data (top N important terms from feature extraction) can be different.
- SOLUTION1: continuously update the previous model and apply it to new data.
GOAL2: reuse SLR: apply model to a similar SLR. Review protocal stays similar but different, data as well.
- SOLUTION2: apply the old model only to replace random selection, once got a initial training set, start a new model, purely based on the new data.

Transfer knowledge in SLR

Abstract

Systematic literature review (SLR) is the primary method for aggregating and synthesizing evidence in evidence-based software engineering. Such SLR studies need to be conducted frequently since a) researchers should update their SLR result once one or two years to consider latest publications; b) most researchers are constantly studying different research questions in the same or similar topic area. However, SLR studies cannot be conducted frequently due to its heavy cost. In our previous study, with the help of FASTREAD, we succeed to save 90% of the review cost in sacrifice of 10% recall in primary study selection of systematic literature review (SLR). In this paper, we allow researchers to import knowledge from previously completed SLR studies to boost FASTREAD. With the appropriate knowledge transfering technique, review cost and variances can be further reduced when updating an SLR or initiating an new SLR on similar or related topics.

Why

In our previous study, FASTREAD has effectively reduced the cost of primary study selection in SLR.

However, in FASTREAD, random sampling costs a large amount of review effort and introduces most of the variances as shown above. To further reduce the cost of primary study selection, random sampling step needs to be replaced.

How

External knowledge needs to be introduced in order to replace random sampling. There are certain scenarios that reviewers are guaranteed to have some knowledge on their SLRs. Such scenarios are when a reviewer has done an SLR using FASTREAD (or has access to all the data of other reviewers conducting an SLR with FASTREAD) and now

he wants to update the SLR result with newly published primary studies;
he wants to initiate a new SLR on topics similar to or related to the previous one.

We call these two scenarios update SLR and transfer SLR respectively. In both of these scenarios, the knowledge of previously conducted SLR can be imported as external knowledge to boost the primary study selection of the new SLR.

The following of this paper will discuss the use of previous knowledge in such scenarios.

Background

Update SLR

Some literature review, existing SLR update examples, full update vs. snowballing...

Assumptions:

one single reviewer who never makes mistakes
binary classification, studies will be labeled as "relevant" or "irrelevant" by the reviewer
previously completed an SLR which
- is of the same topic
- has same or similar review protocols

Transfer SLR

Some literature review, examples.

Assumptions:

one single reviewer who never makes mistakes
binary classification, studies will be labeled as "relevant" or "irrelevant" by the reviewer
previously completed an SLR which
- is of similar or related topics.

Method

UPDATE is designed to transfer knowledge in update SLR scenario where

Whole UPDATE: all data from previous SLR are imported, including labeled and unlabeled;
Partial UPDATE: only labeled data from previous SLR are imported.
Skip random sampling step.

REUSE is designed to transfer knowledge in transfer SLR scenario where

only the learned model is imported from previous SLR;
reuse sampling is utilized to replace random sampling;
after ENOUGH examples retrieved by reuse sampling, train a new model with only examples from current SLR.

Experiment

Update SLR

Partial UPDATE vs. Whole UPDATE: (can be a RQ).

Data: Hall2007- as previous SLR, Hall2007+ as new SLR:

FASTREAD on Hall2007-:

FASTREAD vs. Partial UPDATE vs. Whole UPDATE on Hall2007+:

UPDATE reduces review cost and variances on Hall2007+.
Partial UPDATE has same performance with Whole UPDATE while less data are imported.

Transfer SLR

Depending on the topic similarity of the new SLR and the previous one, different methods might be more suitable:

high: UPDATE may work best;
medium: REUSE may work best;
low: FASTREAD may work best;

Topic similarity

Data sets:

Hall: defect prediction
Wahono: defect prediction
Abdellatif: software analysis

Similarity measurement: 30 topics LDA, L2 normalization, cosine distance.

Data similarity

data_Hall_Wahono: 0.860254
data_Hall_Abdellatif: 0.726351
data_Abdellatif_Wahono: 0.809703

Target similarity

target_Hall_Wahono: 0.995255
target_Hall_Abdellatif: 0.64379
target_Abdellatif_Wahono: 0.649005

(Data similarity does not necessarily reflect target similarity)

Hall and Wahono are both on defect prediction, and these two have very high target similarity (0.995).

Abdellatif is on software analysis, the target similarity between Abdellatif and the other two are about 0.64.

Result

Hall as previous SLR,

on Wahono:
on Abdellatif:

Wahono as previous SLR,

on Hall:
on Abdellatif:

Abdellatif as previous SLR,

on Hall:
on Wahono:

Conclusions:

mostly consistent with Similarity(Hall, Wahono)=high, Similarity(Hall, Abdellatif)=medium, Similarity(Abdellatif, Wahono)=medium.
from Wahono to Hall: REUSE is better than UPDATE when retrieving 90% relevant studies while UPDATE is better than REUSE when retrieving less than 80%. Using UPDATE on transfer SLR scenarios, even when target similarity is high, may sacrifice completeness.

A series of experiments

FASTREAD on Hall2007- => UPDATE on Hall2007+ => UPDATE on Wahono => REUSE on Adbellatif: