The issta-16-ae-6artifacts from softwarekitty

##Artifacts prepared for ISSTA-16-AE-6

The tools in this repo can be used to recreate the results published in the ISSTA-16 paper, Exploring Regular Expression Usage and Context in Python.

######This code does not mine GitHub for Python regexes.

Contents of important folders are described below:

####artifacts

The artifacts folder contains key objects used in recreating the results of the paper:

merged_report.db is an SQLite3 database file containing all the data mined from GitHub.
projectInfo.tsv contains a list of the projects mined for regexes (only those that contained regexes are included).
fullCorpus.tsv is a dump of all the patterns (and the project sets associated with them) from the corpus in the paper.
patternTracking contains three files, accounting for patterns excluded from analysis due to unicode errors, rare features and other errors, as mentioned in the paper.
featureStats.tex is the .tex table that displays feature statistics for regexes in the corpus.
rexStrings is a folder containing all the strings generated by Rex used to build the input for mcl.
filteredCorpus.tsv is a file containing the regexes supported by Rex, used in the analysis.
similarityGraph.abc is the input for mcl. It represents the weighted, undirected edges of the similarity matrix.
clusters.tsv is the custer specification produced by mcl.
patternClusterDump.tsv is the more human-readable content displaying which patterns are included in which clusters.

####src/recreateArtifacts

Readme files containing instructions for how to reproduce the artifacts are in the following paths:

miningDataSources generates the projectInfo.tsv file from the database.
corpus generates the fullCorpus.tsv file from the database, and error tracking files in patternTracking.
featureTable generates featureStats.tex from fullCorpus.tsv. This program is reusable for appropriate inputs.
similarityMatrix contains a program that transforms filteredCorpus.tsv and the contents of rexStrings into the similarityGraph.abc file.
clusters contains a program that runs mcl on the similarityGraph.abc input to generate clusters.tsv and patternClusterDump.tsv.

######For all programs, inputs are taken from the artifacts folder and outputs are produced in an output folder within the folder for recreating the artifact.

Note that at this time, we do not have a convenient way to recreate the contents of the rexStrings folder, or filteredCorpus.tsv, which both use Rex. The inconvenient way, which we used, is to install .Net 4.5, Rex and VisualStudio on a Windows 7 machine, and incrementally run Rex (batch size is pre-defined) using the code in the csharp folder of the tour_de_source repository after setting up various path variables within that csharp code.

####src/main and src/test

These folders contain the core code used to reason about regular expressions, and the test suite protecting and specifying that code.

####Input format A tab-separated-values (tsv) file with Python patterns and a CSV list of project IDs, like:

"ab*c"  1,2
"(?:\\d+)\.(\\d+)"   2,3,5
u'[^a-zA-Z0-9_]' 1,5
'^[-\\w]+$' 2
'^\\s*\\n'  1,3,4

At this time, all patterns must be followed by a tab and at least one project ID.

Patterns should be valid in Python - raw Python Strings are not supported at this time.

No extra lines or whitespace in input files, please. No dulplicate patterns, please.

##Setup Eclipse

Create a new Java Project in Eclipse (use Java 1.7) using this repo as the project directory.
Add the jar files in the lib directory to the build path.
To run tests of the core code, set up JUnit4.

##F.A.Q. ####why Python? It was not an arbitrary choice, but it was not the only option, either. JavaScript would have been a reasonable alternative using our rationalle. Consider first that regular expression languages have different feature sets, and doing this analysis takes some time. In order to maximize the impact of the research, we wanted a language that includes common features (features shared by other languages) and excludes rare features (features not shared by many other languaes). Python fits this description, as can be seen by looking at a comparison of language feature sets from my thesis.

####where is the mining code? It can be found in the tour_de_source repo, but it is not groomed for public consumption, and is probably not an optimal mining solution.

####why not use formal tools for behavioral analysis? Because the tools we found cannot handle regexes using certain common features, like '$'.

####how can I submit an error report, bug report or pull requrest? Please open an issue if you find any problems or want to be a contributor.

softwarekitty / issta-16-ae-6artifacts Goto Github PK

issta-16-ae-6artifacts's Introduction

issta-16-ae-6artifacts's People

Contributors

Watchers

issta-16-ae-6artifacts's Issues

support raw Python patterns

Enforce pattern uniqueness in loading and in clustering.

testing issues

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent