Giter Site home page Giter Site logo

issta-16-ae-6artifacts's Introduction

##Artifacts prepared for ISSTA-16-AE-6

The tools in this repo can be used to recreate the results published in the ISSTA-16 paper, Exploring Regular Expression Usage and Context in Python.

######This code does not mine GitHub for Python regexes.

Contents of important folders are described below:


####artifacts

The artifacts folder contains key objects used in recreating the results of the paper:

  • merged_report.db is an SQLite3 database file containing all the data mined from GitHub.
  • projectInfo.tsv contains a list of the projects mined for regexes (only those that contained regexes are included).
  • fullCorpus.tsv is a dump of all the patterns (and the project sets associated with them) from the corpus in the paper.
  • patternTracking contains three files, accounting for patterns excluded from analysis due to unicode errors, rare features and other errors, as mentioned in the paper.
  • featureStats.tex is the .tex table that displays feature statistics for regexes in the corpus.
  • rexStrings is a folder containing all the strings generated by Rex used to build the input for mcl.
  • filteredCorpus.tsv is a file containing the regexes supported by Rex, used in the analysis.
  • similarityGraph.abc is the input for mcl. It represents the weighted, undirected edges of the similarity matrix.
  • clusters.tsv is the custer specification produced by mcl.
  • patternClusterDump.tsv is the more human-readable content displaying which patterns are included in which clusters.

####src/recreateArtifacts

Readme files containing instructions for how to reproduce the artifacts are in the following paths:

  • miningDataSources generates the projectInfo.tsv file from the database.
  • corpus generates the fullCorpus.tsv file from the database, and error tracking files in patternTracking.
  • featureTable generates featureStats.tex from fullCorpus.tsv. This program is reusable for appropriate inputs.
  • similarityMatrix contains a program that transforms filteredCorpus.tsv and the contents of rexStrings into the similarityGraph.abc file.
  • clusters contains a program that runs mcl on the similarityGraph.abc input to generate clusters.tsv and patternClusterDump.tsv.

######For all programs, inputs are taken from the artifacts folder and outputs are produced in an output folder within the folder for recreating the artifact.

Note that at this time, we do not have a convenient way to recreate the contents of the rexStrings folder, or filteredCorpus.tsv, which both use Rex. The inconvenient way, which we used, is to install .Net 4.5, Rex and VisualStudio on a Windows 7 machine, and incrementally run Rex (batch size is pre-defined) using the code in the csharp folder of the tour_de_source repository after setting up various path variables within that csharp code.


####src/main and src/test

These folders contain the core code used to reason about regular expressions, and the test suite protecting and specifying that code.


####Input format A tab-separated-values (tsv) file with Python patterns and a CSV list of project IDs, like:

"ab*c"  1,2
"(?:\\d+)\.(\\d+)"   2,3,5
u'[^a-zA-Z0-9_]' 1,5
'^[-\\w]+$' 2
'^\\s*\\n'  1,3,4

At this time, all patterns must be followed by a tab and at least one project ID.

Patterns should be valid in Python - raw Python Strings are not supported at this time.

No extra lines or whitespace in input files, please. No dulplicate patterns, please.


##Setup Eclipse

  1. Create a new Java Project in Eclipse (use Java 1.7) using this repo as the project directory.
  2. Add the jar files in the lib directory to the build path.
  3. To run tests of the core code, set up JUnit4.

##F.A.Q. ####why Python? It was not an arbitrary choice, but it was not the only option, either. JavaScript would have been a reasonable alternative using our rationalle. Consider first that regular expression languages have different feature sets, and doing this analysis takes some time. In order to maximize the impact of the research, we wanted a language that includes common features (features shared by other languages) and excludes rare features (features not shared by many other languaes). Python fits this description, as can be seen by looking at a comparison of language feature sets from my thesis.

####where is the mining code? It can be found in the tour_de_source repo, but it is not groomed for public consumption, and is probably not an optimal mining solution.

####why not use formal tools for behavioral analysis? Because the tools we found cannot handle regexes using certain common features, like '$'.

####how can I submit an error report, bug report or pull requrest? Please open an issue if you find any problems or want to be a contributor.

issta-16-ae-6artifacts's People

Contributors

softwarekitty avatar

Watchers

 avatar  avatar  avatar

issta-16-ae-6artifacts's Issues

support raw Python patterns

luckily the corpus does not have any raw patterns:

SELECT pattern FROM RegexCitationMerged WHERE pattern LIKE "r%" ;

but they should be handled properly, by simply re-escaping slashes. This test should help:

@Test
public void test_PythonValidRegex_escape_raw() throws IllegalArgumentException,QuoteRuleException,PythonParsingException {
        RegexProjectSet rps = new RegexProjectSet("r\"\\\\\"", pidList.get(0));
        assertNotNull(rps);
}

because of this:

>>> import re
>>> x = re.compile("\\")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 194, in compile
    return _compile(pattern, flags)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 251, in _compile
    raise error, v # invalid expression
sre_constants.error: bogus escape (end of line)
>>> x = re.compile(r"\\")
>>>

Enforce pattern uniqueness in loading and in clustering.

Right now, if the same pattern has two different project lists, the RegexProjectSet will compare as different, and both will be added.

In the past, duplicates were dealt with when loading from sql using select and other systems found in the tests for LoadUtil, but now we are loading from text files.

This should be an easy fix - just detect duplicate patterns and merge their project lists into one representative, as should be expected. Then the input format could allow users to have one line per project ID.

Then in clustering, mcl guarantees unique nodes per cluster, but that is not checked or enforced and extensions using overlapping clustering techniques may be concerned about this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.