Giter Site home page Giter Site logo

querybuilder's Introduction

<meta charset="utf-8" lang="en">

                **Project: QueryBuilder**

This is the report of our progress so far in developing a domain specific language known as QueryBuilder.


<!--

General
========

For each of your team's implementations, explain the following (where appropriate and applicable):

- details on calling conventions, input and output data formats, limitations, bugs, and special features.
- negative aspects of your program (limitations, known bugs)
- positive aspects (extensions, special features)
- describe your choice of modularization (abstractions), data structures, and algorithms
- explain anything you did that is likely to be different from what other students may have done
- justify any design decisions for which the rationale isn't immediately clear

Feel free to modify the structure of this `readme.html` file to fit the current assignment and to fit how you wish to present your findings.


Submission
-----------


Create a zip file that contains all of your code, this `readme.html` document, and any additional files of evidence (ex: screenshots).

Use folders to divide up the subparts of your submission.
For example, your zip file could look like the following:

~~~~
readme.html
ada/
    primepartitions.adb
    screenshot.png
go/
    primepartitions.go
    screenshot.png
haskell/
    primepartitions.hs
    screenshot.png
prolog/
    primepartitions.pl
    screenshot.png
python/
    primepartitions.py
    screenshot.png
~~~~
-->


Authors
=============

<!-- Note: wrapping table in div.noheader will hide the table's header -->
<!-- Note: wrapping table in div.firstcol will style the first column different from other columns -->
<div class="noheader firstcol">
                  |             
------------------|-------------
names             | Owen Elliott / Jeff Jewett
computer + OS     | cse21703 (Ubuntu)
time to complete  | 30 hours
</div>


Introduction
================

QueryBuilder is a DSL for composing and executing regex statements

regex is confusing, the DSL seeks to simplify it and allow for testing

useful to write complex regex queries quickly

solves the problem of capture groups not being very robust

Features
================

labels for repeated inputs

simplified common expressions

distinguish between capture groups (format output accordingly)

compile to regex

template functions

test regex against input data (STDIN | INPUT FILE)

Examples
================

File 1: example.qb

```
namespace example

label capture firstname = "[a-zA-z]+"
capture lastname = "[a-zA-z]+"
label fullname = firstname "\s" lastname

function fencepost(x,y) = x "(?>" y x ")*"
label namelist = fencepost(fullname, ",\s")

build output = namelist  #build regex form
test output @STDIN
test output 'inputfile.txt'
```

File 2: test.qb

```
import "previousfile.qb"
namespace test

label capture[] email = "[\w\.]*@\w*.\w*"
label emaillist = example.fencepost(email, ",\s")
build output = emaillist
```

Process
================

namelist

fencepost(fullname,",\s")

fullname(,",\s"fullname)*

firstname"\s"lastname"(?>,\s"firstname"\s"lastname)*

(firstname)"\s"(lastname)"(?>,\s"(firstname)"\s"(lastname))*

([a-zA-z]+)\s([a-zA-z]+)(?>,\s([a-zA-z]+)\s([a-zA-z]+))*

Test
================

input 1
--------------------------------

Joe Smith, Debra Jones

Jeff Jewett, Owen Elliott

output 1
--------------------------------

Match 1: Joe Smith, Debra Jones

Group 1: Joe

Group lastname1: Smith

Group 3: Debra

Group lastname2: Jones


Match 2: Jeff Jewett, Owen Elliott

Match 1: Joe Smith, Debra Jones

Group 1: Jeff

Group lastname1: Jewett

Group 3: Owen

Group lastname2: Elliott


input 2
--------------------------------

[email protected], [email protected]

output 2
--------------------------------

Match 1: [email protected], [email protected]

Group email1: [email protected]

Group email2: [email protected]


Grammar
================

import: appends following file to the top of the script, generates symbol table (like a #include)

label: a named pattern of regex literals or function outputs that can be referenced later, added to the symbol table

capture: a modifier to a label that indicates that it is a capture group (can also name capture groups in a list)

function: a template pattern with placeholder values

build: translates the following pattern into proper regex. can be modified with flags.

test: given a built query and a target, search the target using the given query.

namespace: delcare the title of this file's labels / functions, if included in another file reference with namespace.label


Parser
================

QueryBuilder files are recognized as valid input using the ANTLR grammar system. We ran tests to see if the parser could tell the difference between valid and invalid syntax, and it passed all ten tests.
All ten tests used are in the /test/ subdirectory.

![Parse Tree](asset1.png)

Here are a few sample tests for valid input

test6

```
import './test.qb'
#import the file

label ex = 'ex'
label mod = '?'

function concat (a,b) = a b
#the parser accepts this input
```

test5

```
import 'test.qb'
namespace example

label new = test.foo test.bar('a')
build out = test.prebuilt @GM
test out @STDIN
#the parser accepts this input
```

Here are a few sample tests for invalid input

test 3

```
namespace bad3

label nothingWrongSoFar = 'ok'

build forgotAnEquals nothingWrongSoFar
#the parser rejects this input
```

test 4

```
namespace bad4

label missingQuotes = $hello^

build output = missingQuotes
#the parser rejects this input
```



Translator
==================

We used the ANTLR listener classes as a tool to build a translator for QueryBuilder. We accomplished our goal of making a language that is easy to write in while being functional.

The best way to explain it is to show a few examples. The following QueryBuilder program is an example of the proper way to utilize the language.

```
# declare the namespace
# (optional unless you import this file into a different file)
namespace example

capture firstname = "[a-zA-z]+"
capture[] lastname = "[a-zA-z]+"
label fullname = firstname "\s" lastname

function fencepost(x,y) = x "(" y x ")*"
label namelist = fencepost(fullname, ",\s?")

# build regex form
build output = namelist

# test against STDIN
test output @STDIN
```

![Result (from STDIN)](result1.png)

As you can see, the compiled regex statement successfully matched the user input string. You can also match a different file by replacing `test output @STDIN` with something like `test output 'test/inputfile.txt'`

![Result (from file)](result2.png)

Here are a few of the situations where our translator will catch semantic errors in the code. We created a SymbolTable class that stores all the named labels, functions, builds, etc. and used it to make sure the code doesn't try to declare
more than one of these with the same identifier. The following code tries to declare two labels with the same name:

```
label example = "[^abc]"
label example = "[^def]"

build output = example
test output @STDIN
```

The process ends with the message `Cannot overwrite an existing label or function name: example`.

The previous example also did not have a namespace declared. While that is legal, it would cause a problem if a different program imported it, as its symbols are referenced by that name.

The error message `Imported files must have a namespace: test/testFile1.qb` would appear if testFile1 was about to be imported and it didn't have a declared namespace.

Similarly, if you try to import a file that doesn't exist, the program will halt with the error message `Could not open file filename.txt`.

Another feature of QueryBuilder is the concept of namespaces. By importing other files, their symbols are accessed with the syntax `namespace.symbol`. For example, take a look at the following program.

```
import "test/good4.qb"

label test1 = aNameSpace.ftest

build output = test1
test output @STDIN
```
And this is what `good4.qb` has:

```
namespace aNameSpace

label ex = 'ex'

function concat (a,b) = a "(?>" b a ")*"

label ftest = concat('g', 'h')

build out = ftest
```

![Result (imports)](result3.png)

Non Trivial Examples
=====================

IPv6 Address Matcher
----------------------

This code uses a combination of labels and a function to generate a long regex query that matches
all forms of IPv6 addresses.

```
namespace ipv6

label addrChunk1to4 = '(?>[0-9a-fA-F]{1,4}:)'
function setCountChanger(num1,num2) = addrChunk1to4 '{1,' num1 '}' '(?>:[0-9a-fA-F]{1,4}){1,' num2 '}'

# IPv6 Address Type 1 (1-4 chars per set) (8 sets)
# Example: 2001:db8:85a3:0000:0000:8a2e:370:7334

capture addrType1 = addrChunk1to4 '{7,7}' '[0-9a-fA-F]{1,4}'

# IPv6 Address Type 2 (1-4 chars per set) (1-7 sets :: 0 sets)
# Example: fe42:26a1:370:507::

capture addrType2 = addrChunk1to4 '{1,7}:'

# IPv6 Address Type 3 (1-4 chars per set) (1-6 sets :: 1 set)
# Example: ff06:1d09:1344:cfc2:31::c33d:1313

capture addrType3 = setCountChanger('6','1')

# IPv6 Address Type 4 (1-4 chars per set) (1-5 sets :: 1-2 sets)
# Example: ff06:1d09:1344:cfc2:31::c33d:1313

capture addrType4 = setCountChanger('5','2')

# IPv6 Address Type 5 (1-4 chars per set) (1-4 sets :: 1-3 sets)
# Example: ff06:1d09:1344:cfc2::31:c33d:1313

capture addrType5 = setCountChanger('4','3')

# IPv6 Address Type 6 (1-4 chars per set) (1-3 sets :: 1-4 sets)
# Example: ff06:1d09:1344::cfc2:31:c33d:1313

capture addrType6 = setCountChanger('3','4')

# IPv6 Address Type 7 (1-4 chars per set) (1-2 sets :: 1-5 sets)
# Example: ff06:1d09::1344:cfc2:31:c33d:1313

capture addrType7 = setCountChanger('2','4')

# IPv6 Address Type 8 (1-4 chars per set) (1 set :: 1-6 sets)
# Example: ff06::1d09:1344:cfc2:31:c33d:1313

capture addrType8 = setCountChanger('1','6')

# IPv6 Address Type 9 (1-4 chars per set) (0 sets :: 0-7 sets)
# Example: ::3e13:1335

capture addrType9 = ':((:[0-9a-fA-F]{1,4}){1,7}|:)'

# build and test

label ipv6addr = addrType1 '|' addrType2 '|' addrType3 '|' addrType4 '|' addrType5 '|' addrType6 '|' addrType7 '|' addrType8 '|' addrType9

build query = ipv6addr
test query "ipv6.txt"
test query "badIPV6data.txt"
```
![IPv6 Address Matcher](ipv6.png)

HTML Table Query
----------------------

This code takes an html page as an input and matches the data in any tables on that page (including rows and cols).

```
namespace htmltable

label notbrackets = '[^<>]'
label spacebetween = notbrackets '*'
function dup6(a) = a a a a a a
function opentag(type) = '<' type spacebetween '>'
function closetag(type) = '</' type notbrackets '*>'

label openth = opentag('th')
label closeth = closetag('th')
capture[] thval = notbrackets '*'
label thpair = openth thval closeth

label opentd = opentag('t[dh]')
label closetd = closetag('t[dh]')
capture[] tdval = '.*?'
label tdpair = opentd tdval closetd

label opentr = opentag('tr')
label closetr = closetag('tr')
label thpairspace = thpair spacebetween
label trh6inner = spacebetween dup6(thpairspace) spacebetween
label trhpair = opentr trh6inner closetr

label tdpairspace = tdpair spacebetween
label trd6inner = spacebetween dup6(tdpairspace) spacebetween
label trdpair = opentr trd6inner closetr

build open = trdpair
test open 'htmltable.html'
```
![HTML Table](wikitableoutput.png)

Conclusions
=================

QueryBuilder is a pretty useful tool when trying to conceptualize and build complex queries.
Compared to doing regex by hand, our DSL is vastly superior. For example, `ipv6.qb` generates a *very* complex
query that is essentially a list of possible forms for IPv6 address. Using the function I declared at the start, however,
I was able to create this massive query in less than 20 lines of code (comments not included). In addition, our `test` functionality
can optionally output the results to a .json file for convenience. These features are what makes QueryBuilder a more useful tool
than a generalized language.

![JSON example](json.png)

In our opinion, the code looks very clean and readable, which can help someone who comes accross the code to understand it,
even if they dont know what this means:
```
(?<ipv6addrType1>(?>[0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4})|(?<ipv6addrType2>(?>[0-9a-fA-F]{1,4}:){1,7}:)|(?<ipv6addrType3>(?>[0-9a-fA-F]{1,4}:){1,6}(?>:[0-9a-fA-F]{1,4}){1,1})|(?<ipv6addrType4>(?>[0-9a-fA-F]{1,4}:){1,5}(?>:[0-9a-fA-F]{1,4}){1,2})|(?<ipv6addrType5>(?>[0-9a-fA-F]{1,4}:){1,4}(?>:[0-9a-fA-F]{1,4}){1,3})|(?<ipv6addrType6>(?>[0-9a-fA-F]{1,4}:){1,3}(?>:[0-9a-fA-F]{1,4}){1,4})|(?<ipv6addrType7>(?>[0-9a-fA-F]{1,4}:){1,2}(?>:[0-9a-fA-F]{1,4}){1,4})|(?<ipv6addrType8>(?>[0-9a-fA-F]{1,4}:){1,1}(?>:[0-9a-fA-F]{1,4}){1,6})|(?<ipv6addrType9>:((:[0-9a-fA-F]{1,4}){1,7}|:))
```
We took about equal responsibilty for this project, although Jeff did more coding while Owen did more documenting.

<!--   Feel free to modify the following to fit a theme of your choosing   -->
<link href="https://fonts.googleapis.com/css?family=Open+Sans&display=swap" rel="stylesheet"> <!-- a sans-serif font -->
<style>  /* A TAYLOR-INSPIRED THEME */
    body {font-family:'Open Sans',sans-serif;}
    .md a:link, .md a:visited {color:hsl(252,23.0%,44.3%); font-family:'Open Sans',sans-serif;}
    .md table.table th {background-color:hsl(252,23.0%,44.3%);}
    .md .noheader th {display:none;}
    .md .firstcol td:first-child {white-space:pre;color:white;vertical-align:top;font-weight:bold;border-color:black;background:hsl(252,23.0%,54.3%);}
    .md .firstcol tr:nth-child(even) td:first-child {background:hsl(252,23.0%,44.3%);}
</style>


<!-- ****************************** -->
<!--    Leave the content below     -->
<!-- ****************************** -->

<!-- The script and style below are added for clarity and to workaround a bug -->
<script>
    // this is a hack to workaround a bug in Markdeep+Mathjax, where
    // `&#36;`` is automatically converted to `\(`` and `\)`` too soon.
    // the following code will replace the innerHTML of all elements
    // with class "dollar" with a dollar sign.
    setTimeout(function() {
        var dollars = document.getElementsByClassName('dollar');
        for(var i = 0; i < dollars.length; i++) {
            dollars[i].innerHTML = '&#' + '36;'; // split to prevent conversion to $
        }
    }, 1000);
</script>
<style>
    /* adding some styling to <code> tags (but not <pre><code> coding blocks!) */
    :not(pre) > code {
        background-color: rgba(0,0,0,0.05);
        outline: 1px solid rgba(0,0,0,0.15);
        margin-left: 0.25em;
        margin-right: 0.25em;
    }
    /* fixes table of contents of medium-length document from looking weird if admonitions are behind */
    .md div.mediumTOC { background: white; }
    .md div.admonition { position: initial !important; }
</style>

<!--   Leave the following Markdeep formatting code, as this will format your text above to look nice in a wed browser   -->
<style class="fallback">body{visibility:hidden;white-space:pre;font-family:monospace}</style>
<script src="https://casual-effects.com/markdeep/latest/markdeep.min.js"></script>
<script>window.alreadyProcessedMarkdeep||(document.body.style.visibility="visible");</script>

querybuilder's People

Contributors

jeffjewett27 avatar ozoneislucky avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.