Giter Site home page Giter Site logo

ml4ai / automates Goto Github PK

View Code? Open in Web Editor NEW
25.0 25.0 9.0 287.71 MB

AutoMATES: Automated Model Assembly from Text, Equations, and Software

Home Page: https://ml4ai.github.io/automates

License: Other

Dockerfile 0.08% Makefile 0.10% Python 15.06% Shell 0.05% TeX 0.32% HTML 0.11% CSS 0.27% Scala 2.21% JavaScript 1.56% Jupyter Notebook 13.24% Fortran 65.89% Julia 0.02% C++ 0.15% Forth 0.03% Scheme 0.05% Gnuplot 0.67% Pascal 0.01% C 0.13% CMake 0.06% Pawn 0.01%

automates's People

Contributors

adarshp avatar alicekwak avatar aswinchester avatar beckysharp avatar cl4yton avatar dependabot[bot] avatar dpdicken avatar enoriega avatar jkadowaki avatar jobagy avatar jpfairbanks avatar marcovzla avatar maxaalexeeva avatar pauldhein avatar pratikbhd avatar rsulli55 avatar skdebray avatar skhan1020 avatar titomeister avatar vincentraymond-ua avatar zupon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

automates's Issues

[PA]: Issue with GrFN generation while accessing in conditional block

This was found while writing the if_statement unit test in #275. In the C file for that test, notice that variable a is accessed in the second if block. In the GrFN, we would expect an arrow from a at the top to in the interface node in the second conditional block. Furthermore, the GrFN which is generated cannot be executed and results in the following error:

Executing GrFN...
Traceback (most recent call last):
  File "/home/ryan/projects/automates/scripts/program_analysis/run_gcc_to_grfn.py", line 213, in <module>
    run_gcc_pipeline()
  File "/home/ryan/projects/automates/scripts/program_analysis/run_gcc_to_grfn.py", line 205, in run_gcc_pipeline
    result = grfn(inputs)
  File "/home/ryan/projects/automates/automates/model_assembly/networks.py", line 1197, in __call__
    self.root_subgraph(self, subgraph_to_hyper_edges, node_to_subgraph, set())
  File "/home/ryan/projects/automates/automates/model_assembly/networks.py", line 621, in __call__
    sugraph_execution_result = subgraph(
  File "/home/ryan/projects/automates/automates/model_assembly/networks.py", line 646, in __call__
    to_execute()
  File "/home/ryan/projects/automates/automates/model_assembly/networks.py", line 390, in __call__
    variable = self.outputs[i]
IndexError: list index out of range

Excerpt of relevant code:

    if (a > b) {
        x = b;
        b = a;
    }

    if (x == 3) {
        a = x;
        b = a;
        x = 10;
    }

Here are the CAST and GrFN pdfs created from the C file:
if_statement--CAST.pdf
if_statement--GrFN.pdf

Pinned Python Dependencies making for a challenging install?

I am having difficulty installing the package and noticed the number of pinned dependencies in setup.py is fairly substantial (see below). I'm trying to install on an M1 Mac and get stuck on the scipy dep not having a wheel for that version and my arch. Would it be possible to unpin some of these deps? Thanks!

        "antlr4-python3-runtime==4.8",
        "dill==0.3.4",
        "Flask==1.1.1",
        "flask_codemirror==1.1",
        "flask_wtf==0.14.3",
        "future==0.18.2",
        "matplotlib==3.3.4",
        "networkx==2.5",
        "nltk==3.6.6",
        "notebook==6.4.12",
        "numpy==1.21",
        "pandas==1.2.2",
        "plotly==4.5.4",
        "pygraphviz==1.7",
        "pytest==6.2.2",
        "pytest-cov==2.11.1",
        "python-igraph==0.9.1",
        "Pygments==2.7.4",
        "SALib==1.3.12",
        "seaborn==0.10.0",
        "scikit_learn==0.24.1",
        "SPARQLWrapper==1.8.5",
        "sympy==1.5.1",
        "tqdm==4.29.0",
        "WTForms==2.2.1",
        "flask-codemirror",
        "scipy==1.6.0",
        "ruamel.yaml",
        "pdfminer.six",
        "pdf2image",
        "webcolors",
        "lxml",
        "Pillow",
        "ftfy",
        "fastparquet"
    ],

Improvement of conjunction handling

Cases currently not handled by untangleConj:

  • _ αs and αc are soil evaporation coefficient and crop transpiration coefficient, respectively._ (here, we extract two separate conjDefs with multiple variables attached one def and there is no way to know whether the definition should be attached to the first or the second var)
  • where RHmax and RHmin are maximum and minimum relative humidity, respectively (here, maximum is no included into the def)

Possible solutions:

  • allow mult defs in conjDefinitions and then, if equal number of vars and defs, align them as var1->def1, var2 -> def2, etc
  • if there are multiple vars but one def, check if there is a conjoined element to the left (then attach var 2 to this def) or a conjoined element to the right (then attach var 2 to this def); need to generalize this somehow to more than two conjoined elements, eg by checking the number of conjoined elements to the left and right.
  • to solve the 'maximum' example, might need to expand on conj_and or include that in the rule.

Python Conditional Variable Issue

Given a Python program like the following
y = 10
if y < 5:
x = 1
else:
x = 3
print(x)

This program won't correctly translate to GrFN due to a bug having to do with the variable x. The variable x doesn't appear before the conditional, and this introduces an issue at the GrFN generation when creating the appropriate GrFN variables.
There are currently two proposed fixes:

  • Create a dummy variable of sorts at the GrFN level that appears before the conditional starts so that it can generate the GrFN consistently as if a variable really was there.
  • Create a dummy Assign CAST node at the Python2CAST level before the conditional so that when we get to GrFN there is a variable before the conditional. This would make the GrFN generation work.

The current proposal is to implement the second as a solution, though perhaps in the future the first solution should also be implemented.

DocumentFilter

when we parse arxiv pdfs with science parse, the first page seems to get the sideways watermark in the document pretty consistently:

The federated learning learns a shared global model by the aggregation of local models 
on client devices. But in the original paper of federated learning [18] only uses a simple 
average on client models, taking the number of samples in each client device as the 
weight of averaging. In the mobile keyboard applications, the language preference may 
vary from different individuals. There is a relation among the client
ar X
iv :1
81 2.
07 10
8v 1
[ cs
.C L
] 1
7 D
ec 2
01 8
language models, and their contributions to the central server are quite different. 

Since we're primarily using arxiv papers, I think we should removed this, prob with a DocumentFilter

Only using grobid to find atomic values (not intervals)

@bsharpataz, just another example to support the idea of not using grobid-quant for intervals:

image

{
    "runtime": 59,
    "measurements": [
        {
            "type": "listc",
            "quantities": [
                {
                    "rawValue": "0.9",
                    "parsedValue": {
                        "numeric": 0.9,
                        "structure": {
                            "type": "NUMBER",
                            "formatted": "0.9"
                        },
                        "parsed": "0.9"
                    },
                    "offsetStart": 105,
                    "offsetEnd": 108,
                    "quantified": {
                        "rawName": "< Kcbmax",
                        "normalizedName": "kcbmax",
                        "offsetStart": 109,
                        "offsetEnd": 117
                    }
                }
            ],
            "quantified": {
                "rawName": "< Kcbmax",
                "normalizedName": "kcbmax",
                "offsetStart": 109,
                "offsetEnd": 117
            }
        },
        {
            "type": "interval",
            "quantityLeast": {
                "rawValue": "1.15",
                "parsedValue": {
                    "numeric": 1.15,
                    "structure": {
                        "type": "NUMBER",
                        "formatted": "1.15"
                    },
                    "parsed": "1.15"
                },
                "offsetStart": 177,
                "offsetEnd": 181
            },
            "quantityMost": {
                "rawValue": "1.15",
                "parsedValue": {
                    "numeric": 1.15,
                    "structure": {
                        "type": "NUMBER",
                        "formatted": "1.15"
                    },
                    "parsed": "1.15"
                },
                "offsetStart": 120,
                "offsetEnd": 124
            }
        }
    ]
}

And what we get because of this:
image

TR improvement

  • handle parameter settings expressed in natural language, e.g., where b is a positive number
  • handle compound vars with greek characters, e.g. _ αs and αc are soil evaporation coefficient and crop transpiration coefficient, respectively._ alpha c does not get reduced to one variable even though keepLongestVariable should take care of that.

Param Setting after Align endpoint

  • after align, there should be an option for more than one param setting value for each variable
  • store all associated param setting as a list - strings ok for now

buggy with certain sentences

With an RMSE of 22.8%, drastic discrepancies were found in the comparison of Ref-ET ETo and ETpm from DSSAT-CSM version 4.5 for Arizona conditions (fig. 1a).

Processing sentence : With an RMSE of 22.8%, drastic discrepancies were found in the comparison of Ref-ET ETo and ETpm from DSSAT-CSM version 4.5 for Arizona conditions (fig. 1a).
DOC : org.clulab.processors.corenlp.CoreNLPDocument@6992ebed
[error] application -

! @7am4d9lji - Internal server error, for (GET) [/parseSentence?sent=With+an+RMSE+of+22.8%25%2C+drastic+discrepancies+were+found+in+the+comparison+of+Ref-ET+ETo+and+ETpm+from+DSSAT-CSM+version+4.5+for+Arizona+conditions+(fig.+1a).&showEverything=true] ->

play.api.http.HttpErrorHandlerExceptions$$anon$1: Execution exception[[NoSuchElementException: key not found: type]]
	at play.api.http.HttpErrorHandlerExceptions$.throwableToUsefulException(HttpErrorHandler.scala:255)
	at play.api.http.DefaultHttpErrorHandler.onServerError(HttpErrorHandler.scala:182)
	at play.core.server.AkkaHttpServer$$anonfun$$nestedInanonfun$executeHandler$1$1.applyOrElse(AkkaHttpServer.scala:251)
	at play.core.server.AkkaHttpServer$$anonfun$$nestedInanonfun$executeHandler$1$1.applyOrElse(AkkaHttpServer.scala:250)
	at scala.concurrent.Future.$anonfun$recoverWith$1(Future.scala:414)
	at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:37)
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
	at play.api.libs.streams.Execution$trampoline$.execute(Execution.scala:70)
	at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68)
	at scala.concurrent.impl.Promise$KeptPromise$Kept.onComplete(Promise.scala:368)
Caused by: java.util.NoSuchElementException: key not found: type
	at scala.collection.MapLike.default(MapLike.scala:232)
	at scala.collection.MapLike.default$(MapLike.scala:231)
	at scala.collection.AbstractMap.default(Map.scala:59)
	at scala.collection.MapLike.apply(MapLike.scala:141)
	at scala.collection.MapLike.apply$(MapLike.scala:140)
	at scala.collection.AbstractMap.apply(Map.scala:59)
	at ujson.Value$Selector$StringSelector.apply(Value.scala:97)
	at ujson.Value.apply(Value.scala:62)
	at ujson.Value.apply$(Value.scala:62)
	at ujson.Obj.apply(Value.scala:190)

Adding handling of break and continue in loops and conditionals

Adding this will require keeping track of operation order in at least base CAST. We (Tito, Ryan, Janalee, Clay) have previously discussed doing this by adding an order number to each CAST node; when a break or continue is indicated, then the order number of the parent can be used to resolve where to return control.

Implement translation from GrFN3 to CAG

Overview

Now that we are switching over to GrFN3 we need to revamp the grfn2cag translation pipeline. @aswinchester this would be a great task for you to get integrated into the core of the AutoMATES repository. Please reach out to @dpdicken and @titomeister for assistance as you accomplish this task. They will be more than happy to pair-program with you and help you along the way!

Everyone

  • This issue pairs with issue #256 . This issue will need to be completed in order for that issue to be complete.
  • All tasks for this issue should be carried out on the grfn2gromet branch or a child branch from that branch. DO NOT attempt to do this on master because the setup tests and objects only exist on the grfn2gromet branch.

Useful pointers

  • automates/model_assembly/networks.py: this file contains the class definitions for the GroundedFunctionNetwork and the CausalAnalysisGraph classes. This file should be the one location where implementations need to be made.
  • scripts/model_assembly/py2grfn.py: this is a script that will run our whole pipeline to translate Python source code into a GrFN/CAG. Read the flag definitions carefully. You will need to supply the correct flags to actually run CAG generation and to test whether the CAG is equivalent to one saved/loaded from JSON.
  • tests/data/program_analysis/language_tests/python/<idiom-name>/<test-name>/<test-name>.py: this is a collection of Python code examples that can be used to see if we can generate CAGs for certain programming idioms that we may see in scientific model source code.
  • tests/model_assembly/test_grfn2cag.py: this is a file where tests exist that will test whether the methods that will be defined as part of this issue are working or not. Use this file to determine if your implementations are successful or not after you can successfully run the py2grfn.py script.
  • tests/conftest.py: a helper file for the tests mentioned above that defines pytest fixtures used during the testing process.

Tasks

  • Implement the methods defined in both the CausalAnalysisGraph and CAGContainer classes marked with TODO: @Alex ....
    • CausalAnalysisGraph.from_GrFN(): this method creates a CausalAnalysisGraph from a GroundedFunctionNetwork
    • CausalAnalysisGraph.from_json_file(): this method creates a CausalAnalysisGraph from a CAG.json file
    • CAGContainer.from_func_node(): this method creates a CAGContainer from a GrFN BaseConFuncNode. Containers in a CAG play the role of determining variable node subgraph membership based upon the functions found in the GrFN parent. Ask @cl4yton if you need guidance on what a CAGContainer should look like.
    • CAGContainer.from_dict(): this method creates a CAGContainer from a dictionary of data. This dict is derived from a re-loaded CAG.json.
  • Add a to_igraph_gml() method in the CausalAnalysisGraph class that allows the CAG to be converted to an igraph object that can be used with the identifiability algorithm. An example of this for the old CAG class is in the same file (commented out and below the new class definition).

Helpful tips

  • When making the from_json_file and from_dict methods take a look at the already implemented to_json and to_dict methods because those reveal what information is expected to be present in the CAG.json.

[PA]: Issue with GrFN generation while assigning in conditional block

This was found while writing the if_else_statement unit test in #275. In the C file for that test, notice that variable a is assigned in the else branch, but it does not appear as the output of an assignment node in the GrFN. Furthermore, as a was assigned, it should come out of the interface block, but it does not. The CAST correctly has the assignment of a in the else branch of the first conditional.

Excerpt of relevant code:

if (a < b) {
        x = b;
        b = a;
    }
    // else should be taken
    else {
        x = a;
        a = b;
    }

Here are the CAST and GrFN pdfs created from the C file:
if_else_statement--CAST.pdf
if_else_statement--GrFN.pdf

Prepare paired samples of GrFN2 and GrFN3 JSON for GE

Overview

This issue is much simpler. We should prepare sets of JSONs for GE for the models they have been working with to assist them in making the transition from using GrFN2 + ExprTree JSONs to using GrFN3 JSON.

Note: all references in this issue are for the grfn2gromet branch. I'm not sure whether the references are correct on master.

Tasks

  • Prepare the following three JSON objects for any model of interest to GE:
    • model_name--GrFN.json
    • model_name--GrFN3.json
    • model_name--expr-trees.json

Helpful tips

The code for transforming lambda expressions into expr-trees exists both as a standalone script and as a function in automates/model_assembly/networks.py. The same core library is used and the calls are done for each expression function in GrFN whether we are talking 2.0 or 3.0.

  • The script where the expr-trees are formed is located at: scripts/model_assembly/expression_walker.py
  • The method in GrFN3 generation that performs the same function for each ExpressionFuncNode is located at: automates/model_assembly/networks.py::BaseFuncNode::expression_to_expr_nodes()

If necessary, it should be possible to use the second function to generate the information needed to form the expr-tree JSON during GrFN3 translation. I'm not sure if that will make things easier, but I wanted you both to know about that as an option.

[GCC2CAST] Handling of compound conditionals

In the GCC/Gimple to AST pipeline, we see that compound conditional (e.g., (cond1 && cond2) ) get split into sequentially handled conditions. These need to be recognized as compound, otherwise the CAST that gets generated will look like two sequential if-statements.

Finish GrFN3 test suite generation

Overview

The collection of tasks defined below should be all that is needed to prepare the tests that will determine if the grfn2gromet branch to merge into master. All tasks defined should be carried out on the grfn2gromet branch or a branch from that branch.

Useful pointers

  • Python code examples: tests/data/program_analysis/language_tests/python/<idiom-name>/<test-name>/<test-name>.py
  • PyTest fixtures location: tests/conftest.py
  • PyTest mark definitions: ./pytest.ini
  • Full pipeline script: scripts/model_assembly/py2grfn.py

Tasks

  • Write a script that will auto-generate the test stubs for tests/program_analysis/test_python2cag.py, tests/program_analysis/test_cag2air.py, tests/model_assembly/test_air2grfn.py, and tests/model_assembly/test_grfn2cag.py for all of the Python code examples according to the three example stubs shown
  • Create or refactor tests/model_assembly/test_grfn_execution.py to utilize all of the Python code examples
    • Implement the GrFN3 execution framework
    • Run the Python code examples with sample input/output. Save those examples to a JSON file that you can load during testing so that you can verify that the GrFN execution is producing the appropriate output for given input.
  • Run the full pipeline script for each example in the Python code examples with all of the necessary flags to produce the JSON file output for CAST, AIR, GrFN, GrFN dotfiles, and CAG.
    • This may require changes or corrections to be made in the pipeline that generates these items
    • Manually inspect the output to ensure that the output is correct
  • Remove old tests that are no longer necessary for the new pipeline, add any other desired test cases and mark the sensitivity analysis test cases with pytest.mark.skip so that we have a record of them but they will not be caught as failing for this PR.

Text reading improvement

This will list outstanding text reading grammar issues:

  • Handling conjoined nouns, e.g. "Here r and rP are related to the infection rate of disease A and B respectively, while a, aP and b are the removal rate of individuals in class I, IP and E respectively."
  • Only R is found although compound R(t) is found: "R(t) is the cumulative number of patients admitted to hospitals"

Update GrFN extraction from AIR container types

Overview

@pratikbhd and @hlim1 have finished getting the new container types loop, if-block, and select-block ready to be used by the GrFN extraction module. We now need to refactor networks.py::process_container so that processing is done based on the type of the container.

Example error message from processing PETPT

Screen Shot 2020-04-06 at 5 40 55 PM

Updating setup wiki with python deps

Check if there are any (non-standard) python dependencies that have to be installed prior to running the code in the repo.
https://github.com/ml4ai/automates/wiki/Setup

@pauldhein, I think we could be the main python users, so this is probably on us. I will check the TR-specific python scripts. Let me know if there is anything you'd like me to add from your side of things.

Also, let me know if you had to go through any other extra steps to setup tr and alignment and i can add that to the wiki page.

Adding the notion of <ANY> for declared but not assigned values.

In the GCC2CAST pipeline, any declared variables that aren't assigned a value like
int x;
Currently have their default values set to -1. We would like to change that so that it's an value of sorts. This way it's more clear that it's an actual declaration but not an assignment.
This is more likely going to be done at the CAST level by leveraging the LiteralValue node.

PA handling side-effects

We have done some limited handling of side-effecting on the GCC-side. E.g., there should be some handling (within the annCAST passes) of identification of globals being assigned, and capturing this in some the annCAST (needs review).

But we do not currently have a general approach to handling side-effects / wiring.

For example, for the use of side-effecting functions or the Python assignment-expression Walrus Operator:

z = -5
b = -10
a = -5

print(a)

def temp_a(b):
	global a
	a = max(2, b)
	return a

if ( z - 20 < a or not(z < temp_a(b)) ) and ...:
	print("if_body", z, a)
else:
	print("else_body", z, a)

With the Walrus Operator:

z = -5
b = -10
a = -5

print(a)

if z - 20 < a or not(z < (a := max(2, b))):
	print("if_body", z, a)
else:
	print("else_body", z, a)

These cases need to be handled.

It seems that the general solution requires passing side-effected updated variables as part of expression return values, so likely require use of packing/unpacking.
(Gromet FN aim to be monadic, so need to massage side-effecting into monadic framework.)

Renaming val to name in CAST Var node

To alleviate some confusion in the CAST generation pipeline we propose changing the attribute 'val' to 'name' in order to reduce confusion with the new 'default_val' attribute. We will make this change sometime after Milestone 5 and the demo in order to prevent any bug ripple effects in CAST generation.

filtering out units that look like variables

d-1 and h-1 are extracted as variables in the example below, but there shouldn't be.

ETsz = standardized reference crop evapotranspiration for short (ETos) or tall (ETrs) surfaces (mm d-1 for daily time steps or mm h-1 for hourly time steps),

Wiki grounding wish list / debug

  • return hierarchy info (e.g., subclass of)
  • return groundings over threshold (what to base the threshold on? longest term with wiki concept edit distance?)
  • tests for the grounder ranker
  • cache grounding results to avoid regrinding
  • writing a query that returns the closest string match, e.g., time for time (are queries deterministic?)
  • document in repo wiki
  • exclude verbs from terms?
  • pass number of queries for Wikidata to return from config
  • pass number of grounding to return per global var from config

webapp: app error when submit blank form

[Not high priority] In chrome, web app returns error when submit an empty form. Perhaps just have this be a "no-op" -- i.e., do nothing and just provide empty form new entry.

Improve display of Intervals

With the webapp, we aren't displaying Intervals well:
image

Found Entities:

List(ValueAndUnit, Value, Measurement, Entity) => 10 - 20 cm 
     ------------------------------ 
     Rule => GrobidEntityFinder 
     Type => RelationMention 
     ------------------------------ 
     value (Value, Measurement, Entity) => 10 
     unit (Unit, Measurement, Entity) => cm 

     ------------------------------ 

List(Interval, Measurement, Entity) => 10 - 20 cm 
     ------------------------------ 
     Rule => GrobidEntityFinder 
     Type => RelationMention 
     ------------------------------ 
     most (ValueAndUnit, Value, Measurement, Entity) => 20 cm 
     least (ValueAndUnit, Value, Measurement, Entity) => 10 - 20 cm 

     ------------------------------ 

List(Value, Measurement, Entity) => 10 
     ------------------------------ 
     Rule => GrobidEntityFinder 
     Type => TextBoundMention 
     ------------------------------ 
     Value, Measurement, Entity => 10 
     ------------------------------ 

List(Value, Measurement, Entity) => 20 
     ------------------------------ 
     Rule => GrobidEntityFinder 
     Type => TextBoundMention 
     ------------------------------ 
     Value, Measurement, Entity => 20 
     ------------------------------ 

List(Unit, Measurement, Entity) => cm 
     ------------------------------ 
     Rule => GrobidEntityFinder 
     Type => TextBoundMention 
     ------------------------------ 
     Unit, Measurement, Entity => cm 
     ------------------------------ 

Add/Import derived types to python lambdas

Overview

Currently, generation of the lambdas for Mini-PET causes the following error to occur:
Screen Shot 2020-04-06 at 11 22 03 AM
This seems like a simple bug that should be solved by adding the proper import line so that controltype and switchtype are visible in this lambdas file.

NOTE: This issue pertains to the dssat_pet branch in Delphi but we are documenting it here for task tracking purposes.

Finish Implementation of TeX2Py

Overview

@marcovzla I created a very rough and undocumented script to run SymPy's LaTeX parsing pipeline. It is located at scripts/equation_reading/tex2py.py.

What we need to do now is to modify this script to turn it into a callable library routine that takes a tokenized LaTeX equation string as input and outputs a string representation of the equivalent python code.

Once we have that I can use Python's ast module to turn the mathematic expression code into a parse tree that I will align to a lambda expression extracted from source code.

Some Important Notes

  • This implementation expects chunked tokenized LaTeX (i.e. the chunking portion needs to already be accomplished) @marcovzla does this make sense?

TeX2Py High-level Algorithm

At a very high-level, here is what the TeX2Py algorithm tries to accomplish:

(0) Given a string of tokenized LaTeX (call it T)
(1) Split T on = into a left-hand side (LHS) and a right-hand side (RHS)
    (1a) Set aside the LHS, we will return that as-is for now
    (1b) if there are multiple =
        (1bi) count every expression other than the LHS as an RHS
(2) Remove common LaTeX formatting tokens (e.g. ~, \left, \mathrm, etc)
(3) Create a LaTeX variable to simple variable map (call it V)
    (3a) simple variables will be a single letter
    (3b) LaTeX variables can use _{}, _, ^, ^{}, _{}^{}, ^{}_{} in their definition
    (3c) Convert to pythonic form by replacing _{} with _ and ^{} with __
    (3d) Create map of pythonic vars to single-letter vars
(4)Replace all variables in RHS with one-letter vars in V
(5) Perform the translation to python with sympy.parsing.latex.parse_latex
(6) Replace one letter vars with the pythonic vars from V
(7) Return the results

@marcovzla if you have any questions about the above algorithm or any ambiguity associated with this task please @ me in this issue.

losing token interval when copying

Copy mention may not be updating the new token interval here (may be able to combine with previous several lines of unused code (constructing new mentions instead of copying---those several lines are deleted in the Alice_Functions2 branch):

if (descrAttachments(i).toUJson("charOffsets").arr.length > 1) {

Also, make sure discontinuous char offset attachments are handled properly---if we use indices to get to them, we need to be adding something to the sequence even if there is no attachment.

Output not right in webapp

The output marked with red in the webapp is not correct.

image

The output in the terminal while running the webapp is correct, eg this:

Processing sentence : Under full irrigation, Kcbmax with the ETo-Kcb method had little influence on maize and cotton yield for 0.9 < Kcbmax < 1.15, but simulated yield decreased rapidly for Kcbmax > 1.15 (fig. 6a). 
DOC : org.clulab.processors.corenlp.CoreNLPDocument@12c55fc2
Done extracting the mentions ... 
They are : 1.15, but simulated yield decreased rapidly for Kcbmax > 1.15,       > 1.15 (fig,    1.15,   0.9,    1.15,   Kcbmax, full irrigation,        simulated yield,        0.9 < Kcbmax < 1.15,    little influence,       > 1.15, ETo-Kcb method, cotton yield,  Kcbmax, maize,  fig,    fig,    6a
Sentence returned from processPlaySentence : Under full irrigation , Kcbmax with the ETo-Kcb method had little influence on maize and cotton yield for 0.9 < Kcbmax < 1.15 , but simulated yield decreased rapidly for Kcbmax > 1.15 ( fig .
Found mentions (in mkJson):
List(Concept, Entity) => 6a 
         ------------------------------ 
         Rule => simple-np 
         Type => TextBoundMention 
         ------------------------------ 
         Concept, Entity => 6a 
         ------------------------------ 
 

List(Concept, Entity) => full irrigation 
         ------------------------------ 
         Rule => simple-np 
         Type => TextBoundMention 
         ------------------------------ 
         Concept, Entity => full irrigation 
         ------------------------------ 
 

List(Concept, Entity) => Kcbmax 
         ------------------------------ 
         Rule => simple-np 
         Type => TextBoundMention 
         ------------------------------ 
         Concept, Entity => Kcbmax 
         ------------------------------ 

I'm trying to fix this myself, but I haven't succeeded so far.

We are not handling listc

This sentence:
The height of the chair was 100-150 cm and it was between 2700.5 and 3000 mm in length.

stack trace:

Exception in thread "main" java.lang.RuntimeException: unsupported measurement type 'listc'
	at org.clulab.aske.automates.quantities.GrobidQuantitiesClient.mkMeasurement(GrobidQuantitiesClient.scala:30)
	at org.clulab.aske.automates.quantities.GrobidQuantitiesClient.$anonfun$getMeasurements$1(GrobidQuantitiesClient.scala:24)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:52)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at scala.collection.TraversableLike.map(TraversableLike.scala:234)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:227)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at org.clulab.aske.automates.quantities.GrobidQuantitiesClient.getMeasurements(GrobidQuantitiesClient.scala:24)
	at org.clulab.aske.automates.entities.GrobidEntityFinder.extract(GrobidEntityFinder.scala:19)
	at org.clulab.aske.automates.entities.TestStuff$.main(GrobidEntityFinder.scala:156)
	at org.clulab.aske.automates.entities.TestStuff.main(GrobidEntityFinder.scala)

Process finished with exit code 1

anything else we're not handling????? (assuredly)

expansion expects a TextBoundMention

Should we need to modify it to handle other mention types? Or not expand?

with text: For this research Tx = 10 mm d−1, Lsc = −1100 J kg−1 and Lpwp = −2000 J kg−1.
(variable test t9b)

[info]   java.lang.ClassCastException: org.clulab.odin.RelationMention cannot be cast to org.clulab.odin.TextBoundMention
[info]   at org.clulab.aske.automates.actions.ExpansionHandler.expand(ExpansionHandler.scala:136)
[info]   at org.clulab.aske.automates.actions.ExpansionHandler.expandIfNotAvoid(ExpansionHandler.scala:95)
[info]   at org.clulab.aske.automates.actions.ExpansionHandler.$anonfun$expandArgs$4(ExpansionHandler.scala:67)
[info]   at org.clulab.aske.automates.actions.ExpansionHandler.$anonfun$expandArgs$4$adapted(ExpansionHandler.scala:66)
[info]   at scala.collection.Iterator.foreach(Iterator.scala:929)
[info]   at scala.collection.Iterator.foreach$(Iterator.scala:929)
[info]   at scala.collection.AbstractIterator.foreach(Iterator.scala:1417)
[info]   at scala.collection.IterableLike.foreach(IterableLike.scala:71)
[info]   at scala.collection.IterableLike.foreach$(IterableLike.scala:70)
[info]   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)

need a separate preprocessor for webapp

PassThroughPreprocessor still does some cleaning (e.g., doesn't allow sequences that are less than 60% letters excluding spaces), so the webapp fails on text like this: then: dI (0) > 0.

Combining COSMOS blocks makes mention location information less specific

In PR #263, we combine cosmos blocks to make sure paragraphs are not split up (that happens at the end of a column in two-column papers and at the end of pages). When we combine blocks, the location of extracted mentions becomes less specific---instead of saying Mention 1 comes from p. 1 block 1, we are saying Mention 1 comes from p. 1 block 1-2, and the mention can be located either in block 1, block 2, or be split between the two blocks. Keeping track of length of each block in characters and knowing the character offset of the extraction based on the combined block content can help narrow it down.

Note: Currently, COSMOS combines some paragraphs into longer blocks. This needs to be discussed with UW.

debug jsonDoc_to_mentions

The endpoint described here https://github.com/ml4ai/automates/wiki/Text-Reading#mention-extraction- results in some mentions with null paths, which can't be handled by the align endpoint (JNull error). If input is supposed to be the processors document, the wiki will need to be updated with an extra step, but it is probably more convenient for users to not have extra steps, so just make sure the endpoint takes a science parse doc and produces mentions of the right format. (also, check if the pdf_to_mentions produces mentions that can be handled by align)

Refactoring of GroMEt Generation Visitors.

At this point the GroMEt generation has extensively grown, and with that it could use some refactoring in some visitors.
Some visitors were initially written in a less-than-general manner, but as new language features get added the visitors become harder to maintain. I think a rewrite of some pieces would help alleviate the issue, and make the implementations more general.

The current pieces that could use a rewrite, in order of priority:
AnnCAST Call Node: Adding primitives and attributes into the mix has made this visitor difficult to understand and maintain.
AnnCAST Assignment Node: This visitor in particular is dealt on a case-by-case basis. While this doesn't need a serious rewrite it could definitely use some looking over as it's been added to a lot.
AnnCAST Attribute Node: This visitor is relatively new, and doesn't need a rewrite, but needs more expansion in order to better fit itself when used by other visitors.

MathJax: Math Processing Error

Encountering Math Error while running mathjax_mml_conversion.py https://github.com/ml4ai/automates/blob/gauravs_automates/scripts/equation_reading/mathjax/arxiv_eqn_extraction/mathjax_mml_converter.py

"Math Processing Error: Maximum call stack size exceeded"

It puts the server on a hold. The server stops responding without getting killed. Is there a way to make the MathJax node kill by itself? I have tried providing timeout in requests.post () but nothing worked.

Some related post that I have found that might be helpful:
https://phabricator.wikimedia.org/T120959
https://moodle.org/mod/forum/discuss.php?d=318488

TR tests needed

  • attachments
  • pairwise alignment of descriptions from texts and comments
  • edit param setting interval test to check value least and value most properly
  • mention serialization

Unicode char issue

Text: where locations are indexed by i, observational periods are indexed by t, b is the parameter of interest, and ∈ is the error.
Webapp:
Screenshot from 2020-01-26 14-53-10
The symbol is not processed right:
Screenshot from 2020-01-26 14-53-49

Updated TR alignment should

  • include alignment on bigrams
  • be case-sensitive for identifiers
  • have thresholds for when to return links, e.g., for gvar -> param setting through identifier (for the same paper), the threshold has to be very high
  • have indication of which variables are likely related, e.g., I, I(0), I(t), etc.
  • handle equations that contain mainly text, e.g., PPE ~ use ~ of ~ an ~ item ~ on ~ a ~ particular ~ day = \frac{(\# ~ of ~ patients ~ that ~ day) (\# ~ of ~ daily ~ cont acts ~ per ~ patient)}{\# ~ of ~ patient ~ contacts ~ before ~ discarding ~ item}
  • have the option to align text to src directly if comments are not available

TR alignment refactoring

Overview

This issue addresses two requests for the interface between the TR-endpoints and the model analysis pipeline. The two requests are:

  1. Incorporating changes to the GrFN JSON into the delivery of variables and source code comments to the TR alignment endpoints
  2. Refactor /align into new methods where each method focuses on creating a single link type

Along with merging the changes necessary to resolve this issue to master, we should also document the choices made in the AutoMATES Github wiki. That will require a new section that we will call Model Assembly ... @pauldhein will work on that.

All changes implemented for this issue should be done in PR #121

Incorporating changes to GrFN JSON

This section is here just to disambiguate some terms. Previously, the PA and MA teams had been referring to the output from the PA pipeline as GrFN JSON, and previously (as well as currently) we had been sending a path to that JSON file to the /align endpoint. @BeckySharp and/or @maxaalexeeva correct me if I am wrong, but I believe the only fields needed by the /align endpoint from the old GrFN JSON are the variables and source_comments. I propose that I deliver these fields to you in a JSON file without any of the other old or new GrFN components.

Does that work for you both? If this works then you won't have to worry about any further changes to the PA/MA GrFN JSON (or it's new sousing the AIR JSON).

Proposed new alignment endpoints

Currently we have the following endpoint defined for aligning all sources:

/**
  * Align mentions from text, code, comment. Expected fields in the json obj passed in:
  *  'mentions' : file path to Odin serialized mentions
  *  'equations': path to the decoded equations
  *  'grfn'     : path to the grfn file, already expected to have comments and vars
  * @return decorated grfn with link elems and link hypotheses
  */
def align: Action[AnyContent] = Action { request =>
    ...
}

I would like to change this so that we have separate endpoints for each of the dashed links shown in our overall link diagram:
Screen Shot 2020-04-07 at 1 11 13 PM

Let's review these links:

  • tokenized equation <--> assignment statement: The TR team doesn't need to worry about this one.
  • ontology concept <--> text definition: I will address this link in a separate GitHub issue
  • source code variable <--> source comment variable: what are we doing for this link right now? Is it just a string edit distance? Would it be easier for us to make this link on the python side of AutoMATES?
  • equation variable <--> text variable: This seems like a good candidate for a new endpoint. Perhaps we can name it /alignEquationsAndText?
  • variable docstring <--> text definition: Can we have this new endpoint be /alignDocstringsAndText?

For each of the new endpoints, I'd like to pass in to Scala a single path to a single JSON file that holds an object that has all the fields necessary for that endpoint. Does that sound reasonable? Perhaps we can plan what fields are needed for which endpoints in this issue?

Access to Google Drive

Hello there, first of all, congratulations on the project, I found it very interesting !

However I could not test/reproduce it because it requires access to GDrive files such as "ASKE-AutoMATES/Data/equation_decoding/arxiv2018-downsample-aug-model_step_80000.pt". Could you please add the Gdrive path to the documentation?

Many thanks
Murilo

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.