uclnlp / simplenumericalfactchecker Goto Github PK

View Code? Open in Web Editor NEW

26.0 26.0 5.0 4.11 MB

Fact checker for simple claims about statistical properties

Python 100.00%

simplenumericalfactchecker's People

Contributors

Stargazers

Watchers

Forkers

dhruvghulati-zz j6mes anantdgoel kod3r afcarl

simplenumericalfactchecker's Issues

Linkage between KB values and actual EV pairs found in HTML JSONs and where in code found

Actual Property, Region, Value triple: Is this obtained from the knowledge base for that sentence, or from the stored values we stripped out of the sentence (region, value, property)?

If from the KB (the former), am I correct in saying we knew we obtained the sentence via a specific Bing web search of “Population, Germany” and so can query the knowledge base for those two property, region to obtain the third thing, the value?

Obtaining dependency paths for full sentence (or between NUMBER_SLOT and LOCATION_SLOT).

I was running buildMatrix.py and started printing out what you return from your functions for each set of parsed HTML json files and each dict of tokens/dependencies (the dependencies are always empty arrays in the JSON files).

I get for each dict this with the code (print statement):

Dependency path is 
Sentence is  Ecuador 's population is estimated at 12.9 million in 2004 ,
Shortest paths are [[0, 2, 4, 7]]
Path strings are [u'LOCATION_SLOT~-poss+population~-nsubjpass+estimated~prep_at+NUMBER_SLOT', u'LOCATION_SLOT~-poss+*extend*is~auxpass+population~-nsubjpass+estimated~prep_at+NUMBER_SLOT', u'*extend*DATE~prep_in+LOCATION_SLOT~-poss+population~-nsubjpass+estimated~prep_at+NUMBER_SLOT']
Surface pattern token sequences [['LOCATION_SLOT', u'"\'s"', u'"population"', u'"is"', u'"estimated"', u'"at"', 'NUMBER_SLOT'], ['LOCATION_SLOT', u'"\'s"', u'"population"', u'"is"', u'"estimated"', u'"at"', 'NUMBER_SLOT', u'"in"'], ['LOCATION_SLOT', u'"\'s"', u'"population"', u'"is"', u'"estimated"', u'"at"', 'NUMBER_SLOT', u'"in"', u'"DATE"']]

What I am trying to do is for each sentence, create bi-grams of the paths between the LOCATION_SLOT and NUMBER_SLOT as additional features of the sentence (beyond the bag of words).

So an example Sebastian gave here (see screenshot) would be turning Germany's population was 80 million into Germany's population was 80 million (<-compound,->nsubj) (->nsubj, ->nmod:poss) and vectorizing the whole thing.

How would I do the above given your code?

locationTokenIDs and numberTokenIDs being duplicated in certain cases

When creating a new version of my parsed sentences with NUMBER_SLOT and LOCATION_SLOT filled in, I see two main issues:

Not dealing with millions and billions e.g.

    "parsedSentence": "Trade with LOCATION_SLOT has increased NUMBER_SLOT % since 2004 -LRB- TIFA -RRB- up t0 NUMBER_SLOT NUMBER_SLOT dollars from NUMBER_SLOT NUMBER_SLOT in 2004 .", 
    "sentence": "Trade with America has increased 1333 % since 2004 -LRB- TIFA -RRB- up t0 2.2 billion dollars from 150 million in 2004 ."

Not dealing with multiple locations and collapsing to one e.g. Ivory Coast, United Kingdom, Spanish Netherlands:

 "parsedSentence": "The LOCATION_SLOT LOCATION_SLOT in 1600 had NUMBER_SLOT NUMBER_SLOT ; in 1650 , NUMBER_SLOT NUMBER_SLOT .", 
    "sentence": "The Spanish Netherlands in 1600 had 1.5 million ; in 1650 , 1.9 million ."

 "parsedSentence": "While LOCATION_SLOT produces and exports heavy crude , it imports NUMBER_SLOT NUMBER_SLOT CFA francs of light crude oil -LRB- which is suitable for its refinery -RRB- from LOCATION_SLOT , LOCATION_SLOT LOCATION_SLOT , LOCATION_SLOT LOCATION_SLOT , LOCATION_SLOT , and LOCATION_SLOT .", 
    "sentence": "While Cameroon produces and exports heavy crude , it imports 73.3 billion CFA francs of light crude oil -LRB- which is suitable for its refinery -RRB- from Nigeria , Equatorial Guinea , Ivory Coast , Angola , and Italy ."`

   "parsedSentence": "Migration from Saint Lucia is primarily to Anglophone countries , with the LOCATION_SLOT LOCATION_SLOT -LRB- see Saint Lucian British -RRB- having almost NUMBER_SLOT Saint Lucian-born citizens , and over NUMBER_SLOT of Saint Lucian heritage .", 
    "sentence": "Migration from Saint Lucia is primarily to Anglophone countries , with the United Kingdom -LRB- see Saint Lucian British -RRB- having almost 10,000 Saint Lucian-born citizens , and over 30,000 of Saint Lucian heritage ."

What were the steps to ignore and filter properties from propertiesOfInterest.json into featuresKept.json?

Was this related to some analysis of the data within those properties within Freebase?

Requiring each region-pattern combination to have appeared at least twice and its values to have standard deviation less than 0.1.

This seems to relate to discarding training instances, not properties altogether.

The claims identified and annotated manually are attached to our submission...

Have you done a qualitative evaluation of high MAPE sentences that you found just out of interest?

What is the significance of a high MAPE sentence that has then been labelled as actually being a claim? Is there a relationship you found between high MAPE sentences in your labelled claims vs. them being actually labelled as claims or not?

Is this a prediction of purely the value given the pattern and region? This is not a prediction of EV pair?

This line seems to only predict a value, or will it allow me to obtain the predicted EV pairs per pattern? Are we predicting both region, value for a pattern, or just a value, given a pattern and region?

In general, what are the inputs and what are the outputs in terms of prediction? Are you stripping the actual property, region, value out and just look at what is in between e.g. “The _ of _ is _”, then predict all 3 values based on just knowing the sentence pattern? Or are you predicting value, given you know the property, region from what you observe from the slots? What did you use to make his entity, value pair prediction for a particular property essentially with regards to the code?

Which of the outputs from buildMatrix is used to build theMatrixExtend120TokenFiltered_2_2_0.1_0.5_fixed2.json?

I reran buildMatrix.py. This file spits out mainMatrix.json (2.3MB file), mainMatrix.json_tmp (1.33GB), mainMatrixTrain.json_sentences (7.94GB file), mainMatrixTrain.json_sentences_tmp (which is a 7.72 GB file). You then have theMatrixExtend120TokenFiltered_2_2_0.1_0.5_fixed2.json which contains 59584 items, whereas the mainMatrix.json which I would have thought is the one from which this is built contains 16866 items.

Which of the first 4 files from the first output should be used? It doesn't make sense that a filtered JSON file contains more patterns than the non-filtered one?

Linking sentences to original token and dependency set representing sentence

I see from the final lines that I can link each pattern observed to a set of EV pairs and the original sentence the pattern came from. However, how do I also link each pattern observed to the tokens and dependencies originally observed from each HTML JSON we obtained after step 6, and also what is in theMatrixExtend120TokenFiltered_2_2_0.1_0.5_fixed2.json?

e.g. LOCATION_SLOT~-appos+*extend*putting~-prep_after+LOCATION~dep+PERCENT~number+NUMBER_SLOT could be linked to:

"According to the most recent available data from the World Bank , Zambia\u00c3 cents \u00c2 $ \u00c2 s population growth rate in 2012 was at 3.2 % putting it in eighth position worldwide after Oman -LRB- 9.1 % -RRB- , Qatar -LRB- 7.1 % -RRB- , South Sudan -LRB- 4.3 % -RRB- , Kuwait -LRB- 3.9 % -RRB- , Niger -LRB- 3.8 % -RRB- , Uganda -LRB- 3.4 % -RRB- and Eritrea -LRB- 3.3 % -RRB- .".....(6 instances of these sentences for each EV pair found in mainMatrix.json_sentences
"Eritrea": [ 9.1 ], "Kuwait": [ 9.1 ], "Niger": [ 9.1 ], "Qatar": [ 9.1 ], "South Sudan": [ 9.1 ], "Uganda": [ 9.1 ] , which are the EV pairs found in mainMatrix.json
What is missing is linking the above to the patterns and values we see in theMatrixExtend120TokenFiltered_2_2_0.1_0.5_fixed2.json. However, I found just from searching that I couldn't find any patterns that were in mainMatrix.json_sentences and mainMatrix.json that were also in theMatrixExtend120TokenFiltered_2_2_0.1_0.5_fixed2.json to then perhaps find EV pairs from this latter file, as example "LOCATION_SLOT,\"MONEY\",\"MONEY\",\"-lrb-\",NUMBER_SLOT,\"people\",\"-rrb-\",\"NUMBER\"": { "Brazil": 191000000.0, "France": 65800000.0, "Italy": 60600000.0, "Mexico": 112000000.0 }, as an example of a pattern (here obviously a non-matching pattern to what we want).
The ideal would be to also retain the tokens and dependencies for the original sentence e.g. "dependencies": [ { "dep": 0, "head": 1, "label": "nn" } ], "tokens": [ { "lemma": "World", "ner": "O", "pos": "NNP", "word": "World" }, { "lemma": "fact", "ner": "O", "pos": "NNS", "word": "Facts" } ] },, noting there may be multiple token/dependency pairs comprising a single sentence.

In general why this is useful is because if I can I rebuild the original text from the HTML web page JSONs you originally generated, and also have the tokens and dependencies, I could extend the work into for example LSTMs where I might need the sentences before and after a particular sentence?

So overall HTML JSON Full Text > FileName of HTML Web Page JSON > Sentences > Tokens/Dependencies > Patterns > Actual Entity, Region Value from text observed > Predicted Entity, Region, Value from MAPE algorithm > Freebase Entity, Region, Value would be the goal,

Why multiple entries for same sentence, same MAPE_support_scaling.

When looking at population_claims csv, I see multiple entries for some sentences, e.g. the sentence <location>Morocco</location> has a population of over <number>33 million</number> and an area of 446,550 km2 -LRB- 172,410 sq mi -RRB- . has 15 duplicate entries, and the overall file has 1583 entries. Is there a reason there are multiple rows for the same sentence/mape scaling parameter?

Running fact checker for other properties not population seems to be very slow/not outputting to CSV

For example, tested:

python src/main/factChecker.py data/allCountriesPost2010-2014Filtered15-150.json data/theMatrixExtend120TokenFiltered_2_2_0.1_0.5_fixed2.json gdp_nominal_per_capita 16 data/htmlPages2textPARSEDALL data/locationNames data/aliases.json out/gdp_nominal_per_capita.tsv

and did not see sentences being printed to console or the file being updated at all.

Obtaining the real vs. predicted EV pairs for each pattern

It seems like this is where you can obtain the real vs. predicted EV pairs after training, after adding each pattern. Is this correct? My aim is to obtain for each pattern, what the final real vs. predicted EV pairs are. It seems like when I use your adjusted MAPE code, you run through testing each pattern using different params (e.g. 0.0625, 32) and then have EV predictions for each param and pattern set. You then have a bestParams parameter for all of these predictions? How can I obtain the adjustedMAPE for each pattern using the bestParams you already got to?

e.g. when I uncomment these lines, I get:

real: {u'Canada': 52218.99, u'Afghanistan': 619.59, u'Madagascar': 447.44, u'Turkmenistan': 6510.61, u'Liberia': 421.7, u'Guinea': 591.02, u'Vanuatu': 3176.21, u'Cambodia': 945.99, u'Swaziland': 3043.5, u'Laos': 1399.21, u'Seychelles': 11758.04, u'Cameroon': 1151.36, u'Burkina Faso': 634.32, u'Ecuador': 5456.43, u'Bahrain': 18334.17, u'Brunei': 41126.61, u'Saudi Arabia': 20777.67, u'Belarus': 6685.02, u'Algeria': 5404.0, u'Slovenia': 22092.26, u'Venezuela': 12766.72, u'Zambia': 1469.12, u'Montenegro': 6813.04, u'Papua New Guinea': 2184.16, u'Togo': 574.12, u'Zimbabwe': 787.94, u'Kiribati': 1743.39, u'Haiti': 770.95, u'Belize': 4576.64, u'Hong Kong': 36795.82, u'C\xf4te d\u2019Ivoire': 1243.99, u'Ukraine': 3866.99, u'Mauritania': 1106.14, u'Tonga': 4493.87, u'Tanzania': 608.85, u'Israel': 31281.47, u'Mali': 693.98, u'Philippines': 2587.88, u'Sweden': 55244.65, u'Latvia': 14008.51, u'Gabon': 11430.49, u'Guyana': 3583.96, u'Thailand': 5473.75, u'Switzerland': 79052.34, u'Bulgaria': 6986.04, u'Iraq': 6454.62, u'Honduras': 2264.09, u'Macau': 78275.15, u'Chad': 885.11, u'United Arab Emirates': 40363.16, u'United Kingdom': 38514.46, u'Malaysia': 10380.54, u'Vietnam': 1595.81, u'Saint Vincent and the Grenadines': 6515.22, u'Uganda': 547.01, u'South Korea': 23020.0, u'Cyprus': 26315.47, u'Barbados': 13076.46} predicted: {u'Canada': 52147.75, u'Afghanistan': 0, u'Madagascar': 445.5, u'Turkmenistan': 6972.75, u'Mauritania': 0, u'Guinea': 571.75, u'Vanuatu': 0, u'Cambodia': 976.25, u'Swaziland': 0, u'Laos': 0, u'Venezuela': 12726.5, u'Burkina Faso': 0, u'Ecuador': 5743.0, u'Bahrain': 0, u'Brunei': 40499.0, u'Saudi Arabia': 24979.75, u'Belarus': 0, u'Algeria': 5466.25, u'Togo': 0, u'Cameroon': 1206.0, u'Zambia': 0, u'Montenegro': 0, u'Papua New Guinea': 2236.2049999999999, u'Slovenia': 0, u'Zimbabwe': 0, u'Kiribati': 0, u'Haiti': 774.0, u'Belize': 0, u'Hong Kong': 37375.0, u'Tanzania': 0, u'Ukraine': 3864.5, u'Liberia': 0, u'Tonga': 0, u'Iraq': 0, u'C\xf4te d\u2019Ivoire': 0, u'Israel': 34305.5, u'Philippines': 0, u'Sweden': 56356.5, u'Latvia': 0, u'Gabon': 0, u'Guyana': 0, u'Mali': 685.25, u'Switzerland': 80024.75, u'Thailand': 5630.75, u'Bulgaria': 7154.5, u'Seychelles': 0, u'Honduras': 0, u'Chad': 0, u'Macau': 81283.333333333328, u'United Arab Emirates': 0, u'United Kingdom': 38797.473333333328, u'Malaysia': 10437.5, u'Vietnam': 0, u'Saint Vincent and the Grenadines': 0, u'Uganda': 0, u'South Korea': 0, u'Cyprus': 0, u'Barbados': 0} printed to console.

Crucially, how do I link these lines of code to each pattern?

Cannot find propertiesOfInterest.json file

In numberExtraction.py, in _main_you refer to:

propertiesOfInterest = {}
    with open(dirName + "../propertiesOfInterest.json") as props:
        propertiesOfInterest = json.loads(props.read())

However I cannot find this file. Should this step of generating that JSON file be within the first step, FreeBaseDownload.py? Secondly, in FreeBaseDownload.py you refer to:

api_key = open("/cs/research/intelsys/home1/avlachos/freebaseApiKey").read(), it may be worth inserting a generic API key to use for this so that others can reuse the code?

In the labelled_claim excel files, are the statistical properties in the files for each sentence manually labelled, or are they predicted through some other mechanism? I presume not manually labelled but checking.

Retraining and running `factChecker.py` on a set of test instances

The patterns based approach requires training on a set of tokens and dependencies, extracting patterns for each property, and then running again on some test instances to output sentences that have patterns corresponding to that property that your model has pre-selected. Then you evaluate manually via annotation.

I have a set of pre-labelled sentences, and want to effectively directly run the factChecker on only these sentences after having trained the model. I want to output, for these sentences, the property that the model predicts given its set of sentences.

How can I adapt the code to do this?

What are the arguments for the matrixFiltering.py you used for the results?

I noticed this step in the filtering file, and was trying to see what to run on my machine to match the parameters you used.

From the paper I read:

We distinguish between the two by requiring each region-pattern combination to have appeared at least twice and its values to have standard deviation less than 0.1. In this case, then the region-pattern value is set to the mean of the values it is encountered with, otherwise is removed.

What are the values for:

minNumberOfValues = int(sys.argv[4]) minNumberOfLocations = int(sys.argv[5]) maxAllowedDeviation = float(sys.argv[6]) percentageRemoved = float(sys.argv[7])

That I can use? 2, 2, 0.1 for the first 3? What about the final arg? Thanks :)

Is this line going to give me the predicted EV pairs per pattern?

Using this line, if I add of.write("Predicted EV pairs are: "+ str(prediction)) I get:

text pattern: LOCATION_SLOT~-nn+*extend*DATE~num+consumer~dep+NUMBER_SLOT MAPE:0.0 {u'Brunei': 107.29, u'Saudi Arabia': 142.05, u'Czech Republic': 121.1, u'South Africa': 154.95, u'Sierra Leone': 214.26, u'United Kingdom': 122.99, u'New Zealand': 121.08, u'Dominican Republic': 153.26} region: Saudi Arabia pattern used: *extend*DATE~appos+LOCATION_SLOT~-nn+consumer~dep+NUMBER_SLOT value: 142.05 region: Saudi Arabia pattern used: LOCATION_SLOT~-nn+*extend*DATE~num+consumer~dep+NUMBER_SLOT value: 142.05 region: Dominican Republic pattern used: *extend*DATE~appos+LOCATION_SLOT~-nn+consumer~dep+NUMBER_SLOT value: 153.26 region: Dominican Republic pattern used: LOCATION_SLOT~-nn+*extend*DATE~num+consumer~dep+NUMBER_SLOT value: 153.26 region: Brunei pattern used: *extend*DATE~appos+LOCATION_SLOT~-nn+consumer~dep+NUMBER_SLOT value: 107.29 region: Brunei pattern used: LOCATION_SLOT~-nn+*extend*DATE~num+consumer~dep+NUMBER_SLOT value: 107.29 region: Sierra Leone pattern used: *extend*DATE~appos+LOCATION_SLOT~-nn+consumer~dep+NUMBER_SLOT value: 214.26 region: Sierra Leone pattern used: LOCATION_SLOT~-nn+*extend*DATE~num+consumer~dep+NUMBER_SLOT value: 214.26 region: United Kingdom pattern used: *extend*DATE~appos+LOCATION_SLOT~-nn+consumer~dep+NUMBER_SLOT value: 122.99 region: United Kingdom pattern used: LOCATION_SLOT~-nn+*extend*DATE~num+consumer~dep+NUMBER_SLOT value: 122.99 Predicted EV pairs are: {u'Canada': 146.12, u'Lithuania': 146.12, u'Cambodia': 146.12, u'Ethiopia': 146.12, u'Swaziland': 146.12, u'Argentina': 146.12, u'Cameroon': 146.12, u'Burkina Faso': 146.12, u'Ghana': 146.12, u'Saudi Arabia': 142.05000000000001, u'Republic of Ireland': 146.12, u'Bosnia and Herzegovina': 146.12, u'Spain': 146.12, u'Liberia': 146.12, u'Maldives': 146.12, u'Tanzania': 146.12, u'Gabon': 146.12, u'Albania': 146.12, u'Samoa': 146.12, u'India': 146.12, u'Azerbaijan': 146.12, u'Lesotho': 146.12, u'Saint Vincent and the Grenadines': 146.12, u'Cyprus': 146.12, u'Tajikistan': 146.12, u'Afghanistan': 146.12, u'Bangladesh': 146.12, u'Solomon Islands': 146.12, u'Saint Lucia': 146.12, u'Mongolia': 146.12, u'France': 146.12, u'Slovakia': 146.12, u'Laos': 146.12, u'Malawi': 146.12, u'Singapore': 146.12, u'Montenegro': 146.12, u'Saint Kitts and Nevis': 146.12, u'Armenia': 146.12, u'Dominican Republic': 153.25999999999999, u'Ukraine': 146.12, u'Bahrain': 146.12, u'Tonga': 146.12, u'Libya': 146.12, u'Central African Republic': 146.12, u'Mauritius': 146.12, u'Vietnam': 146.12, u'Mali': 146.12, u'Russia': 146.12, u'Bulgaria': 146.12, u'Romania': 146.12, u'Angola': 146.12, u'Portugal': 146.12, u'Nicaragua': 146.12, u'Malaysia': 146.12, u'Austria': 146.12, u'Mozambique': 146.12, u'Hungary': 146.12, u'Brazil': 146.12, u'Kuwait': 146.12, u'Qatar': 146.12, u'Nigeria': 146.12, u'Brunei': 107.29000000000001, u'Australia': 146.12, u'Algeria': 146.12, u'Belgium': 146.12, u'Haiti': 146.12, u'Iraq': 146.12, u'Sierra Leone': 214.25999999999999, u'Denmark': 146.12, u'Namibia': 146.12, u'Guinea-Bissau': 146.12, u'Switzerland': 146.12, u'Seychelles': 146.12, u'Estonia': 146.12, u'Kosovo': 146.12, u'Timor-Leste': 146.12, u'Dominica': 146.12, u'Colombia': 146.12, u'Burundi': 146.12, u'Fiji': 146.12, u'Barbados': 146.12, u'Madagascar': 146.12, u'Bhutan': 146.12, u'Sudan': 146.12, u'Netherlands': 146.12, u'Suriname': 146.12, u'S\xe3o Tom\xe9 and Pr\xedncipe': 146.12, u'Venezuela': 146.12, u'Israel': 146.12, u'Senegal': 146.12, u'Papua New Guinea': 146.12, u'Germany': 146.12, u'Kazakhstan': 146.12, u'Mauritania': 146.12, u'Kyrgyzstan': 146.12, u'Trinidad and Tobago': 146.12, u'Latvia': 146.12, u'Guyana': 146.12, u'Belarus': 146.12, u'Honduras': 146.12, u'Myanmar': 146.12, u'Tunisia': 146.12, u'Serbia': 146.12, u'Comoros': 146.12, u'United Kingdom': 122.98999999999999, u'Greece': 146.12, u'Sri Lanka': 146.12, u'Croatia': 146.12, u'Botswana': 146.12}MAPE of predictor before adding the pattern:0.17656771874 MAPE of predictor after adding the pattern:0.17656771874

appearing in the adjustedMAPE_consumer_price_index_True_1_TEST file. Is this what I could be storing and thus link these predicted EV pairs to this pattern LOCATION_SLOT~-nn+*extend*DATE~num+consumer~dep+NUMBER_SLOT?