earlng / academic-pdf-scrap Goto Github PK

View Code? Open in Web Editor NEW

4.0 1.0 2.0 167 KB

Code that scraps the contents of the PDF papers submitted for NeurIPS 2020

License: MIT License

Python 75.21% Jupyter Notebook 24.79%

neurips conference scraps

academic-pdf-scrap's Introduction

Hi 👋, I'm Earl

I tinker, I tanker, I break stuff and sometimes manage to put them back together.

Links

academic-pdf-scrap's People

Contributors

Stargazers

Watchers

Forkers

paulsedille mini-configs

academic-pdf-scrap's Issues

Impact statement is split between multiple tags

Describe the bug
After the BIS title, the BIS content is split between multiple tags (e.g. ) but code only scrapes the first tag, not the entire BIS content (this happens often if BIS content is split into different paragraphs)

To Reproduce
Papers to look into:

b704ea2c39778f07c617f6b7ce480e9e
33a854e247155d590883b93bca53848a
B460cf6b09878b00a3e1ad4c72344ccd
460191c72f67e90150a093b4585e7eb4
2290a7385ed77cc5592dc2153229f082 (this is an xml tagging error, but scraping the entire
would fix it)

Expected behavior
Grab the entirety of the impact statement. Not just the first portion that so happens to immediately follow the h1 tag.

Suggested Fix
review use of itertext? Or instead of scraping the content after h1, we could pull the entire section that contains h1 (thus also scraping the title into the statement itself)?

Improve sentence count

Currently, the code counts BIS (broader impact statement) sentences by simply counting the number of final punctuation markers (. ! ?). This is not perfect because strings like "e.g." or "1.5 gallons" incorrectly add to the sentence count.

Ideally, the script would take these exceptional cases into account and reflect this in the final count.

There is an easy fix for the two most common occurrences: e.g. and i.e., which would be to subtract "2" from the sentence count for every separate occurence of either substrings ("e.g." or "i.e.") in the BIS text. More complex solutions might be (1) to automatically dismiss any sentence that is shorter than X characters (around 3-10 seems appropriate) and/or (2) only count ".", "!", or "?" if they are followed by a blank space (that is, count ". ", "! " and "? "). This would help exclude rarer false positives, for example tables, lists or numerical values that include full stops (such as "934.2" or "1. Computation Cost, 2. Training Data" etc.)

Double Space

Describe the bug
There's also a lot of discrepancies which are just due to double spaces " " vs " " which I suppose is due to the code adding a space in between text chunks, even when there already is one.

Additional context
But that's an easy fix with a replace and find or through the TRIM() function

Scrape impact statements by looking for statements with "impact" in the title

Is your feature request related to a problem? Please describe.
Currently the statements scraped are those with "broader impact(s)" in the title, would like to get all those with impact (case insensitive) in the title.

Describe the solution you'd like
code scraps all occurrences of a section that has "impact" in its header

Describe alternatives you've considered
using the "if X in Y" function?

Merge dataframe with second data (authors, institutions, countries)

Is your feature request related to a problem? Please describe.
The PDF formatting makes it difficult to scrap the authors and their institutions from the XML. Fortunately, there is another repository of the articles that makes this easier, and even more fortunately, someone has already done the hard work of scraping it with python, as well as adding for many institutions their country of affiliation, here: https://github.com/nd7141/icml2020

Describe the solution you'd like
Can the authors+institutions+countries data scraped by the above github user be collated into our dataframe, and output in a single csv file?

Describe alternatives you've considered
Will need to look into this!

Include impact statement title in dataframe

Is your feature request related to a problem? Please describe.
The current dataframe does not include the title of the sections pulled, which would be helpful to be able to eyeball if the section is actually an impact statement or a section we do not need (directly within the csv).

Describe the solution you'd like
Include in the dataframe not just the text of the impact statement, but its title.

Describe alternatives you've considered
This is straightforward I suppose: define a variable for the title text (something like BIS_title = child.text if child.tag == "h1" and child.text includes impact) and append it to impact_dict

new code missed some

Examples of BIS that are not being pulled under the new code but used to be:

285baacbdf8fda1de94b19282acd23e2
cdfa4c42f465a5a66871587c69fcfa34
33a854e247155d590883b93bca53848a (though the original code only pulled it in partially anyways)
4496bf24afe7fab6f046bf4923da8de6

The Impact Statement is split across pages

Describe the bug
the BIS is split across multiple pages, the code only pulls in the content before the page break because in the xml the page break is a <section> break

To Reproduce
Papers:

285baacbdf8fda1de94b19282acd23e2

Expected behavior
Grab the entire impact statement across pages.

Possible Fix
pages seem to be marked by an <outsider> tag; might be possible to write the code so that it would continue to scrap the <section> immediately following an <outsider> tag?

Capturing page number

Describe the bug
these are all examples of the the new code just pulling in a "9" at the end, which should be the page number—if there's a way to get the code not to pull in the page number that'd be great

To Reproduce
Refer to these files:

0332d694daab22e0e0eaf7a5e88433f9
0415740eaa4d9decbc8da001d3fd805f
066f182b787111ed4cb65ed437f0855b

Expected behavior
Ignore the "9" if possible.

Related to #10

Too many Impact Statements

If there are two different parts of the paper that both have a title that included “impact”, the new code will concatenate both and pull that combination in as the BIS

196f5641aa9dc87067da4ff90fd81e7b
274e6fcf4a583de4a81c6376f17673e7
33a5435d4f945aa6154b31a73bab3b73
93d9033636450402d67cd55e60b3f926
A03fa30821986dff10fc66647c84c9c3
B607ba543ad05417b8507ee86c54fcb7

Pull no more than one impact statement per paper

Is your feature request related to a problem? Please describe.
Currently, the code pulls any section within a paper whose title includes the word "impact." This includes any section that happens to include impact in its name, even if it is not an "Impact Statement." This also means that if a paper includes an impact statement AND another section that has "impact" in the title, the code will output more than one "impact statement" per article.

Describe the solution you'd like
In order to minimise these problems, I would like it if the code only pulled the last section that includes "impact" in its title in cases where there is more than one such section. By "last" I mean "that appears latest in the body of the xml/paper". This is because impact statements are typically placed at the end of a paper since they do not count for the 8-page limit imposed by NeurIPS; therefore, if more than one section has "impact" in the title, the correct one to pull is most likely the latest one.

Describe alternatives you've considered
Not sure; could "copy over" each old section in the dataframe as a new one is found?

Impact statement is not in a header

Describe the bug
“Impact” appears in an <h1> title that is not the BIS, therefore scraping the wrong content

To Reproduce
Paper:

55479c55ebd1efd3ff125f1337100388

Possible Fix
Switch the if and elif statements, so as to first look for the title names we know are popular like “Broader Impact” etc and then if not found look for any h1 title that contains “impact”?

Included acknowledgement

For some papers, the new code pulls in the acknowledgement section (which usually follows the BIS) although we don’t want that

A89b71bb5227c75d463dd82a03115738
D3b1fb02964aa64e257f9f26a31f72cf
D40d35b3063c11244fbf38e9b55074be
F52a7b2610fb4d3f74b4106fb80b233d
fea16e782bc1b1240e4b3c797012e289

XML tagging of PDFs is too faulty

Describe the bug
The xml tagging is too faulty to be correctly scraped

To Reproduce

BIS title improperly coded, in xml appears after BIS content, and tagged as <region>

103303dd56a731e377d01f6a37badae3
6271faadeedd7626d661856b7a004e27
F5b1b89d98b7286673128a5fb112cb9a
f0bda020d2470f2e74990a07a607ebd9

entire BIS, title and content, contained within a tag (with other, non BIS content), the BIS is not scraped at all by the code

201d7288b4c18a679e48b31c72c30ded
8ab70731b1553f17c11a3bbc87e0b605
94d2a3c6dd19337f2511cdf8b4bf907e
2974788b53f73e7950e8aa49f3a306db
B139aeda1c2914e3b579aafd3ceeb1bd
Be23c41621390a448779ee72409e5f49
7a006957be65e608e863301eb98e1808

BIS title and content improperly and arbitrarily coded as <outsider> and <region>

1325cdae3b6f0f91a1b629307bf2d498

Proposed Fix
No possible fix

What if Impact Statement is not a h1?

Describe the bug
The BIS title is not contained in an <h1> tag (e.g. there is a more general “Conclusion” <h1> tag or just bad xml tagging) and therefore it is not scraped

To Reproduce
Papers that this problem occurs in:

9f1d5659d5880fb427f6e04ae500fc25 (contained in <h2>)
7a43ed4e82d06a1e6b2e88518fb8c2b0 (contained in <h2>)
4b29fa4efe4fb7bc667c7b301b74d52d (contained in <h2>)
c589c3a8f99401b24b9380e86d939842 (contained in a <region> tag, no title)

Expected behavior
The code should be able to find the impact statement even if it is not in the h1 tag.

Proposed fix
look for a BIS through <h2> tags as well, or simply pull any text content that contains “broader impact”?

Include paper link in dataframe

Is your feature request related to a problem? Please describe.
The current dataframe only includes the "paper identifier" (an alphanumeric string) but this could easily be used to also provide the url link to the paper itself, a useful addition.

Describe the solution you'd like
Add a "paper link" column in the dataframe by adding two strings to the paper identifier, before and after, as follows: "https://proceedings.neurips.cc/paper/2020/file/" + {paper identifier} + "-Paper.pdf" (all paper links are built like this).