Giter Site home page Giter Site logo

academic-pdf-scrap's Introduction

Hi 👋, I'm Earl

I tinker, I tanker, I break stuff and sometimes manage to put them back together.

Links

Website Badge Resume Badge Instagram Badge LinkedIn Badge Twitter Badge

academic-pdf-scrap's People

Contributors

earlng avatar paulsedille avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

academic-pdf-scrap's Issues

Impact statement is split between multiple tags

Describe the bug
After the BIS title, the BIS content is split between multiple tags (e.g. ) but code only scrapes the first tag, not the entire BIS content (this happens often if BIS content is split into different paragraphs)

To Reproduce
Papers to look into:

  1. b704ea2c39778f07c617f6b7ce480e9e
  2. 33a854e247155d590883b93bca53848a
  3. B460cf6b09878b00a3e1ad4c72344ccd
  4. 460191c72f67e90150a093b4585e7eb4
  5. 2290a7385ed77cc5592dc2153229f082 (this is an xml tagging error, but scraping the entire
    would fix it)

Expected behavior
Grab the entirety of the impact statement. Not just the first portion that so happens to immediately follow the h1 tag.

Suggested Fix
review use of itertext? Or instead of scraping the content after h1, we could pull the entire section that contains h1 (thus also scraping the title into the statement itself)?

Improve sentence count

Currently, the code counts BIS (broader impact statement) sentences by simply counting the number of final punctuation markers (. ! ?). This is not perfect because strings like "e.g." or "1.5 gallons" incorrectly add to the sentence count.

Ideally, the script would take these exceptional cases into account and reflect this in the final count.

There is an easy fix for the two most common occurrences: e.g. and i.e., which would be to subtract "2" from the sentence count for every separate occurence of either substrings ("e.g." or "i.e.") in the BIS text. More complex solutions might be (1) to automatically dismiss any sentence that is shorter than X characters (around 3-10 seems appropriate) and/or (2) only count ".", "!", or "?" if they are followed by a blank space (that is, count ". ", "! " and "? "). This would help exclude rarer false positives, for example tables, lists or numerical values that include full stops (such as "934.2" or "1. Computation Cost, 2. Training Data" etc.)

Double Space

Describe the bug
There's also a lot of discrepancies which are just due to double spaces " " vs " " which I suppose is due to the code adding a space in between text chunks, even when there already is one.

Additional context
But that's an easy fix with a replace and find or through the TRIM() function

Scrape impact statements by looking for statements with "impact" in the title

Is your feature request related to a problem? Please describe.
Currently the statements scraped are those with "broader impact(s)" in the title, would like to get all those with impact (case insensitive) in the title.

Describe the solution you'd like
code scraps all occurrences of a section that has "impact" in its header

Describe alternatives you've considered
using the "if X in Y" function?

Merge dataframe with second data (authors, institutions, countries)

Is your feature request related to a problem? Please describe.
The PDF formatting makes it difficult to scrap the authors and their institutions from the XML. Fortunately, there is another repository of the articles that makes this easier, and even more fortunately, someone has already done the hard work of scraping it with python, as well as adding for many institutions their country of affiliation, here: https://github.com/nd7141/icml2020

Describe the solution you'd like
Can the authors+institutions+countries data scraped by the above github user be collated into our dataframe, and output in a single csv file?

Describe alternatives you've considered
Will need to look into this!

Include impact statement title in dataframe

Is your feature request related to a problem? Please describe.
The current dataframe does not include the title of the sections pulled, which would be helpful to be able to eyeball if the section is actually an impact statement or a section we do not need (directly within the csv).

Describe the solution you'd like
Include in the dataframe not just the text of the impact statement, but its title.

Describe alternatives you've considered
This is straightforward I suppose: define a variable for the title text (something like BIS_title = child.text if child.tag == "h1" and child.text includes impact) and append it to impact_dict

new code missed some

Examples of BIS that are not being pulled under the new code but used to be:

  1. 285baacbdf8fda1de94b19282acd23e2
  2. cdfa4c42f465a5a66871587c69fcfa34
  3. 33a854e247155d590883b93bca53848a (though the original code only pulled it in partially anyways)
  4. 4496bf24afe7fab6f046bf4923da8de6

The Impact Statement is split across pages

Describe the bug
the BIS is split across multiple pages, the code only pulls in the content before the page break because in the xml the page break is a <section> break

To Reproduce
Papers:

  • 285baacbdf8fda1de94b19282acd23e2

Expected behavior
Grab the entire impact statement across pages.

Possible Fix
pages seem to be marked by an <outsider> tag; might be possible to write the code so that it would continue to scrap the <section> immediately following an <outsider> tag?

Capturing page number

Describe the bug
these are all examples of the the new code just pulling in a "9" at the end, which should be the page number—if there's a way to get the code not to pull in the page number that'd be great

To Reproduce
Refer to these files:

  1. 0332d694daab22e0e0eaf7a5e88433f9
  2. 0415740eaa4d9decbc8da001d3fd805f
  3. 066f182b787111ed4cb65ed437f0855b

Expected behavior
Ignore the "9" if possible.

Related to #10

Too many Impact Statements

If there are two different parts of the paper that both have a title that included “impact”, the new code will concatenate both and pull that combination in as the BIS

  1. 196f5641aa9dc87067da4ff90fd81e7b
  2. 274e6fcf4a583de4a81c6376f17673e7
  3. 33a5435d4f945aa6154b31a73bab3b73
  4. 93d9033636450402d67cd55e60b3f926
  5. A03fa30821986dff10fc66647c84c9c3
  6. B607ba543ad05417b8507ee86c54fcb7

Pull no more than one impact statement per paper

Is your feature request related to a problem? Please describe.
Currently, the code pulls any section within a paper whose title includes the word "impact." This includes any section that happens to include impact in its name, even if it is not an "Impact Statement." This also means that if a paper includes an impact statement AND another section that has "impact" in the title, the code will output more than one "impact statement" per article.

Describe the solution you'd like
In order to minimise these problems, I would like it if the code only pulled the last section that includes "impact" in its title in cases where there is more than one such section. By "last" I mean "that appears latest in the body of the xml/paper". This is because impact statements are typically placed at the end of a paper since they do not count for the 8-page limit imposed by NeurIPS; therefore, if more than one section has "impact" in the title, the correct one to pull is most likely the latest one.

Describe alternatives you've considered
Not sure; could "copy over" each old section in the dataframe as a new one is found?

Impact statement is not in a header

Describe the bug
“Impact” appears in an <h1> title that is not the BIS, therefore scraping the wrong content

To Reproduce
Paper:

  • 55479c55ebd1efd3ff125f1337100388

Possible Fix
Switch the if and elif statements, so as to first look for the title names we know are popular like “Broader Impact” etc and then if not found look for any h1 title that contains “impact”?

Included acknowledgement

For some papers, the new code pulls in the acknowledgement section (which usually follows the BIS) although we don’t want that

  1. A89b71bb5227c75d463dd82a03115738
  2. D3b1fb02964aa64e257f9f26a31f72cf
  3. D40d35b3063c11244fbf38e9b55074be
  4. F52a7b2610fb4d3f74b4106fb80b233d
  5. fea16e782bc1b1240e4b3c797012e289

XML tagging of PDFs is too faulty

Describe the bug
The xml tagging is too faulty to be correctly scraped

To Reproduce

BIS title improperly coded, in xml appears after BIS content, and tagged as <region>

  • 103303dd56a731e377d01f6a37badae3
  • 6271faadeedd7626d661856b7a004e27
  • F5b1b89d98b7286673128a5fb112cb9a
  • f0bda020d2470f2e74990a07a607ebd9

entire BIS, title and content, contained within a tag (with other, non BIS content), the BIS is not scraped at all by the code

  • 201d7288b4c18a679e48b31c72c30ded
  • 8ab70731b1553f17c11a3bbc87e0b605
  • 94d2a3c6dd19337f2511cdf8b4bf907e
  • 2974788b53f73e7950e8aa49f3a306db
  • B139aeda1c2914e3b579aafd3ceeb1bd
  • Be23c41621390a448779ee72409e5f49
  • 7a006957be65e608e863301eb98e1808

BIS title and content improperly and arbitrarily coded as <outsider> and <region>

  • 1325cdae3b6f0f91a1b629307bf2d498

Proposed Fix
No possible fix

What if Impact Statement is not a h1?

Describe the bug
The BIS title is not contained in an <h1> tag (e.g. there is a more general “Conclusion” <h1> tag or just bad xml tagging) and therefore it is not scraped

To Reproduce
Papers that this problem occurs in:

  1. 9f1d5659d5880fb427f6e04ae500fc25 (contained in <h2>)
  2. 7a43ed4e82d06a1e6b2e88518fb8c2b0 (contained in <h2>)
  3. 4b29fa4efe4fb7bc667c7b301b74d52d (contained in <h2>)
  4. c589c3a8f99401b24b9380e86d939842 (contained in a <region> tag, no title)

Expected behavior
The code should be able to find the impact statement even if it is not in the h1 tag.

Proposed fix
look for a BIS through <h2> tags as well, or simply pull any text content that contains “broader impact”?

Include paper link in dataframe

Is your feature request related to a problem? Please describe.
The current dataframe only includes the "paper identifier" (an alphanumeric string) but this could easily be used to also provide the url link to the paper itself, a useful addition.

Describe the solution you'd like
Add a "paper link" column in the dataframe by adding two strings to the paper identifier, before and after, as follows: "https://proceedings.neurips.cc/paper/2020/file/" + {paper identifier} + "-Paper.pdf" (all paper links are built like this).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.