Giter Site home page Giter Site logo

jpwahle / cs-insights-crawler Goto Github PK

View Code? Open in Web Editor NEW
8.0 3.0 1.0 8.88 MB

This repository implements the interaction with DBLP, information extraction and pre-processing of papers, and a client to store data to the cs-insights-backend.

Home Page: https://aclanthology.org/2022.lrec-1.283.pdf

License: Apache License 2.0

Python 98.66% Shell 0.60% Dockerfile 0.74%
crawler dblp dblp-dataset nlp semanticscholar

cs-insights-crawler's Introduction



Actions Status Actions Status Actions Status License: MIT Code style: black


This is the official crawler implementation for the D3 Dataset in almost pure python. The crawler is also used for the cs-insights project.

Starting from version 1.0.2, this project is using semantic versioning, and supports SemanticScholar. For more info about the features supported, see the releases.

Installation & Setup

First install the package manager poetry:

pip install poetry

Then run:

poetry install

To start the crawling process, run:

poetry run cli main --s2_use_papers --s2_use_abstracts --s2_filter_dblp

For help run:

poetry run cli main --help

Code quality and tests

To maintain a consistent and well-tested repository, we use unit tests, linting, and typing checkers with GitHub actions. We use pytest for testing, pylint for linting, and pyright for typing. Every time code gets pushed to our repository these checks are executed and have to fullfill certain requirements before you can merge the code to our master branch.

Whenever you create a pull request against the default branch, GitHub actions will create a CI job executing unit tests and linting.

To run all tests that are tested during CI locally, run:

poetry run poe alltest

Contributing

Fork the repo, make changes and send a PR. We'll review it together!

Commit messages should follow Angular's conventions.

License

This project is licensed under the terms of MIT license. For more information, please see the LICENSE file.

Citation

If you use this repository, or use our tool for analysis, please cite our work:

Citation

If you use this repository, or use our tool for analysis, please cite our work:

@inproceedings{Wahle2022c,
  title        = {D3: A Massive Dataset of Scholarly Metadata for Analyzing the State of Computer Science Research},
  author       = {Wahle, Jan Philip and Ruas, Terry and Mohammad, Saif M. and Gipp, Bela},
  year         = {2022},
  month        = {July},
  booktitle    = {Proceedings of The 13th Language Resources and Evaluation Conference},
  publisher    = {European Language Resources Association},
  address      = {Marseille, France},
  doi          = {},
}

Also make sure to cite the following papers if you use SemanticScholar data:

@inproceedings{ammar-etal-2018-construction,
    title = "Construction of the Literature Graph in Semantic Scholar",
    author = "Ammar, Waleed  and
      Groeneveld, Dirk  and
      Bhagavatula, Chandra  and
      Beltagy, Iz",
    booktitle = "Proceedings of the 2018 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)",
    month = jun,
    year = "2018",
    address = "New Orleans - Louisiana",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/N18-3011",
    doi = "10.18653/v1/N18-3011",
    pages = "84--91",
}
@inproceedings{lo-wang-2020-s2orc,
    title = "{S}2{ORC}: The Semantic Scholar Open Research Corpus",
    author = "Lo, Kyle  and Wang, Lucy Lu  and Neumann, Mark  and Kinney, Rodney  and Weld, Daniel",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.acl-main.447",
    doi = "10.18653/v1/2020.acl-main.447",
    pages = "4969--4983"
}

cs-insights-crawler's People

Contributors

dependabot[bot] avatar jpwahle avatar trannel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

icarlous

cs-insights-crawler's Issues

Cleanup & Misc

This issue is there to keep track of previous TODOs that were in the code.

  • Add the last missing webpage (with only 1 paper) for the paper download
  • Try to vectorize the paper download and improve the download by bundling downloads
  • Try to vectorize the rule-based abstracts exctraction
  • Try to improve the rule-based abstract extraction. Look at the "long" abstracts and those where the start can be found but not the end.

Add documentation for act and docker environment.

We should have documentation in the README.md file for running the github pipeline locally with act, running the repository in docker, and checking linting and typing before each commit to maintain a clean code repository using pre-commit.

  • This should be done after issue #43
  • Documentation extended.

Step 6: Create visualization methods for the embeddings

With the embeddings we can now compare and visualize them.

  • Write a function that determines the most similar and dissimilar papers
    • Check if the results make sense
  • Write a function that visualizes the embeddings in a 2D space using UMAP
    • Check if the results make sense
    • Check if there are clusters and if yes, determine what they could mean manually
    • Figure out a way to name those clusters automatically (topic modelling?)
    • Determine how long one visualization takes and decide further steps based on this
    • Save the visualiations
  • Determine whether the results are usable and how to proceed (is the model good/bad?)

Step 3: Apply topic modelling

We want to gain further insights regarding the topics and how they change using topic modelling.

  • Get topic modelling running on our data (https://radimrehurek.com/gensim/)
  • Hyperparameter tuning (amount of topics, words per topic)
    • Minimum 1 topic per conference
    • Play around with the granularity
  • Visualize it with pyLDAvis for one venue/year

- Alternatively check out this post on how to extract and visualize topics using LDA: https://towardsdatascience.com/how-to-extract-labelled-topics-from-natural-language-data-8af121491bfd
- Implement the presented solution

Extend time range from 10 years to longest time series possible.

Is your feature request related to a problem? Please describe.
Currently, we are analyzing the last 10 years of ACL papers. We would like to extend it to the earliest paper published (and available to us) at each conference.

Describe the solution you'd like
OCR the paper information for older publications.

Describe alternatives you've considered
If we see the OCR quality drop significantly for older papers we can limit it to ~30 years.

Step 1: Create a general pipeline to retrieve the complete dataset

The NLP scholar dataset only seems to have the titles and no other text with contents of the papers (i.e. abstracts). We want to complete the dataset by adding the abstracts for each paper.

  • Check the authors github for a bigger dataset and code. He might have useful information there.
  • Download the offline AA dataset (https://www.aclweb.org/anthology/info/development/)
    - Analyse the offline AA dataset for differences compared to the web version
  • Analyse how we can access the abstracts for papers of different venues and how they are referenced
  • Decide on the format and categorisation (e.g. by year) to save our new dataset
    • Expand the dataset with abstracts
  • Write the code to extract the abstracts and create the new dataset
  • Save the PDFs of the accessed papers in a structed way (year/venue/...)
    • Get a disc for long term storage
    • Save the PDFs on the disc
  • Pull abstracts directly from AA
  • Try out GROBID instead of tika

Compasion to NLPScholar using title + abstract

Description of the Epic's Goal
The goal of this epic has two parts:

  • to compare our analysis using titles and abstracts to NLPScholar using Titles only.
  • formulate this as an interesting research question. What do abstracts provide additionally over only titles?
  • which questions can we now answer that were not possible without abstracts?
  • how can we categorize abstracts?

Issues to resolve the Epic

  • Title Unigrams
  • Title Bigrams
  • Abstracts Unigrams
  • Abstracts Bigrams
  • Title+Abstract Unigrams - tf-idf?
  • Title+Abstract Bigrams - tf-idf?
  • LDA between NLPS and NLPS2 over time

Improve test pipeline compatibility.

Is your feature request related to a problem? Please describe.
Right now the tests and dependency installation are running in the same step.
Also since we are using the self-hosted runner from our group (dke01), the tests don't run with act locally anymore. See 0f0ea13.

act -j typing --container-architecture linux/amd64
ERRO[0000] 'runs-on' key not defined in Tests/typing    
ERRO[0000] 'runs-on' key not defined in Tests/typing    
ERRO[0000] 'runs-on' key not defined in Tests/typing    

Describe the solution you'd like

  • Split dependency installation into a separate step.
  • Adjust the "runs-on" flag so that it runs with act locally again.
  • Add dependency caching so pipelines are faster.

Describe alternatives you've considered
Another ways of running tests locally in a docker as suggestions in the comments are welcome. Our current solution should reproduce the exact same tests as we run in GitHub.

Implement a database module that stores paper information to a database

Is your feature request related to a problem? Please describe.
The aggregated data used for our backend needs to be stored in a MongoDB. Currently, we store the papers as pdfs and other information in txt or CSV which is not compatible.

Describe the solution you'd like
Implement a module in this repository that connects to a local MongoDB and stores the data for papers the same way as defined here.

The collection and document models of MongoDB should always match the backend repository.

If there are features that are not yet implemented in this repository yet (e.g., institutions), ignore them for now, but have a general interface such that later it can be included (e.g., via a callback function).

There should be an environment variable to connect the CLI to an online MongoDB for later deployment. We won't use it now, but later it is important.

Additional context
For example, there could be a file created under nlpland/modules/database.py where we connect to a local database, and store the paper information. For testing purposes, we can store a few papers from each conference and year.

Upgrade code quality to 10/10

Is your feature request related to a problem? Please describe.
The code quality after running pylint (see issue #11) is below 7/10.

Describe the solution you'd like
We need to upgrade the code quality to 10/10.

Extended paper information extraction

Description of the Epic's Goal
The goal of this epic is to extract missing core features for our following analysis such as the institution, location, etc.

Issues to resolve the Epic

Step 5: Create the first embeddings

Now that we have the dataset and already did some analysis on it, we now want to look into the semantic side. For this we first want to create some embeddings.

  • Decide on model for our first analysis
    • fastText, (GloVe or word2vec)
  • Train the model
  • Decide how we want to save the embeddings (in the dataset)
  • Create the embeddings for the papers (of one venue in one year, only using abstracts)
  • Save the embeddings

Outline a full overview for the project.

Before you develop anything, outline a plan (as detailed as possible) for everything you want to do in the project.
When you need our input or want to brainstorm, don't hesitate to make a small PowerPoint that we can discuss.

Step 4: Core analysis

These are the things we want to analyse using the expanded dataset. They might change over time.
We might generally want to look into things NLP scholar looked into for comparisions.

  • Most used words (see Step 2)
    • By conference
    • By author (first author or in author list)
      - [ ] By institution
    • Over time (10 years, percentage based plot)
    • Also check for bigrams
  • Most studied topics (see Step 3)
    • By conference
    • By author (first author or in author list)
      - [ ] By institution
    • Over time (10 years)
    • Measured by citations
  • Amount of publications of top-k authors low-k authors and mid-k authors (idea 💡)
    • (first author/in author list) comparision
    • by conference (which authors are publishing were?)
  • Map different identifiers for conferences together
  • Compare our results with those of NLP Scholar

(Dis)similarity of features

Description of the Epic's Goal
The goal of this epic is to understand how (dis)similar authors, topics, institutions, and venues are w.r.t. statistical and semantic features.

Issues to resolve the Epic

  • #issuenumber

Write unit tests for 95%+ coverage

Is your feature request related to a problem? Please describe.
We are lacking unit tests for all code written so far. After issue #11 the tests obviously fail with 2% coverage.

Describe the solution you'd like
Design unit tests that test all functions. Do not ignore code. We need a coverage of 95%+. If you think the code can't be tested, talk to @jpelhaW or @truas.

Describe alternatives you've considered
Well, some code simply cannot be tested in a closed docker environment (either it takes too long or requires too many resources or has too long response times for requests). Please note down in the comments what we didn't test here as documentation.

Step 7: Semantic analysis of the papers

Now that we have useable results from the previous step we now want to analyse the papers more in depth.

  • Create the remaining embeddings for all top-tier conference papers published (2010-2020)
  • Run the visualizations for each top-tier conference for each year (2010-2020)
  • Compare how the topics/clusters shift over time and in between venues
    • Do the clusters differ between venues?
    • Are there trend-setters among the venues?
    • How do they shift over time?
  • Expand the function using UMAP to color code different venues (all venues in one visualization)
    • Run the visualizations again for 2010-2020
  • Compare results with previous analysis and NLP scholar

Step 2: First look into the data

First we should take a look into the data we have by analysing keywords and using tf-idf.

  • Determine the top 20 words (unigrams) per conference
  • Determine the top 20 bigrams per conference
  • Do the same with same with tf-idf
  • Implement scattertext for comparisions
  • Compare our results with those of NLP Scholar

Consistent static typing

Is your feature request related to a problem? Please describe.
After issue #11 our typer pyright has many complaints.

Describe the solution you'd like
We need to fix all issues that pyright detected for a repo with consistent typing.

Describe alternatives you've considered
Some of the things pyright detects are unnecessary. Please talk to @jpelhaW or @truas to see whether we need to fix or ignore.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.