Giter Site home page Giter Site logo

inspirehep / hepcrawl Goto Github PK

View Code? Open in Web Editor NEW
16.0 16.0 30.0 4.06 MB

Scrapy project for feeds into INSPIRE-HEP

Home Page: http://inspirehep.net

License: Other

Python 94.98% HTML 4.27% Shell 0.40% C 0.36%
crawler harvest-data publishing python

hepcrawl's Introduction

Inspirehep

Pre requirements

Python

Python 3.9

You can also use pyenv for your python installations. Simply follow the instructions and set the global version to 3.9.

Debian / Ubuntu

$ sudo apt-get install python3 build-essential python3-dev

MacOS

$ brew install postgresql@14 libmagic openssl@3 openblas python

nodejs & npm using nvm

Please follow the instructions https://github.com/nvm-sh/nvm#installing-and-updating

We're using v20.0.0 (first version we install is the default)

$ nvm install 20.0.0
$ nvm use global 20.0.0

yarn

Debian / Ubuntu

Please follow the instructions https://classic.yarnpkg.com/en/docs/install/#debian-stable

MacOS

$ brew install yarn

poetry

install poetry https://poetry.eustace.io/docs/

$ curl -sSL https://raw.githubusercontent.com/sdispater/poetry/master/get-poetry.py | python -

pre-commit

install pre-commit https://pre-commit.com/

$ curl https://pre-commit.com/install-local.py | python -

And run

$ pre-commit install

Docker & Docker Compose

The topology of docker-compose

Alt text

Follow the guide https://docs.docker.com/compose/install/

For MacOS users

General

Turn of the AirPlay Receiver under System Preference -> Sharing -> AirPlay Receiver. Otherwise, you will run into problems with port 5000 being already in use. See this for more information.

M1 users

Install Homebrew-file https://homebrew-file.readthedocs.io/en/latest/installation.html

$ brew install rcmdnk/file/brew-file

And run

$ brew file install

Run with docker

Make

This will prepare the whole inspire development with demo records:

make run
make setup

You can stop it by simply run

make stop

Alternatively you can follow the steps:

Step 1: In a terminal run

docker-compose up

Step 2: On another browser run

docker-compose exec hep-web ./scripts/setup

Step 3: Import records

docker-compose exec hep-web inspirehep importer demo-records

Usage

inspirehep should now be available under http://localhost:8080


Run locally

Backend

$ cd backend
$ poetry install

UI

$ cd ui
$ yarn install

Editor

$ cd record-editor
$ yarn install

Setup

First you need to start all the services (postgreSQL, Redis, ElasticSearch, RabbitMQ)

$ docker-compose -f docker-compose.services.yml up es mq db cache

And initialize database, ES, rabbitMQ, redis and s3

$ cd backend
$ ./scripts/setup

Note that s3 configuration requires default region to be set to us-east-1. If you have another default setup in your AWS config (~/.aws/config) you need to update it!

Also, to enable fulltext indexing & highlighting the following feature flags must be set to true:

FEATURE_FLAG_ENABLE_FULLTEXT = True
FEATURE_FLAG_ENABLE_FILES = True

Run

Backend

You can visit Backend http://localhost:8000

$ cd backend
$ ./scripts/server

UI

You can visit UI http://localhost:3000

$ cd ui
$ yarn start

Editor

$ cd ui
$ yarn start

You can also connect UI to another environment by changing the proxy in ui/setupProxy.js

proxy({
  target: 'http://A_PROXY_SERVER',
  ...
});

How to test

Backend

The backend tests locally use testmon to only run tests that depend on code that has changed (after the first run) by default:

$ cd backend
$ poetry run ./run-tests.sh

If you pass the --all flag to the run-tests.sh script, all tests will be run (this is equivalent to the --testmon-noselect flag). All other flags passed to the script are transferred to py.test, so you can do things like

$ poetry run ./run-tests.sh --pdb -k test_failing

You'll need to run all tests or force test selection (e.g. with -k) in a few cases:

  • an external dependency has changed, and you want to make sure that it doesn't break the tests (as testmon doesn't track external deps)
  • you manually change a test fixture in a non-python file (as testmon only tracks python imports, not external data)

If you want to invoke py.test directly but still want to use testmon, you'll need to use the --testmon --no-cov flags:

$ poetry run py.test tests/integration/records --testmon --no-cov

If you want to disable testmon test selection but still perform collection (to update test dependencies), use --testmon-noselect --no-cov instead.

Note that testmon is only used locally to speed up tests and not in the CI to be completely sure all tests pass before merging a commit.

SNow integration tests

If you wish to modify the SNow integration tests, you have to set the following variables in the SNow config file:

 SNOW_CLIENT_ID
 SNOW_CLIENT_SECRET
 SNOW_AUTH_URL

The secrets can be found in the inspirehep QA or PROD sealed secrets. After setting the variables, run the tests, so the cassettes get generated.

Before you push dont forget to delete the secrets from the config file!

UI

$ cd ui
$ yarn test # runs everything (lint, bundlesize etc.) indentical to CI
$ yarn test:unit # will open jest on watch mode

Note that jest automatically run tests that changed files (unstaged) affect.

cypress (e2e)

Runs everything from scratch, identical to CI

$ sh cypress-tests-chrome.sh
$ sh cypress-tests-firefox.sh

Opens cypress runner GUI runs them against local dev server (localhost:8080)

$ cd e2e
$ yarn test:dev
$ yarn test:dev --env inspirehep_url=<any url that serves inspirehep ui>

visual tests

Visual tests are run only on headless mode. So yarn test:dev which uses the headed browser will ignore them. Running existing visual tests and updating/creating snapshots requires cypress-tests.sh script.

For continuous runs (when local DB is running and has required records etc.), the script can be reduced to only the last part sh cypress-tests-run.sh.

If required, tests can run against localhost:3000 by simply modifying --host option in sh cypress-tests-run.sh.

working with (visual) tests more efficiently

You may not always need to run tests exactly like on the CI environment.

  • To run specific suite, just change test script in e2e/package.json temporarily to cypress run --spec cypress/integration/<spec.test.js>

How to import records

First make sure that you are running:

$ cd backend
$ ./scripts/server

There is a command inspirehep importer records which accepts url -u, a directory of JSON files -d and JSON files -f. A selection of demo records can be found in data directory and they are structure based on the record type (i.e. literature). Examples:

With url

# Local
$ poetry run inspirehep importer records -u https://inspirehep.net/api/literature/20 -u https://inspirehep.net/api/literature/1726642
# Docker
$ docker-compose exec hep-web inspirehep importer records -u https://inspirehep.net/api/literature/20 -u https://inspirehep.net/api/literature/1726642

# `--save` will save the imported record also to the data folder
$ <...> inspirehep importer records -u https://inspirehep.net/api/literature/20 --save

Valid --token or backend/inspirehep/config.py:AUTHENTICATION_TOKEN is required.

With directory

# Local
$ poetry run inspirehep importer records -d data/records/literature
# Docker
$ docker-compose exec hep-web inspirehep importer records -d data/records/literature

With files

# Local
$ poetry run inspirehep importer records -f data/records/literature/374836.json -f data/records/authors/999108.json
# Docker
$ docker-compose exec hep-web inspirehep importer records -f data/records/literature/374836.json -f data/records/authors/999108.json

All records

# Local
$ poetry run inspirehep importer demo-records
# Docker
$ docker-compose exec hep-web inspirehep importer demo-records

hepcrawl's People

Contributors

ammirate avatar bittirousku avatar chris-asl avatar david-caro avatar drjova avatar eamonnmag avatar fschwenn avatar glignos avatar jacquerie avatar jalavik avatar kaplun avatar ksachs avatar lilykos avatar michamos avatar miguelgrc avatar mihaibivol avatar mjedr avatar nooraangelva avatar oguzdemirbasci avatar pazembrz avatar rikirenz avatar spirosdelviniotis avatar szymonlopaciuk avatar tsgit avatar turtle321 avatar vbalbp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hepcrawl's Issues

All spiders: jobs

Getting random errors related to spider.state and requests.queue in jobs/.

Examples:

  • PicklingError: Can't pickle <type 'instancemethod'>: attribute lookup __builtin__.instancemethod failed
  • CRITICAL: Unhandled Error for twisted.
    Example log when running APS spider:
    APS_log.txt

Errors go away when you delete jobs/ directory. I'll add more when I see them.

I'll also try to reproduce these, but so far they come when you least expect them...

APS: also scrape references

The APS crawler is not scraping references as the JSON API do not seem to provide them:
http://harvest.aps.org/docs/harvest-api#example

So in this case the XML output might be a better source as it seem to have references. See fulltext xml example here: http://harvest.aps.org/docs/harvest-api#retrieve-all-open-access-articles

Since the crawler today works already well with the JSON format, we can adjust it to yield a secondary request per record to the XML endpoint (basically the fulltext with Accept: text/xml

pipelines.py: JsonWriterPipeline

In process_item() on line 57:
line = json.dumps(dict(item), indent=4) + ",\n"

",\n" prints commas between records, but also after the last record. Result is not valid JSON.

OSTI spider

Expected Behavior

We should be able to harvest single records from OSTI, which often contains interesting information.

Current Behavior

We cannot harvest from OSTI.

Tests: move reusable code into testlib

In order to re-use existing code for the tests we have to move some functions under hepcrawl/testlib/fixtures.py module.

More specifically the functions that can be moved under hepcrawl/testlib module are:

  • tests/functional/WSP/test_wsp.py:get_crawler_instance.
  • tests/functional/WSP/test_wsp.py:expected_results.

In addition those functions could be more generic not touching only unit tests folders.

  • hepcrawl/testlib/fixtures.py:get_responses_path to hepcrawl/testlib/fixtures.py:get_test_suite_path

Using refextract for unstructured references

When the metadata for an article include references but only in an unstructured way, refextract should be used in the workflow after the individual spider (pipeline.py?).

At the moment refextract is only called if a fulltext is attached. But this wont be the case for all records. And in some cases it's even with fulltext better to start from a list of individual unstructured references than from the complete PDF, where refextract first has to find such list.

Unit tests: mock the connection to external services

  • tests/unit/test_alpha.py
  • tests/unit/test_aps.py
  • tests/unit/test_arxiv_all.py
  • tests/unit/test_arxiv_single.py
  • tests/unit/test_base.py
  • tests/unit/test_brown.py
  • tests/unit/test_dnb.py
  • tests/unit/test_edp.py
  • tests/unit/test_elsevier.py
  • tests/unit/test_hindawi.py
  • tests/unit/test_infn.py
  • tests/unit/test_magic.py
  • tests/unit/test_mit.py
  • tests/unit/test_phenix.py
  • tests/unit/test_phil.py
  • tests/unit/test_pipelines.py
  • tests/unit/test_pos.py
  • tests/unit/test_t2k.py
  • tests/unit/test_utils.py
  • tests/unit/test_world_scientific.py

unit tests: create environment handler fixture

Currently the unit tests that depend on the pipeline module to generate the records, are implicitly depending on the test tests/unit/test_pipelines.py::test_prepare_payload to set some env variables for them before running (meaning that if you try to run them without running that one first, they will fail).

We should refactor it and add a fixture that properly sets up and cleans up the env variables (we might want mock.patch).

Deploy docker `hepcrawl_base` image to dockerhub

In order to speed up the CI-travis process we could implement a mechanism for deploying docker hepcrawl_base image to dockerhub.

As a result of it, we will be able to stop the building of the hepcrawl_base image that takes around 3 minutes.

identify the project in the user agent

The current INSPIRE user agent for FFT downloads is Invenio-1.1.2.1260-aa76f (+http://inspirehep.net; "HEP"). It might be good to have some variation of that format for hepcrawl so that publishers can whitelist us based on user agent.

WSP: fix local package crawling

The mechanism of WSP spider that crawls from local stored folders (given a path) doesn't work for paths including files.

E.g: /path/to/file.zip works but /path/to/folder_with_zip_files/ doesn't.

Use material whenever possible

The arXiv spider harvests only preprints, so there should be material: preprint in the output JSON whenever there is a material field.
For the publishers, if we know for sure whether are harvesting an article/erratum/addendum/reprint, the material should be set appropriately as well.

DESY FTP

During the INSPIRE Week it has been agreed that DESY would make available through FTP the different feeds that are then loaded into INSPIRE.

I'd propose that the FTP is divided into one directory per feed.

@ksachs @fschwenn can you detail which feeds you would actually put there? I guess a spider per feed will need to be written, correct?

Use the full pipeline output on the wsp tests

As we do on arxiv, we should be using the full pipeline output instead of just the spider, that way all the tests will be really checking that the produced data is schema compliant.

Add missing crawler2hep unit tests

There's the responses but they don't seem to be used anywhere:

dcaro@pcrcssis001$ ll tests/unit/responses/crawler2hep/
total 32
drwxrwxr-x.  2 dcaro dcaro 4096 May 22 17:47 .
drwxrwxr-x. 21 dcaro dcaro 4096 May 22 17:47 ..
-rw-rw-r--.  1 dcaro dcaro 4690 May 22 17:47 in_generic_crawler_record.yaml
-rw-rw-r--.  1 dcaro dcaro 4615 May 22 17:47 in_no_document_type.yaml
-rw-rw-r--.  1 dcaro dcaro 2538 May 22 17:47 out_generic_crawler_record.yaml
-rw-rw-r--.  1 dcaro dcaro 2490 May 22 17:47 out_no_document_type.yaml

Add mechanism for crawling only once

We have to find a way not to crawl many times the same records.

Expected Behavior

We are going to extend the scrapy-crawl-once plug-in.

Current Behavior

Hepcrawl re-crawls records generated from previous executions.

Steps to Reproduce (for bugs)

  1. Adapt scrapy-crawl-once plug-in to Hepcrawl.
  2. Extend the scrapy-crawl-once plug-in in a way that stores in the DB a key-value record for every request. As key we have the unique file name (FTP-FILE requests) or the unique id in the parameters (HTTP-HTTPS requests). As value we store the last-modified time stamp (FTP-FILE requests) or the crawling time stamp (HTTP-HTTPS requests).

Context

We are trying to crawl only once every record.

Screenshots (if appropriate):

Conference proceedings

For certain feeds, the actual content is a conference proceeding. When this is known:

  • the proper document type should be set
  • the name of the conference (or any other information) that is usually available at the upper package level, must be stored inside each record.

Introduce functional tests for WSP spider

  • disable passive ftp mode for WSP spider.
  • dockerize environment for functional tests (Dockerfile for hepcrawl, docker-compose files).
  • dockerize FTPServer with needed fixtures for the WSP's functional tests.
  • 'mock' celery tasks to catch outgoing tasks to Inspire.
  • create WSP functional test.
  • WSP functional test to travis.
  • dockerize execution on travis for unit tests and docs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.