inspirehep / hepcrawl Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 30.0 4.06 MB

Scrapy project for feeds into INSPIRE-HEP

Home Page: http://inspirehep.net

License: Other

Python 94.98% HTML 4.27% Shell 0.40% C 0.36%

crawler harvest-data publishing python

hepcrawl's Introduction

Inspirehep

Pre requirements

Python

Python 3.9

You can also use pyenv for your python installations. Simply follow the instructions and set the global version to 3.9.

Debian / Ubuntu

$ sudo apt-get install python3 build-essential python3-dev

MacOS

$ brew install postgresql@14 libmagic openssl@3 openblas python

nodejs & npm using nvm

Please follow the instructions https://github.com/nvm-sh/nvm#installing-and-updating

We're using v20.0.0 (first version we install is the default)

$ nvm install 20.0.0
$ nvm use global 20.0.0

yarn

Debian / Ubuntu

Please follow the instructions https://classic.yarnpkg.com/en/docs/install/#debian-stable

MacOS

$ brew install yarn

poetry

install poetry https://poetry.eustace.io/docs/

$ curl -sSL https://raw.githubusercontent.com/sdispater/poetry/master/get-poetry.py | python -

pre-commit

install pre-commit https://pre-commit.com/

$ curl https://pre-commit.com/install-local.py | python -

And run

$ pre-commit install

Docker & Docker Compose

The topology of docker-compose

Follow the guide https://docs.docker.com/compose/install/

For MacOS users

General

Turn of the AirPlay Receiver under System Preference -> Sharing -> AirPlay Receiver. Otherwise, you will run into problems with port 5000 being already in use. See this for more information.

M1 users

Install Homebrew-file https://homebrew-file.readthedocs.io/en/latest/installation.html

$ brew install rcmdnk/file/brew-file

And run

$ brew file install

Run with docker

Make

This will prepare the whole inspire development with demo records:

make run
make setup

You can stop it by simply run

make stop

Alternatively you can follow the steps:

Step 1: In a terminal run

docker-compose up

Step 2: On another browser run

docker-compose exec hep-web ./scripts/setup

Step 3: Import records

docker-compose exec hep-web inspirehep importer demo-records

Usage

inspirehep should now be available under http://localhost:8080

Run locally

Backend

$ cd backend
$ poetry install

UI

$ cd ui
$ yarn install

Editor

$ cd record-editor
$ yarn install

Setup

First you need to start all the services (postgreSQL, Redis, ElasticSearch, RabbitMQ)

$ docker-compose -f docker-compose.services.yml up es mq db cache

And initialize database, ES, rabbitMQ, redis and s3

$ cd backend
$ ./scripts/setup

Note that s3 configuration requires default region to be set to us-east-1. If you have another default setup in your AWS config (~/.aws/config) you need to update it!

Also, to enable fulltext indexing & highlighting the following feature flags must be set to true:

FEATURE_FLAG_ENABLE_FULLTEXT = True
FEATURE_FLAG_ENABLE_FILES = True

Run

Backend

You can visit Backend http://localhost:8000

$ cd backend
$ ./scripts/server

UI

You can visit UI http://localhost:3000

$ cd ui
$ yarn start

Editor

$ cd ui
$ yarn start

You can also connect UI to another environment by changing the proxy in ui/setupProxy.js

proxy({
  target: 'http://A_PROXY_SERVER',
  ...
});

How to test

Backend

The backend tests locally use testmon to only run tests that depend on code that has changed (after the first run) by default:

$ cd backend
$ poetry run ./run-tests.sh

If you pass the --all flag to the run-tests.sh script, all tests will be run (this is equivalent to the --testmon-noselect flag). All other flags passed to the script are transferred to py.test, so you can do things like

$ poetry run ./run-tests.sh --pdb -k test_failing

You'll need to run all tests or force test selection (e.g. with -k) in a few cases:

an external dependency has changed, and you want to make sure that it doesn't break the tests (as testmon doesn't track external deps)
you manually change a test fixture in a non-python file (as testmon only tracks python imports, not external data)

If you want to invoke py.test directly but still want to use testmon, you'll need to use the --testmon --no-cov flags:

$ poetry run py.test tests/integration/records --testmon --no-cov

If you want to disable testmon test selection but still perform collection (to update test dependencies), use --testmon-noselect --no-cov instead.

Note that testmon is only used locally to speed up tests and not in the CI to be completely sure all tests pass before merging a commit.

SNow integration tests

If you wish to modify the SNow integration tests, you have to set the following variables in the SNow config file:

 SNOW_CLIENT_ID
 SNOW_CLIENT_SECRET
 SNOW_AUTH_URL

The secrets can be found in the inspirehep QA or PROD sealed secrets. After setting the variables, run the tests, so the cassettes get generated.

Before you push dont forget to delete the secrets from the config file!

UI

$ cd ui
$ yarn test # runs everything (lint, bundlesize etc.) indentical to CI
$ yarn test:unit # will open jest on watch mode

Note that jest automatically run tests that changed files (unstaged) affect.

cypress (e2e)

Runs everything from scratch, identical to CI

$ sh cypress-tests-chrome.sh
$ sh cypress-tests-firefox.sh

Opens cypress runner GUI runs them against local dev server (localhost:8080)

$ cd e2e
$ yarn test:dev
$ yarn test:dev --env inspirehep_url=<any url that serves inspirehep ui>

visual tests

Visual tests are run only on headless mode. So yarn test:dev which uses the headed browser will ignore them. Running existing visual tests and updating/creating snapshots requires cypress-tests.sh script.

For continuous runs (when local DB is running and has required records etc.), the script can be reduced to only the last part sh cypress-tests-run.sh.

If required, tests can run against localhost:3000 by simply modifying --host option in sh cypress-tests-run.sh.

working with (visual) tests more efficiently

You may not always need to run tests exactly like on the CI environment.

To run specific suite, just change test script in e2e/package.json temporarily to cypress run --spec cypress/integration/<spec.test.js>

How to import records

First make sure that you are running:

$ cd backend
$ ./scripts/server

There is a command inspirehep importer records which accepts url -u, a directory of JSON files -d and JSON files -f. A selection of demo records can be found in data directory and they are structure based on the record type (i.e. literature). Examples:

With url

# Local
$ poetry run inspirehep importer records -u https://inspirehep.net/api/literature/20 -u https://inspirehep.net/api/literature/1726642
# Docker
$ docker-compose exec hep-web inspirehep importer records -u https://inspirehep.net/api/literature/20 -u https://inspirehep.net/api/literature/1726642

# `--save` will save the imported record also to the data folder
$ <...> inspirehep importer records -u https://inspirehep.net/api/literature/20 --save

Valid --token or backend/inspirehep/config.py:AUTHENTICATION_TOKEN is required.

With directory

# Local
$ poetry run inspirehep importer records -d data/records/literature
# Docker
$ docker-compose exec hep-web inspirehep importer records -d data/records/literature

With files

# Local
$ poetry run inspirehep importer records -f data/records/literature/374836.json -f data/records/authors/999108.json
# Docker
$ docker-compose exec hep-web inspirehep importer records -f data/records/literature/374836.json -f data/records/authors/999108.json

All records

# Local
$ poetry run inspirehep importer demo-records
# Docker
$ docker-compose exec hep-web inspirehep importer demo-records

hepcrawl's People

Contributors

Stargazers

Watchers

hepcrawl's Issues

Support multiple publication_info entries

Some journals (mainly springer) have multiple publication infos for each article, being published in two languages (russian and english for example).

Dummy test issue

please ignore, testing

docker-compose: remove doc service

We don't need it, it should be in the unit one to run at the same time on travis.

Harvesting Books from Amazon

@annetteholtkamp mentions there are entire categories we might simply take into INSPIRE.

All spiders: jobs

Getting random errors related to spider.state and requests.queue in jobs/.

Examples:

PicklingError: Can't pickle <type 'instancemethod'>: attribute lookup __builtin__.instancemethod failed
CRITICAL: Unhandled Error for twisted.
Example log when running APS spider:
APS_log.txt

Errors go away when you delete jobs/ directory. I'll add more when I see them.

I'll also try to reproduce these, but so far they come when you least expect them...

APS: also scrape references

The APS crawler is not scraping references as the JSON API do not seem to provide them:
http://harvest.aps.org/docs/harvest-api#example

So in this case the XML output might be a better source as it seem to have references. See fulltext xml example here: http://harvest.aps.org/docs/harvest-api#retrieve-all-open-access-articles

Since the crawler today works already well with the JSON format, we can adjust it to yield a secondary request per record to the XML endpoint (basically the fulltext with Accept: text/xml

Create hepdata spider

Upgrade `inspire-schemas` to the current latest (`34.3`) version

We have to upgrade inspire-schemas to the current latest version 34.3 in order to introduce new fields such as material.

Relates to #128

Source value for desy-ingested records

We have to agree on what values to put to records ingested from desy flows.

Related to #73
Related to inspirehep/inspire-next#2356

Use non-root for hepcrawl_base

We must avoid ugly root owned files on the code directory.

pipelines.py: JsonWriterPipeline

In process_item() on line 57:
line = json.dumps(dict(item), indent=4) + ",\n"

",\n" prints commas between records, but also after the last record. Result is not valid JSON.

source should always be spider name, not hepcrawl

For arXiv harvests, sometimes the source is set to hepcrawl instead of arXiv. This is the case for example in report_numbers. Instead, source should always be arXiv.

OSTI spider

Expected Behavior

We should be able to harvest single records from OSTI, which often contains interesting information.

Current Behavior

We cannot harvest from OSTI.

desy spider

Create the desy spider

EDP: Allow using a custom cache dir for past crawls

Actually it's using /tmp, and we will want to cache the already crawled ones to avoid re-downloading everything each time.

CrossRef spider

All publishers are pushing bibliographic data to CrossRef, and this data is freely available through CrossRef API:
https://github.com/CrossRef/rest-api-doc/blob/master/rest_api.md

We should write a Spider that is able to transform CrossRef JSON format into our forma via:
https://github.com/CrossRef/rest-api-doc/blob/master/api_format.md

Additionally we should import when available ORCIDs and References.

Everything should be on EOS

Move to inspire-schemas ~4.0

A new 3.0 version of the schemas has been released, we should adapt to it.

Upgrade to schemas 40

global: bump `hepcrawl` to version `~36.0`

Use only inspire_field_categories and arxive_field_categories

There's no need anymore to support categories from other sources, we can only keep inspire and arxive ones, simplifying a lot and getting rid of the ~~nefarious~~ challenging 'anyOf'.

Introduce `public_notes` field to the WSP spider

We have to introduce missing field public_notes to the WSP spider and update unit tests of WSP.

Uniform future imports

We should amend all the existing files with from __future__ import absolute_import, division, print_function (motivation: http://python-future.org/imports.html).

Add google docstring support

Add the sphinx plugin for it (see https://github.com/inveniosoftware-contrib/json-merger/tree/master/json_merger for an example)

Check if it can validate that the params defined in the docstring matches the params in the function, that would be great.

Create/update PoS spider

The rationale document with the info can be found here:

https://docs.google.com/document/d/1OILFZNgDjAUSjs-l0fODT2-uaXavCFycAw6W5PPg6yw/edit

Add functional tests to arxiv spider

We have to ensure that the arxiv spider works properly.
So there is need of introducing functional tests to it.

Remove unused JsonWriterPipeline

Check if it's really not used and remove if not.

Tests: move reusable code into testlib

In order to re-use existing code for the tests we have to move some functions under hepcrawl/testlib/fixtures.py module.

More specifically the functions that can be moved under hepcrawl/testlib module are:

tests/functional/WSP/test_wsp.py:get_crawler_instance.
tests/functional/WSP/test_wsp.py:expected_results.

In addition those functions could be more generic not touching only unit tests folders.

hepcrawl/testlib/fixtures.py:get_responses_path to hepcrawl/testlib/fixtures.py:get_test_suite_path

Using refextract for unstructured references

When the metadata for an article include references but only in an unstructured way, refextract should be used in the workflow after the individual spider (pipeline.py?).

At the moment refextract is only called if a fulltext is attached. But this wont be the case for all records. And in some cases it's even with fulltext better to start from a list of individual unstructured references than from the complete PDF, where refextract first has to find such list.

docs: migrate to google doc style

Depends on #88, migrate all the current docstrings to google docs style, reviewing, rewriting or removing them if needed.

Unit tests: mock the connection to external services

Add `keywords` field

We have to add keywords field at spiders via pipelines.

Relates to inspirehep/inspire-schemas#151

unit tests: create environment handler fixture

Currently the unit tests that depend on the pipeline module to generate the records, are implicitly depending on the test tests/unit/test_pipelines.py::test_prepare_payload to set some env variables for them before running (meaning that if you try to run them without running that one first, they will fail).

We should refactor it and add a fixture that properly sets up and cleans up the env variables (we might want mock.patch).

theses.fr spider

HAL is the French national repository. We should write a spider able to harvest HAL grey literature such as thesis.

HAL uses OAI-PMH and TEI:
http://api.archives-ouvertes.fr/docs/oai
https://api.archives-ouvertes.fr/docs/sword#submission

Investigate handling of updates

Investigate in the general case, how we do avoid hepcrawl to crawl twice the same documents.

Deploy docker `hepcrawl_base` image to dockerhub

In order to speed up the CI-travis process we could implement a mechanism for deploying docker hepcrawl_base image to dockerhub.

As a result of it, we will be able to stop the building of the hepcrawl_base image that takes around 3 minutes.

identify the project in the user agent

The current INSPIRE user agent for FFT downloads is Invenio-1.1.2.1260-aa76f (+http://inspirehep.net; "HEP"). It might be good to have some variation of that format for hepcrawl so that publishers can whitelist us based on user agent.

WSP: fix local package crawling

The mechanism of WSP spider that crawls from local stored folders (given a path) doesn't work for paths including files.

E.g: /path/to/file.zip works but /path/to/folder_with_zip_files/ doesn't.

Use material whenever possible

The arXiv spider harvests only preprints, so there should be material: preprint in the output JSON whenever there is a material field.
For the publishers, if we know for sure whether are harvesting an article/erratum/addendum/reprint, the material should be set appropriately as well.

DESY FTP

During the INSPIRE Week it has been agreed that DESY would make available through FTP the different feeds that are then loaded into INSPIRE.

I'd propose that the FTP is divided into one directory per feed.

@ksachs @fschwenn can you detail which feeds you would actually put there? I guess a spider per feed will need to be written, correct?

Use the full pipeline output on the wsp tests

As we do on arxiv, we should be using the full pipeline output instead of just the spider, that way all the tests will be really checking that the produced data is schema compliant.

Add missing crawler2hep unit tests

There's the responses but they don't seem to be used anywhere:

dcaro@pcrcssis001$ ll tests/unit/responses/crawler2hep/
total 32
drwxrwxr-x.  2 dcaro dcaro 4096 May 22 17:47 .
drwxrwxr-x. 21 dcaro dcaro 4096 May 22 17:47 ..
-rw-rw-r--.  1 dcaro dcaro 4690 May 22 17:47 in_generic_crawler_record.yaml
-rw-rw-r--.  1 dcaro dcaro 4615 May 22 17:47 in_no_document_type.yaml
-rw-rw-r--.  1 dcaro dcaro 2538 May 22 17:47 out_generic_crawler_record.yaml
-rw-rw-r--.  1 dcaro dcaro 2490 May 22 17:47 out_no_document_type.yaml

docs: use Sphinx bundled napoleon

Napoleon is a Sphinx extension since Sphinx 1.3, so that can be used
instead of the sphinxcontrib-napoleon external dependency.

See inspirehep/inspire-next#2482 for an example.

Refactor tests code

We should move all the reusable code to inside hepcrawl, in a module called testlib or similar, specially the celery tasks and the celery monitor.

That would allow us to remove the init.py files and leave the tests directory as pytest likes it, to avoid name collisions if at any time we use tox:

https://docs.pytest.org/en/latest/goodpractices.html#choosing-a-test-layout-import-rules

Add mechanism for crawling only once

We have to find a way not to crawl many times the same records.

Expected Behavior

We are going to extend the scrapy-crawl-once plug-in.

Current Behavior

Hepcrawl re-crawls records generated from previous executions.

Steps to Reproduce (for bugs)

Adapt scrapy-crawl-once plug-in to Hepcrawl.
Extend the scrapy-crawl-once plug-in in a way that stores in the DB a key-value record for every request. As key we have the unique file name (FTP-FILE requests) or the unique id in the parameters (HTTP-HTTPS requests). As value we store the last-modified time stamp (FTP-FILE requests) or the crawling time stamp (HTTP-HTTPS requests).

Context

We are trying to crawl only once every record.

Screenshots (if appropriate):

Move all the logic from the pipelines `process_item` method to the spiders

There's no need for an extra internal format anymore, thus, we can move/remove the logic needed to transform from the implicit internal json format (as generated by the spiders) to the hep one, as the json generated by the spiders should be already hep-correct.

Conference proceedings

For certain feeds, the actual content is a conference proceeding. When this is known:

the proper document type should be set
the name of the conference (or any other information) that is usually available at the upper package level, must be stored inside each record.

disable passive ftp mode for WSP spider.
dockerize environment for functional tests (Dockerfile for hepcrawl, docker-compose files).
dockerize FTPServer with needed fixtures for the WSP's functional tests.
'mock' celery tasks to catch outgoing tasks to Inspire.
create WSP functional test.
WSP functional test to travis.
dockerize execution on travis for unit tests and docs.

inspirehep / hepcrawl Goto Github PK

hepcrawl's Introduction

Inspirehep

Pre requirements

Python

Debian / Ubuntu

MacOS

nodejs & npm using nvm

yarn

Debian / Ubuntu

MacOS

poetry

pre-commit

Docker & Docker Compose

The topology of docker-compose

For MacOS users

General

M1 users

Run with docker

Make

Step 1: In a terminal run

Step 2: On another browser run

Step 3: Import records

Usage

Run locally

Backend

UI

Editor

Setup

Run

Backend

UI

Editor

How to test

Backend

SNow integration tests

UI

cypress (e2e)

visual tests

working with (visual) tests more efficiently

How to import records

With url

With directory

With files

All records

hepcrawl's People

Contributors

Stargazers

Watchers

Forkers

hepcrawl's Issues

Expected Behavior

Current Behavior

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Context

Screenshots (if appropriate):

Recommend Projects

Recommend Topics

Recommend Org