The fatcat's discuss from internetarchive

new "missing" licenses

As detected by fatcat datacite importer ("missed potential license"):

cc: @miku if you have time to add these, or i'll get around to adding them eventually

Discovery of "needs editing" / "to be fixed" items for human editors

It would be nice to have a way for volunteer editors to find particular entities that need review or editing. The specific user request was for "a prioritized list of "to be fixed items" that users can see."

Off the top of my head, one way to do this would be adding a numerical score to text flag to elasticsearch documents. When the full entity schema is transformed into this document, an automated sniff of "how good is this metadata" would happen and set this field appropriately. Editors can then search for entities to work on using the full faceting power of the index (eg, only specific journals, or fields, or languages, or time spans).

Entity edit previews and history pages don't show inbound "graph" relations

The two clearest examples are creator entities (not showing releases) and release entities (not showing files/filesets/webcaptures).

The reason for this is that these preview pages are re-using the generic revision view pages, and revisions can't have inbound graph relations (only full entity identifiers do).

This could be worked around for, eg, the specific case of edit previews in the web interface, where there is a specific entity identifier in play, by doing the extra API calls to fetch inbound relations based on the identifier and "enriching" the revision with those fields.

An alternative, with less code complexity but poorer user experience, would be to show a warning or similar in the preview.

Typo of RFC path in README.md

https://github.com/bnewbold/fatcat/blob/master/farcat-rfc.md in README.md should be
https://github.com/bnewbold/fatcat/blob/master/fatcat-rfc.md

Representing retractions and withdrawls

There should be a clear way to represent retractions (and/or "withdrawn") of works in fatcat.

My current proposal is to represent "retractions" as a separate release under the same work as a published release (with release_status "retraction"). All releases would gain a new withdrawn_date field (a la release_date), and this field would get updated for the published work, but the status would not be changed from published. Retractions are often published in the form of a note or letter, and given their own external identifiers, which is why they become a new full release. Un-published releases (eg, pre-prints, drafts) are sometimes yanked from repositories or websites; we would represent this with a withdrawn_date but no "retraction" release (unless there was a posted notice/release of some kind). An update/correction to a work would be a separate release entity with update release_status.

Presumably withdrawl/retraction/update status would get aggregated at the work level (eg, in the elasticsearch index), and this could be used in both search results as a facet ("don't show retracted works") and to display warning labels in the webface.

Open issues:

what if the withdrawn_date isn't known? don't want bogus dates getting filled in... should there be a withdrawn_year as well? a withdrawn status/flag field?
do we always creation retraction releases for published works, even if they aren't papers, and don't have a retraction notice? eg, datasets.

Maintain rust macaroons library

The rust macaroons library that fatcat uses is unmaintained. I intend to take over maintainership and improve quality and test coverage, but if somebody wanted to help with that it would be great!

OpenSSL 1.0 / 1.1 Compatibility

The production and QA fatcat instances are currently deployed on Internet Archive VM machines which currently all run Ubuntu xenial (16.04 LTS), which ships OpenSSL 1.0. In Ubuntu bionic (18.04), and most modern linux distributions (including Debian buster / stable and testing), OpenSSL 1.1 is the default.

The fatcat rust API server depends on an older iron-oriented Swagger codegen version, which generates code with an old hyper release that supports only OpenSSL 1.0. At some point it would be good to switch to the current upstream codegen version; in particular, supporting Open API 3.0 instead of Open API 2.0 would be nice. However, switching away from Iron would be a large refactor. In the meantime, it would be worthwhile to try patch (and/or upstream) the dependency tree to be compatible with newer OpenSSL.

As a work-around, on Ubuntu bionic, it is possible to install the older OpenSSL 1.0 libraries:

# note that libssl1.0-dev conflicts with libssl-dev
sudo apt install openssl1.0 libssl1.0-dev

and build fatcat:

cargo clean
OPENSSL_LIB_DIR="/usr/lib/x86_64-linux-gnu" OPENSSL_INCLUDE_DIR="/usr/include/openssl" cargo build --release

Importing archive.org files

There are many PDFs and datasets on archive.org in items and collections (distinct from wayback web archives). It should be relatively straightforward to get these imported into fatcat in a one-time batch.

Specific files to include:

JSTOR early journals collection
PubMed and arxiv snapshots (now out-of-date)
bulk arxiv backups from AWS S3 (need to deep-link into large .zip files)
pre-1923 journal literature uploaded by a contributor

It would also be great to include datasets, such as those from academictorrents, but metadata records may need to be created first; for the works above, metadata records should already exist.

"Save Paper Now"

This would be a fun script/webface/API feature: a form to paste a URL to a PDF on the public web, that would archived by wayback (using API) on-demand, then processed by GROBID (with metadata extracted), then have a fatcat file entity created, linked to the given release. Various checks would happen in between (eg, that file is accessible, file (by hash) isn't already in fatcat (in which case URL could be added), etc).

Basic entity creation/updates

Basic (only partial schema) creation and updating of at least releases and containers via the web interface.

It is important that entities "round trip" through updates process cleanly: fields which are not edited should not be mutated by this process (eg, no fields dropped, or lists re-ordered).

Link citations (based on existing DOI matches)

Crossref and PubMed release metadata includes citation linkage by identifier. It wasn't possible to make these linkages explicit during the early import (because the "target" releases mostly did not exist), but it is possible now.

UI Iteration Feedback

Feedback from QA UI experiments (https://qa.fatcat.wiki):

shorter/fewer blurbs on front page: "what is fatcat" and "institution" paragraphs in a single bar; beta note just a single line
file search for URLs (of files or webcaptures)
release search: release_stage not status
generic search view which looks for identifiers and some URL patterns, then redirects to lookups or release search as appropriate
joi's "citing blogs" release is dataset; should be weblog-post or whatever
all lookups of all entity types could be a single page, but fine as-is
entity pages, should show counts for tabs (before clicking)
release entity page: abstracts and files on overview; refs and metadata as pages
entity pages: fixed sidebar size (not window-resize-specific)
entity pages: need to emphasize edit/history tabs
examples on front page could be full/real examples (eg, citation text as a link) instead of slug-like "Paper", "Journal". Would probably require a redesign of box
fixed minimum width on desktop (instead of continuous resizing

Thoughts:

files/filesets/webcaptures could be single elasticsearch index

Structured (given/sur) names in release contribs

Creator entities have three fields for human names: display, given, and sur. Release contrib rows currently have only a single field ('raw'), along with an optional reference to a creator entity. The proposed change is to also include full given and sur name fields in the contrib row.

As some context, the overall intent is for fatcat to avoid going down the path of representing "true" human names, and to only represent bibliographic names (aka, what is listed on a published work or in a citation for attribution).

A number of large metadata sources (eg, PubMed) include structured (given/sur) names, and we currently throw this structure away when importing. In the mid-term future, we are only going to have creator entities for a very small fraction of contributions, and this structure may make disambiguation easier in the future, so we should be capturing this full metadata for now.

Rename `release_status` to `release_stage`

I think this clarifies that the field describes this current release part of a larger work. However, this might be a gratuitously disruptive change (not with the disruption).

As a mitigation, could include the same value in reads (GET) under both key names for an iteration (eg, from v0.3 through v0.4 or v1.0).

Better web interface auth token fail error message

When auth fails, currently user just gets a 4xx error. Should give a proper error message; in general we should probably be passing 400 error messages though?

API error (from logs) was:

27278130-Jun 28 04:40:47 wbgrp-svc502.us.archive.org fatcat-web[25198]: HTTP response body: {"success":false,"error":"InvalidCredentials","message":"auth token was missing, expired, revoked, or corrupt: auth token (macaroon) not valid (signature and/or caveats failed)"}

In particular this impacts login expiration.

ISSN-L matching for JURN index

JURN is "An organised links directory for the arts & humanities, listing selected open access or otherwise free ejournals." They list 3000-4000 such journals by name, URL, and category at http://www.jurn.org/directory/, and an additional 800 ecology titles at https://jurnsearch.wordpress.com/titles-indexed-ecology-related/.

It would be great to include these in fatcat (probably via chocula first, though could go direct via API as well), and mark them as open so they will be included in broad IA crawls for preservation. However, JURN doesn't link any persistent identifiers (eg, wikidata QID or ISSN/ISSN-L), which makes it hard to reference them anywhere without duplication.

Some brainstorms of how to go about this:

query existing fatcat by both fuzzy title match or URL match, using "container" metadata dump
same as the above, but using Wikidata tooling, eg openrefine
query portal.issn.org by title
visit each journal homepage and try to parse out an ISSN; verify this ISSN against portal.issn.org

arxiv: improve author name parsing ("and")

Our arxiv harvester receives author metadata as a single string, with individual author names separated by commas and "and".

Here is the function: https://github.com/internetarchive/fatcat/blob/master/python/fatcat_tools/importers/arxiv.py#L24

In some cases, as discovered by Sawood, this doesn't work and all author names come through as a single string. For example:

https://fatcat.wiki/release/c5s6d7f7w5b3himgditfbiu5nq
https://fatcat.wiki/release/f7j4lf4aqfeqlaqfrtayt62rwe

FIxing this will include:

handling and testing these cases in the parsing function
sampling for additional author patterns which are not getting parsed correctly
run a cleanup task over existing release entities

Bogus far-future longtail OA (GROBID) release_date

Over 60k release entities have a release_year greater than 2030: https://fatcat.wiki/release/search?q=year%3A%3E2030

These are clearly bogus, due to errors in GROBID header metadata extraction, and should be cleaned up.

Search pagination in web interface

Currently release and container search truncates to the most recent N results. We should have a mechanism in the web interface to see additional results.

This will require some care around performance. The preferred mechanism for bulk exporting metadata should not be paginating through elasticsearch queries.

Incorrect order of author names

I have noticed that the order of author names is incorrect in https://fatcat.wiki/release/tcmnksnbabdfxl7fdrws6nkq4e as shown in the screenshot below. The list of contributors is in the correct order, but authors listed right below the title have the first author list as the last one.

duplicates w/ fulltext

These seem to be 3 identical files w identical metadata:

Duplicated DOIs from datacite import

When importing content from Datacite bulk dump, we seem to have some duplicated adjacent (or near-adjacent?) rows, which resulted in the same DOI getting imported multiple times in the same editgroup. This resulted in at least 8000 duplicate DOIs.

Cleanup is to merge releases (redirecting one to the other). Presumably using common tooling with pubmed, container, and other cleanups.

cc: @miku

File entity elasticsearch update pipeline

Two components of this:

should have a file document type in elasticsearch, primarily to do bulk stats on domains, corpus size (bytes)
when a file entity is updated, should update both new and old release entities, so the "have file" flags (and aggregate stats) get updated

Currently, when a file entity is created or updated, the entire release entity database needs to be dumped and bulk loaded into elasticsearch to update things like the per-container coverage stats. Periodic (weekly? monthly? daily?) stats updates are always going to be necessary to fold container updates into release metadata (eg, "journal for this paper is in DOAJ"), but files updates are a much more common case.

API: implement single-GET lookup of release entities by Wikidata QID

Currently need to do, eg:

https://api.fatcat.wiki/v0/release/lookup?doi=10.1215/10407391-3621721&hide=abstract,refs,contribs

... then, using 'ident': hdfx4g4vkjdjvhsbz7br2oqida

https://api.fatcat.wiki/v0/release/hdfx4g4vkjdjvhsbz7br2oqida?hide=refs,contribs,abstracts&expand=files,container

Would rather just:

https://api.fatcat.wiki/v0/release/lookup?wikidata_qid=Q39375377&hide=abstract,refs,contribs&expand=files,container

cc: @Samwalton9

Friendlier identifier lookup 404/400 pages

The web interface should pass through API errors better, and be more helpful.

Eg, currently if one looks up ISSN 2575-7849, you get a blunt 404 page. Instead it should probably mention the ISSN/ISSN-L distinction and link to https://portal.issn.org/resource/ISSN/2575-7849.

If you enter 25757849 you get a 500 error; should be either a 400 error with a description of the problem (missing the '-'), or auto-correct it (don't want to be clever in API, but in web UI probably appropriate).

Relatedly, should have lookup view page for entity types that show all the lookup options with descriptions of identifiers. This view can also contain the error descriptions.

First iteration of release/work grouping

Using title/author metadata, same as the current sandcrawler file/release matching, identical releases should be clustered under the same work entity, and the unused work entities marked as deleted.

Future work related to this merging would be to sort out exact matches (eg, duplicate records for same published version and same external identifiers) as opposed to preprint/published (aka, green OA) versions (with the former releases merged, and the later left distinct), and sorting out file/release linkage.

Container pagination links to release search, not container search

Eg, if you go to the bottom of: https://fatcat.wiki/container/search?q=bmj

The "next" link points to release search, not container search.

database SQL: missing index for editor history lookup

Pages like: https://fatcat.wiki/editor/4vmpwdwxxneitkonvgm2pk6kya/editgroups

Can be very slow to look up. PostgreSQL logs slow queries like:

2020-09-14 12:46:09.401 UTC [3277] fatcat@fatcat_prod LOG:  duration: 7019.246 ms  execute __diesel_stmt_21: SELECT "editgroup"."id", "editgroup"."editor_id", "editgroup"."created", "editgroup"."submitted", "editgroup"."is_accepted", "editgroup"."description", "editgroup"."extra_json", "changelog"."id", "changelog"."editgroup_id", "changelog"."timestamp" FROM ("editgroup" LEFT OUTER JOIN "changelog" ON "changelog"."editgroup_id" = "editgroup"."id") WHERE "editgroup"."editor_id" = $1 ORDER BY "editgroup"."created" DESC LIMIT $2
2020-09-14 12:46:09.401 UTC [3277] fatcat@fatcat_prod DETAIL:  parameters: $1 = 'f7e3a22c-1712-4cf5-9783-2dfb354a5ec1', $2 = '50'

The current indices on the editgroup table are:

CREATE INDEX editgroup_submitted_idx ON editgroup(is_accepted, submitted);
CREATE INDEX editgroup_editor_idx ON editgroup(is_accepted, editor_id);

There might have been a good reason for making the editor table of the form (is_accepted, editor_id), but for this particular query it should at least be (editor_id, is_accepted), if not (editor_id, created). Worth checking if the order of the created part should be specified in the index.

Should be possible to re-create this index with an "online" SQL migration with no other code changes and minimal service impact (only these already-slow views?).

Better API token parse error reponse

I made a typo when copy/pasting an API token (removing a trailing = character) and the API responded with a 500 error to an auth_check call:

{"success":false,"error":"InternalError","message":"unexpected internal error: invalid length at 196"}

This should be a 400 error (not 500), and the message should indicate the actual problem.

Additional entity types in elasticsearch

Currently only releases and containers are represented in elasticsearch. It would be great to have works (a summary of sub-releases) and files (as well as filesets and webcaptures) as well.

The first step is to define schemas (see ./extra/elasticsearch/), then to write transforms (with tests!), and last to wire up CLI tooling and workers to continuously update the index.

Require Python 3.7+ for python code

Python 3.6+ seems to be a reasonable ecosytem, brings improvements, and some packages/tools already don't support older python3 versions.

Older platforms without python3 (including ubuntu xenial in production) can be supported either using pyenv for development, or more likely installing a python3.6 system package using alternate .deb repositories.

Some things that upgrading would unlock are using the 'black' code formatting tool.

Commenting (annotating) and accepting editgroups via webface

What it says! API support already exists.

High-availability Production Configuration

The PostgreSQL database setup should be configured for warm failover to a secondary machine in a different data center. This will mostly be a matter of testing and documenting operational procedures; we already have the hardware resources allocated.

HAProxy (or equivalent) should be deployed in from of the API, webface, and search API, with appropriate rate-limits and monitoring. Health-checking should enable relatively seamless failover between servers.

Nagios and other alerts should be in place to notify about disk space, SSL certificate expiration, and other basic system monitoring.

A public status page (lambstatus?) and offsite alerting service (cabot?) should be deployed.

The "help wanted" aspect of this issue is advice arounnd postgres best practices/monitoring, Kafka monitoring (eg, how to integrate kafka topic lag metrics in statsd/grafana), simple log management (eg, retain error lines longer than regular logs), and basic alerting on systemd daemon status (eg, if a python worker daemon crashes, send email with last few log lines).

Design of tooling for bulk metadata fixes/updates

I suspect it will be common to need to update a release metadata in-place pretty often. It would be great to have tooling for this, where a snippet of python code can be submitted, run over small cases, then scaled up to run over the entire corpus, with progress monitoring, error handling, etc.

At this point mostly looking for design inspiration and existing work.

Duplicated import containers (by ISSN-L)

The JALC import bot was poorly tested and had a race condition resulting in many duplicate containers being created (because it was run with many threads).

Cleanup is a bit tricky because there are so many release entities pointing at all the variants, but if the duplicate containers are merged (instead of deleted), then container transclusion into releases should work well enough for, eg, stats and release lookups. Can cleanup the actual release entities later/eventually.

datacite importer duplicated author names

Sometimes Datacite metadata includes the same people/entities as both "creators" and "contributors", and we end up duplicating them in fatcat metadata. Eg:

I think the behavior should probably be to only add the contributors if they are not already in the author list by string check. Not sure if this should be a fuzzy string check; and exact check is a good start.

Will need to do cleanup as well.

Search bugs

This is an issue to track known metadata search queries that, eg, should return a result but don't.

"Love, Hate and Murder: Commitment Devices in Violent Relationships" / doi:10.1016/j.jpubeco.2008.09.011
queries with author name and title return no results
queries with "N/A" (a slash) return an error
queries without double quotes, and a Word: in, is interpreted as a field filter/facet instead of (eg) part of a title

Expand API test coverage

I made good progress on expanding code coverage of the fatcat API using python tests, but more work is needed. A rough guide is the fraction of python_client endpoint wrappers that get called at all; this can be seen by browsing the pipenv run pytest --cov --cov-report html HTML output.

Known endpoints needing coverage at this time:

Python tests for these sorts of endpoints live under ./python/tests/api_*.py.

Separately, it would be great to track Rust test coverage. I experimented with the most popular cargo-integrated coverage tools, but they didn't work for me. It would also be important to measure coverage of Rust code from the python API tests, which at this time are more complete (and much easier to write/maintain).

Mark datacite releases as release_stage=submitted when extra.datacite.resourceType=Preprint

Here's an example entity. I think we can safely mark the release_stage as a pre-print (submitted) when the datacite-reported resourceType is Preprint. Here is an example entity (JSON):

https://api.fatcat.wiki/v0/release/imdfdbsmsjhxvlys7dcvlozqra?expand=container,files,filesets,webcaptures

In this particular case the type should maybe be draft (paper does not seem to actually have been submitted anywhere, and is in researchgate not a pre-print repository), but we should probably just trust the label from datacite.

Refactor python kafka code to use better library

Existing python workers talk to Kafka using the 'pykafka' library. This has worked well enough for early development, but has some operational and performance issues: consumer group conflicts are common, it takes a relatively long time (tens of seconds) for workers to startup in some cases, and it would often be better to process entire batches of messages instead of one-at-a-time.

Having looked at a few other python libraries, it seems like with our Kafka broker version (2.0), we should definitely take advantage of a librdkafka-based library, and the Confluent-maintained one is probably the best bet.

The primary (and first) change is to refactor all workers to use the new library. The second is to refactor some workers (particularly the elasticsearch inserters) to poll() and process entire batches at a time (as opposed to single messages at a time).

Button to re-index entity in elasticsearch

The web interface entity pages could expose a button that would allow anybody to instantly re-index the entity into elasticsearch. This would make use of all the same back-end code, and would make quick fixes of corner-cases or operational failures simple. Also useful for manual testing.

Consider switching guide.fatcat.wiki from mdbook (rust) to mkdocs (python)

The main motivation would be an easier to read and navigate theme. We could write or port a theme to mdbook, but the docs are "just markdown" so it would also be easy to switch to mkdocs, which has a larger community and optional features/plugins.

Here's an example theme that is particularly readable; we'd want to host fonts ourselves on (eg) archive.org:

https://squidfunk.github.io/mkdocs-material/getting-started/

Invalid DOIs from Pubmed import

When importing from PubMed/MEDLINE, I trusted that listed DOIs would be valid and registered. This may not have been a good idea! Consider these release entities:

https://fatcat.wiki/release/search?q=Failure+to+use+ECT+in+treatment+of+catatonia

https://fatcat.wiki/release/geyxegazqvdfzinxpk6oafx7ju
https://fatcat.wiki/release/66jfjjdlc5alvjy2onwuk4c7fq
https://fatcat.wiki/release/4iuwwkpylnfdlies3hal2shpsm
https://fatcat.wiki/release/5nlwujzn7na4hmx3eg2xhiq4vq

There seem to be two actual letters written, in the same journal issue on partially overlapping pages by different authors. There are valid PMIDs and DOIs for both, but the MEDLINE entries both have incorrect (invalid) DOIs linked, so we end up with pairs of fatcat releases with (PMID, bad DOI) and (good DOI).

Cleanup: rewrite DOIs with double slashes (`10.1037//0021-843x.106.2.280`)

These come from, eg, PubMed imports. The rewrite should only apply to double slashes after the first numbers (the publisher identifier), not elsewhere in the DOI.

I tried several examples using the doi.org resolver and crossref API lookups, and the slash-removed version always seems to be correct.

There are a number of fatcat entities with DOIs in both forms that should be merged. Eg:

https://fatcat.wiki/release/62i5rntupfazhg5jpkwxr6c7na
https://fatcat.wiki/release/hcpjjnfxdfhizcqsom3karcvx4

Example of a (trivial) editgroup review bot

Something in python that looks at submitted editgroups and annotates them (pass/fail) if appropriate.

API support should already exist; part of this task is to flush out any missing features.

Invalid ISSN-L in fatcat containers

I was a bit sloppy during initial import, and have a bit over 100 journals (containers) with invalid ISSN-Ls (eg, checksums don't match). Here's a TSV file:

https://gist.github.com/bnewbold/1767fd0e32449b3380d5de7fe39c9359

The majority of these seem to be single-character typos that have filtered through upstream sources. As an example:

https://fatcat.wiki/container/lyfqyt3klnfsbewvhlspmxoyu4 (1761-7227, correct)
https://fatcat.wiki/container/uynmrtrspngoxdjdihrwx3x6oi (1761-7727, typo)

It would be great if somebody could go through this TSV, looking up titles on portal.issn.org, and adding the corrected ISSN-L if there is one. In some cases it is important to click through on ISSN records and ensure it is the ISSN-L ("Linking") that is getting added.

With this updated file we can fix the metadata in fatcat, as well as send correction notes to our upstream partners (Pubmed, Crossref, DOAJ, etc).

There is a larger set of 2000+ ISSNs which are not in the most recent issn.org ISSN-L mapping file, but the majority of these seem to either obscure corner cases (known to issn.org, but not listed publicly in portal, eg due to mis-registration) or some status issue (shows up in portal.issn.org but not ISSN-L listing file).

Bootstrap with more large corpuses

Work is in progress on the more-importers branch to bulk-import release metadata from a few more large corpuses:

PubMed baseline
arXiv.org
JSTOR early works
JALC

These initial imports will be one-offs, but the code will be reusable for continuous imports (like the Crossref importer works currently).

Some of these importers will depend on schema changes in the v0.3 milestone.

Automatically import retraction metadata

There is a database of retracted papers at: http://retractiondatabase.org/RetractionSearch.aspx?&AspxAutoDetectCookieSupport=1

It would be good to have a bot which periodically fetches updates, and then updates article metadata in fatcat appropriately.

Fix long-tail release <-> container linkage

When 1+ million long-tail releases were imported during bootstrapping, their associated container wasn't linked. We have this metadata (with reasonably high confidence) from the crawl logs. The number of releases isn't very large, so it should be reasonable to go through and update container links for each release; this will improve container coverage analytics.

This processing could happen after releases are grouped into works.

zotero plugin?

It would be cool to have a zotero plugin that could query fatcat for full text when I save a new paper reference to Zotero.

on mobile, will add detail soon.

internetarchive / fatcat Goto Github PK

fatcat's Issues

Recommend Projects

Recommend Topics

Recommend Org