internetarchive / fatcat Goto Github PK
View Code? Open in Web Editor NEWPerpetual Access To The Scholarly Record
Home Page: https://guide.fatcat.wiki
License: Other
Perpetual Access To The Scholarly Record
Home Page: https://guide.fatcat.wiki
License: Other
As detected by fatcat datacite importer ("missed potential license"):
cc: @miku if you have time to add these, or i'll get around to adding them eventually
It would be nice to have a way for volunteer editors to find particular entities that need review or editing. The specific user request was for "a prioritized list of "to be fixed items" that users can see."
Off the top of my head, one way to do this would be adding a numerical score to text flag to elasticsearch documents. When the full entity schema is transformed into this document, an automated sniff of "how good is this metadata" would happen and set this field appropriately. Editors can then search for entities to work on using the full faceting power of the index (eg, only specific journals, or fields, or languages, or time spans).
The two clearest examples are creator entities (not showing releases) and release entities (not showing files/filesets/webcaptures).
The reason for this is that these preview pages are re-using the generic revision view pages, and revisions can't have inbound graph relations (only full entity identifiers do).
This could be worked around for, eg, the specific case of edit previews in the web interface, where there is a specific entity identifier in play, by doing the extra API calls to fetch inbound relations based on the identifier and "enriching" the revision with those fields.
An alternative, with less code complexity but poorer user experience, would be to show a warning or similar in the preview.
There should be a clear way to represent retractions (and/or "withdrawn") of works in fatcat.
My current proposal is to represent "retractions" as a separate release under the same work as a published release (with release_status
"retraction"). All releases would gain a new withdrawn_date
field (a la release_date
), and this field would get updated for the published work, but the status would not be changed from published
. Retractions are often published in the form of a note or letter, and given their own external identifiers, which is why they become a new full release. Un-published releases (eg, pre-prints, drafts) are sometimes yanked from repositories or websites; we would represent this with a withdrawn_date
but no "retraction" release (unless there was a posted notice/release of some kind). An update/correction to a work would be a separate release entity with update
release_status
.
Presumably withdrawl/retraction/update status would get aggregated at the work level (eg, in the elasticsearch index), and this could be used in both search results as a facet ("don't show retracted works") and to display warning labels in the webface.
Open issues:
withdrawn_year
as well? a withdrawn
status/flag field?retraction
releases for published works, even if they aren't papers, and don't have a retraction notice? eg, datasets.The rust macaroons library that fatcat uses is unmaintained. I intend to take over maintainership and improve quality and test coverage, but if somebody wanted to help with that it would be great!
The production and QA fatcat instances are currently deployed on Internet Archive VM machines which currently all run Ubuntu xenial (16.04 LTS), which ships OpenSSL 1.0. In Ubuntu bionic (18.04), and most modern linux distributions (including Debian buster / stable and testing), OpenSSL 1.1 is the default.
The fatcat rust API server depends on an older iron
-oriented Swagger codegen version, which generates code with an old hyper
release that supports only OpenSSL 1.0. At some point it would be good to switch to the current upstream codegen version; in particular, supporting Open API 3.0 instead of Open API 2.0 would be nice. However, switching away from Iron would be a large refactor. In the meantime, it would be worthwhile to try patch (and/or upstream) the dependency tree to be compatible with newer OpenSSL.
As a work-around, on Ubuntu bionic, it is possible to install the older OpenSSL 1.0 libraries:
# note that libssl1.0-dev conflicts with libssl-dev
sudo apt install openssl1.0 libssl1.0-dev
and build fatcat:
cargo clean
OPENSSL_LIB_DIR="/usr/lib/x86_64-linux-gnu" OPENSSL_INCLUDE_DIR="/usr/include/openssl" cargo build --release
There are many PDFs and datasets on archive.org in items and collections (distinct from wayback web archives). It should be relatively straightforward to get these imported into fatcat in a one-time batch.
Specific files to include:
It would also be great to include datasets, such as those from academictorrents, but metadata records may need to be created first; for the works above, metadata records should already exist.
This would be a fun script/webface/API feature: a form to paste a URL to a PDF on the public web, that would archived by wayback (using API) on-demand, then processed by GROBID (with metadata extracted), then have a fatcat file entity created, linked to the given release. Various checks would happen in between (eg, that file is accessible, file (by hash) isn't already in fatcat (in which case URL could be added), etc).
Basic (only partial schema) creation and updating of at least releases and containers via the web interface.
It is important that entities "round trip" through updates process cleanly: fields which are not edited should not be mutated by this process (eg, no fields dropped, or lists re-ordered).
Crossref and PubMed release metadata includes citation linkage by identifier. It wasn't possible to make these linkages explicit during the early import (because the "target" releases mostly did not exist), but it is possible now.
Feedback from QA UI experiments (https://qa.fatcat.wiki):
release_stage
not statusThoughts:
Creator entities have three fields for human names: display, given, and sur. Release contrib rows currently have only a single field ('raw'), along with an optional reference to a creator entity. The proposed change is to also include full given and sur name fields in the contrib row.
As some context, the overall intent is for fatcat to avoid going down the path of representing "true" human names, and to only represent bibliographic names (aka, what is listed on a published work or in a citation for attribution).
A number of large metadata sources (eg, PubMed) include structured (given/sur) names, and we currently throw this structure away when importing. In the mid-term future, we are only going to have creator entities for a very small fraction of contributions, and this structure may make disambiguation easier in the future, so we should be capturing this full metadata for now.
I think this clarifies that the field describes this current release part of a larger work. However, this might be a gratuitously disruptive change (not with the disruption).
As a mitigation, could include the same value in reads (GET) under both key names for an iteration (eg, from v0.3 through v0.4 or v1.0).
When auth fails, currently user just gets a 4xx error. Should give a proper error message; in general we should probably be passing 400 error messages though?
API error (from logs) was:
27278130-Jun 28 04:40:47 wbgrp-svc502.us.archive.org fatcat-web[25198]: HTTP response body: {"success":false,"error":"InvalidCredentials","message":"auth token was missing, expired, revoked, or corrupt: auth token (macaroon) not valid (signature and/or caveats failed)"}
In particular this impacts login expiration.
JURN is "An organised links directory for the arts & humanities, listing selected open access or otherwise free ejournals." They list 3000-4000 such journals by name, URL, and category at http://www.jurn.org/directory/, and an additional 800 ecology titles at https://jurnsearch.wordpress.com/titles-indexed-ecology-related/.
It would be great to include these in fatcat (probably via chocula first, though could go direct via API as well), and mark them as open so they will be included in broad IA crawls for preservation. However, JURN doesn't link any persistent identifiers (eg, wikidata QID or ISSN/ISSN-L), which makes it hard to reference them anywhere without duplication.
Some brainstorms of how to go about this:
Our arxiv harvester receives author metadata as a single string, with individual author names separated by commas and "and".
Here is the function: https://github.com/internetarchive/fatcat/blob/master/python/fatcat_tools/importers/arxiv.py#L24
In some cases, as discovered by Sawood, this doesn't work and all author names come through as a single string. For example:
https://fatcat.wiki/release/c5s6d7f7w5b3himgditfbiu5nq
https://fatcat.wiki/release/f7j4lf4aqfeqlaqfrtayt62rwe
FIxing this will include:
Over 60k release entities have a release_year
greater than 2030: https://fatcat.wiki/release/search?q=year%3A%3E2030
These are clearly bogus, due to errors in GROBID header metadata extraction, and should be cleaned up.
Currently release and container search truncates to the most recent N results. We should have a mechanism in the web interface to see additional results.
This will require some care around performance. The preferred mechanism for bulk exporting metadata should not be paginating through elasticsearch queries.
See also: #29
I have noticed that the order of author names is incorrect in https://fatcat.wiki/release/tcmnksnbabdfxl7fdrws6nkq4e as shown in the screenshot below. The list of contributors is in the correct order, but authors listed right below the title have the first author list as the last one.
When importing content from Datacite bulk dump, we seem to have some duplicated adjacent (or near-adjacent?) rows, which resulted in the same DOI getting imported multiple times in the same editgroup. This resulted in at least 8000 duplicate DOIs.
Cleanup is to merge releases (redirecting one to the other). Presumably using common tooling with pubmed, container, and other cleanups.
cc: @miku
Two components of this:
Currently, when a file entity is created or updated, the entire release entity database needs to be dumped and bulk loaded into elasticsearch to update things like the per-container coverage stats. Periodic (weekly? monthly? daily?) stats updates are always going to be necessary to fold container updates into release metadata (eg, "journal for this paper is in DOAJ"), but files updates are a much more common case.
Currently need to do, eg:
https://api.fatcat.wiki/v0/release/lookup?doi=10.1215/10407391-3621721&hide=abstract,refs,contribs
... then, using 'ident': hdfx4g4vkjdjvhsbz7br2oqida
Would rather just:
cc: @Samwalton9
The web interface should pass through API errors better, and be more helpful.
Eg, currently if one looks up ISSN 2575-7849
, you get a blunt 404 page. Instead it should probably mention the ISSN/ISSN-L distinction and link to https://portal.issn.org/resource/ISSN/2575-7849.
If you enter 25757849
you get a 500 error; should be either a 400 error with a description of the problem (missing the '-'), or auto-correct it (don't want to be clever in API, but in web UI probably appropriate).
Relatedly, should have lookup view page for entity types that show all the lookup options with descriptions of identifiers. This view can also contain the error descriptions.
Using title/author metadata, same as the current sandcrawler file/release matching, identical releases should be clustered under the same work entity, and the unused work entities marked as deleted.
Future work related to this merging would be to sort out exact matches (eg, duplicate records for same published version and same external identifiers) as opposed to preprint/published (aka, green OA) versions (with the former releases merged, and the later left distinct), and sorting out file/release linkage.
Eg, if you go to the bottom of: https://fatcat.wiki/container/search?q=bmj
The "next" link points to release search, not container search.
Pages like: https://fatcat.wiki/editor/4vmpwdwxxneitkonvgm2pk6kya/editgroups
Can be very slow to look up. PostgreSQL logs slow queries like:
2020-09-14 12:46:09.401 UTC [3277] fatcat@fatcat_prod LOG: duration: 7019.246 ms execute __diesel_stmt_21: SELECT "editgroup"."id", "editgroup"."editor_id", "editgroup"."created", "editgroup"."submitted", "editgroup"."is_accepted", "editgroup"."description", "editgroup"."extra_json", "changelog"."id", "changelog"."editgroup_id", "changelog"."timestamp" FROM ("editgroup" LEFT OUTER JOIN "changelog" ON "changelog"."editgroup_id" = "editgroup"."id") WHERE "editgroup"."editor_id" = $1 ORDER BY "editgroup"."created" DESC LIMIT $2
2020-09-14 12:46:09.401 UTC [3277] fatcat@fatcat_prod DETAIL: parameters: $1 = 'f7e3a22c-1712-4cf5-9783-2dfb354a5ec1', $2 = '50'
The current indices on the editgroup
table are:
CREATE INDEX editgroup_submitted_idx ON editgroup(is_accepted, submitted);
CREATE INDEX editgroup_editor_idx ON editgroup(is_accepted, editor_id);
There might have been a good reason for making the editor table of the form (is_accepted, editor_id)
, but for this particular query it should at least be (editor_id, is_accepted)
, if not (editor_id, created)
. Worth checking if the order of the created
part should be specified in the index.
Should be possible to re-create this index with an "online" SQL migration with no other code changes and minimal service impact (only these already-slow views?).
I made a typo when copy/pasting an API token (removing a trailing =
character) and the API responded with a 500 error to an auth_check call:
{"success":false,"error":"InternalError","message":"unexpected internal error: invalid length at 196"}
This should be a 400 error (not 500), and the message should indicate the actual problem.
Currently only releases and containers are represented in elasticsearch. It would be great to have works (a summary of sub-releases) and files (as well as filesets and webcaptures) as well.
The first step is to define schemas (see ./extra/elasticsearch/
), then to write transforms (with tests!), and last to wire up CLI tooling and workers to continuously update the index.
Python 3.6+ seems to be a reasonable ecosytem, brings improvements, and some packages/tools already don't support older python3 versions.
Older platforms without python3 (including ubuntu xenial in production) can be supported either using pyenv for development, or more likely installing a python3.6 system package using alternate .deb repositories.
Some things that upgrading would unlock are using the 'black' code formatting tool.
What it says! API support already exists.
The PostgreSQL database setup should be configured for warm failover to a secondary machine in a different data center. This will mostly be a matter of testing and documenting operational procedures; we already have the hardware resources allocated.
HAProxy (or equivalent) should be deployed in from of the API, webface, and search API, with appropriate rate-limits and monitoring. Health-checking should enable relatively seamless failover between servers.
Nagios and other alerts should be in place to notify about disk space, SSL certificate expiration, and other basic system monitoring.
A public status page (lambstatus?) and offsite alerting service (cabot?) should be deployed.
The "help wanted" aspect of this issue is advice arounnd postgres best practices/monitoring, Kafka monitoring (eg, how to integrate kafka topic lag metrics in statsd/grafana), simple log management (eg, retain error lines longer than regular logs), and basic alerting on systemd daemon status (eg, if a python worker daemon crashes, send email with last few log lines).
I suspect it will be common to need to update a release metadata in-place pretty often. It would be great to have tooling for this, where a snippet of python code can be submitted, run over small cases, then scaled up to run over the entire corpus, with progress monitoring, error handling, etc.
At this point mostly looking for design inspiration and existing work.
The JALC import bot was poorly tested and had a race condition resulting in many duplicate containers being created (because it was run with many threads).
Cleanup is a bit tricky because there are so many release entities pointing at all the variants, but if the duplicate containers are merged (instead of deleted), then container transclusion into releases should work well enough for, eg, stats and release lookups. Can cleanup the actual release entities later/eventually.
Sometimes Datacite metadata includes the same people/entities as both "creators" and "contributors", and we end up duplicating them in fatcat metadata. Eg:
I think the behavior should probably be to only add the contributors if they are not already in the author list by string check. Not sure if this should be a fuzzy string check; and exact check is a good start.
Will need to do cleanup as well.
This is an issue to track known metadata search queries that, eg, should return a result but don't.
Word:
in, is interpreted as a field filter/facet instead of (eg) part of a titleI made good progress on expanding code coverage of the fatcat API using python tests, but more work is needed. A rough guide is the fraction of python_client
endpoint wrappers that get called at all; this can be seen by browsing the pipenv run pytest --cov --cov-report html
HTML output.
Known endpoints needing coverage at this time:
Python tests for these sorts of endpoints live under ./python/tests/api_*.py
.
Separately, it would be great to track Rust test coverage. I experimented with the most popular cargo-integrated coverage tools, but they didn't work for me. It would also be important to measure coverage of Rust code from the python API tests, which at this time are more complete (and much easier to write/maintain).
Here's an example entity. I think we can safely mark the release_stage
as a pre-print (submitted
) when the datacite-reported resourceType
is Preprint
. Here is an example entity (JSON):
In this particular case the type should maybe be draft
(paper does not seem to actually have been submitted anywhere, and is in researchgate not a pre-print repository), but we should probably just trust the label from datacite.
Existing python workers talk to Kafka using the 'pykafka' library. This has worked well enough for early development, but has some operational and performance issues: consumer group conflicts are common, it takes a relatively long time (tens of seconds) for workers to startup in some cases, and it would often be better to process entire batches of messages instead of one-at-a-time.
Having looked at a few other python libraries, it seems like with our Kafka broker version (2.0), we should definitely take advantage of a librdkafka
-based library, and the Confluent-maintained one is probably the best bet.
The primary (and first) change is to refactor all workers to use the new library. The second is to refactor some workers (particularly the elasticsearch inserters) to poll() and process entire batches at a time (as opposed to single messages at a time).
The web interface entity pages could expose a button that would allow anybody to instantly re-index the entity into elasticsearch. This would make use of all the same back-end code, and would make quick fixes of corner-cases or operational failures simple. Also useful for manual testing.
The main motivation would be an easier to read and navigate theme. We could write or port a theme to mdbook, but the docs are "just markdown" so it would also be easy to switch to mkdocs, which has a larger community and optional features/plugins.
Here's an example theme that is particularly readable; we'd want to host fonts ourselves on (eg) archive.org:
https://squidfunk.github.io/mkdocs-material/getting-started/
When importing from PubMed/MEDLINE, I trusted that listed DOIs would be valid and registered. This may not have been a good idea! Consider these release entities:
https://fatcat.wiki/release/search?q=Failure+to+use+ECT+in+treatment+of+catatonia
https://fatcat.wiki/release/geyxegazqvdfzinxpk6oafx7ju
https://fatcat.wiki/release/66jfjjdlc5alvjy2onwuk4c7fq
https://fatcat.wiki/release/4iuwwkpylnfdlies3hal2shpsm
https://fatcat.wiki/release/5nlwujzn7na4hmx3eg2xhiq4vq
There seem to be two actual letters written, in the same journal issue on partially overlapping pages by different authors. There are valid PMIDs and DOIs for both, but the MEDLINE entries both have incorrect (invalid) DOIs linked, so we end up with pairs of fatcat releases with (PMID, bad DOI) and (good DOI).
These come from, eg, PubMed imports. The rewrite should only apply to double slashes after the first numbers (the publisher identifier), not elsewhere in the DOI.
I tried several examples using the doi.org resolver and crossref API lookups, and the slash-removed version always seems to be correct.
There are a number of fatcat entities with DOIs in both forms that should be merged. Eg:
https://fatcat.wiki/release/62i5rntupfazhg5jpkwxr6c7na
https://fatcat.wiki/release/hcpjjnfxdfhizcqsom3karcvx4
Something in python that looks at submitted editgroups and annotates them (pass/fail) if appropriate.
API support should already exist; part of this task is to flush out any missing features.
I was a bit sloppy during initial import, and have a bit over 100 journals (containers) with invalid ISSN-Ls (eg, checksums don't match). Here's a TSV file:
https://gist.github.com/bnewbold/1767fd0e32449b3380d5de7fe39c9359
The majority of these seem to be single-character typos that have filtered through upstream sources. As an example:
https://fatcat.wiki/container/lyfqyt3klnfsbewvhlspmxoyu4 (1761-7227, correct)
https://fatcat.wiki/container/uynmrtrspngoxdjdihrwx3x6oi (1761-7727, typo)
It would be great if somebody could go through this TSV, looking up titles on portal.issn.org, and adding the corrected ISSN-L if there is one. In some cases it is important to click through on ISSN records and ensure it is the ISSN-L ("Linking") that is getting added.
With this updated file we can fix the metadata in fatcat, as well as send correction notes to our upstream partners (Pubmed, Crossref, DOAJ, etc).
There is a larger set of 2000+ ISSNs which are not in the most recent issn.org ISSN-L mapping file, but the majority of these seem to either obscure corner cases (known to issn.org, but not listed publicly in portal, eg due to mis-registration) or some status issue (shows up in portal.issn.org but not ISSN-L listing file).
Work is in progress on the more-importers
branch to bulk-import release metadata from a few more large corpuses:
These initial imports will be one-offs, but the code will be reusable for continuous imports (like the Crossref importer works currently).
Some of these importers will depend on schema changes in the v0.3 milestone.
There is a database of retracted papers at: http://retractiondatabase.org/RetractionSearch.aspx?&AspxAutoDetectCookieSupport=1
It would be good to have a bot which periodically fetches updates, and then updates article metadata in fatcat appropriately.
When 1+ million long-tail releases were imported during bootstrapping, their associated container wasn't linked. We have this metadata (with reasonably high confidence) from the crawl logs. The number of releases isn't very large, so it should be reasonable to go through and update container links for each release; this will improve container coverage analytics.
This processing could happen after releases are grouped into works.
It would be cool to have a zotero plugin that could query fatcat for full text when I save a new paper reference to Zotero.
on mobile, will add detail soon.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.