Comments (4)
Thanks for the catch, and filing an issue! These are all the same version of the same paper and should be merged into a single entity. If they were different versions they would still need to be merged under the same "work" entity.
Here are the three release entities and the search query:
- https://fatcat.wiki/release/k7stofweanfabfnhkm3wcjbl3u
- https://fatcat.wiki/release/j6bzeablenaahmrdfnjzz7eyxy
- https://fatcat.wiki/release/dnulvnijqbfdhawqmgsp557mhy
- https://fatcat.wiki/release/search?q=plastic+factory
Some more background and details:
What happened in this particular case is that I crawled a number of "long-tail" open access journals and inserted about 1.5 million release entities from that crawl without matching to an identifier (like DOI), because most of these works don't have DOIs or other identifiers. Here's what semantic scholar and google scholar know about this paper (note no identifier):
- https://www.semanticscholar.org/paper/Evaluation-of-noise-pollution-and-the-efficiency-of-Hosseini-Khorashad/ababf5ac11aa1f79dc3d645a13bb20ac6c34866e
- https://scholar.google.com/scholar?cluster=6126366077149053969&hl=en&as_sdt=0,48
In this case, I crawled 3 near-identical PDFs, and created new release entities for each, so there are three copies.
I wasn't aware of this category of problem from this import, but I am aware of two related problems with the long-tail import: we don't have linked "container" (journal) metadata for these 1.5 million papers, and many of the papers are actually from larger OA publishers (eg, PLOS), but got mixed in with smaller publishers on repository domains that got crawled. Here's an example of the later category of error:
- https://fatcat.wiki/release/g3txnvdnqfdu5l6y5ljl2zpvpi
- https://fatcat.wiki/release/6vwznavjhrf6dine5zq6ztuzxe
There are a few solutions to these categories of problems:
- releases will be auto-grouped into works based on metadata (title, authors, year). This is primarily to group pre-prints with published versions, but will also group these near-duplicates as a partial resolution of duplicate entries
- future creation of release entities lacking a persistent identified (eg, DOI) will be much more conservative. For works with identifiers, we can do a fast lookup to see if something with the same ID already exists; for works without identifiers, we need to do a fuzzy match to see if something very similar already exists and should be merged. biblio-glutton is the tool we'll use for this fuzzy matching
- targeted cleanups of the earlier 1.5 million long-tail work import are needed; at a minimum container metadata is needed. I've been working on this in the past couple weeks but haven't come up with a robust solution yet
from fatcat.
For this specific case of three duplicates, I merged the entities in https://fatcat.wiki/editgroup/shf64rgvgreqbm4dqekjx5d4cq
from fatcat.
Thanks for the detailed explanation and links! That really helps me visualize how changes propagate. (I still need to figure out grouping other than redirects.) If the PDFs were completely identical, might the duping still have happened?
from fatcat.
If the PDFs are identical (usually using SHA1 to check), these failure modes shouldn't happen: import scripts do a lookup before insert.
As a fine print detail, there are something like 20 duplicate file entities (duplicates of same file) that slipped through due to a race-condition when doing early bulk imports, and I haven't cleaned these up yet (merged the entities).
from fatcat.
Related Issues (20)
- release search links from preservation summary tables
- CEUR container has only 140 releases HOT 3
- Some CrossRef DOIs are missing
- "Diff" view for reviewing editgroups HOT 1
- "Graph" view for reviewing editgroups
- Edit workflow improvements
- Recent bugs: creator linkage and JSON API links HOT 1
- many file webarchive (wayback) URLs have only 12 of 14 timestamp digits HOT 1
- non-lowercase DOIs HOT 1
- match old-style arxiv identifiers in references HOT 1
- array sort order stability (file URLs, etc) HOT 1
- release abstracts mimetype uses less-popular JATS XML type
- `fatcat_tools` should not import from `fatcat_web` HOT 1
- Support National Bibliography Number as external id (NBN URN) HOT 2
- re-edit results in error 400 HOT 3
- notifications feed
- arXiv work missing HOT 3
- language on releases HOT 3
- editgroup diff view fails on some short strings in refs, via TOML
- Release type speech not supported by graphical editor
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fatcat.