Giter Site home page Giter Site logo

duplicates w/ fulltext about fatcat HOT 4 OPEN

internetarchive avatar internetarchive commented on May 24, 2024
duplicates w/ fulltext

from fatcat.

Comments (4)

bnewbold avatar bnewbold commented on May 24, 2024 1

Thanks for the catch, and filing an issue! These are all the same version of the same paper and should be merged into a single entity. If they were different versions they would still need to be merged under the same "work" entity.

Here are the three release entities and the search query:

Some more background and details:

What happened in this particular case is that I crawled a number of "long-tail" open access journals and inserted about 1.5 million release entities from that crawl without matching to an identifier (like DOI), because most of these works don't have DOIs or other identifiers. Here's what semantic scholar and google scholar know about this paper (note no identifier):

In this case, I crawled 3 near-identical PDFs, and created new release entities for each, so there are three copies.

I wasn't aware of this category of problem from this import, but I am aware of two related problems with the long-tail import: we don't have linked "container" (journal) metadata for these 1.5 million papers, and many of the papers are actually from larger OA publishers (eg, PLOS), but got mixed in with smaller publishers on repository domains that got crawled. Here's an example of the later category of error:

There are a few solutions to these categories of problems:

  • releases will be auto-grouped into works based on metadata (title, authors, year). This is primarily to group pre-prints with published versions, but will also group these near-duplicates as a partial resolution of duplicate entries
  • future creation of release entities lacking a persistent identified (eg, DOI) will be much more conservative. For works with identifiers, we can do a fast lookup to see if something with the same ID already exists; for works without identifiers, we need to do a fuzzy match to see if something very similar already exists and should be merged. biblio-glutton is the tool we'll use for this fuzzy matching
  • targeted cleanups of the earlier 1.5 million long-tail work import are needed; at a minimum container metadata is needed. I've been working on this in the past couple weeks but haven't come up with a robust solution yet

from fatcat.

bnewbold avatar bnewbold commented on May 24, 2024 1

For this specific case of three duplicates, I merged the entities in https://fatcat.wiki/editgroup/shf64rgvgreqbm4dqekjx5d4cq

from fatcat.

metasj avatar metasj commented on May 24, 2024

Thanks for the detailed explanation and links! That really helps me visualize how changes propagate. (I still need to figure out grouping other than redirects.) If the PDFs were completely identical, might the duping still have happened?

from fatcat.

bnewbold avatar bnewbold commented on May 24, 2024

If the PDFs are identical (usually using SHA1 to check), these failure modes shouldn't happen: import scripts do a lookup before insert.

As a fine print detail, there are something like 20 duplicate file entities (duplicates of same file) that slipped through due to a race-condition when doing early bulk imports, and I haven't cleaned these up yet (merged the entities).

from fatcat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.