Giter Site home page Giter Site logo

Comments (13)

atomrab avatar atomrab commented on June 14, 2024

Also Li Tzu chʻeng Rebellion, 1628-1645, even though this appears in the same form as an alternate label for http://n2t.net/ark:/99152/p06c6g36v7k.

from periodo-reconciler.

rybesh avatar rybesh commented on June 14, 2024

Can you share the Refine project you've created? Just export the project and attach the .tar.gz file to a comment on this issue. I can't reproduce this with the following tests:

test('identical labels should match', t => {
  const docs = [
    {id: 'exactmatch', label: 'Bourbons, 1700-'},
  ]
  const results = label.index(docs).search('Bourbons, 1700-')
  t.plan(1)
  t.same(results.map(({ref}) => ref), ['exactmatch'])
})

test('identical alternate labels should match', t => {
  const docs = [
    {id: 'exactmatch', label: 'Li Zicheng Rebellion, 1628-1645',
     localizedLabels: "Li Tzu ch'eng Rebellion, 1628-1645"},
  ]
  const results = label.index(docs).search("Li Tzu ch'eng Rebellion, 1628-1645")
  t.plan(1)
  t.same(results.map(({ref}) => ref), ['exactmatch'])
})

But it might be a more subtle character encoding issue in the Refine project (and presumably in the original file from which you created it).

from periodo-reconciler.

atomrab avatar atomrab commented on June 14, 2024

Here's the project. I wonder if I needed to re-install the reconciler after the last set of major changes you made?

LCSH_cleaning_test.openrefine.tar.zip

from periodo-reconciler.

atomrab avatar atomrab commented on June 14, 2024

Maybe in the encoding of the dash in the first case and the encoding of the apostrophe in the second? Several of the instances I saw were in the form "Label, XXXX-", with no stop date.

from periodo-reconciler.

rybesh avatar rybesh commented on June 14, 2024

Oh yes, you definitely need to reinstall the reconciler. Can you try that before I dig further?

from periodo-reconciler.

atomrab avatar atomrab commented on June 14, 2024

from periodo-reconciler.

atomrab avatar atomrab commented on June 14, 2024

Ok, now we've got everything -- but it's a disaster for automatic reconciliation, since not only the label strings but the date strings show up, which means that "Abdul Aziz, 1861-1876" returns not only all the Abduls but also a hundred Civil War entries (because of the "1861"). I've put this reconciliation attempt in below. Can we adjust the algorithm to match only non-numeric characters in the reconciliation field? I think this might simplify things substantially -- we'd get all the Abduls, but none of the Civil Wars. And there are few or no cases where the numeric values included in the label are going to be necessary for reconciliation.

LCSH_cleaning_test.openrefine-2.tar.zip

from periodo-reconciler.

rybesh avatar rybesh commented on June 14, 2024

OK, there is a new version of the reconciler (3.1.0) which ignores years in labels. Please npm install -g periodo-reconciler and try it out.

from periodo-reconciler.

atomrab avatar atomrab commented on June 14, 2024

This is tighter, but the one-to-one matching is still being thrown off by common words and characters ("I" or "II", "War", etc). Only 209 of 1637 matched; the rest would have to be done manually.

This is probably a particular issue for LCSH, and won't be as problematic for other datasets, which are unlikely to have all these wars and reigns. One solution is to maintain different reconciliation services with different resolutions. Another is to make it possible (or maybe this isn't possible in Refine?) to carry out a manual lookup for non-matches with the first version, which only missed a few items (I think you're right that this was a special-character issue).

from periodo-reconciler.

rybesh avatar rybesh commented on June 14, 2024

Let me see how many of these I can deal with through a combination of stopwords and special handling of Roman numerals. We still may need a high-recall/high-precision toggle if that doesn't work.

from periodo-reconciler.

rybesh avatar rybesh commented on June 14, 2024

Also: I am going to work with the LCSH data by pre-processing the labels (in Refine) to pull out the years. That seems like a more realistic test—it's a bit unfair to expect the reconciler to work well with labels alone, when we do have information about the temporal coverage.

from periodo-reconciler.

rybesh avatar rybesh commented on June 14, 2024

OK, I've just published a new version of the reconciler that performs considerably better on the LCSH dataset. The basic strategy was to identify modifying prefixes (e.g. "Anglo", "pre") and period category terms ("Empire", "Revolts") and not tokenize these separately.

When I reconcile the LCSH data using this version, after first adding start and stop columns extracted from the labels, I get zero candidates for 463 rows, immediate matches (i.e. only a single matching candidate) for 667 rows, 2–5 candidates for 349 rows, and 6+ candidates for 127 rows (these are things like “Bombardment,” “Civil War,” or “Modern”). I've attached the Refine (version 2.8) project: LCSH_cleaning_test.openrefine.tar.gz

from periodo-reconciler.

rybesh avatar rybesh commented on June 14, 2024

Closing this now, as I think the major issues have been addressed.

from periodo-reconciler.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.