Comments (13)
Also Li Tzu chʻeng Rebellion, 1628-1645, even though this appears in the same form as an alternate label for http://n2t.net/ark:/99152/p06c6g36v7k.
from periodo-reconciler.
Can you share the Refine project you've created? Just export the project and attach the .tar.gz
file to a comment on this issue. I can't reproduce this with the following tests:
test('identical labels should match', t => {
const docs = [
{id: 'exactmatch', label: 'Bourbons, 1700-'},
]
const results = label.index(docs).search('Bourbons, 1700-')
t.plan(1)
t.same(results.map(({ref}) => ref), ['exactmatch'])
})
test('identical alternate labels should match', t => {
const docs = [
{id: 'exactmatch', label: 'Li Zicheng Rebellion, 1628-1645',
localizedLabels: "Li Tzu ch'eng Rebellion, 1628-1645"},
]
const results = label.index(docs).search("Li Tzu ch'eng Rebellion, 1628-1645")
t.plan(1)
t.same(results.map(({ref}) => ref), ['exactmatch'])
})
But it might be a more subtle character encoding issue in the Refine project (and presumably in the original file from which you created it).
from periodo-reconciler.
Here's the project. I wonder if I needed to re-install the reconciler after the last set of major changes you made?
LCSH_cleaning_test.openrefine.tar.zip
from periodo-reconciler.
Maybe in the encoding of the dash in the first case and the encoding of the apostrophe in the second? Several of the instances I saw were in the form "Label, XXXX-", with no stop date.
from periodo-reconciler.
Oh yes, you definitely need to reinstall the reconciler. Can you try that before I dig further?
from periodo-reconciler.
from periodo-reconciler.
Ok, now we've got everything -- but it's a disaster for automatic reconciliation, since not only the label strings but the date strings show up, which means that "Abdul Aziz, 1861-1876" returns not only all the Abduls but also a hundred Civil War entries (because of the "1861"). I've put this reconciliation attempt in below. Can we adjust the algorithm to match only non-numeric characters in the reconciliation field? I think this might simplify things substantially -- we'd get all the Abduls, but none of the Civil Wars. And there are few or no cases where the numeric values included in the label are going to be necessary for reconciliation.
LCSH_cleaning_test.openrefine-2.tar.zip
from periodo-reconciler.
OK, there is a new version of the reconciler (3.1.0) which ignores years in labels. Please npm install -g periodo-reconciler
and try it out.
from periodo-reconciler.
This is tighter, but the one-to-one matching is still being thrown off by common words and characters ("I" or "II", "War", etc). Only 209 of 1637 matched; the rest would have to be done manually.
This is probably a particular issue for LCSH, and won't be as problematic for other datasets, which are unlikely to have all these wars and reigns. One solution is to maintain different reconciliation services with different resolutions. Another is to make it possible (or maybe this isn't possible in Refine?) to carry out a manual lookup for non-matches with the first version, which only missed a few items (I think you're right that this was a special-character issue).
from periodo-reconciler.
Let me see how many of these I can deal with through a combination of stopwords and special handling of Roman numerals. We still may need a high-recall/high-precision toggle if that doesn't work.
from periodo-reconciler.
Also: I am going to work with the LCSH data by pre-processing the labels (in Refine) to pull out the years. That seems like a more realistic test—it's a bit unfair to expect the reconciler to work well with labels alone, when we do have information about the temporal coverage.
from periodo-reconciler.
OK, I've just published a new version of the reconciler that performs considerably better on the LCSH dataset. The basic strategy was to identify modifying prefixes (e.g. "Anglo", "pre") and period category terms ("Empire", "Revolts") and not tokenize these separately.
When I reconcile the LCSH data using this version, after first adding start
and stop
columns extracted from the labels, I get zero candidates for 463 rows, immediate matches (i.e. only a single matching candidate) for 667 rows, 2–5 candidates for 349 rows, and 6+ candidates for 127 rows (these are things like “Bombardment,” “Civil War,” or “Modern”). I've attached the Refine (version 2.8) project: LCSH_cleaning_test.openrefine.tar.gz
from periodo-reconciler.
Closing this now, as I think the major issues have been addressed.
from periodo-reconciler.
Related Issues (10)
- "SyntaxError: unexpected token function" on run HOT 3
- Improving reconciliation performance: better recall HOT 4
- Broader/narrower matches (continues convo in #2)
- Substitutions for special characters and numerals HOT 8
- Matching multiple periods to a single value
- Any plans to host this on perio.do? HOT 3
- Better control of pop-up window with period details HOT 1
- Sensitivity of string matching HOT 1
- Support Data Extension API HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from periodo-reconciler.