Giter Site home page Giter Site logo

digitalscriptorium / ds-open-refine Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 4.0 2.19 MB

Digital Scriptorium OpenRefine documentation and JSON recipes for data reconciliation

authorities controlled-vocabularies data-reconciliation digital-humanities json metadata metadata-management openrefine semantic-enrichment

ds-open-refine's Introduction

Digital Scriptorium Data Reconciliation Process through OpenRefine

Digital Scriptorium OpenRefine documentation and JSON recipes for data reconciliation and management

General instructions

When utilizing the JSON instructions (also known as recipes) found in this repository for DS data in OpenRefine, find the left column, select the Undo/Redo tab, select Apply, paste the JSON code, and then select Perform operations. This will execute the prewritten commands which perform various actions on the data for the reconciliation process and when merging new datasets with previous ones.

Facets and filters can also be used on the data by using drop-down menus available on each column header and displayed in the left column when selecting the Facet/Filter tab.

The following notes apply to file naming conventions for editing file name variables found in the instructions in this repository (use all lowercase letters where applicable):

  • DATE = the date the file/dataset was generated/created/extracted in YYYYMMDD format
  • VALUE = the type of metadata values or metadata element extracted and enriched, such as genres or languages or names
  • INSTITUTION = the code for the name of the institutional source for the data, such as penn or kansas or csl
  • DATATYPE = the type of encoding standard or technical format of the metadata source, such as marcxml or mets or csv
  • One or more DIFFERENTIATORS may also be added on the file name to disambiguate files, using sources names of collections or databases, such as bibliophilly or muslimworld, or batch numbers, such as batch-1, batch-2, etc.

Examples of correctly formatted file names:

  • 20230518-materials-rome-mets-legacy-enriched.csv
  • 20230630-genres-penn-marcxml-bibliophilly-enriched.csv
  • 20230715-names-kansas-marc-enriched.csv
  • 20230816-languages-princeton-marcxml-batch-3-enriched.csv
  • 20230901-places-hrc-csv-fragments-batch-1-enriched.csv

Reconciliation instructions by metadata element / authority type

Genres

Genre reconciliation instructions

Languages

Language reconciliation instructions

Materials

Material reconciliation instructions

Names

Name reconciliation instructions

Places

Place reconciliation instructions

Subjects

Subject reconciliation instructions

Titles

Title reconciliation instructions

Instructions for integrating new reconciliations with previously reconciled data

Merging newly enriched data with data dictionaries

ds-open-refine's People

Contributors

emeryr-upenn avatar lpcoladangelo avatar marsassi avatar reord-berend avatar rosemccandless avatar

Stargazers

 avatar

Watchers

 avatar

ds-open-refine's Issues

Bug: reconciliation and cell merging

Bug 1:

@lpcoladangelo discovered that reconciliation is more reliable if a recon column is created from a multi-valued cell and then the recon column is split (rather than split the source column and then creating the recon column). This mitigates a problem in OpenRefine 3.5.1, where names in the recon column are not reconciled and no option is available to search for a matching value.

@demery found that in Safari, OpenRefine 3.5.1 still returned unreconciled values without the "Search for a match" button, but the problem did not occur in Chrome.

This change, splitting the "recon" column rather than the source column should be applied to former owner and authors.

Bug 2:

The current method for merging qid-human with qid-organization and instance of-human with instance of-organization doesn't work. We want to grab the first non-blank value, but we're using coalesce which returns the first non-null value; because blank values (the empty string) aren't null, coalesce won't work:

coalesce("", "a value") // => ""

We want a value to be returned. To find non-blank values we need to use if() and isBlank().

Feature/Conversion from `material_as_recorded` to `material`

The input CSV has two columns for support: material and material_as_recorded. The idea being that an authorized value (or values for MSS with mixed support media, like paper and parchment) would be used for material.

There are a number of tasks:

  • Determine whether the model makes this distinction. It does!
  • If yes, Identify a vocabulary/authority for material and
  • Come up with a method to convert from material_as_recorded to material.

Rename working `qid` column to source-column specific name

During the working operations the QID column will be called names like qid, human_qid, organization_qid, when the reconciliation process is complete the final name should be something like former_owner_qid, author_qid. and so forth.

Alternately: We may use the QID for the authority column: former_owner, author, etc. This will only be the case if we're not using the DS-internal ID's in these spots.

Add ability to reconcile human and non-human names: owners, authors, etc.

Wikidata name reconciliation is different for humans and organizations. These require separate steps. Here's a proposed/possible workflow:

  1. create recon column
  2. reconcile recon column to Q5 human
  3. Add human_qid column
  4. reconcile recon column to Q42229 organization
  5. add organization_qid column
  6. merge/coalesce columns human_qid + organization_qid > qid
  7. rename qid column to SOURCE_COLUMN_qid

JSON should be modified with new column names. New templates should be created for these JSON files as well.

Use separate columns for human and organization reconciliation types

To prevent data corruption and to make behavior clear to data cleaners, separate columns should be created for reconciling human and organizations names against WikiData; e.g.,

  • recon-human
  • recon-organization

We discovered a bug (?) in OpenRefine such that when Wikidata reconciliation type in a column is changed from human (Q5) to organization (Q43229) not all matches are cleared. So, for example, the entry "Morris Jastrow" will be reconciled both as a human and an organization, have its QID in both the human and organization columns. The steps expect that human and organization QIDs will be distinct.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.