Giter Site home page Giter Site logo

wdmapper's Introduction

wdmapper

https://travis-ci.org/gbv/wdmapper.png?branch=master https://coveralls.io/repos/gbv/wdmapper/badge.svg?branch=master https://requires.io/github/gbv/wdmapper/requirements.svg?branch=master

Wikidata authority file mapping tool

See https://wdmapper.readthedocs.io/ or source folder docs for full documentation.

Description

wdmapper is a command line application and Python library to manage mappings between authority files in Wikidata. The current draft is limited to simple 1-to-1 mapping that only exist for concepts of obviously unique identity such as people.

Installation

$ pip install wdmapper

Add option --user to install as non-root and option --upgrade to update an already installed version.

Usage

The general calling syntax is

wdmapper [OPTIONS] COMMAND [SOURCE] TARGET

where COMMAND is a wdmapper command, TARGET is a Wikidata property, and SOURCE is an optional Wikidata property for indirect links. TARGET can also be omitted when read from a BEACON file. Depending on the command wdmapper reads input mappings from a file or standard input and/or Wikidata and writes them to standard output or a file.

Run wdmapper without command line arguments (or with option --help|-h) for a list of command line input, and mappings from Wikidata. options.

License

The source code is available at https://github.com/gbv/wdmapper and licensed under the terms of the MIT license.

See also

wdmapper's People

Contributors

nichtich avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

wdmapper's Issues

Output format quicks

Not sure if "quicks" output format is supposed to survive - if not, the following is not relevant.

I would expect that quicks output gives me input statements for https://tools.wmflabs.org/wikidata-todo/quick_statements.php - that means: valid and executable lines according to the specification, perhaps intermingled with other information marked a comment.

The following does not look like commands (in that lines do not start with an item identifier:

$ wdmapper check P227 P2428 -i /opt/repec-ras/var/ras/example2/map/gnd_ras_mapping.test.csv -t quicks
# skipping indirect link 124825109 -> Q? -> psn9
Christian von Hirschhausen (GND PRESENT BUT NOT LINKED) P227    "170947386"
Christian von Hirschhausen (GND PRESENT BUT NOT LINKED) P2428   "phi58"
-Q1082503       P227    "170947386"

Since the wfmlabs tool requires TAB (\t) separators and input by copy&paste, it could make sense to mask the TAB delimiter, because tabs may get replaced by blanks when copied over from a terminal window. I used something printable (e.g. "___") in such cases and replaced that with actual TABs in an intermediate step.

TypeError in check -t quicks

Not sure if "quicks" output format is supposed to survive - if not, the following is not relevant.

The following command gives an TypeError:

$ wdmapper check P227 P2428 -i /opt/repec-ras/var/ras/example2/map/gnd_ras_mapping.test.csv -t quicks
# skipping indirect link 124825109 -> Q? -> psn9
Christian von Hirschhausen (GND PRESENT BUT NOT LINKED) P227    "170947386"
Christian von Hirschhausen (GND PRESENT BUT NOT LINKED) P2428   "phi58"
-Q1082503       P227    "170947386"
Traceback (most recent call last):
  File "/usr/bin/wdmapper", line 11, in <module>
    sys.exit(main())
  File "/usr/lib/python2.7/site-packages/wdmapper/cli.py", line 113, in main
    run(*args)
  File "/usr/lib/python2.7/site-packages/wdmapper/cli.py", line 98, in run
    wdmapper.wdmapper(command, **vars(args))
  File "/usr/lib/python2.7/site-packages/wdmapper/wdmapper.py", line 255, in wdmapper
    args.writer.write_delta(delta)
  File "/usr/lib/python2.7/site-packages/wdmapper/format/quicks.py", line 43, in write_delta
    self.edit_link(link,'-')
  File "/usr/lib/python2.7/site-packages/wdmapper/format/quicks.py", line 23, in edit_link
    self.write_command((qid, prop, '"' + link.target + '"'))
TypeError: coercing to Unicode: need string or buffer, NoneType found

LC_ALL and LANG are set according to #22.

Output missing with diff command / missing utf-8 locale

With a mapping file created externally, I get no output from the wdmapper diff command:

$ wdmapper diff P227 P2428 < /opt/repec_ras/var/pers_examples/example2/map/gnd_ras_mapping.example.csv
$ head  /opt/repec_ras/var/pers_examples/example2/map/gnd_ras_mapping.example.csv
source, target, annotation
171255054, phe112, "anthony giles heyes"
114008787, pwy2, "Charles Wyplosz"
137272472, pli522, "Stephen Littlechild"
171244133, pka87, "Sandeep Kapur"
120931761, pba2, "Söhnke M. Bartram"
171128656, pde16, "Rommert Dekker"
171110358, par215, "Anil Arya"
160912946, pqu66, "Agnes Reynes Quisumbing"
171827759, phy7, "Ari Hyytinen"

The complete input file is available on github

I've checked this on the current master and dev branch.

#SOURCESET / #TARGETSET beacon header lines

I have recognized, that in v 0.0.9 the result header of a check command may include lines such as:

#DESCRIPTION: Mapping from GND IDs to RePEc Short-IDs
#PREFIX: http://d-nb.info/gnd/
#TARGET: https://authors.repec.org/pro/
#SOURCESET: http://www.wikidata.org/entity/Q36578
#TARGETSET: http://www.wikidata.org/entity/Q206316

I am not sure if that's a good idea, because these may not be clearly defined.

Targetset http://www.wikidata.org/entity/Q206316 designates "Research Papers in Economics" as a bibliographic database, whereas P2428 has humans as intended range (here authors identified in Repec's Author Service) . Furthermore, http://www.wikidata.org/entity/Q206316 lists as "Wikidata property" (http://www.wikidata.org/entity/P1687) "Research Papers in Economics Series handle" (http://www.wikidata.org/entity/P2761).

P1687 and P1629 are intended to be inverse properties, but obviously that is not enforced. The data may be messy here, and more opportunities to mess it up lurk just arround the corner (Repec institutions EDIRC, e.g.).

On source and target side, only an explicit (RAS) or implicit subset (former PND part of GND) may consitute the set addressed in the mapping. If there is no explicit requirement to return the sets in the beacon header, it perhaps perhaps would be better to drop it, instead of giving missleading hints.

wdmapper get not working with two properties

The example command wdmapper get P214 P2428 --limit 10 results in

source, target, annotation
sourceId, targetId, Q18430
sourceId, targetId, Q100689
sourceId, targetId, Q131112
sourceId, targetId, Q173994
sourceId, targetId, Q191020
sourceId, targetId, Q434509
sourceId, targetId, Q502557
sourceId, targetId, Q925174
sourceId, targetId, Q1153459
sourceId, targetId, Q1189225

Python 2.7.5 and d9e46a0

wdmapper binary not installed

When following the installation instruction, I get

# pip install wdmapper
Requirement already satisfied: wdmapper in /usr/lib/python2.7/site-packages/wdmapper-0.0.0-py2.7.egg
Requirement already satisfied: pywikibot in /usr/lib/python2.7/site-packages/pywikibot-2.0rc5-py2.7.egg (from wdmapper)
Requirement already satisfied: httplib2>=0.9 in /usr/lib/python2.7/site-packages/httplib2-0.9.2-py2.7.egg (from pywikibot->wdmapper)
Requirement already satisfied: ipaddress in /usr/lib/python2.7/site-packages/ipaddress-1.0.17-py2.7.egg (from pywikibot->wdmapper)

# wdmapper
-bash: /usr/bin/wdmapper: No such file or directory

# which wdmapper
/usr/bin/which: no wdmapper in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin)

Add BEACON input format

With BEACON input format, argument source and target do not need to be specified at command line.

Output of the check command

With the following input

source, target, annotation
124825109, psn9, "Dennis Snower (GND PRESENT AND LINKED)"
170947386, phi58, "Christian von Hirschhausen (GND PRESENT BUT NOT LINKED)"
114008787, pwy2, "Charles Wyplosz (GND NOT PRESENT)"
124825109, pxx999, "Dennis SNOWER (GND PRESENT BUT LINKED TO DIFFERENT TARGET)"
123292182, psn9, "Leo H. Klaassen (GND PRESENT AND NOT LINKED BUT LINK TARGET EXISTS ALREADY)"

the current version (0.0.7) gives the output

$ wdmapper check P227 P2428 -i  tests/gnd_repec_test.csv
#FORMAT: BEACON
#NAME: RePEc Short-ID
#DESCRIPTION: Mapping from GND IDs to RePEc Short-IDs
#PREFIX: http://d-nb.info/gnd/
#TARGET: https://authors.repec.org/pro/

~ 124825109|Q1189225|psn9
+ 170947386|Christian von Hirschhausen (GND PRESENT BUT NOT LINKED)|phi58
- 170947386|Q1082503
+ 114008787|Charles Wyplosz (GND NOT PRESENT)|pwy2
+ 124825109|Dennis SNOWER (GND PRESENT BUT LINKED TO DIFFERENT TARGET)|pxx999
- 124825109|Q1189225|psn9
+ 123292182|Leo H. Klaassen (GND PRESENT AND NOT LINKED BUT LINK TARGET EXISTS ALREADY)|psn9
- 124825109|Q1189225|psn9
- 123292182|Q18817490

Feature suggestion: command add --dry-run

When I have an externally crafted mapping file, I currently cannot see a way to evaluate which of these mappings would "work" for wikidata - meaning: have a matching property value for the WDSOURCEPROPERTY column. The diff command does not provide this, the description of the not-yet-implemented check command seems to aim at a different use case.

An add --dry-run could provide the information where a WDTARGETPROPERTY would be added (plus warnings, where a different WDTARGETPROPERTY value is already attached, plus perhaps - with --verbose - messages where the actual property value is already in place).

In production use, the function would be very helpful to check input files in advance and avoid costly mistakes. I suppose, it could be useful in the development of wdmaper, too, because it can be run on live data again and again.

Check source and target for duplicates before making changes

The limitation to 1-to-1 mappings (which make sense in my eyes) requires that the source property value and the target property value are unique all over wikidata.

In modes which change wikidata, for each input record, the program should check the following conditions before executing acutal changes:

  1. more than one item exist for the source property value: alter nothing, log "WARN: Duplicate source items Qn, Qm, ... for Psource {value}"
  2. the item identified by the source property already has a target property value, which differs from value of the input: alter nothing, log "WARN: Qn already has Ptarget {existing value}, not changed to {value}" (optionally enhance log message by annotation, if present in the input file)
  3. the item identified by the source property already has a target property value, which is identical to the value of the input: alter nothing, log "DEBUG: Qn Ptarget {value} already exists"

Cases 1) and 2) require human judgement. The log message should be formated in a way which makes it easy to assess.

Fail when checking CSV input without property

When submitting a CSV input file, I get:

# ./wdmapper.py check -f csv -i tests/gnd_repec_test.csv
#FORMAT: BEACON

Traceback (most recent call last):
  File "./wdmapper.py", line 8, in <module>
    main()
  File "/opt/wdmapper/wdmapper/cli.py", line 113, in main
    run(*args)
  File "/opt/wdmapper/wdmapper/cli.py", line 98, in run
    wdmapper.wdmapper(command, **vars(args))
  File "/opt/wdmapper/wdmapper/wdmapper.py", line 254, in wdmapper
    for delta in deltas:
  File "/opt/wdmapper/wdmapper/wikidata.py", line 147, in get_deltas
    p_target=target.id,
AttributeError: 'NoneType' object has no attribute 'id'

Input was identical to that in #25 (comment)

Include references

References are not typical in authority file statements but they exist. See http://tinyurl.com/gvlk95y for a sample query (examples) and http://tinyurl.com/zy5clfv for an example aggregate. Most used reference is "imported from" but some other properties are also used.

When adding mappings to Wikidata, should a reference be added too?

Feature suggestion: command add-reverse

As I understand it, the add commands adds mappings from the WDSOURCEPROPERTY to the WDTARGETPROPERTY. If both properties are inverse functional, it should be possible with a add-reverse command to lookup the WDTARGETPROPERTY values and add the WDSOURCEPROPERTY value, if not already there.

That could also be done by reformatting the input file , but that would add an additional burden for the maintainer. On the other side, the adding of properties in the reverse direction should not be a default fuctionality of add, but require a deliberate descision.

Trouble installation pywikibot?

Tested on Ubuntu 14.04 LTS:

$ pip install wdmapper

Could not find a version that satisfies the requirement pywikibot (from wdmapper) (from versions: 2.0rc5, 2.0rc5, 2.0rc3, 2.0rc4, 2.0rc1.post2, 2.0rc1.post1)

Solved via installing pywikibot like this:

$ pip install pywikibot==2.0rc5
$ pip install wdmapper

Encoding issue at STDOUT

When following the hints in #21, wdmapper worked fine when the output was redirected via > or -o.

However, with output to the console, an Unicode error occurs at the first non-ascii character:

$ ./wdmapper.py check P227 -i gnds.csv
#FORMAT: BEACON
#NAME: GND ID
#DESCRIPTION: Mapping from Wikidata IDs to GND IDs
#PREFIX: http://www.wikidata.org/entity/
#TARGET: http://d-nb.info/gnd/

~ Q152835|Hans von Aachen|118643525
~ Q116268|Johannes Aal|118500015
~ Q26931|Evaristo Felice Dall'Abaco|104198273
~ Q518333|Jakob Abbadie|100002307
~ Q76359|Ernst Abbe|118646419
~ Q60582|Thomas Abbt|118500074
~ Q123557|Emil Abderhalden|118643576
~ Traceback (most recent call last):
  File "./wdmapper.py", line 8, in <module>
    main()
  File "/opt/wdmapper/wdmapper/cli.py", line 105, in main
    run(*args)
  File "/opt/wdmapper/wdmapper/cli.py", line 90, in run
    wdmapper.wdmapper(command, **vars(args))
  File "/opt/wdmapper/wdmapper/wdmapper.py", line 228, in wdmapper
    args.writer.write_delta(delta)
  File "/opt/wdmapper/wdmapper/writer.py", line 20, in write_delta
    self.write_link(link)
  File "/opt/wdmapper/wdmapper/format/beacon.py", line 68, in write_link
    self.print('|'.join(token))
  File "/opt/wdmapper/wdmapper/writer.py", line 11, in print
    print(s, file=self.stream)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 19: ordinal not in range(128)

May still be related to my particular environment, though the settings as described in
#8 (comment) are in place.

How to discover WD items from a simple Beacon file?

I just tried to use a simple Beacon file, which containes plain GND ids (no second or third field) in order to discover for which of these ids Wikidata items do exist - but could not figure out which command should do this.

Is that a use case which is supported already?

Cheers, Joachim

Installation issue

I had installed the version from this noon. The script was executable, but gave an error when I tried to apply it.

Now I removed the installed files (python setup.py install --record files.txt), and installed cc4c990 again from scratch. Now I don't even get the help info:

[root@nbt4 bin]# wdmapper -h
Traceback (most recent call last):
  File "/usr/bin/wdmapper", line 9, in <module>
    load_entry_point('wdmapper==0.0.0', 'console_scripts', 'wdmapper')()
TypeError: main() takes exactly 1 argument (0 given)

Python version is 2.7.5

Feature suggestion: Add properties to wikidata items directly

Perhaps the following feature is planned anyway, then just take it as a confirmation that there is interest :)

Sometimes, authority properties for wikidata can be derived from external sources. (E.g., the Repec-ShortID property can be extracted from infoboxes in the Englisch wikipedia via dbpedia.)

I've not found a program/bot which can process a file for adding these properties. The error checking logic is very similar to wdmapper, so it could make sense to extend the tool for this purpose, too.

Example input (in beacon format):

$ head -20 /opt/repec_ras/var/ras/latest/beacon/dbpedia_repec_wd.txt
#DESCRIPTION: RePEc-ShortID properties for wikidata from en.wikipedia via dbpedia
#CREATOR: ZBW - Leibniz Information Centre for Economics
#CONTACT: [email protected]
#HOMEPAGE: http://zbw.eu
#TIMESTAMP: 2016-12-22
#PREFIX: http://www.wikidata.org/entity/
#ANNOTATION: RePEC author name
#TARGET: https://authors.repec.org/pro/
#WDTARGETPROPERTY: P2428

Q353915|David D. Friedman|pfr16
Q312561|James Heckman|phe22
Q192592|Kenneth Arrow|par7
Q132489|Amartya Sen|pse23
Q107264|Robert Lucas|plu15
Q295647|Myron Scholes|psc29
Q219721|Robert Mundell|pmu18
Q157268|Robert Solow|pso18
Q295717|Vernon L. Smith|psm12
Q222541|George Akerlof|pak7

Avoid query timeout for large mapping sets

Very large result sets result in a query timeout, e.g.

wdmapper get P268 -o bnf.txt
wdmapper get P269 -o sudoc.txt
wdmapper get P646 -o freebase.txt

Maybe a simplified query mode could help.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.