gbv / wdmapper Goto Github PK
View Code? Open in Web Editor NEWWikidata authority file mapping tool
Home Page: https://wdmapper.readthedocs.io/
License: MIT License
Wikidata authority file mapping tool
Home Page: https://wdmapper.readthedocs.io/
License: MIT License
Not sure if "quicks" output format is supposed to survive - if not, the following is not relevant.
I would expect that quicks output gives me input statements for https://tools.wmflabs.org/wikidata-todo/quick_statements.php - that means: valid and executable lines according to the specification, perhaps intermingled with other information marked a comment.
The following does not look like commands (in that lines do not start with an item identifier:
$ wdmapper check P227 P2428 -i /opt/repec-ras/var/ras/example2/map/gnd_ras_mapping.test.csv -t quicks
# skipping indirect link 124825109 -> Q? -> psn9
Christian von Hirschhausen (GND PRESENT BUT NOT LINKED) P227 "170947386"
Christian von Hirschhausen (GND PRESENT BUT NOT LINKED) P2428 "phi58"
-Q1082503 P227 "170947386"
Since the wfmlabs tool requires TAB (\t) separators and input by copy&paste, it could make sense to mask the TAB delimiter, because tabs may get replaced by blanks when copied over from a terminal window. I used something printable (e.g. "___") in such cases and replaced that with actual TABs in an intermediate step.
Tested on Ubuntu 14.04 LTS:
$ pip install wdmapper
Could not find a version that satisfies the requirement pywikibot (from wdmapper) (from versions: 2.0rc5, 2.0rc5, 2.0rc3, 2.0rc4, 2.0rc1.post2, 2.0rc1.post1)
Solved via installing pywikibot like this:
$ pip install pywikibot==2.0rc5
$ pip install wdmapper
I just tried to use a simple Beacon file, which containes plain GND ids (no second or third field) in order to discover for which of these ids Wikidata items do exist - but could not figure out which command should do this.
Is that a use case which is supported already?
Cheers, Joachim
References are not typical in authority file statements but they exist. See http://tinyurl.com/gvlk95y for a sample query (examples) and http://tinyurl.com/zy5clfv for an example aggregate. Most used reference is "imported from" but some other properties are also used.
When adding mappings to Wikidata, should a reference be added too?
See subclasses of https://www.wikidata.org/wiki/Q21502402
Not sure if "quicks" output format is supposed to survive - if not, the following is not relevant.
The following command gives an TypeError:
$ wdmapper check P227 P2428 -i /opt/repec-ras/var/ras/example2/map/gnd_ras_mapping.test.csv -t quicks
# skipping indirect link 124825109 -> Q? -> psn9
Christian von Hirschhausen (GND PRESENT BUT NOT LINKED) P227 "170947386"
Christian von Hirschhausen (GND PRESENT BUT NOT LINKED) P2428 "phi58"
-Q1082503 P227 "170947386"
Traceback (most recent call last):
File "/usr/bin/wdmapper", line 11, in <module>
sys.exit(main())
File "/usr/lib/python2.7/site-packages/wdmapper/cli.py", line 113, in main
run(*args)
File "/usr/lib/python2.7/site-packages/wdmapper/cli.py", line 98, in run
wdmapper.wdmapper(command, **vars(args))
File "/usr/lib/python2.7/site-packages/wdmapper/wdmapper.py", line 255, in wdmapper
args.writer.write_delta(delta)
File "/usr/lib/python2.7/site-packages/wdmapper/format/quicks.py", line 43, in write_delta
self.edit_link(link,'-')
File "/usr/lib/python2.7/site-packages/wdmapper/format/quicks.py", line 23, in edit_link
self.write_command((qid, prop, '"' + link.target + '"'))
TypeError: coercing to Unicode: need string or buffer, NoneType found
LC_ALL and LANG are set according to #22.
See http://ericholscher.com/blog/2016/jul/1/sphinx-and-rtd-for-writers/ for an introduction.
With a mapping file created externally, I get no output from the wdmapper diff
command:
$ wdmapper diff P227 P2428 < /opt/repec_ras/var/pers_examples/example2/map/gnd_ras_mapping.example.csv
$ head /opt/repec_ras/var/pers_examples/example2/map/gnd_ras_mapping.example.csv
source, target, annotation
171255054, phe112, "anthony giles heyes"
114008787, pwy2, "Charles Wyplosz"
137272472, pli522, "Stephen Littlechild"
171244133, pka87, "Sandeep Kapur"
120931761, pba2, "Söhnke M. Bartram"
171128656, pde16, "Rommert Dekker"
171110358, par215, "Anil Arya"
160912946, pqu66, "Agnes Reynes Quisumbing"
171827759, phy7, "Ari Hyytinen"
The complete input file is available on github
I've checked this on the current master and dev branch.
With the following input
source, target, annotation
124825109, psn9, "Dennis Snower (GND PRESENT AND LINKED)"
170947386, phi58, "Christian von Hirschhausen (GND PRESENT BUT NOT LINKED)"
114008787, pwy2, "Charles Wyplosz (GND NOT PRESENT)"
124825109, pxx999, "Dennis SNOWER (GND PRESENT BUT LINKED TO DIFFERENT TARGET)"
123292182, psn9, "Leo H. Klaassen (GND PRESENT AND NOT LINKED BUT LINK TARGET EXISTS ALREADY)"
the current version (0.0.7) gives the output
$ wdmapper check P227 P2428 -i tests/gnd_repec_test.csv
#FORMAT: BEACON
#NAME: RePEc Short-ID
#DESCRIPTION: Mapping from GND IDs to RePEc Short-IDs
#PREFIX: http://d-nb.info/gnd/
#TARGET: https://authors.repec.org/pro/
~ 124825109|Q1189225|psn9
+ 170947386|Christian von Hirschhausen (GND PRESENT BUT NOT LINKED)|phi58
- 170947386|Q1082503
+ 114008787|Charles Wyplosz (GND NOT PRESENT)|pwy2
+ 124825109|Dennis SNOWER (GND PRESENT BUT LINKED TO DIFFERENT TARGET)|pxx999
- 124825109|Q1189225|psn9
+ 123292182|Leo H. Klaassen (GND PRESENT AND NOT LINKED BUT LINK TARGET EXISTS ALREADY)|psn9
- 124825109|Q1189225|psn9
- 123292182|Q18817490
Try wdmapper -d P1331
and see https://www.wikidata.org/wiki/Property:P1331 for an example. P1630 can be used with several qualifiers, such as "language of work or name" and "applies to territorial jurisdiction". At least wdmapper error message should be improved in this cases.
I had installed the version from this noon. The script was executable, but gave an error when I tried to apply it.
Now I removed the installed files (python setup.py install --record files.txt
), and installed cc4c990 again from scratch. Now I don't even get the help info:
[root@nbt4 bin]# wdmapper -h
Traceback (most recent call last):
File "/usr/bin/wdmapper", line 9, in <module>
load_entry_point('wdmapper==0.0.0', 'console_scripts', 'wdmapper')()
TypeError: main() takes exactly 1 argument (0 given)
Python version is 2.7.5
I've downloaded http://www.historische-kommission-muenchen-editionen.de/beacon_ndb.txt and wanted to know how many of the persons referenced there by GND ID are present in Wikidata. I tried
wdmapper check P227 -i beacon_ndb.txt
but only got a message missing source in line 2
.
See http://tinyurl.com/hjhgnqh and similar queries. Qualifiers are not common in authority file statements but they exist, so they should at least be shown.
At https://wdmapper.readthedocs.io/en/latest/mappings.html a general introduction with examples is needed to illustrate the idea behind authority files and mappings.
As I understand it, the add
commands adds mappings from the WDSOURCEPROPERTY to the WDTARGETPROPERTY. If both properties are inverse functional, it should be possible with a add-reverse
command to lookup the WDTARGETPROPERTY values and add the WDSOURCEPROPERTY value, if not already there.
That could also be done by reformatting the input file , but that would add an additional burden for the maintainer. On the other side, the adding of properties in the reverse direction should not be a default fuctionality of add
, but require a deliberate descision.
Very large result sets result in a query timeout, e.g.
wdmapper get P268 -o bnf.txt
wdmapper get P269 -o sudoc.txt
wdmapper get P646 -o freebase.txt
Maybe a simplified query mode could help.
When submitting a CSV input file, I get:
# ./wdmapper.py check -f csv -i tests/gnd_repec_test.csv
#FORMAT: BEACON
Traceback (most recent call last):
File "./wdmapper.py", line 8, in <module>
main()
File "/opt/wdmapper/wdmapper/cli.py", line 113, in main
run(*args)
File "/opt/wdmapper/wdmapper/cli.py", line 98, in run
wdmapper.wdmapper(command, **vars(args))
File "/opt/wdmapper/wdmapper/wdmapper.py", line 254, in wdmapper
for delta in deltas:
File "/opt/wdmapper/wdmapper/wikidata.py", line 147, in get_deltas
p_target=target.id,
AttributeError: 'NoneType' object has no attribute 'id'
Input was identical to that in #25 (comment)
When I have an externally crafted mapping file, I currently cannot see a way to evaluate which of these mappings would "work" for wikidata - meaning: have a matching property value for the WDSOURCEPROPERTY column. The diff
command does not provide this, the description of the not-yet-implemented check
command seems to aim at a different use case.
An add --dry-run
could provide the information where a WDTARGETPROPERTY would be added (plus warnings, where a different WDTARGETPROPERTY value is already attached, plus perhaps - with --verbose
- messages where the actual property value is already in place).
In production use, the function would be very helpful to check input files in advance and avoid costly mistakes. I suppose, it could be useful in the development of wdmaper, too, because it can be run on live data again and again.
When following the hints in #21, wdmapper worked fine when the output was redirected via >
or -o
.
However, with output to the console, an Unicode error occurs at the first non-ascii character:
$ ./wdmapper.py check P227 -i gnds.csv
#FORMAT: BEACON
#NAME: GND ID
#DESCRIPTION: Mapping from Wikidata IDs to GND IDs
#PREFIX: http://www.wikidata.org/entity/
#TARGET: http://d-nb.info/gnd/
~ Q152835|Hans von Aachen|118643525
~ Q116268|Johannes Aal|118500015
~ Q26931|Evaristo Felice Dall'Abaco|104198273
~ Q518333|Jakob Abbadie|100002307
~ Q76359|Ernst Abbe|118646419
~ Q60582|Thomas Abbt|118500074
~ Q123557|Emil Abderhalden|118643576
~ Traceback (most recent call last):
File "./wdmapper.py", line 8, in <module>
main()
File "/opt/wdmapper/wdmapper/cli.py", line 105, in main
run(*args)
File "/opt/wdmapper/wdmapper/cli.py", line 90, in run
wdmapper.wdmapper(command, **vars(args))
File "/opt/wdmapper/wdmapper/wdmapper.py", line 228, in wdmapper
args.writer.write_delta(delta)
File "/opt/wdmapper/wdmapper/writer.py", line 20, in write_delta
self.write_link(link)
File "/opt/wdmapper/wdmapper/format/beacon.py", line 68, in write_link
self.print('|'.join(token))
File "/opt/wdmapper/wdmapper/writer.py", line 11, in print
print(s, file=self.stream)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 19: ordinal not in range(128)
May still be related to my particular environment, though the settings as described in
#8 (comment) are in place.
Perhaps the following feature is planned anyway, then just take it as a confirmation that there is interest :)
Sometimes, authority properties for wikidata can be derived from external sources. (E.g., the Repec-ShortID property can be extracted from infoboxes in the Englisch wikipedia via dbpedia.)
I've not found a program/bot which can process a file for adding these properties. The error checking logic is very similar to wdmapper, so it could make sense to extend the tool for this purpose, too.
Example input (in beacon format):
$ head -20 /opt/repec_ras/var/ras/latest/beacon/dbpedia_repec_wd.txt
#DESCRIPTION: RePEc-ShortID properties for wikidata from en.wikipedia via dbpedia
#CREATOR: ZBW - Leibniz Information Centre for Economics
#CONTACT: [email protected]
#HOMEPAGE: http://zbw.eu
#TIMESTAMP: 2016-12-22
#PREFIX: http://www.wikidata.org/entity/
#ANNOTATION: RePEC author name
#TARGET: https://authors.repec.org/pro/
#WDTARGETPROPERTY: P2428
Q353915|David D. Friedman|pfr16
Q312561|James Heckman|phe22
Q192592|Kenneth Arrow|par7
Q132489|Amartya Sen|pse23
Q107264|Robert Lucas|plu15
Q295647|Myron Scholes|psc29
Q219721|Robert Mundell|pmu18
Q157268|Robert Solow|pso18
Q295717|Vernon L. Smith|psm12
Q222541|George Akerlof|pak7
Some properties of type string or item can also be used for mapping.
The example command wdmapper get P214 P2428 --limit 10
results in
source, target, annotation
sourceId, targetId, Q18430
sourceId, targetId, Q100689
sourceId, targetId, Q131112
sourceId, targetId, Q173994
sourceId, targetId, Q191020
sourceId, targetId, Q434509
sourceId, targetId, Q502557
sourceId, targetId, Q925174
sourceId, targetId, Q1153459
sourceId, targetId, Q1189225
Python 2.7.5 and d9e46a0
$ pip install wdmapper --user
$ wdmapper get P2428
The limitation to 1-to-1 mappings (which make sense in my eyes) requires that the source property value and the target property value are unique all over wikidata.
In modes which change wikidata, for each input record, the program should check the following conditions before executing acutal changes:
Cases 1) and 2) require human judgement. The log message should be formated in a way which makes it easy to assess.
When following the installation instruction, I get
# pip install wdmapper
Requirement already satisfied: wdmapper in /usr/lib/python2.7/site-packages/wdmapper-0.0.0-py2.7.egg
Requirement already satisfied: pywikibot in /usr/lib/python2.7/site-packages/pywikibot-2.0rc5-py2.7.egg (from wdmapper)
Requirement already satisfied: httplib2>=0.9 in /usr/lib/python2.7/site-packages/httplib2-0.9.2-py2.7.egg (from pywikibot->wdmapper)
Requirement already satisfied: ipaddress in /usr/lib/python2.7/site-packages/ipaddress-1.0.17-py2.7.egg (from pywikibot->wdmapper)
# wdmapper
-bash: /usr/bin/wdmapper: No such file or directory
# which wdmapper
/usr/bin/which: no wdmapper in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin)
With BEACON input format, argument source and target do not need to be specified at command line.
I have recognized, that in v 0.0.9 the result header of a check command may include lines such as:
#DESCRIPTION: Mapping from GND IDs to RePEc Short-IDs
#PREFIX: http://d-nb.info/gnd/
#TARGET: https://authors.repec.org/pro/
#SOURCESET: http://www.wikidata.org/entity/Q36578
#TARGETSET: http://www.wikidata.org/entity/Q206316
I am not sure if that's a good idea, because these may not be clearly defined.
Targetset http://www.wikidata.org/entity/Q206316 designates "Research Papers in Economics" as a bibliographic database, whereas P2428 has humans as intended range (here authors identified in Repec's Author Service) . Furthermore, http://www.wikidata.org/entity/Q206316 lists as "Wikidata property" (http://www.wikidata.org/entity/P1687) "Research Papers in Economics Series handle" (http://www.wikidata.org/entity/P2761).
P1687 and P1629 are intended to be inverse properties, but obviously that is not enforced. The data may be messy here, and more opportunities to mess it up lurk just arround the corner (Repec institutions EDIRC, e.g.).
On source and target side, only an explicit (RAS) or implicit subset (former PND part of GND) may consitute the set addressed in the mapping. If there is no explicit requirement to return the sets in the beacon header, it perhaps perhaps would be better to drop it, instead of giving missleading hints.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.