Giter Site home page Giter Site logo

wikibase-dump-filter's Introduction

wikibase-dump-filter

Filter and format a newline-delimited JSON stream of Wikibase entities.

Typically useful to create a formatted subset of a Wikibase JSON dump.

Some context: This tool was formerly known as wikidata-filter. Wikidata is an instance of Wikibase. This tool was primarly designed with Wikidata in mind, but should be usable for any Wikibase instance.

This project received a Wikimedia Project Grant.


wikibase           wikidata

License Node JavaScript Style Guide

NPM Download stats

Summary

Install

this tool requires to have NodeJs installed.

# Install globally
npm install -g wikibase-dump-filter
# Or install just to be used in the scripts of the current project
npm install wikibase-dump-filter

Changelog

See CHANGELOG.md for version info

Download dump

Wikidata dumps

Wikidata provides a bunch of database dumps, among which the desired JSON dump. As a Wikidata dump is a very laaarge file (April 2020: 75GB compressed), it is recommended to download that file first before doing operations on it, so that if anything crashes, you don't have to start the download from zero (the download time being usually the bottleneck).

wget --continue https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz
cat latest-all.json.gz | gzip -d | wikibase-dump-filter --claim P31:Q5 > humans.ndjson

Your own Wikibase instance dump

You can generate a JSON dump using the script dumpJson.php. If you are running Wikibase with wikibase-docker, you could use the following command:

cd wikibase-docker
docker-compose exec wikibase /bin/sh -c "php ./extensions/Wikibase/repo/maintenance/dumpJson.php --log /dev/null" > dump.json
cat dump.json | wikibase-dump-filter --claim P1:Q1 > entities_with_claim_P1_Q1.ndjson

How-to

This package can both be used as a command-line tool (CLI) and as a NodeJS module. Those 2 uses have their own documentation page but the options stay the same, and are documented in the CLI section

See Also


You may also like

inventaire banner

Do you know Inventaire? It's a web app to share books with your friends, built on top of Wikidata! And its libre software too.

License

MIT

wikibase-dump-filter's People

Contributors

daniel-mietchen avatar jsteemann avatar maxlath avatar nestarz avatar nichtich avatar shashank-agg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

wikibase-dump-filter's Issues

Troubles with old wikidump

Hi! I'm working with old wikidata dumps. Dump was downloaded from official source
I'm trying to filtering through my unpacked dump with different parameters and received that error.

daniillevchenko@danil ~/w/P/RaiseWikibase (main) [SIGPIPE|1]> cat entities.json | wikibase-dump-filter --claim 'P31:Q5' > another_my_test.json
    parsed | total average parse time | recent average parse time |       kept | % of total |   last kept | last kept time | elapsed time
         0 |                      0ms |                       0ms |          0 |         0% |             |              0 |     00:00:00/usr/lib/node_modules/wikibase-dump-filter/lib/valid_claims.js:20
  let propClaims = claims[P]
                         ^

TypeError: Cannot read properties of undefined (reading 'P31')
    at /usr/lib/node_modules/wikibase-dump-filter/lib/valid_claims.js:20:26
    at arraySome (/usr/lib/node_modules/wikibase-dump-filter/node_modules/lodash.some/index.js:140:9)
    at some (/usr/lib/node_modules/wikibase-dump-filter/node_modules/lodash.some/index.js:1838:10)
    at /usr/lib/node_modules/wikibase-dump-filter/lib/valid_claims.js:14:10
    at arrayEvery (/usr/lib/node_modules/wikibase-dump-filter/node_modules/lodash.every/index.js:140:10)
    at every (/usr/lib/node_modules/wikibase-dump-filter/node_modules/lodash.every/index.js:1865:10)
    at module.exports (/usr/lib/node_modules/wikibase-dump-filter/lib/valid_claims.js:10:10)
    at /usr/lib/node_modules/wikibase-dump-filter/lib/filter_entity.js:13:10
    at /usr/lib/node_modules/wikibase-dump-filter/lib/filter_format_and_serialize_entity.js:16:9
    at Stream.<anonymous> (/usr/lib/node_modules/wikibase-dump-filter/lib/stream_utils.js:24:22)

Node.js v17.3.0

I tried this yesterday with latest wikidata-dump and it worked normally. But with old dump i have some troubles which i can not fix.
Please give me direction to think where trouble is - with old dump or with my hands.

Thanks in advance for your reply!

Filter formatting

Is it possible to format the following query into a valid filter somehow?

SELECT ?work ?workLabel
WHERE
{
  ?work wdt:P31/wdt:P279* wd:Q43229. # instance of any subclass of work of art
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

I want to get all entities that are instances of a subclass (of a subclass, etc.) of organizations.

Filter by multiple claims

Hi,

I think it would be nice if we could filter by multiple claims at once (either their disjunction or conjunction). This could look something like the & and | syntax the sitelink option uses:

P42:Q8,Q13&~P666, P45:Q10|P32:Q10, etc.

Support for multicore systems

I use the filter in 2 ramdisks, each around 100GB large to speed up processing. Still my 32 cores machine idles at around 5% and will take 12-16 hours filtering all entries (0.5ms average time).

As i don't know nodejs a lot im not sure i can add multi threading to this but node.js totally can invoke child threads - is there an easy 2-3 line addition possible to spawn more threads? See https://nodejs.org/docs/latest/api/cluster.html

Im using server boards, but i guess lots of ppl doing this will sit on a ryzen system or similiar.

multicore unpacking the archive is doable with 'pbzip2 -d -c /mnt/ramdisk/latest-all.json.bz2 | wikibase-dump-filter', thus showing node at exactly 100% and unzipping at ~110%, so its still node the bottleneck. This halves average to 0.25 for me, but with "just" 64GB RAM on maybe some rented hosted machine with lots of cores you can get filter time down to under 30 minutes with multicore processing, thus greatly reducing costs for weekly updates.

Thanks for your great work, really sparing me days of processing,
R

Keep selected languages only

Keeping all languages of labels, descriptions, and aliases takes a lot of space. Maybe you are only interested for instance in English and Spanish:

wikidata-filter --keep-language en,es

There is probably less use for omit-language, but who knows?

Make --claim optional

The script fails if no claim is given. It should be possible to filter an already filtered dump for instance with --keep only.

Unexpected token error.

cat latest-all.json | wikidata-filter --claim P31:Q5 > humans.ndjson

/usr/local/lib/node_modules/wikidata-filter/lib/entity_parser.js:8
.pipe(filter(line => {
^
SyntaxError: Unexpected token >
at Module._compile (module.js:439:25)

"not"-operator doesn't seem to be working properly

I tried to get all humans without a death date and used this claim: --claim 'P31:Q5|~P570'

It seems to be writing all entries from the overall wikidata dump in the file and not just the requested ones.
I also tried the claim from the How-To examples but I got the same result. Is there something wrong with the claim I'm using?

filtering items with deprecated claims

Applying the command bzcat latest-all.json.bz2 |wikibase-dump-filter --simplify --claim 'P698' |jq '[.id,.claims.P698,.claims.P921]' -c >PMID.ndjson results in >30M lines like this:

["Q94880466",["19484558"],null]
["Q17485067",["21609473"],["Q18123741","Q12156","Q193430"]]

where the first case is an item with P698 claim but without P921 claims, and the second has P698 and P921 claims. However out of these 30M there are at least six (6) that are different:
ralf@ark:~/wikidata> grep '[]' PMID.ndjson

["Q30573040",["23057853"],[]]
["Q30523792",["22888462"],[]]
["Q48835971",[],null]
["Q50125628",[],null]
["Q58616403",[],null]
["Q31128925",["27613570"],[]]

Note that 3 don't have P698 (which should not happen given the filter), and 3 have [] instead of null for no P921.

I'm not claiming there is a bug in wikibase-dump-filter, just that this needs investigating, and the ticket is a start. But maybe you have seen this and have an immediate explanation?

certain entities aren't filtered for unknown reason

I tried this claim: --claim 'P31:Q515,Q7930989,Q15284' to get all cities and municipalities from around the world. I found that "Frankfurt am Main" isn't in the resulting file, even though it is an instance of "city" (Q515). "Frankfurt am Main" also has other items in the "instance of" property but they shouldn't be affecting the outcome, right? Also similar entities like "Munich", which also have multiple items in that property next to the "city" item and are in the resulting file.
I noticed that the filter shows this after finishing: in: 1736 | total: 9762074 | last entity in: Q84908318. If I understand it correctly this means 1736 entities have been filtered from 9.7 Million. However, the resulting file has over 14 000 lines, of which each is an entity, right? How does this fit together?

filter all entities that are instances of a subclass of an item

Hey there :)
So I'd like to get all cities and municipalities of the world. But for example some only have "municipality of [country]". The "municipality of [...]" item is a subclass of the "municipality" item. Would it be possible to only filter by the overall item and still get all entities that are instances of the different subclasses?

Filter purely by Q-Number

Hello,

I have a large list of wikidata id's or Q Numbers and I'd like to filter out purely these entities. Does this already exist/is this possible to implement?

Thank you!

Filter by Language Exclude Many Entities

Little debugging on the results after filtering by en, I found that the filter discard the entity if en not found in labels even though en is presented in the aliases, and vice versa

I think entities should be only removed if the the language is not found in both, labels and aliases

What do you think?

Differing item numbers, issue with tool?

Hello!

I'm working from a MacOS v12.2, script is run on a server at our research institute (in the Su Lab. I've been using WDF and getting different results from @andrawaag (https://github.com/andrawaag/wdsub) and @seyedahbr (https://github.com/seyedahbr/wdumper) and we're not entirely sure why this is the case.

We are using the same Wikidata dump (Jan 3 2022), takes between 5-7 hours to download. Here is my script I then run using WDF with and sans prefilter.

The output is the same for both (with or sans prefilter), so we're wondering if it may be something with the tool or perhaps you can advise?

One example: We expect to see ~1.2 million item pages for compounds. This is true for both Andra and Seyed using the same identifiers I have (P31:Q11173). However, I get ~18000 not 1.2 million

Thoughts on why this is the case?
Many thanks!

Simplified requires arguments?

Can't use simplify with default options (no arguments) as specified in the examples on the "How-to" page. I get the following error:

error: option '-s, --simplify <boolean|options> argument missing"

long claim filter makes the process extremly slow

@hsaif comment:

Something I noticed is that passing 25K claims to the filter render the process extremely slow.
For example, my process has been running for 19 hours now and managed to process ~8M entities only!

I didn't have that kind of extreme use case in mind when I wrote the claim filter, so for every entity claim, it iterates over the 25K Q ids to determine if there is a match, which with that amount of ids is crazy inefficient.

Replacing the array lookup by a hash should already be a huge boost, and then we could also play with some multi-process load balancing

Add entity stream method: grep

Like method filter introduced with cca67d8 to support writing

parser(process.stdin)
.grep( item => item.claims && item.claims.P356 )

instead of

parser(process.stdin)
.filter( item => (item.claims && item.claims.P356) ? item : null )

Filter by non-existing claim

All entities that are not instance of something (also requires #2):

wikidata-filter --claim ~P31

Better use ~ instead of ! because the latter requires escaping, at least in bash. ~ is only expanded to the home directory if given as part of a directory specification but not if given as above.

Filter by type

Only interested in items? Option --type item should help.

Error when using both language filter and omitting sitelinks

Hi! Found a bug when trying to use both the --languages filter and omitting 'sitelinks'.

  • wikibase-dump-filter version: 5
  • NodeJS version: v14.3.0
  • OS version: MacOS 10.15

Reproduction:
cat latest-all.json.bz2 | bzcat | wikibase-dump-filter --languages en,de --omit sitelinks > all.ndjson

Error:
`/usr/local/lib/node_modules/wikibase-dump-filter/lib/keep_matching_sitelinks.js:6
Object.keys(sitelinks).forEach(sitelinkName => {
^

TypeError: Cannot convert undefined or null to object
at Function.keys ()
at module.exports (/usr/local/lib/node_modules/wikibase-dump-filter/lib/keep_matching_sitelinks.js:6:10)
at /usr/local/lib/node_modules/wikibase-dump-filter/lib/format_entity.js:21:26
at /usr/local/lib/node_modules/wikibase-dump-filter/lib/filter_format_and_serialize_entity.js:18:30
at Stream. (/usr/local/lib/node_modules/wikibase-dump-filter/lib/stream_utils.js:24:22)
at Stream.stream.write (/usr/local/lib/node_modules/wikibase-dump-filter/node_modules/through/index.js:26:11)
at Stream.ondata (internal/streams/legacy.js:19:31)
at Stream.emit (events.js:315:20)
at drain (/usr/local/lib/node_modules/wikibase-dump-filter/node_modules/through/index.js:36:16)
at Stream.stream.queue.stream.push (/usr/local/lib/node_modules/wikibase-dump-filter/node_modules/through/index.js:45:5)
`

The error is because we try to filter the sitelinks by language even if sitelinks field has been omitted by the user.
I have raised a PR for the fix here: #29
Great library, btw :)

Claims should have multiple possible values

In addition to just one value (P31:Q5) and any value (#2), it should be possible to filter by a set of possible values, e.g. to keep only entities of humans and fictional characters: wikidata-filter -c P31:Q5,Q95074

progress bar

Would be nice to have a progress bar as you cannot find out if the process failed.

Long List of Claims

I'm trying to pass the a long list of claims (+10k) to --claim option as follows:

cat latest-all.json.gz | gzip -d | wdfilter --languages en \
		--claim P31:Q5,Q811979,Q294414,Q6256,Q486972,Q5107,Q15916867,Q484652,Q43229,... >> output.json

But I'm getting the following error:

Argument list too long

I was wondering if there a smarter way to do it?

Thanks in advance

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.