maxlath / wikibase-dump-filter Goto Github PK

View Code? Open in Web Editor NEW

95.0 6.0 15.0 1.01 MB

Filter and format a newline-delimited JSON stream of Wikibase entities

JavaScript 97.23% Shell 2.77%

wikidata cli dump wikibase wikidata-dump

wikibase-dump-filter's Introduction

wikibase-dump-filter

Filter and format a newline-delimited JSON stream of Wikibase entities.

Typically useful to create a formatted subset of a Wikibase JSON dump.

Some context: This tool was formerly known as wikidata-filter. Wikidata is an instance of Wikibase. This tool was primarly designed with Wikidata in mind, but should be usable for any Wikibase instance.

This project received a Wikimedia Project Grant.

Download stats

Summary

Install
Changelog
Download dump
- Wikidata dumps
- Your own Wikibase instance dump
How-to
See Also
You may also like
License

Install

this tool requires to have NodeJs installed.

# Install globally
npm install -g wikibase-dump-filter
# Or install just to be used in the scripts of the current project
npm install wikibase-dump-filter

Changelog

See CHANGELOG.md for version info

Download dump

Wikidata dumps

Wikidata provides a bunch of database dumps, among which the desired JSON dump. As a Wikidata dump is a very laaarge file (April 2020: 75GB compressed), it is recommended to download that file first before doing operations on it, so that if anything crashes, you don't have to start the download from zero (the download time being usually the bottleneck).

wget --continue https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz
cat latest-all.json.gz | gzip -d | wikibase-dump-filter --claim P31:Q5 > humans.ndjson

Your own Wikibase instance dump

You can generate a JSON dump using the script dumpJson.php. If you are running Wikibase with wikibase-docker, you could use the following command:

cd wikibase-docker
docker-compose exec wikibase /bin/sh -c "php ./extensions/Wikibase/repo/maintenance/dumpJson.php --log /dev/null" > dump.json
cat dump.json | wikibase-dump-filter --claim P1:Q1 > entities_with_claim_P1_Q1.ndjson

How-to

This package can both be used as a command-line tool (CLI) and as a NodeJS module. Those 2 uses have their own documentation page but the options stay the same, and are documented in the CLI section

License

MIT

wikibase-dump-filter's People

Contributors

Stargazers

Watchers

Forkers

nichtich abbe98 wikidonne dalavancloud drouarb xwhan shashank-agg nestarz sabahzero smartniz daniel-mietchen liu0820 jsteemann simbaninja917 13guff13

wikibase-dump-filter's Issues

suggestion: encourage the users to use PAWS/Toolforge

hi
The dumps are directly accessible from PAWS and Toolforge so if the user is working on a WMF related project there is no need to downoad the dump at all 😃
this is good for the environment and saves time

Troubles with old wikidump

Hi! I'm working with old wikidata dumps. Dump was downloaded from official source
I'm trying to filtering through my unpacked dump with different parameters and received that error.

daniillevchenko@danil ~/w/P/RaiseWikibase (main) [SIGPIPE|1]> cat entities.json | wikibase-dump-filter --claim 'P31:Q5' > another_my_test.json
    parsed | total average parse time | recent average parse time |       kept | % of total |   last kept | last kept time | elapsed time
         0 |                      0ms |                       0ms |          0 |         0% |             |              0 |     00:00:00/usr/lib/node_modules/wikibase-dump-filter/lib/valid_claims.js:20
  let propClaims = claims[P]
                         ^

TypeError: Cannot read properties of undefined (reading 'P31')
    at /usr/lib/node_modules/wikibase-dump-filter/lib/valid_claims.js:20:26
    at arraySome (/usr/lib/node_modules/wikibase-dump-filter/node_modules/lodash.some/index.js:140:9)
    at some (/usr/lib/node_modules/wikibase-dump-filter/node_modules/lodash.some/index.js:1838:10)
    at /usr/lib/node_modules/wikibase-dump-filter/lib/valid_claims.js:14:10
    at arrayEvery (/usr/lib/node_modules/wikibase-dump-filter/node_modules/lodash.every/index.js:140:10)
    at every (/usr/lib/node_modules/wikibase-dump-filter/node_modules/lodash.every/index.js:1865:10)
    at module.exports (/usr/lib/node_modules/wikibase-dump-filter/lib/valid_claims.js:10:10)
    at /usr/lib/node_modules/wikibase-dump-filter/lib/filter_entity.js:13:10
    at /usr/lib/node_modules/wikibase-dump-filter/lib/filter_format_and_serialize_entity.js:16:9
    at Stream.<anonymous> (/usr/lib/node_modules/wikibase-dump-filter/lib/stream_utils.js:24:22)

Node.js v17.3.0

I tried this yesterday with latest wikidata-dump and it worked normally. But with old dump i have some troubles which i can not fix.
Please give me direction to think where trouble is - with old dump or with my hands.

Thanks in advance for your reply!

Filter formatting

Is it possible to format the following query into a valid filter somehow?

SELECT ?work ?workLabel
WHERE
{
  ?work wdt:P31/wdt:P279* wd:Q43229. # instance of any subclass of work of art
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}

I want to get all entities that are instances of a subclass (of a subclass, etc.) of organizations.

Filter by multiple claims

Hi,

I think it would be nice if we could filter by multiple claims at once (either their disjunction or conjunction). This could look something like the & and | syntax the sitelink option uses:

P42:Q8,Q13&~P666, P45:Q10|P32:Q10, etc.

Add FUNDING file

See https://help.github.com/en/articles/displaying-a-sponsor-button-in-your-repository

Support for multicore systems

I use the filter in 2 ramdisks, each around 100GB large to speed up processing. Still my 32 cores machine idles at around 5% and will take 12-16 hours filtering all entries (0.5ms average time).

As i don't know nodejs a lot im not sure i can add multi threading to this but node.js totally can invoke child threads - is there an easy 2-3 line addition possible to spawn more threads? See https://nodejs.org/docs/latest/api/cluster.html

Im using server boards, but i guess lots of ppl doing this will sit on a ryzen system or similiar.

multicore unpacking the archive is doable with 'pbzip2 -d -c /mnt/ramdisk/latest-all.json.bz2 | wikibase-dump-filter', thus showing node at exactly 100% and unzipping at ~110%, so its still node the bottleneck. This halves average to 0.25 for me, but with "just" 64GB RAM on maybe some rented hosted machine with lots of cores you can get filter time down to under 30 minutes with multicore processing, thus greatly reducing costs for weekly updates.

Thanks for your great work, really sparing me days of processing,
R

Keep selected languages only

Keeping all languages of labels, descriptions, and aliases takes a lot of space. Maybe you are only interested for instance in English and Spanish:

wikidata-filter --keep-language en,es

There is probably less use for omit-language, but who knows?

Make --claim optional

The script fails if no claim is given. It should be possible to filter an already filtered dump for instance with --keep only.

Try simdjson to speed up parsing

See https://github.com/luizperes/simdjson_nodejs

Unexpected token error.

cat latest-all.json | wikidata-filter --claim P31:Q5 > humans.ndjson

/usr/local/lib/node_modules/wikidata-filter/lib/entity_parser.js:8
.pipe(filter(line => {
^
SyntaxError: Unexpected token >
at Module._compile (module.js:439:25)

"not"-operator doesn't seem to be working properly

I tried to get all humans without a death date and used this claim: --claim 'P31:Q5|~P570'

It seems to be writing all entries from the overall wikidata dump in the file and not just the requested ones.
I also tried the claim from the How-To examples but I got the same result. Is there something wrong with the claim I'm using?

filtering items with deprecated claims

Applying the command bzcat latest-all.json.bz2 |wikibase-dump-filter --simplify --claim 'P698' |jq '[.id,.claims.P698,.claims.P921]' -c >PMID.ndjson results in >30M lines like this:

["Q94880466",["19484558"],null]
["Q17485067",["21609473"],["Q18123741","Q12156","Q193430"]]

where the first case is an item with P698 claim but without P921 claims, and the second has P698 and P921 claims. However out of these 30M there are at least six (6) that are different:
ralf@ark:~/wikidata> grep '[]' PMID.ndjson

["Q30573040",["23057853"],[]]
["Q30523792",["22888462"],[]]
["Q48835971",[],null]
["Q50125628",[],null]
["Q58616403",[],null]
["Q31128925",["27613570"],[]]

Note that 3 don't have P698 (which should not happen given the filter), and 3 have [] instead of null for no P921.

I'm not claiming there is a bug in wikibase-dump-filter, just that this needs investigating, and the ticket is a start. But maybe you have seen this and have an immediate explanation?

Recommendation: Include pre-filtering as direct component of documentation

At the moment it's a brief line, but as it saves quite a bit of time for the user it may be good to have a paragraph or so directly in the main documentation briefly describing what pre-filtering does why it's useful (that then also includes the hyperlink)

certain entities aren't filtered for unknown reason

I tried this claim: --claim 'P31:Q515,Q7930989,Q15284' to get all cities and municipalities from around the world. I found that "Frankfurt am Main" isn't in the resulting file, even though it is an instance of "city" (Q515). "Frankfurt am Main" also has other items in the "instance of" property but they shouldn't be affecting the outcome, right? Also similar entities like "Munich", which also have multiple items in that property next to the "city" item and are in the resulting file.
I noticed that the filter shows this after finishing: in: 1736 | total: 9762074 | last entity in: Q84908318. If I understand it correctly this means 1736 entities have been filtered from 9.7 Million. However, the resulting file has over 14 000 lines, of which each is an entity, right? How does this fit together?

How to filter the triples from wikidata, like (subject, predicate, object) format?

filter all entities that are instances of a subclass of an item

Hey there :)
So I'd like to get all cities and municipalities of the world. But for example some only have "municipality of [country]". The "municipality of [...]" item is a subclass of the "municipality" item. Would it be possible to only filter by the overall item and still get all entities that are instances of the different subclasses?

Filter purely by Q-Number

Hello,

I have a large list of wikidata id's or Q Numbers and I'd like to filter out purely these entities. Does this already exist/is this possible to implement?

Thank you!

Filter by Language Exclude Many Entities

Little debugging on the results after filtering by en, I found that the filter discard the entity if en not found in labels even though en is presented in the aliases, and vice versa

I think entities should be only removed if the the language is not found in both, labels and aliases

What do you think?

Differing item numbers, issue with tool?

Hello!

I'm working from a MacOS v12.2, script is run on a server at our research institute (in the Su Lab. I've been using WDF and getting different results from @andrawaag (https://github.com/andrawaag/wdsub) and @seyedahbr (https://github.com/seyedahbr/wdumper) and we're not entirely sure why this is the case.

We are using the same Wikidata dump (Jan 3 2022), takes between 5-7 hours to download. Here is my script I then run using WDF with and sans prefilter.

The output is the same for both (with or sans prefilter), so we're wondering if it may be something with the tool or perhaps you can advise?

One example: We expect to see ~1.2 million item pages for compounds. This is true for both Andra and Seyed using the same identifiers I have (P31:Q11173). However, I get ~18000 not 1.2 million

Thoughts on why this is the case?
Many thanks!

Simplified requires arguments?

Can't use simplify with default options (no arguments) as specified in the examples on the "How-to" page. I get the following error:

error: option '-s, --simplify <boolean|options> argument missing"

long claim filter makes the process extremly slow

@hsaif comment:

Something I noticed is that passing 25K claims to the filter render the process extremely slow.
For example, my process has been running for 19 hours now and managed to process ~8M entities only!

I didn't have that kind of extreme use case in mind when I wrote the claim filter, so for every entity claim, it iterates over the 25K Q ids to determine if there is a match, which with that amount of ids is crazy inefficient.

Replacing the array lookup by a hash should already be a huge boost, and then we could also play with some multi-process load balancing

Add entity stream method: grep

Like method filter introduced with cca67d8 to support writing

parser(process.stdin)
.grep( item => item.claims && item.claims.P356 )

instead of

parser(process.stdin)
.filter( item => (item.claims && item.claims.P356) ? item : null )

Filter by non-existing claim

All entities that are not instance of something (also requires #2):

wikidata-filter --claim ~P31

Better use ~ instead of ! because the latter requires escaping, at least in bash. ~ is only expanded to the home directory if given as part of a directory specification but not if given as above.

Correction under Prefiltering md

'instances of humans' > 'instances of paintings'?

https://github.com/maxlath/wikibase-dump-filter/blob/master/docs/prefilter.md#basic-prefilter

Filter by type

Only interested in items? Option --type item should help.

Invalid comparator: ' failed at [email protected] postinstall: `check-node-version --node '>= 6.4.0'`

Hi Team,
There is an error:
while trying to install the package.

replaced
postinstall: check-node-version --node '>= 6.4.0'

with:
"postinstall": "check-node-version --node ">= 6.4.0"",

I cloned locally and tried to install with local package

Should we really do this or is there any alternative that I am missing. Could you help me on this.

Thank you in advance!

Regards,
Praveen

Filter by property only

This should filter all entities that have a doctoral advisor no matter who it is:

wikidata-filter --claim P184

Error when using both language filter and omitting sitelinks

Hi! Found a bug when trying to use both the --languages filter and omitting 'sitelinks'.

wikibase-dump-filter version: 5
NodeJS version: v14.3.0
OS version: MacOS 10.15

Reproduction:
cat latest-all.json.bz2 | bzcat | wikibase-dump-filter --languages en,de --omit sitelinks > all.ndjson

Error:
`/usr/local/lib/node_modules/wikibase-dump-filter/lib/keep_matching_sitelinks.js:6
Object.keys(sitelinks).forEach(sitelinkName => {
^

TypeError: Cannot convert undefined or null to object
at Function.keys ()
at module.exports (/usr/local/lib/node_modules/wikibase-dump-filter/lib/keep_matching_sitelinks.js:6:10)
at /usr/local/lib/node_modules/wikibase-dump-filter/lib/format_entity.js:21:26
at /usr/local/lib/node_modules/wikibase-dump-filter/lib/filter_format_and_serialize_entity.js:18:30
at Stream. (/usr/local/lib/node_modules/wikibase-dump-filter/lib/stream_utils.js:24:22)
at Stream.stream.write (/usr/local/lib/node_modules/wikibase-dump-filter/node_modules/through/index.js:26:11)
at Stream.ondata (internal/streams/legacy.js:19:31)
at Stream.emit (events.js:315:20)
at drain (/usr/local/lib/node_modules/wikibase-dump-filter/node_modules/through/index.js:36:16)
at Stream.stream.queue.stream.push (/usr/local/lib/node_modules/wikibase-dump-filter/node_modules/through/index.js:45:5)
`

The error is because we try to filter the sitelinks by language even if sitelinks field has been omitted by the user.
I have raised a PR for the fix here: #29
Great library, btw :)

cat latest-all.json.gz | gzip -d | wdfilter --languages en \
		--claim P31:Q5,Q811979,Q294414,Q6256,Q486972,Q5107,Q15916867,Q484652,Q43229,... >> output.json

But I'm getting the following error:

Argument list too long

I was wondering if there a smarter way to do it?

Thanks in advance