gambolputty / wikitable2csv Goto Github PK
View Code? Open in Web Editor NEWA web tool to convert Wiki tables to CSV 📈
Home Page: https://wikitable2csv.ggor.de
License: MIT License
A web tool to convert Wiki tables to CSV 📈
Home Page: https://wikitable2csv.ggor.de
License: MIT License
Thanks for the good tool, it's very useful!
Sometimes the tables on the page are of same format, so if there is a button to copy all contents, it will save a lot of time.
After a successfull installation I am now stuck.
Please add info how to run the app to the readme.
If you pass a URL to an wikipedia page's history, this tool outputs the current version of the page, not the historical version.
Seems to be broken atm?
Only white page when trying to fetch tables, for example
https://de.wikipedia.org/wiki/Raumschiff_Enterprise/Episodenliste
Tried different browsers.
correction: Actually the tool seems to have an issue with that specific wiki page. others work fine.
edit: same here
https://de.wikipedia.org/wiki/Star_Trek:_Enterprise/Episodenliste
Cannot find any table on https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita
Additionally, is there a possibility to select only one table? For e.g., I'm interested only in the IMF table.
Thanks in advance.
I would like to use this tool for other mediawiki sites than just wikipedia. So I would install this on my server to change the source accordingly.
Line 193 in de8898e
How could I edit this so that the title of the wikipedia article is also added in front of blockId?
I want to convert mediawiki data to csv. I want to use either text or a file. Is this possible?
Hello,
I used your website and ran into a rather unexpected behavior.
I tried to parse the table at https://de.wikipedia.org/wiki/Liste_traditioneller_Radikale, which, for the most part, resulted in a great csv table.
Only the lines with the number 64 and 147 contained a (unwanted) .mw-parser-output .Hant{font-size:110%}
:
Nr.,Zeichen (Varianten),Pīnyīn,Bedeutung und Anmerkungen,Häufig-keit,Kurz-zeichen,Beispiele
1,一,yī,eins,42,,七三不世
2,丨,gǔn,Vertikalstrich,21,,中
3,丶,zhǔ,Tropfstrich,10,,丸主
[...]
64,"手 (.mw-parser-output .Hans{font-size:110%}才,扌 links)",shǒu,"Hand, in der Hand halten",1.203,,手打持掛挙
[...]
147,.mw-parser-output .Hant{font-size:110%}見,jiàn,sehen,161,见[2],規親覺觀
[...]
When I inspected the source code of the wiki page, I saw that this text is indeed embedded in the html table itself (only for these two lines though):
<td>
<link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r184932623">
<span lang="zh-Hani" class="Hani">手</span> (
<style data-mw-deduplicate="TemplateStyles:r184932629">.mw-parser-output .Hans{font-size:110%}</style>
<span lang="zh-Hans" class="Hans">才</span>,
<link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r184932623">
<span lang="zh-Hani" class="Hani">扌</span> <small>links</small>)
</td>
<td>
<style data-mw-deduplicate="TemplateStyles:r184932626">.mw-parser-output .Hant{font-size:110%}</style>
<span lang="zh-Hant" class="Hant">見</span>
</td>
Can the CSS code inside any <style></style>
tag, or the style
tag itself, be removed when generating the csv table?
Thanks!
It happens that we might not need all the listed columns of a wikipedia table.
I imagine this feature's implementation as an extra text field in which user can define the columns to be skipped.
What is your take?
It would be really useful, to me, to be able to switch whether the output is returned in the current way or as a file, named for the page url.
Would it be possible to use node on the command line to generate csv from wiki tables?
Thanks for a great project nonetheless.
Steps to reproduce
I try to convert the tables in the following url:
The 2nd table have complex rowspan settings, and the result messed up. Some content in following lines appear in the leading lines. For example, ",耶律亿,阿保机,872年-926年,神册、天赞、天显,907年-926年" in the 4th line are from the 1st line.
1 太祖,升天皇帝 (926年太宗初谥),耶律亿,阿保机,872年-926年,神册、天赞、天显,907年-926年
2 太祖,大圣大明天皇帝 (1008年圣宗加谥)
3 太祖,大圣大明神烈天皇帝 (1052年兴宗加谥)
4 ,贞烈皇后 (953年穆宗初谥),耶律亿,阿保机,872年-926年,神册、天赞、天显,907年-926年,述律平 (称制),月理朵,879年-953年,天显,926年-927年
5 ,淳钦皇后 (1052年兴宗改谥)
Hi, I found urls in following format not supported:
For languages which have several styles and will do convertion between each other, this format is normal. So please try to support it, Thanks!
On my Ubuntu 18.04 I started with npm install
and this seemed to work at first, but after some minutes I got this error:
npm WARN optional Skipping failed optional dependency /chokidar/fsevents:
npm WARN notsup Not compatible with your operating system or architecture: [email protected]
npm WARN [email protected] requires a peer of popper.js@^1.14.3 but none was installed.
npm ERR! Linux 4.16.0-041600-generic
npm ERR! argv "/usr/bin/node" "/usr/bin/npm" "install"
npm ERR! node v8.10.0
npm ERR! npm v3.5.2
npm ERR! code ELIFECYCLE
npm ERR! [email protected] install: `node install.js`
npm ERR! Exit status 1
npm ERR!
npm ERR! Failed at the [email protected] install script 'node install.js'.
npm ERR! Make sure you have the latest version of node.js and npm installed.
npm ERR! If you do, this is most likely a problem with the phantomjs-prebuilt package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR! node install.js
npm ERR! You can get information on how to open an issue for this project with:
npm ERR! npm bugs phantomjs-prebuilt
npm ERR! Or if that isn't available, you can get their info via:
npm ERR! npm owner ls phantomjs-prebuilt
npm ERR! There is likely additional logging output above.
npm ERR! Please include the following file with any support request:
npm ERR! /var/www/wikitable2csv/npm-debug.log
$ nodejs --version
v8.10.0
$ npm --version
3.5.2
Trying to parse this:
https://en.wikipedia.org/wiki/Workweek_and_weekend#Around_the_world
It gets tripped up on:
Congo, Democratic Republic of
Well it doesn't, but the output isn't any good.
To fix it, you could do one or more of the following:
I want to get the table that is present on https://meta.wikimedia.org/wiki/2017_Community_Wishlist_Survey/Results, but as it's not WP, I can't :(
Hi, I found the tool will automatically repeat cells for specific times if the original row has rowspan attribute, and if the row have cells which have colspan attribute, the cells content will only be outputed in the first assocated row, I wonder if this is in design or a bug.
And for tables with rowspan and colspan cells, I have to do some manual combine/clean work sometimes, so if there is a flag to indicate the row is expanded from a colspan and/or a rowspan row, it will help a lot.
Thank you for this good tool, it really helps!
Hello,
when the wiki table contains a citation (e.g. [2]
), the generated csv will interpret it as pure text. This is probably not desired.
Example: https://de.wikipedia.org/wiki/Liste_traditioneller_Radikale#Tabelle_der_Radikale
Output:
Nr.,Zeichen (Varianten),Pīnyīn,Bedeutung und Anmerkungen,Häufig-keit,Kurz-zeichen,Beispiele
147,.mw-parser-output .Hant{font-size:110%}見,jiàn,sehen,161,见[2],規親覺觀
148,角,jiǎo,"Horn, Ecke",158,,觚解觕觥觸
149,言 (訁 links),yán,"sprechen, Wort",861,讠[2]links,誁詋詔評詗詥試詧
(The [2]
is the undesired text, because it is useless by itself)
The HTML responsible for this is:
<td>
<link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r184932629">
<span lang="zh-Hans" class="Hans">见</span>
<sup id="cite_ref-s_2-1" class="reference">
<a href="#cite_note-s-2">[2]</a>
</sup>
</td>
Can the citation links (hyperlinks with square brackets) be removed when generating the csv?
So basically all the <a>
tags that are surrounded by a <sup>
tag with class="reference"
.
I try to convert this url:
https://zh.wikipedia.org/wiki/%E5%85%83%E6%9C%9D%E8%A1%8C%E6%94%BF%E5%8C%BA%E5%88%92
All grids data in other than the first three columns missed in the result.
Whatever data have wiki links pull the data with "[[India]]" , "[[M. K. Stalin|Stalin]]" like this
Try with this link: https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:Rapports/Nombre_de_pages_par_namespace
The tool will call https://fr.wikipedia.org/api/rest_v1/page/html/Wikip%C3%A9dia:Rapports/Nombre_de_pages_par_namespace
But this link is the good one: https://fr.wikipedia.org/api/rest_v1/page/html/Wikip%C3%A9dia%3ARapports%2FNombre_de_pages_par_namespace
Perhaps it's a problem from the api server too.
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.