gambolputty / wikitable2csv Goto Github PK

View Code? Open in Web Editor NEW

149.0 149.0 21.0 2.34 MB

A web tool to convert Wiki tables to CSV 📈

Home Page: https://wikitable2csv.ggor.de

License: MIT License

JavaScript 1.63% HTML 5.02% CSS 0.41% Shell 0.04% TypeScript 92.90%

converter csv data table wikipedia

wikitable2csv's People

Contributors

Stargazers

Watchers

wikitable2csv's Issues

Could you please add a button to copy all tables?

Thanks for the good tool, it's very useful!
Sometimes the tables on the page are of same format, so if there is a button to copy all contents, it will save a lot of time.

How do I run the app, so I can see it in my browser?

After a successfull installation I am now stuck.

Please add info how to run the app to the readme.

Pulling wrong page when passed an old revision

If you pass a URL to an wikipedia page's history, this tool outputs the current version of the page, not the historical version.

blank page on specific URL

Seems to be broken atm?
Only white page when trying to fetch tables, for example
https://de.wikipedia.org/wiki/Raumschiff_Enterprise/Episodenliste

Tried different browsers.

correction: Actually the tool seems to have an issue with that specific wiki page. others work fine.

edit: same here
https://de.wikipedia.org/wiki/Star_Trek:_Enterprise/Episodenliste

Doesn't find table!

Cannot find any table on https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita

I also tried https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita#List_of_per_capita_nominal_GDP_for_countries_and_dependencies

Additionally, is there a possibility to select only one table? For e.g., I'm interested only in the IMF table.

Thanks in advance.

please add install instructions to the Readme

I would like to use this tool for other mediawiki sites than just wikipedia. So I would install this on my server to change the source accordingly.

make filename more specific

wikitable2csv/src/ui/index.js

Line 193 in de8898e

    
           '<button class="table2csv-output__download-btn btn btn-secondary mr-2" data-download-target="table-' + blockId + '">Download</button>' +

How could I edit this so that the title of the wikipedia article is also added in front of blockId?

I want to convert mediawiki table data to csv. How can this be achieved?

I want to convert mediawiki data to csv. I want to use either text or a file. Is this possible?

https://www.mediawiki.org/wiki/Help:Tables

CSV output contains CSS code lines from style tag

Hello,

I used your website and ran into a rather unexpected behavior.
I tried to parse the table at https://de.wikipedia.org/wiki/Liste_traditioneller_Radikale, which, for the most part, resulted in a great csv table.

Only the lines with the number 64 and 147 contained a (unwanted) .mw-parser-output .Hant{font-size:110%}:

Nr.,Zeichen (Varianten),Pīnyīn,Bedeutung und Anmerkungen,Häufig-keit,Kurz-zeichen,Beispiele
1,一,yī,eins,42,,七三不世
2,丨,gǔn,Vertikalstrich,21,,中
3,丶,zhǔ,Tropfstrich,10,,丸主
[...]
64,"手 (.mw-parser-output .Hans{font-size:110%}才,扌 links)",shǒu,"Hand, in der Hand halten",1.203,,手打持掛挙
[...]
147,.mw-parser-output .Hant{font-size:110%}見,jiàn,sehen,161,见[2],規親覺觀
[...]

When I inspected the source code of the wiki page, I saw that this text is indeed embedded in the html table itself (only for these two lines though):

<td>
   <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r184932623">
   <span lang="zh-Hani" class="Hani">手</span> (
   <style data-mw-deduplicate="TemplateStyles:r184932629">.mw-parser-output .Hans{font-size:110%}</style>
   <span lang="zh-Hans" class="Hans">才</span>,
   <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r184932623">
   <span lang="zh-Hani" class="Hani">扌</span> <small>links</small>)
</td>

<td>
   <style data-mw-deduplicate="TemplateStyles:r184932626">.mw-parser-output .Hant{font-size:110%}</style>
   <span lang="zh-Hant" class="Hant">見</span>
</td>

Can the CSS code inside any <style></style> tag, or the style tag itself, be removed when generating the csv table?

Thanks!

[FEATURE] Restrict columns to be extracted

It happens that we might not need all the listed columns of a wikipedia table.

I imagine this feature's implementation as an extra text field in which user can define the columns to be skipped.

What is your take?

Enable csv file as response

It would be really useful, to me, to be able to switch whether the output is returned in the current way or as a file, named for the page url.

Use from command-line

Would it be possible to use node on the command line to generate csv from wiki tables?

Thanks for a great project nonetheless.

Reference to a file is not serialized

Steps to reproduce

Choose a page where table contains images, i.e. https://en.m.wikipedia.org/wiki/List_of_equipment_of_the_United_States_Army
Copy and paste url to wikitable2csv web application
Image section of a table is mostly empty

table rows messed

I try to convert the tables in the following url:

https://zh.wikipedia.org/wiki/%E8%BE%BD%E6%9C%9D%E5%90%9B%E4%B8%BB%E5%88%97%E8%A1%A8#.E8.A5.BF.E8.BE.BD

The 2nd table have complex rowspan settings, and the result messed up. Some content in following lines appear in the leading lines. For example, ",耶律亿,阿保机,872年-926年,神册、天赞、天显,907年-926年" in the 4th line are from the 1st line.

1 太祖,升天皇帝（926年太宗初谥）,耶律亿,阿保机,872年-926年,神册、天赞、天显,907年-926年
2 太祖,大圣大明天皇帝（1008年圣宗加谥）
3 太祖,大圣大明神烈天皇帝（1052年兴宗加谥）
4 ,贞烈皇后（953年穆宗初谥）,耶律亿,阿保机,872年-926年,神册、天赞、天显,907年-926年,述律平（称制）,月理朵,879年-953年,天显,926年-927年
5 ,淳钦皇后（1052年兴宗改谥）

url not supported

Hi, I found urls in following format not supported:

https://zh.wikipedia.org/zh-tw/**君主列表

For languages which have several styles and will do convertion between each other, this format is normal. So please try to support it, Thanks!

phantomjs install error and Errorpage on start

On my Ubuntu 18.04 I started with npm install and this seemed to work at first, but after some minutes I got this error:


npm WARN optional Skipping failed optional dependency /chokidar/fsevents:
npm WARN notsup Not compatible with your operating system or architecture: [email protected]
npm WARN [email protected] requires a peer of popper.js@^1.14.3 but none was installed.
npm ERR! Linux 4.16.0-041600-generic
npm ERR! argv "/usr/bin/node" "/usr/bin/npm" "install"
npm ERR! node v8.10.0
npm ERR! npm  v3.5.2
npm ERR! code ELIFECYCLE

npm ERR! [email protected] install: `node install.js`
npm ERR! Exit status 1
npm ERR! 
npm ERR! Failed at the [email protected] install script 'node install.js'.
npm ERR! Make sure you have the latest version of node.js and npm installed.
npm ERR! If you do, this is most likely a problem with the phantomjs-prebuilt package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR!     node install.js
npm ERR! You can get information on how to open an issue for this project with:
npm ERR!     npm bugs phantomjs-prebuilt
npm ERR! Or if that isn't available, you can get their info via:
npm ERR!     npm owner ls phantomjs-prebuilt
npm ERR! There is likely additional logging output above.

npm ERR! Please include the following file with any support request:
npm ERR!     /var/www/wikitable2csv/npm-debug.log

$ nodejs --version
v8.10.0
$ npm --version
3.5.2

Table cells with commas in

Trying to parse this:
https://en.wikipedia.org/wiki/Workweek_and_weekend#Around_the_world

It gets tripped up on:
Congo, Democratic Republic of

Well it doesn't, but the output isn't any good.

To fix it, you could do one or more of the following:

Allow alternative delimiters
Quote the strings

Allow to parse other wikis than Wikipedia

I want to get the table that is present on https://meta.wikimedia.org/wiki/2017_Community_Wishlist_Survey/Results, but as it's not WP, I can't :(

Add a flag for rows expanded from a rowspan row

Hi, I found the tool will automatically repeat cells for specific times if the original row has rowspan attribute, and if the row have cells which have colspan attribute, the cells content will only be outputed in the first assocated row, I wonder if this is in design or a bug.

And for tables with rowspan and colspan cells, I have to do some manual combine/clean work sometimes, so if there is a flag to indicate the row is expanded from a colspan and/or a rowspan row, it will help a lot.

Thank you for this good tool, it really helps!

CSV contains citation link text

Hello,

when the wiki table contains a citation (e.g. [2] ), the generated csv will interpret it as pure text. This is probably not desired.

Example: https://de.wikipedia.org/wiki/Liste_traditioneller_Radikale#Tabelle_der_Radikale

Output:

Nr.,Zeichen (Varianten),Pīnyīn,Bedeutung und Anmerkungen,Häufig-keit,Kurz-zeichen,Beispiele
147,.mw-parser-output .Hant{font-size:110%}見,jiàn,sehen,161,见[2],規親覺觀
148,角,jiǎo,"Horn, Ecke",158,,觚解觕觥觸
149,言 (訁 links),yán,"sprechen, Wort",861,讠[2]links,誁詋詔評詗詥試詧

(The [2] is the undesired text, because it is useless by itself)

The HTML responsible for this is:

<td>
   <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r184932629">
   <span lang="zh-Hans" class="Hans">见</span>
   <sup id="cite_ref-s_2-1" class="reference">
      <a href="#cite_note-s-2">[2]</a>
   </sup>
</td>

Can the citation links (hyperlinks with square brackets) be removed when generating the csv?
So basically all the <a> tags that are surrounded by a <sup> tag with class="reference".