Giter Site home page Giter Site logo

gambolputty / wikitable2csv Goto Github PK

View Code? Open in Web Editor NEW
149.0 149.0 21.0 2.34 MB

A web tool to convert Wiki tables to CSV 📈

Home Page: https://wikitable2csv.ggor.de

License: MIT License

JavaScript 1.63% HTML 5.02% CSS 0.41% Shell 0.04% TypeScript 92.90%
converter csv data table wikipedia

wikitable2csv's People

Contributors

gambolputty avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

wikitable2csv's Issues

make filename more specific

'<button class="table2csv-output__download-btn btn btn-secondary mr-2" data-download-target="table-' + blockId + '">Download</button>' +

How could I edit this so that the title of the wikipedia article is also added in front of blockId?

CSV output contains CSS code lines from style tag

Hello,

I used your website and ran into a rather unexpected behavior.
I tried to parse the table at https://de.wikipedia.org/wiki/Liste_traditioneller_Radikale, which, for the most part, resulted in a great csv table.

Only the lines with the number 64 and 147 contained a (unwanted) .mw-parser-output .Hant{font-size:110%}:

Nr.,Zeichen (Varianten),Pīnyīn,Bedeutung und Anmerkungen,Häufig-keit,Kurz-zeichen,Beispiele
1,一,yī,eins,42,,七三不世
2,丨,gǔn,Vertikalstrich,21,,中
3,丶,zhǔ,Tropfstrich,10,,丸主
[...]
64,"手 (.mw-parser-output .Hans{font-size:110%}才,扌 links)",shǒu,"Hand, in der Hand halten",1.203,,手打持掛挙
[...]
147,.mw-parser-output .Hant{font-size:110%}見,jiàn,sehen,161,见[2],規親覺觀
[...]

When I inspected the source code of the wiki page, I saw that this text is indeed embedded in the html table itself (only for these two lines though):

<td>
   <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r184932623">
   <span lang="zh-Hani" class="Hani"></span> (
   <style data-mw-deduplicate="TemplateStyles:r184932629">.mw-parser-output .Hans{font-size:110%}</style>
   <span lang="zh-Hans" class="Hans"></span>,
   <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r184932623">
   <span lang="zh-Hani" class="Hani"></span> <small>links</small>)
</td>
<td>
   <style data-mw-deduplicate="TemplateStyles:r184932626">.mw-parser-output .Hant{font-size:110%}</style>
   <span lang="zh-Hant" class="Hant"></span>
</td>

Can the CSS code inside any <style></style> tag, or the style tag itself, be removed when generating the csv table?

Thanks!

[FEATURE] Restrict columns to be extracted

It happens that we might not need all the listed columns of a wikipedia table.

I imagine this feature's implementation as an extra text field in which user can define the columns to be skipped.

What is your take?

Enable csv file as response

It would be really useful, to me, to be able to switch whether the output is returned in the current way or as a file, named for the page url.

Use from command-line

Would it be possible to use node on the command line to generate csv from wiki tables?

Thanks for a great project nonetheless.

table rows messed

I try to convert the tables in the following url:

https://zh.wikipedia.org/wiki/%E8%BE%BD%E6%9C%9D%E5%90%9B%E4%B8%BB%E5%88%97%E8%A1%A8#.E8.A5.BF.E8.BE.BD

The 2nd table have complex rowspan settings, and the result messed up. Some content in following lines appear in the leading lines. For example, ",耶律亿,阿保机,872年-926年,神册、天赞、天显,907年-926年" in the 4th line are from the 1st line.

1 太祖,升天皇帝 (926年太宗初谥),耶律亿,阿保机,872年-926年,神册、天赞、天显,907年-926年
2 太祖,大圣大明天皇帝 (1008年圣宗加谥)
3 太祖,大圣大明神烈天皇帝 (1052年兴宗加谥)
4 ,贞烈皇后 (953年穆宗初谥),耶律亿,阿保机,872年-926年,神册、天赞、天显,907年-926年,述律平 (称制),月理朵,879年-953年,天显,926年-927年
5 ,淳钦皇后 (1052年兴宗改谥)

phantomjs install error and Errorpage on start

On my Ubuntu 18.04 I started with npm install and this seemed to work at first, but after some minutes I got this error:


npm WARN optional Skipping failed optional dependency /chokidar/fsevents:
npm WARN notsup Not compatible with your operating system or architecture: [email protected]
npm WARN [email protected] requires a peer of popper.js@^1.14.3 but none was installed.
npm ERR! Linux 4.16.0-041600-generic
npm ERR! argv "/usr/bin/node" "/usr/bin/npm" "install"
npm ERR! node v8.10.0
npm ERR! npm  v3.5.2
npm ERR! code ELIFECYCLE

npm ERR! [email protected] install: `node install.js`
npm ERR! Exit status 1
npm ERR! 
npm ERR! Failed at the [email protected] install script 'node install.js'.
npm ERR! Make sure you have the latest version of node.js and npm installed.
npm ERR! If you do, this is most likely a problem with the phantomjs-prebuilt package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR!     node install.js
npm ERR! You can get information on how to open an issue for this project with:
npm ERR!     npm bugs phantomjs-prebuilt
npm ERR! Or if that isn't available, you can get their info via:
npm ERR!     npm owner ls phantomjs-prebuilt
npm ERR! There is likely additional logging output above.

npm ERR! Please include the following file with any support request:
npm ERR!     /var/www/wikitable2csv/npm-debug.log

$ nodejs --version
v8.10.0
$ npm --version
3.5.2

Add a flag for rows expanded from a rowspan row

Hi, I found the tool will automatically repeat cells for specific times if the original row has rowspan attribute, and if the row have cells which have colspan attribute, the cells content will only be outputed in the first assocated row, I wonder if this is in design or a bug.

And for tables with rowspan and colspan cells, I have to do some manual combine/clean work sometimes, so if there is a flag to indicate the row is expanded from a colspan and/or a rowspan row, it will help a lot.

Thank you for this good tool, it really helps!

CSV contains citation link text

Hello,

when the wiki table contains a citation (e.g. [2] ), the generated csv will interpret it as pure text. This is probably not desired.

Example: https://de.wikipedia.org/wiki/Liste_traditioneller_Radikale#Tabelle_der_Radikale

citation

Output:

Nr.,Zeichen (Varianten),Pīnyīn,Bedeutung und Anmerkungen,Häufig-keit,Kurz-zeichen,Beispiele
147,.mw-parser-output .Hant{font-size:110%}見,jiàn,sehen,161,见[2],規親覺觀
148,角,jiǎo,"Horn, Ecke",158,,觚解觕觥觸
149,言 (訁 links),yán,"sprechen, Wort",861,讠[2]links,誁詋詔評詗詥試詧

(The [2] is the undesired text, because it is useless by itself)

The HTML responsible for this is:

<td>
   <link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r184932629">
   <span lang="zh-Hans" class="Hans"></span>
   <sup id="cite_ref-s_2-1" class="reference">
      <a href="#cite_note-s-2">[2]</a>
   </sup>
</td>

Can the citation links (hyperlinks with square brackets) be removed when generating the csv?
So basically all the <a> tags that are surrounded by a <sup> tag with class="reference".

Add wiki tags

Whatever data have wiki links pull the data with "[[India]]" , "[[M. K. Stalin|Stalin]]" like this

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.