Giter Site home page Giter Site logo

scholarpedia.org about archives HOT 11 OPEN

ipfs-inactive avatar ipfs-inactive commented on July 19, 2024
scholarpedia.org

from archives.

Comments (11)

davidar avatar davidar commented on July 19, 2024

SGTM! We can do this once #20 is resolved.

from archives.

vitzli avatar vitzli commented on July 19, 2024

There is a newer version now available at https://archive.org/details/wiki-scholarpediaorg-20151102

from archives.

davidar avatar davidar commented on July 19, 2024

πŸ“Œ /ipfs/Qmaskk1Egq5zmZsGTd7dwNiiK1cwfmx7k1StG1WJQjwGDm

The articles are here.

@DataWraith Feel like converting these to HTML? :)

It's quite a bit smaller than wikipedia, so should hopefully be less problematic.

from archives.

DataWraith avatar DataWraith commented on July 19, 2024

Heh. Eventually I'd like to write a program that converts a MediaWiki dump to HTML (probably by running it through pandoc), but right now I'm fairly busy, sorry.

I could only do the Wikipedia dump because a third party provided a dump in the OpenZIM format, and an easy-to-use library was available for reading and converting that.

With a raw XML dump, I'd have to roll my own solution, which would take more time than I currently have.

from archives.

rht avatar rht commented on July 19, 2024

(@vitzli thanks for updating the archive in archive.org)

from archives.

davidar avatar davidar commented on July 19, 2024

@DataWraith No worries. I might have a go at getting it to render with https://github.com/davidar/markup.rocks

@vitzli didn't realise you where the one who pushed the updated copy - thanks :)

from archives.

DataWraith avatar DataWraith commented on July 19, 2024

I took another look at this, and wanted to share what I found, in case it is useful to the next person.

Extracting the article markup from the XML dump is pretty easy, actually. But just having the article markup doesn't really gain you much. Simple articles can be rendered through pandoc, but more complicated elements (Images, Math, Templates) tend to break things.

I think our best bet is for someone to actually setup a MediaWiki instance and then use MWDumper to load the dump, and then export to HTML with mwoffliner. From what I can tell, this is the workflow that was used to create the HTML content for the ZIM files I used to dump Wikipedia.

The entire process is pretty convoluted though (Database, MediaWiki, Redis, Node...), so I'm currently not willing to tackle it.

If I were to do it, I'd probably try to setup everything in Docker containers with Docker Compose though, so that it is repeatable and applicable to other Wiki dumps.

Edit: Okay, so I couldn't resist fiddling around with this, despite my earlier words. Took much less time than I estimated too, because I could draw on pre-made docker images. The hard part (MWDumper) is yet to come, but I'm confident I'll have this figured out soonish, maybe even this weekend.

from archives.

DataWraith avatar DataWraith commented on July 19, 2024

sigh

This is much harder than it looked in the beginning. I realize I'm flip-flopping on this a lot -- should've kept my mouth shut from the beginning. Anyway. This post is as much for venting as for information's sake, so feel free to ignore it.

I wanted the process of creating HTML dumps from XML dumps to be repeatable, so I set up everything in Docker containers. Turns out that the pre-made docker containers for the necessary software I could find are mostly outdated, so I had to make them from scratch after running into problems with version incompatibilities.

I managed to setup a local MediaWiki instance with a MySQL database and import the Scholarpedia dump using MWDumper in an automated and repeatable fashion, but getting MediaWiki to render mathematical equations took the better part of the weekend (TeX didn't work at all, no matter what I tried, so I had to switch to Mathoid, which meant getting yet another web service up and running...), and it's still not working to my satisfaction (occasionally returns HTTP 400 -- Bad Request). It doesn't help that the documentation on any of this is extremely sparse.

The entire process looks like this:

  1. Start MySQL and create the wiki database skeleton
  2. Run MWDumper to fill the database with the Scholarpedia articles
  3. Start the MediaWiki container
  4. Start the Mathoid container (for equation rendering)
  5. Start the Parsoid container (for HTML extraction)

Remaining work

  • Images need to be imported.

    There is a PHP script included with MediaWiki that should do that. But I'm not expecting it to be easy.

  • The Main_Page has custom CSS templates that MediaWiki isn't parsing out of the box, displaying them verbatim instead.

  • Actually creating static HTML files

    As I mentioned in the previous entry, mwoffliner should be able to use Parsoid to extract HTML via the MediaWiki API. However, it looks non-trivial to setup. It should be possible to create Docker containers for it, but that will take a while yet, so don't hold your breath. :/

from archives.

rht avatar rht commented on July 19, 2024

(sounds more doable, as in, less headache than latex->html)
@DataWraith is the conversion using parsoid lossless?

from archives.

DataWraith avatar DataWraith commented on July 19, 2024

Parsoid is intended to be able to convert from MediaWiki markup to HTML and back in a lossless fashion (they do 'round trip testing'). I haven't noticed any mistakes with the conversion, but from what I gather from the limited documentation available, the conversion process isn't 100% perfect yet.

The fact that they need to be able to make round trips also bloats the generated HTML somewhat. The files use absolute links too, so the additional step of using mwoffliner is necessary to produce an IPFS-suitable folder of files. I'll try to get that working next weekend (so that I have something to show even if the equations don't work quite right yet), but given my over-optimism so far, I don't want to promise anything.

from archives.

davidar avatar davidar commented on July 19, 2024

Hrm, it's unfortunate that MediaWiki is such a beast.

I've also converted it to a GitHub Wiki (example). It's somewhat passable, but definitely not perfect.

from archives.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.