Comments (11)
SGTM! We can do this once #20 is resolved.
from archives.
There is a newer version now available at https://archive.org/details/wiki-scholarpediaorg-20151102
from archives.
The articles are here.
@DataWraith Feel like converting these to HTML? :)
It's quite a bit smaller than wikipedia, so should hopefully be less problematic.
from archives.
Heh. Eventually I'd like to write a program that converts a MediaWiki dump to HTML (probably by running it through pandoc), but right now I'm fairly busy, sorry.
I could only do the Wikipedia dump because a third party provided a dump in the OpenZIM format, and an easy-to-use library was available for reading and converting that.
With a raw XML dump, I'd have to roll my own solution, which would take more time than I currently have.
from archives.
(@vitzli thanks for updating the archive in archive.org)
from archives.
@DataWraith No worries. I might have a go at getting it to render with https://github.com/davidar/markup.rocks
@vitzli didn't realise you where the one who pushed the updated copy - thanks :)
from archives.
I took another look at this, and wanted to share what I found, in case it is useful to the next person.
Extracting the article markup from the XML dump is pretty easy, actually. But just having the article markup doesn't really gain you much. Simple articles can be rendered through pandoc
, but more complicated elements (Images, Math, Templates) tend to break things.
I think our best bet is for someone to actually setup a MediaWiki instance and then use MWDumper to load the dump, and then export to HTML with mwoffliner. From what I can tell, this is the workflow that was used to create the HTML content for the ZIM files I used to dump Wikipedia.
The entire process is pretty convoluted though (Database, MediaWiki, Redis, Node...), so I'm currently not willing to tackle it.
If I were to do it, I'd probably try to setup everything in Docker containers with Docker Compose though, so that it is repeatable and applicable to other Wiki dumps.
Edit: Okay, so I couldn't resist fiddling around with this, despite my earlier words. Took much less time than I estimated too, because I could draw on pre-made docker images. The hard part (MWDumper) is yet to come, but I'm confident I'll have this figured out soonish, maybe even this weekend.
from archives.
sigh
This is much harder than it looked in the beginning. I realize I'm flip-flopping on this a lot -- should've kept my mouth shut from the beginning. Anyway. This post is as much for venting as for information's sake, so feel free to ignore it.
I wanted the process of creating HTML dumps from XML dumps to be repeatable, so I set up everything in Docker containers. Turns out that the pre-made docker containers for the necessary software I could find are mostly outdated, so I had to make them from scratch after running into problems with version incompatibilities.
I managed to setup a local MediaWiki instance with a MySQL database and import the Scholarpedia dump using MWDumper in an automated and repeatable fashion, but getting MediaWiki to render mathematical equations took the better part of the weekend (TeX didn't work at all, no matter what I tried, so I had to switch to Mathoid, which meant getting yet another web service up and running...), and it's still not working to my satisfaction (occasionally returns HTTP 400 -- Bad Request). It doesn't help that the documentation on any of this is extremely sparse.
The entire process looks like this:
- Start MySQL and create the wiki database skeleton
- Run MWDumper to fill the database with the Scholarpedia articles
- Start the MediaWiki container
- Start the Mathoid container (for equation rendering)
- Start the Parsoid container (for HTML extraction)
Remaining work
-
Images need to be imported.
There is a PHP script included with MediaWiki that should do that. But I'm not expecting it to be easy.
-
The Main_Page has custom CSS templates that MediaWiki isn't parsing out of the box, displaying them verbatim instead.
-
Actually creating static HTML files
As I mentioned in the previous entry, mwoffliner should be able to use Parsoid to extract HTML via the MediaWiki API. However, it looks non-trivial to setup. It should be possible to create Docker containers for it, but that will take a while yet, so don't hold your breath. :/
from archives.
(sounds more doable, as in, less headache than latex->html)
@DataWraith is the conversion using parsoid lossless?
from archives.
Parsoid is intended to be able to convert from MediaWiki markup to HTML and back in a lossless fashion (they do 'round trip testing'). I haven't noticed any mistakes with the conversion, but from what I gather from the limited documentation available, the conversion process isn't 100% perfect yet.
The fact that they need to be able to make round trips also bloats the generated HTML somewhat. The files use absolute links too, so the additional step of using mwoffliner is necessary to produce an IPFS-suitable folder of files. I'll try to get that working next weekend (so that I have something to show even if the equations don't work quite right yet), but given my over-optimism so far, I don't want to promise anything.
from archives.
Hrm, it's unfortunate that MediaWiki is such a beast.
I've also converted it to a GitHub Wiki (example). It's somewhat passable, but definitely not perfect.
from archives.
Related Issues (20)
- Datasets hosted by cloud providers
- mirrors-instructions have outdated / wrong instructions HOT 1
- KinoΓ―t IPFS-Pack (KinoKabaret Caen) HOT 1
- Software Heritage HOT 1
- Chellenge: 22.5 Trillion Digits of Pi, and Prime Numbers up to one Trillion HOT 2
- Old Internet Files
- Making sure archives are fetchable HOT 11
- Deprecate archives.ipfs.io? HOT 13
- High-impact scientific papers
- Waves in plasmas HOT 1
- Geocities HOT 3
- Congressional Research Service reports (scientific + climate papers) HOT 1
- Godot downloads
- Talks from Our Networks 2018
- It is possible to replace LOCKSS with IPFS/Archives? HOT 1
- LibriVox, free public domain audiobooks HOT 1
- Twitter Datasets on IPFS
- basic works of aristotle
- 10.1016/j.tsf.2007.04.165
- greek literature
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from archives.