kimbauters / zimply Goto Github PK
View Code? Open in Web Editor NEWAn easy to use offline reader for ZIM files right in your browser!
License: Other
An easy to use offline reader for ZIM files right in your browser!
License: Other
My PC specs:
Intel i5-2320
SSD 120 GB (Running OS)
HDD 1TB (wikipedia dump is here)
RAM 4GB
Swap 2 GB
Ubuntu 20.04.1 LTS
I'm running full English Wikipedia with Pictures. After the start of the server I can browse several articles but then python3 process takes more than 4 GB of RAM and computer freezes for forever, so I have to unplug. Is this how it should behave or I'm doing something wrong?
I have both .py file and dump in the same folder. Size of dump file is 92 GB.
The .py file contains the following:
from zimply import ZIMServer
ZIMServer("$PATH/wikipedia_en_all_maxi_2020-06.zim")
When I try to run ZIMServer("wiki.zim") I get the error invalid argument on file.seek(offset) in unpack_from_file.
I am using the zim_core
from the version2 branch (as used in https://pypi.org/project/zimply-core/) to read a Zim file from Kiwix that contains Wikihow content. Here is an example:
I am attempting to read it with the following code:
from zimply_core import zim_core
zim = zim_core.ZIMClient("wikihow_en_endless_cars-and-other-vehicles_2021-12.zim")
count = zim.get_namespace_count("C")
print(f"Articles count: {count}")
On a Linux system with an ext4 filesystem, it fails with the following traceback:
Traceback (most recent call last):
File "test.py", line 2, in <module>
zim = zim_core.ZIMClient("wikihow_en_endless_cars-and-other-vehicles_2021-12.zim")
File "/home/dylan/.local/share/virtualenvs/dylan-iKdQhdzw/lib/python3.7/site-packages/zimply_core/zim_core.py", line 1152, in __init__
self.language + " (ISO639-1), articles: " + str(len(self._zim_file)))
File "/home/dylan/.local/share/virtualenvs/dylan-iKdQhdzw/lib/python3.7/site-packages/zimply_core/zim_core.py", line 734, in __len__
result = self.get_namespace_range("A" if self.version <= (6, 0) else "C")
File "/home/dylan/.local/share/virtualenvs/dylan-iKdQhdzw/lib/python3.7/site-packages/zimply_core/zim_core.py", line 764, in get_namespace_range
before = self.read_directory_entry_by_index(start_mid - 1)
File "/home/dylan/.local/share/virtualenvs/dylan-iKdQhdzw/lib/python3.7/site-packages/zimply_core/zim_core.py", line 590, in read_directory_entry_by_index
directory_values = self._read_directory_entry(offset)
File "/home/dylan/.local/share/virtualenvs/dylan-iKdQhdzw/lib/python3.7/site-packages/zimply_core/zim_core.py", line 565, in _read_directory_entry
self.file.seek(offset) # move to the desired offset
OSError: [Errno 22] Invalid argument
I poked at this for a while, adding some messages to keep track of what it's doing.
get_namespace_range M
get_namespace_range: start_low 0
get_namespace_range: start_mid 795
get_namespace_range: start_high 1590
read_directory_entry_by_index 795
_read_directory_entry 8655750
read_directory_entry_by_index 794
_read_directory_entry 8655676
[…]
get_namespace_range: start_low 0
get_namespace_range: start_mid 2
get_namespace_range: start_high 4
read_directory_entry_by_index 2
_read_directory_entry 8605156
read_directory_entry_by_index 1
_read_directory_entry 8605116
get_namespace_range: start_low 0
get_namespace_range: start_mid 0
get_namespace_range: start_high 1
read_directory_entry_by_index 0
_read_directory_entry 8605062
read_directory_entry_by_index -1
_read_directory_entry 121364659855736
[Crash]
From the looks of it, Zimply is getting a file offset for a negative index, which isn't expected:
Lines 530 to 539 in e1077d0
In this instance, it ends up reading something from self.header_fields["Q"] + int(8 * -1)
, which as it turns out is a very big number.
Note that whether file.seek(121364659855736)
results in an exception depends on the underlying filesystem, which makes this a little tricky to reproduce :)
With the 2017-01 dump of the english wikipedia (nopic), the main page cannot be found and 0 bytes are read by the index returned from the header_fields. Searching does work though. I do not know enough about the ZIM format to try and work this out.
ZIMServer("wikipedia_de_all_nopic_2018-11.zim")
Traceback (most recent call last):
File "", line 1, in
File "C:\Python\anaconda35\lib\site-packages\zimply\zimply.py", line 845, in init
ZIMRequestHandler.reverse_index = self._bootstrap(index_file)
File "C:\Python\anaconda35\lib\site-packages\zimply\zimply.py", line 864, in _bootstrap
"this can take quite some time! \u2013 " + time.strftime('%X %x'))
File "C:\Python\anaconda35\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 68: character maps to
ZIMServer(r"wikipedia_de_all_nopic_2018-11.zim")
Traceback (most recent call last):
File "", line 1, in
File "C:\Python\anaconda35\lib\site-packages\zimply\zimply.py", line 845, in init
ZIMRequestHandler.reverse_index = self._bootstrap(index_file)
File "C:\Python\anaconda35\lib\site-packages\zimply\zimply.py", line 864, in _bootstrap
"this can take quite some time! \u2013 " + time.strftime('%X %x'))
File "C:\Python\anaconda35\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 68: character maps to
'''
F:\Wiki>python.exe zim-wp.py
Traceback (most recent call last):
File "zim-wp.py", line 1, in
from zimply import ZIMServer
File "F:\Wiki\lib\site-packages\zimply_init_.py", line 1, in
from .zimply import ZIMServer
File "F:\Wiki\lib\site-packages\zimply\zimply.py", line 793
entries = [*entries, *redirects]
^
SyntaxError: can use starred expression only as assignment target
'''
I run on Windows XP, python 3.4, zimply 1.1.3. Would you backport it? I want to have multi-word search ability, and it's a pity if an educational-purpose software drop XP where it still used in underdeveloped countries (for whatever reason is).
last working version: 1.0.x
from zimply import ZIMServer
C:\Python\anaconda35\lib\site-packages\zimply\zimply.py:58: MonkeyPatchWarning: Monkey-patching ssl after ssl has already been imported may lead to errors, including RecursionError on Python 3.6. Please monkey-patch earlier. See gevent/gevent#1016
monkey.patch_all()
Is it possible for Zimply to run multiple ZIMS at once on the same page, or even different ports?
hi recently i reinstalled zimply in another computer with command pip install zimply, and then i found the code still the old versioned which ip address can't be configured.
os. windows and archlinux.
btw, the previous installation's site-packages/zimply/zimply.py was manually edited so i didn't notice this.
I found the following error while running my server python file:
sqlite3.OperationalError: no such module: fts4
As a result of the error, the searching doesn't work anymore, as mentioned in your README note.
I also found that sqlite3
, in the recent versions, ships with fts5
, and requires recompiling and other complex stuff for fts4
support.
With new-style zim files, both articles and their assets appear in the same "C" namespace:
https://www.openzim.org/wiki/ZIM_file_format#Namespaces
ZIMClient.random_article
chooses a random index from the "C" namespace, assuming that entries in that namespace are all articles. This means that it will often return images, for example, instead of articles.
The issue is particularly prominent with this zim file, for example, which contains a very large number of images: https://download.kiwix.org/zim/.hidden/endless/wikihow_en_endless_holidays-and-traditions_2021-12.zim.
For reference, here is how this is implemented in libzim: https://github.com/openzim/libzim/blob/master/src/archive.cpp#L267-L284. It looks like we would need to make use of the title index.
Hi,
I installed ZIMply with pip, using the default template.
The default template displays incorrectly. The navigation bar is displayed on top but it also hide the text appearing faded (like 5% of opacity maybe). I tried with latest Firefox and Chromium versions.
Changing the z-index or the min-height value of the nav element seems to fix the issue but I'm not sure it's the correct way to fix it.
Line 902 in a260f16
if __name__ == "__main__":
, therefore it's launched and looks for ../wiki.zim to be served on port 8080, no matter if we specify another zim file or another port.
This line needs to be commented.
I encountered the following with a Wikipedia zim file and the version2 branch of Zimply:
>>> zim = zimply_core.ZIMFile("wikipedia_simple_endless-kidzsearch_maxi_2022-03.zim", 'utf-8')
# The "Pancake" article is at 187257
>>> zim.read_directory_entry_by_index(187257)
{'mimetype': 8,
'parameterLen': 0,
'namespace': 'A',
'revision': 0,
'clusterNumber': 1403,
'blobNumber': 105,
'url': 'Pancake',
'title': '',
'index': 187257}
# We get its offset in the file...
>>> pancake_offset = zim._read_url_offset(187257)
# Start reading metadata from the file
>>> super(zimply_core.DirectoryBlock, zim.articleEntryBlock)._unpack_from_file(zim.file, pancake_offset)
{'mimetype': 8,
'parameterLen': 0,
'namespace': b'A',
'revision': 0,
'clusterNumber': 1403,
'blobNumber': 105}
# The next item is the URL:
>>> read_zero_terminated(zim.file, "utf-8")
'Pancake'
# And then the title:
>>> read_zero_terminated(zim.file, "utf-8")
''
# The "Pancake Day" article redirects to 224476)
>>> zim.read_directory_entry_by_index(224476)
{'mimetype': 8,
'parameterLen': 0,
'namespace': 'A',
'revision': 0,
'clusterNumber': 896,
'blobNumber': 50,
'url': 'Shrove_Tuesday',
'title': 'Shrove Tuesday',
'index': 224476}
# Get the offset for the article
>>> pancake_day_offset = zim._read_url_offset(224476)
# Start reading metadata from the file
>>> super(zimply_core.DirectoryBlock, zim.articleEntryBlock)._unpack_from_file(zim.file, pancake_day_offset)
{'mimetype': 8,
'parameterLen': 0,
'namespace': b'A',
'revision': 0,
'clusterNumber': 896,
'blobNumber': 50}
# The next item is the URL:
>>> read_zero_terminated(zim.file, "utf-8")
'Shrove_Tuesday'
# And then the title:
>>> read_zero_terminated(zim.file, "utf-8")
'Shrove Tuesday'
I learned that there is an optimization where, if an article's URL is the same as its title, the title field is left blank, and Zimply does the right thing here inside ZIMFile.get_article_by_index. However, that is not the case when we are building a the search index. The code in CreateFTSProcess which goes through articles uses ZIMFileIterator, which calls unpack_from_file directly. That code leaves the (sometimes blank) title field as-is.
i get this error when i try to start the script:
wiki.py
from zimply import ZIMServer
ZIMServer("wiki.zim")
dennis@KODI:/media/dennis/931GB-btrfs/kiwix$ python wiki.py
Traceback (most recent call last):
File "wiki.py", line 1, in
from zimply import ZIMServer
ImportError: No module named zimply
I installed zimply and while importing the library through jupyter notebook it throws this error in the console (the importing doesn't complete) :
[IPKernelApp] ERROR | Exception in message handler:
......
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 293, in wait
waiter.acquire()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/gevent/thread.py", line 84, in acquire
return BoundedSemaphore.acquire(self, blocking, timeout)
File "src/gevent/_semaphore.py", line 198, in gevent._semaphore.Semaphore.acquire (src/gevent/gevent._semaphore.c:4541)
File "src/gevent/_semaphore.py", line 226, in gevent._semaphore.Semaphore.acquire (src/gevent/gevent._semaphore.c:4367)
File "src/gevent/_semaphore.py", line 166, in gevent._semaphore.Semaphore._do_wait (src/gevent/gevent._semaphore.c:3562)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/gevent/hub.py", line 630, in switch
return RawGreenlet.switch(self)
gevent.hub.LoopExit: ('This operation would block forever', <Hub at 0x104a81b90 select default pending=0 ref=0>)
ERROR:tornado.general:Uncaught exception, closing connection.
How do I resolve this error ? Any inputs
I had this error thrown for seemingly random URLs when trying different ZIM files. To reproduce with one sample file:
/w/index.php
page.2022-09-26 01:08:07 [FALCON] [ERROR] GET /w/index.php => Traceback (most recent call last):
File "falcon\app.py", line 365, in falcon.app.App.__call__
File "D:\dev\zimply-demo\venv\lib\site-packages\zimply\zimply.py", line 717, in on_get
article = ZIMRequestHandler.zim.get_article_by_url(namespace, url)
File "D:\dev\zimply-demo\venv\lib\site-packages\zimply\zimply.py", line 537, in get_article_by_url
entry, idx = self._get_entry_by_url(namespace, url) # get the entry
File "D:\dev\zimply-demo\venv\lib\site-packages\zimply\zimply.py", line 516, in _get_entry_by_url
entry = self.read_directory_entry_by_index(middle)
File "D:\dev\zimply-demo\venv\lib\site-packages\zimply\zimply.py", line 454, in read_directory_entry_by_index
directory_values = self._read_directory_entry(offset)
File "D:\dev\zimply-demo\venv\lib\site-packages\zimply\zimply.py", line 435, in _read_directory_entry
fields = unpack("<H", self.file.read(2))
struct.error: unpack requires a buffer of 2 bytes
A requirement for version2
is zstandard
. It's in setup.py
but not in the readme.
Also from my experience of version 2 you should look to switch this onto the master branch, it appears quite stable. Great work.
A feature request to load multiple ZIM files. People interested in hosting a single ZIM archive are likely to want to host multiple ZIM archives.
This might look like:
ZIMServerCollection([{filename, index_file, template}, {..}, ..], ip_address, port, encoding)
Each ZIM archive would be on a different directory (localost:port/wiki1/..
, localhost:port/wiki2/..
). Each would be a separate client. You would then have a simple wikipage on the index to choose each ZIM archive.
To deal with the new standard compression, this module needs some slight changes. Since I cannot contribute directly, feel free to apply the patch below:
--- a/zimply/zimply.py
+++ b/zimply/zimply.py
@@ -56,6 +56,7 @@ from struct import Struct, pack, unpack
# non-standard required packages are gevent and falcon (for its web server),
# as well as and make (for templating)
from mako.template import Template
+import zstandard
import falcon
@@ -243,11 +244,11 @@ class ClusterData(object):
self.offset = offset # store the offset
cluster_info = ClusterBlock(encoding).unpack_from_file(
self.file, self.offset) # Get the cluster fields.
- # Verify whether the cluster has LZMA2 compression
- self.compressed = cluster_info['compressionType'] == 4
+ # Verify whether the cluster has compression
+ self.compression = {4: "lzma", 5: "zstd"}.get(cluster_info['compressionType'], False)
# at the moment, we don't have any uncompressed data
self.uncompressed = None
- self._decompress() # decompress the contents as needed
+ self._decompress() # decompress the contents as needed
# Prepare storage to keep track of the offsets
# of the blobs in the cluster.
self._offsets = []
@@ -255,24 +256,39 @@ class ClusterData(object):
self._read_offsets()
def _decompress(self, chunk_size=32768):
- if self.compressed:
+ if self.compression == "lzma":
# create a bytes stream to store the uncompressed cluster data
self.buffer = io.BytesIO()
decompressor = lzma.LZMADecompressor() # prepare the decompressor
# move the file pointer to the start of the blobs as long as we
# don't reach the end of the stream.
self.file.seek(self.offset + 1)
-
while not decompressor.eof:
chunk = self.file.read(chunk_size) # read in a chunk
data = decompressor.decompress(chunk) # decompress the chunk
self.buffer.write(data) # and store it in the buffer area
+
+ elif self.compression == "zstd":
+ # create a bytes stream to store the uncompressed cluster data
+ self.buffer = io.BytesIO()
+ decompressor = zstandard.ZstdDecompressor().decompressobj() # prepare the decompressor
+ # move the file pointer to the start of the blobs as long as we
+ # don't reach the end of the stream.
+ self.file.seek(self.offset+1)
+ while True:
+ chunk = self.file.read(chunk_size) # read in a chunk
+ try:
+ data = decompressor.decompress(chunk) # decompress the chunk
+ self.buffer.write(data) # and store it in the buffer area
+ except zstandard.ZstdError as e:
+ break
+
def _source_buffer(self):
# get the file buffer or the decompressed buffer
- buffer = self.buffer if self.compressed else self.file
+ buffer = self.buffer if self.compression else self.file
# move the buffer to the starting position
- buffer.seek(0 if self.compressed else self.offset + 1)
+ buffer.seek(0 if self.compression else self.offset + 1)
return buffer
def _read_offsets(self):
This adds zstandard
as a new dependency.
Hello!
Can you please release an update for ZIMply on PyPI with included pull request #7 ? Because
pip install zimply
would give an unfixed version, which is not usable with MWoffliner-made ZIM files in most browsers (except for text-based ones, such as Links 2).
Just did an upgrade on a raspberry pi and did
pip3 install zimply
and got .... Successfully installed zimply-1.1.4
But creating server.py:
from zimply import ZIMServer
ZIMServer("fas-military-medicine_2022-03.zim")
and running it with
python3 server.py
yields:
Traceback (most recent call last):
File "server.py", line 1, in <module>
from zimply import ZIMServer
ImportError: bad magic number in 'zimply': b'\x03\xf3\r\n'
so maybe it didn't install correctly after all?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.