kimbauters / zimply Goto Github PK

View Code? Open in Web Editor NEW

77.0 77.0 16.0 187 KB

An easy to use offline reader for ZIM files right in your browser!

License: Other

HTML 9.39% Python 90.61%

zimply's People

Contributors

Stargazers

Watchers

Forkers

zurga sergio-apithy medkhem ostrosablin svetlana21 noisnemid mrsurly strawberryblackhole johannes128 d9pouces dylanmccall rptst osteth aeryskyb deldesir

zimply's Issues

Takes too much RAM

My PC specs:
Intel i5-2320
SSD 120 GB (Running OS)
HDD 1TB (wikipedia dump is here)
RAM 4GB
Swap 2 GB
Ubuntu 20.04.1 LTS

I'm running full English Wikipedia with Pictures. After the start of the server I can browse several articles but then python3 process takes more than 4 GB of RAM and computer freezes for forever, so I have to unplug. Is this how it should behave or I'm doing something wrong?

I have both .py file and dump in the same folder. Size of dump file is 92 GB.
The .py file contains the following:

from zimply import ZIMServer
ZIMServer("$PATH/wikipedia_en_all_maxi_2020-06.zim")

Error Invalid Argument

When I try to run ZIMServer("wiki.zim") I get the error invalid argument on file.seek(offset) in unpack_from_file.

Zimply tries to seek to an invalid offset with some zim files

I am using the zim_core from the version2 branch (as used in https://pypi.org/project/zimply-core/) to read a Zim file from Kiwix that contains Wikihow content. Here is an example:

https://download.kiwix.org/zim/.hidden/endless/wikihow_en_endless_cars-and-other-vehicles_2021-12.zim

I am attempting to read it with the following code:

from zimply_core import zim_core
zim = zim_core.ZIMClient("wikihow_en_endless_cars-and-other-vehicles_2021-12.zim")
count = zim.get_namespace_count("C")
print(f"Articles count: {count}")

On a Linux system with an ext4 filesystem, it fails with the following traceback:

Traceback (most recent call last):
  File "test.py", line 2, in <module>
    zim = zim_core.ZIMClient("wikihow_en_endless_cars-and-other-vehicles_2021-12.zim")
  File "/home/dylan/.local/share/virtualenvs/dylan-iKdQhdzw/lib/python3.7/site-packages/zimply_core/zim_core.py", line 1152, in __init__
    self.language + " (ISO639-1), articles: " + str(len(self._zim_file)))
  File "/home/dylan/.local/share/virtualenvs/dylan-iKdQhdzw/lib/python3.7/site-packages/zimply_core/zim_core.py", line 734, in __len__
    result = self.get_namespace_range("A" if self.version <= (6, 0) else "C")
  File "/home/dylan/.local/share/virtualenvs/dylan-iKdQhdzw/lib/python3.7/site-packages/zimply_core/zim_core.py", line 764, in get_namespace_range
    before = self.read_directory_entry_by_index(start_mid - 1)
  File "/home/dylan/.local/share/virtualenvs/dylan-iKdQhdzw/lib/python3.7/site-packages/zimply_core/zim_core.py", line 590, in read_directory_entry_by_index
    directory_values = self._read_directory_entry(offset)
  File "/home/dylan/.local/share/virtualenvs/dylan-iKdQhdzw/lib/python3.7/site-packages/zimply_core/zim_core.py", line 565, in _read_directory_entry
    self.file.seek(offset)  # move to the desired offset
OSError: [Errno 22] Invalid argument

I poked at this for a while, adding some messages to keep track of what it's doing.

get_namespace_range M
get_namespace_range: start_low 0
get_namespace_range: start_mid 795
get_namespace_range: start_high 1590
read_directory_entry_by_index 795
_read_directory_entry 8655750
read_directory_entry_by_index 794
_read_directory_entry 8655676
[…]
get_namespace_range: start_low 0
get_namespace_range: start_mid 2
get_namespace_range: start_high 4
read_directory_entry_by_index 2
_read_directory_entry 8605156
read_directory_entry_by_index 1
_read_directory_entry 8605116
get_namespace_range: start_low 0
get_namespace_range: start_mid 0
get_namespace_range: start_high 1
read_directory_entry_by_index 0
_read_directory_entry 8605062
read_directory_entry_by_index -1
_read_directory_entry 121364659855736
[Crash]

From the looks of it, Zimply is getting a file offset for a negative index, which isn't expected:

ZIMply/zimply/zim_core.py

Lines 530 to 539 in e1077d0

    
           def _read_offset(self, index, field_name, field_format, length): 
        
               # move to the desired position in the file 
        
               if index != 0xffffffff: 
        
                   self.file.seek(self.header_fields[field_name] + int(length * index)) 
        
                   # and read and return the particular format 
        
                   read = self.file.read(length) 
        
                   # return unpack("<" + field_format, self.file.read(length))[0] 
        
                   return unpack("<" + field_format, read)[0] 
        
               return None

In this instance, it ends up reading something from self.header_fields["Q"] + int(8 * -1), which as it turns out is a very big number.

Note that whether file.seek(121364659855736) results in an exception depends on the underlying filesystem, which makes this a little tricky to reproduce :)

Main page cannot be found in the english nopic wikipedia dump from 2017-01

With the 2017-01 dump of the english wikipedia (nopic), the main page cannot be found and 0 bytes are read by the index returned from the header_fields. Searching does work though. I do not know enough about the ZIM format to try and work this out.

Unicode Error in Python 3.6

ZIMServer("wikipedia_de_all_nopic_2018-11.zim")
Traceback (most recent call last):
File "", line 1, in
File "C:\Python\anaconda35\lib\site-packages\zimply\zimply.py", line 845, in init
ZIMRequestHandler.reverse_index = self._bootstrap(index_file)
File "C:\Python\anaconda35\lib\site-packages\zimply\zimply.py", line 864, in _bootstrap
"this can take quite some time! \u2013 " + time.strftime('%X %x'))
File "C:\Python\anaconda35\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 68: character maps to
ZIMServer(r"wikipedia_de_all_nopic_2018-11.zim")
Traceback (most recent call last):
File "", line 1, in
File "C:\Python\anaconda35\lib\site-packages\zimply\zimply.py", line 845, in init
ZIMRequestHandler.reverse_index = self._bootstrap(index_file)
File "C:\Python\anaconda35\lib\site-packages\zimply\zimply.py", line 864, in _bootstrap
"this can take quite some time! \u2013 " + time.strftime('%X %x'))
File "C:\Python\anaconda35\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 68: character maps to

ZIMply claim require python 3.4 but really need python 3.5

'''
F:\Wiki>python.exe zim-wp.py
Traceback (most recent call last):
File "zim-wp.py", line 1, in
from zimply import ZIMServer
File "F:\Wiki\lib\site-packages\zimply_init_.py", line 1, in
from .zimply import ZIMServer
File "F:\Wiki\lib\site-packages\zimply\zimply.py", line 793
entries = [*entries, *redirects]
^
SyntaxError: can use starred expression only as assignment target
'''
I run on Windows XP, python 3.4, zimply 1.1.3. Would you backport it? I want to have multi-word search ability, and it's a pity if an educational-purpose software drop XP where it still used in underdeveloped countries (for whatever reason is).

last working version: 1.0.x

MonkeyPatchWarning

from zimply import ZIMServer
C:\Python\anaconda35\lib\site-packages\zimply\zimply.py:58: MonkeyPatchWarning: Monkey-patching ssl after ssl has already been imported may lead to errors, including RecursionError on Python 3.6. Please monkey-patch earlier. See gevent/gevent#1016
monkey.patch_all()

Can Zimply run multiple ZIMS at once

Is it possible for Zimply to run multiple ZIMS at once on the same page, or even different ports?

pip version is still old

hi recently i reinstalled zimply in another computer with command pip install zimply, and then i found the code still the old versioned which ip address can't be configured.
os. windows and archlinux.
btw, the previous installation's site-packages/zimply/zimply.py was manually edited so i didn't notice this.

Error while index creation during first time server start-up

I found the following error while running my server python file:

sqlite3.OperationalError: no such module: fts4

As a result of the error, the searching doesn't work anymore, as mentioned in your README note.
I also found that sqlite3, in the recent versions, ships with fts5, and requires recompiling and other complex stuff for fts4 support.

The `ZIMClient.random_article` function also returns media files with new zim files

With new-style zim files, both articles and their assets appear in the same "C" namespace:

https://www.openzim.org/wiki/ZIM_file_format#Namespaces

ZIMClient.random_article chooses a random index from the "C" namespace, assuming that entries in that namespace are all articles. This means that it will often return images, for example, instead of articles.

The issue is particularly prominent with this zim file, for example, which contains a very large number of images: https://download.kiwix.org/zim/.hidden/endless/wikihow_en_endless_holidays-and-traditions_2021-12.zim.

For reference, here is how this is implemented in libzim: https://github.com/openzim/libzim/blob/master/src/archive.cpp#L267-L284. It looks like we would need to make use of the title index.

html navigation bar hides text

Hi,

I installed ZIMply with pip, using the default template.

The default template displays incorrectly. The navigation bar is displayed on top but it also hide the text appearing faded (like 5% of opacity maybe). I tried with latest Firefox and Chromium versions.

Changing the z-index or the min-height value of the nav element seems to fix the issue but I'm not sure it's the correct way to fix it.

Zimply example line needs to be commented

ZIMply/zimply/zimply.py

Line 902 in a260f16

server = ZIMServer("../wiki.zim", port=8080)

is uncommented and not protected by any if __name__ == "__main__":, therefore it's launched and looks for ../wiki.zim to be served on port 8080, no matter if we specify another zim file or another port.

This line needs to be commented.

Search index is missing articles with single-word titles

I encountered the following with a Wikipedia zim file and the version2 branch of Zimply:

>>> zim = zimply_core.ZIMFile("wikipedia_simple_endless-kidzsearch_maxi_2022-03.zim", 'utf-8')

# The "Pancake" article is at 187257
>>> zim.read_directory_entry_by_index(187257)
{'mimetype': 8,
 'parameterLen': 0,
 'namespace': 'A',
 'revision': 0,
 'clusterNumber': 1403,
 'blobNumber': 105,
 'url': 'Pancake',
 'title': '',
 'index': 187257}

# We get its offset in the file...
>>> pancake_offset = zim._read_url_offset(187257)

# Start reading metadata from the file
>>> super(zimply_core.DirectoryBlock, zim.articleEntryBlock)._unpack_from_file(zim.file, pancake_offset)
{'mimetype': 8,
 'parameterLen': 0,
 'namespace': b'A',
 'revision': 0,
 'clusterNumber': 1403,
 'blobNumber': 105}

# The next item is the URL:
>>> read_zero_terminated(zim.file, "utf-8")
'Pancake'

# And then the title:
>>> read_zero_terminated(zim.file, "utf-8")
''

# The "Pancake Day" article redirects to 224476)
>>> zim.read_directory_entry_by_index(224476)
{'mimetype': 8,
 'parameterLen': 0,
 'namespace': 'A',
 'revision': 0,
 'clusterNumber': 896,
 'blobNumber': 50,
 'url': 'Shrove_Tuesday',
 'title': 'Shrove Tuesday',
 'index': 224476}

# Get the offset for the article
>>> pancake_day_offset = zim._read_url_offset(224476)

# Start reading metadata from the file
>>> super(zimply_core.DirectoryBlock, zim.articleEntryBlock)._unpack_from_file(zim.file, pancake_day_offset)
{'mimetype': 8,
 'parameterLen': 0,
 'namespace': b'A',
 'revision': 0,
 'clusterNumber': 896,
 'blobNumber': 50}

# The next item is the URL:
>>> read_zero_terminated(zim.file, "utf-8")
'Shrove_Tuesday'

# And then the title:
>>> read_zero_terminated(zim.file, "utf-8")
'Shrove Tuesday'

I learned that there is an optimization where, if an article's URL is the same as its title, the title field is left blank, and Zimply does the right thing here inside ZIMFile.get_article_by_index. However, that is not the case when we are building a the search index. The code in CreateFTSProcess which goes through articles uses ZIMFileIterator, which calls unpack_from_file directly. That code leaves the (sometimes blank) title field as-is.

No module named zimply

i get this error when i try to start the script:
wiki.py

from zimply import ZIMServer
ZIMServer("wiki.zim")

dennis@KODI:/media/dennis/931GB-btrfs/kiwix$ python wiki.py
Traceback (most recent call last):
File "wiki.py", line 1, in
from zimply import ZIMServer
ImportError: No module named zimply

gevent.hub.LoopExit: ('This operation would block forever')

I installed zimply and while importing the library through jupyter notebook it throws this error in the console (the importing doesn't complete) :
[IPKernelApp] ERROR | Exception in message handler:
......
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/threading.py", line 293, in wait
waiter.acquire()
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/gevent/thread.py", line 84, in acquire
return BoundedSemaphore.acquire(self, blocking, timeout)
File "src/gevent/_semaphore.py", line 198, in gevent._semaphore.Semaphore.acquire (src/gevent/gevent._semaphore.c:4541)
File "src/gevent/_semaphore.py", line 226, in gevent._semaphore.Semaphore.acquire (src/gevent/gevent._semaphore.c:4367)
File "src/gevent/_semaphore.py", line 166, in gevent._semaphore.Semaphore._do_wait (src/gevent/gevent._semaphore.c:3562)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/gevent/hub.py", line 630, in switch
return RawGreenlet.switch(self)
gevent.hub.LoopExit: ('This operation would block forever', <Hub at 0x104a81b90 select default pending=0 ref=0>)
ERROR:tornado.general:Uncaught exception, closing connection.

How do I resolve this error ? Any inputs

Zimply does not work with Kiwix wikibooks dump

`struct.error: unpack requires a buffer of 2 bytes` thrown when accessing some files

I had this error thrown for seemingly random URLs when trying different ZIM files. To reproduce with one sample file:

Download the wikipedia_ar_computer_maxi.zim file from the Kiwix collection (direct link: https://download.kiwix.org/zim/wikipedia_ar_computer_maxi.zim)
Serve the file using ZIMply then navigate to the /w/index.php page.
An error like this is thrown:

2022-09-26 01:08:07 [FALCON] [ERROR] GET /w/index.php => Traceback (most recent call last):
  File "falcon\app.py", line 365, in falcon.app.App.__call__
  File "D:\dev\zimply-demo\venv\lib\site-packages\zimply\zimply.py", line 717, in on_get
    article = ZIMRequestHandler.zim.get_article_by_url(namespace, url)
  File "D:\dev\zimply-demo\venv\lib\site-packages\zimply\zimply.py", line 537, in get_article_by_url
    entry, idx = self._get_entry_by_url(namespace, url)  # get the entry
  File "D:\dev\zimply-demo\venv\lib\site-packages\zimply\zimply.py", line 516, in _get_entry_by_url
    entry = self.read_directory_entry_by_index(middle)
  File "D:\dev\zimply-demo\venv\lib\site-packages\zimply\zimply.py", line 454, in read_directory_entry_by_index
    directory_values = self._read_directory_entry(offset)
  File "D:\dev\zimply-demo\venv\lib\site-packages\zimply\zimply.py", line 435, in _read_directory_entry        
    fields = unpack("<H", self.file.read(2))
struct.error: unpack requires a buffer of 2 bytes

Missising Dependancy (`version2`)

A requirement for version2 is zstandard. It's in setup.py but not in the readme.

Also from my experience of version 2 you should look to switch this onto the master branch, it appears quite stable. Great work.

Multiple ZIM Archives (`version2`)

A feature request to load multiple ZIM files. People interested in hosting a single ZIM archive are likely to want to host multiple ZIM archives.

This might look like:

ZIMServerCollection([{filename, index_file, template}, {..}, ..], ip_address, port, encoding)

Each ZIM archive would be on a different directory (localost:port/wiki1/.., localhost:port/wiki2/..). Each would be a separate client. You would then have a simple wikipage on the index to choose each ZIM archive.

New Wikipedia and Wiktionary dumps use zstd compression instead of lzma

To deal with the new standard compression, this module needs some slight changes. Since I cannot contribute directly, feel free to apply the patch below:

--- a/zimply/zimply.py
+++ b/zimply/zimply.py
@@ -56,6 +56,7 @@ from struct import Struct, pack, unpack
 # non-standard required packages are gevent and falcon (for its web server),
 # as well as and make (for templating)
 from mako.template import Template
+import zstandard

 import falcon

@@ -243,11 +244,11 @@ class ClusterData(object):
         self.offset = offset  # store the offset
         cluster_info = ClusterBlock(encoding).unpack_from_file(
             self.file, self.offset)  # Get the cluster fields.
-        # Verify whether the cluster has LZMA2 compression
-        self.compressed = cluster_info['compressionType'] == 4
+        # Verify whether the cluster has compression
+        self.compression = {4: "lzma", 5: "zstd"}.get(cluster_info['compressionType'], False)
         # at the moment, we don't have any uncompressed data
         self.uncompressed = None
-        self._decompress()  # decompress the contents as needed
+        self._decompress()  # decompress the contents as needed
         # Prepare storage to keep track of the offsets
         # of the blobs in the cluster.
         self._offsets = []
@@ -255,24 +256,39 @@ class ClusterData(object):
         self._read_offsets()

     def _decompress(self, chunk_size=32768):
-        if self.compressed:
+        if self.compression == "lzma":
             # create a bytes stream to store the uncompressed cluster data
             self.buffer = io.BytesIO()
             decompressor = lzma.LZMADecompressor()  # prepare the decompressor
             # move the file pointer to the start of the blobs as long as we
             # don't reach the end of the stream.
             self.file.seek(self.offset + 1)
-
             while not decompressor.eof:
                 chunk = self.file.read(chunk_size)  # read in a chunk
                 data = decompressor.decompress(chunk)  # decompress the chunk
                 self.buffer.write(data)  # and store it in the buffer area
+
+        elif self.compression == "zstd":
+            # create a bytes stream to store the uncompressed cluster data
+            self.buffer = io.BytesIO()
+            decompressor = zstandard.ZstdDecompressor().decompressobj()  # prepare the decompressor
+            # move the file pointer to the start of the blobs as long as we
+            # don't reach the end of the stream.
+            self.file.seek(self.offset+1)
+            while True:
+                chunk = self.file.read(chunk_size)  # read in a chunk
+                try:
+                    data = decompressor.decompress(chunk)  # decompress the chunk
+                    self.buffer.write(data)  # and store it in the buffer area
+                except zstandard.ZstdError as e:
+                    break
+

     def _source_buffer(self):
         # get the file buffer or the decompressed buffer
-        buffer = self.buffer if self.compressed else self.file
+        buffer = self.buffer if self.compression else self.file
         # move the buffer to the starting position
-        buffer.seek(0 if self.compressed else self.offset + 1)
+        buffer.seek(0 if self.compression else self.offset + 1)
         return buffer

     def _read_offsets(self):

This adds zstandard as a new dependency.

PyPI release update

Hello!

Can you please release an update for ZIMply on PyPI with included pull request #7 ? Because

pip install zimply

would give an unfixed version, which is not usable with MWoffliner-made ZIM files in most browsers (except for text-based ones, such as Links 2).

Zim file can be other content than wikipedia

Remove wikipedia logo , maybe we can just put the favicon instead
Change "Search Wikipedia" to "Search [zim name]"

ImportError: bad magic number in 'zimply': b'\x03\xf3\r\n'

Just did an upgrade on a raspberry pi and did
pip3 install zimply
and got .... Successfully installed zimply-1.1.4

But creating server.py:

from zimply import ZIMServer
ZIMServer("fas-military-medicine_2022-03.zim")

and running it with
python3 server.py
yields:

Traceback (most recent call last):
  File "server.py", line 1, in <module>
    from zimply import ZIMServer
ImportError: bad magic number in 'zimply': b'\x03\xf3\r\n'

so maybe it didn't install correctly after all?

	def _read_offset(self, index, field_name, field_format, length):
	# move to the desired position in the file
	if index != 0xffffffff:
	self.file.seek(self.header_fields[field_name] + int(length * index))

	# and read and return the particular format
	read = self.file.read(length)
	# return unpack("<" + field_format, self.file.read(length))[0]
	return unpack("<" + field_format, read)[0]
	return None