It has been a few years since <a class="issue-link js-issue-link" data-error-text="Fai

Python bindings,about animetosho/node-yencode

Comments (25)

Safihre commented on July 21, 2024 1

We can close this now, as you helped us even more than I asked for!
Thanks so much!

from node-yencode.

animetosho commented on July 21, 2024

I remember having a quick look at it in the past. The main blocker was that I don't know how the Python build system supports setting different compiler flags for different files. I'm guessing this might help; perhaps the code has to be split up into separate modules to get different compiler flags.

Some sort of compiler detection could be useful (so that it still builds on those using older compilers); at the very least, different flags need to be given to x86/ARM builds, though if it's a Python script, I suppose detection can easily be done there.

Essentially, translating binding.gyp to the Python equivalent (setuptools?) would be the main thing. After that, it's just writing wrappers to whatever decode flavour you're after (whether you need dot unstuffing or end-of-article detection).

from node-yencode.

Safihre commented on July 21, 2024

Indeed I understand that those things need to happen, but I feel lost in the compile errors I am getting.. Probably for you they are obvious as you dealt with compiling a lot, but as a Python person it's a whole different universe.

So hoping for your help!

If it matters in any way, we could also donate some of our advertising income to you for the effort! We would be very willing to do so, because this will be a core part of our application.

from node-yencode.

animetosho commented on July 21, 2024

Thanks for the offer, but I don't accept those.

What are you ultimately looking for? A simple modification to your existing module, or something new?
How do you intend to process the data? Do you need dot unstuffing? Scanning for end of article? Does the module need to be redistributable in binary form?

I presumed you'd have more knowledge around creating a Python module? That's something I have no experience with yet.

from node-yencode.

Safihre commented on July 21, 2024

I want to be able to call your do_decode function on a raw python yenc string that I got from usenet. So it's not in chucks but the whole string is present already. It does need all the processing indeed, but that's already in the function as far as I could tell.

My knowledge is not much beyond using Python C extension template projects found on Github.

from node-yencode.

animetosho commented on July 21, 2024

As you may have figured already, it's not really as simple as that. Or rather, calling the function is fairly straightforward (call do_decode_end with the source data buffer, destination data buffer, length of source data, and current yEnc state). The difficulty is getting it to compile, as that requires the appropriate flags to be set for each file.
(you said the 'whole string' is present already, which I'm assuming doesn't represent the whole article, otherwise you'd need something which scans for where the article ends)

Do you already have some code/setup which tries to call the function?

from node-yencode.

Safihre commented on July 21, 2024

I will gather my failed attempts and create 1 project to share so maybe you can see the problem.

from node-yencode.

animetosho commented on July 21, 2024

I see - so this is a fresh project. I thought you may have been trying to adopt SABYenc.

I can take a look at a Python port of this module at some point.

from node-yencode.

Safihre commented on July 21, 2024

Eventually it's intended to replace or be incorporated into Sabyenc, but trying to implement it in there directly was giving me unrelated errors that were too hard for me to seperate.
As such I wanted to get back to basics and first making it compile at all for Python.

from node-yencode.

animetosho commented on July 21, 2024

Eventually it's intended to replace or be incorporated into Sabyenc

If that's the case, do you intend to keep it working as it currently does? As in, there's something else which identifies the end of article, assembles an array of strings, then passes it to the decoder? Or would you rather the yEnc decoder itself to detect the end of article, which means that the network receiver would be passing data directly to it without aggregating into an array?

(Sorry if I'm approaching this differently to how you would - I mostly try to see the end goal and see how to get there instead of taking small steps)

from node-yencode.

Safihre commented on July 21, 2024

Currently we can only have 1 thread that handles all connections, because of the Python GIL that only allows 1 active thread at a time. So we use a select to see which sockets have waiting data, read that with the 1 thread and check for end of article. If not end of article, we append the new raw data to an array of chunks with raw data. If we find end of article, we pass all the chunks of data to the decoder threads. In the decoder thread we call the C extension, which can release the GIL and thus can decode multiple articles at once.

It's a bit of a long introduction to say: decoding while receiving data (to also detect end of article) would probably hit performance, since we then can decode only 1 article simultaneously.

To have multiple sockets being read at the same time, we would also have to move that to C extension. The benefit of Python is that we don't have to care about any of the platform specific socket and IO/OpenSSL stuff, so it's very convenient and safe to have it like this.

from node-yencode.

animetosho commented on July 21, 2024

NodeJS is also single threaded - the typical approach taken is that tasks can be processed asynchronously.

I believe Python also has async support, but am not sure of requirements or whether it's being used. But a possible approach would be to receive data on a socket, then send it to be asynchronously processed, whilst data from another socket is received. The async part can operate outside of the GIL because it doesn't need to interact with Python objects.

From the sounds of it, it's not the approach currently being used, but I don't know if there's any intent to adopt something like this.
The only downside of the current method is that the end-of-article scan is completely single threaded (and blocks the main thread), but presumably not much of an issue to really matter.

Am I safe to assume that we just need to replace the existing yEnc decode function in SABYenc as it is right now?

from node-yencode.

Safihre commented on July 21, 2024

I did some digging last year and four that any "async" stuff in Python is not really asynchronous, only if written in C and if it releases the GIL specifically there. But there's almost no documentation how to properly implement the C part to be used in the async keywords.
End of article is not really that hard to scan for in the chunked approach: https://github.com/sabnzbd/sabnzbd/blob/c3d826df8c41221e0bd57a5e978756e1e0e81e7e/sabnzbd/newswrapper.py#L204

As the data is delivered in chunks, we have to do some logic to jump between chunks: https://github.com/sabnzbd/sabyenc/blob/89ef55dd6d9175589d9dec25acdedd541ea62d75/src/sabyenc3.c#L239

But I think I can implement that, once there is a base project that compiles and can run on a whole string that's not in chunks. Didn't want to bother you with our specific implementation details :)

from node-yencode.

animetosho commented on July 21, 2024

I see. It also looks like asyncio requires most of the project to be async as well, which might be a big change.
It can probably be done without native async, but probably not worth the effort.

I'll find some time to take a look at SABYenc to get a better idea of how to approach this.

End of article is not really that hard to scan for in the chunked approach

Quick scan of the code, so could be wrong here, but does it mean that the end sequence won't be detected if it's sent in three or more chunks? Also, if the server sends the end sequence, followed by something else (say, a quit message), it could get missed?
If I'm correct, and it's never really been an issue, I guess it doesn't matter.

As the data is delivered in chunks, we have to do some logic to jump between chunks

The decoder supports incremental processing - you just have to maintain a parsing state between calls (or chunks, in this case).
The decoder doesn't scan for null bytes though, but it looks like PyBytes_AsStringAndSize can be used to get the size.

from node-yencode.

Safihre commented on July 21, 2024

Quick scan of the code, so could be wrong here, but does it mean that the end sequence won't be detected if it's sent in three or more chunks? Also, if the server sends the end sequence, followed by something else (say, a quit message), it could get missed?

Indeed, it has never happend that it's split between more than 2 chunks.

from node-yencode.

Safihre commented on July 21, 2024

Currently the unit tests of sabyenc are broken, would it help if I fixed them?

from node-yencode.

animetosho commented on July 21, 2024

The unit tests seem to run fine for me?
The only thing I found was the speed test fails on Python 3, but that's an easy fix.

I've been looking at implementing this, but a few things I noticed:

it looks like a search for crc32= is done as opposed to pcrc32= (or probably more accurately, [space]pcrc32=). If the article contains both 'pcrc32' and 'crc32', wouldn't it be possible for this to be confused?
I'm not sure where the article size is determined, but is there any reason to trust what's specified, and not just use the actual data size all the time? Decoding yEnc can never generate output larger than the input, so it seems like a sensible allocation strategy.
might be related to above, but what's with the 10% extra reservation?
the attempt to correct a bad part size doesn't make much sense to me - if it's bad, shouldn't it just simply be rejected?
bit flipping the CRC here is weird (usually you do it on the calculated CRC, not the declared one) - do the upper layers require knowing the CRC specified by the article? Because if they do, they'd be getting an incorrect value (or 0 if CRC_CHECK isn't defined, or crc32= was never found in the line)

from node-yencode.

Safihre commented on July 21, 2024

The unit tests seem to run fine for me?

They fail on some platforms when I tried to run them on GitHub Actions, but good thing they work on yours 👍

it looks like a search for crc32= is done as opposed to pcrc32= (or probably more accurately, [space]pcrc32=). If the article contains both 'pcrc32' and 'crc32', wouldn't it be possible for this to be confused?

Probably, I don't think we ever had those cases where both were present.

I'm not sure where the article size is determined, but is there any reason to trust what's specified, and not just use the actual data size all the time? Decoding yEnc can never generate output larger than the input, so it seems like a sensible allocation strategy.
might be related to above, but what's with the 10% extra reservation?

Since we have a per-article-size-indication from the NZB, we use that to save the extra looping over all chunks to find the "real" length. Think it made a few % performance difference to do this.

the attempt to correct a bad part size doesn't make much sense to me - if it's bad, shouldn't it just simply be rejected?

Good question, probably just a safe-guard "to be sure".

bit flipping the CRC here is weird (usually you do it on the calculated CRC, not the declared one) - do the upper layers require knowing the CRC specified by the article? Because if they do, they'd be getting an incorrect value (or 0 if CRC_CHECK isn't defined, or crc32= was never found in the line)

The upper layers only care if the CRC is correct, not what it was really.

As you can see, many things in this code are more based on real-world tweaks to push some extra performance and not on really being strict on the yEnc standard perse. That is then also combined with my fairly limited C-knowledge, resulting in not so-nice code.
I wanted to note specifically on the CRC: after much testing with real users, it was found that never some char-bit was flipped but always just chunks of data was missing. So calculating the CRC was just an expensive extra step relative to just comparing the sizes.

So for already a number of years in the production release we don't calculate the CRC anymore and just use the size comparison.

from node-yencode.

animetosho commented on July 21, 2024

Probably, I don't think we ever had those cases where both were present.

Nyuu definitely writes both, but puts the pcrc32 first, so it'd work because of that ordering. The yEnc spec example also puts pcrc32 before crc32, so not likely an issue in reality, but I'll change it to pcrc32 as it should be more accurate with basically no downside.
I recall most uploaders not ever including 'crc32', so your experience makes sense.

Since we have a per-article-size-indication from the NZB, we use that to save the extra looping over all chunks to find the "real" length. Think it made a few % performance difference to do this.

Ah I see.
I'm surprised it'd make that much of a difference though, particularly since I'd generally expect most articles to arrive in a few chunks only.

Whilst it may seem slower on the surface, it actually allows you to remove this check, which would have a much greater impact on performance, since that's checked ~700K times per article.

This module similarly skips such a check (for performance reasons), requiring having sufficiently large buffers, so I'd need to remove trusting the NZB value here.

The upper layers only care if the CRC is correct, not what it was really.

I can see it being useful if you want to verify crc32 in addition to pcrc32.
But otherwise, I guess I can remove it then - I'll keep the interface the same, but always return 0 for the CRC (which already happens since CRC is disabled by default).

I wanted to note specifically on the CRC: after much testing with real users, it was found that never some char-bit was flipped but always just chunks of data was missing. So calculating the CRC was just an expensive extra step relative to just comparing the sizes.

That sounds very reasonable, though personally I feel uneasy about skipping the CRC check. What you decide to do is your decision in the end, though the CRC32 performance of this module should be better, so hopefully will encourage you to reconsider.

A few more things:

it looks like this could fail to advance far enough, if there happens to be a chunk boundary in the middle. It's unlikely for ypart to ever include more than a begin/end property, so I suppose it's not an issue in reality
this assumes the input is a bytearray, but everywhere else assumes bytes object. I assume the latter is correct?
the CRC checked in the encoder test is bit-inverted. This actually means the encoder is generating the wrong CRC, though I doubt Sabnzbd really uses an encoder much. I did find this case though, so I don't know if fixing the CRC would break something

For handling multiple chunks, I'd have made a small flat buffer (say, around 4KB as the ybegin/ypart lines should fit within that) and filled that with data from the chunks, before processing. This saves you from having to implement complex logic to deal with data being split across chunks every time you try to parse something, and it also works in the rare case that something spans across three or more chunks.

from node-yencode.

Safihre commented on July 21, 2024

I'm surprised it'd make that much of a difference though, particularly since I'd generally expect most articles to arrive in a few chunks only.

Here's another fun Python fact: when SSL is enabled for the usenet-connection, Python delivers each chunk in the TLS fragment size of 16k, resulting in a minimum of around 40 chunks in case of SSL-connections. This is something that hopefully will be addressed in a future version of Python: python/cpython#25478

Whilst it may seem slower on the surface, it actually allows you to remove this check, which would have a much greater impact on performance, since that's checked ~700K times per article.
This module similarly skips such a check (for performance reasons), requiring having sufficiently large buffers, so I'd need to remove trusting the NZB value here.

👍

this assumes the input is a bytearray, but everywhere else assumes bytes object. I assume the latter is correct?

Indeed, should be bytes.

the CRC checked in the encoder test is bit-inverted. This actually means the encoder is generating the wrong CRC, though I doubt Sabnzbd really uses an encoder much. I did find this case though, so I don't know if fixing the CRC would break something

I should fix that then!

For handling multiple chunks, I'd have made a small flat buffer (say, around 4KB as the ybegin/ypart lines should fit within that) and filled that with data from the chunks, before processing. This saves you from having to implement complex logic to deal with data being split across chunks every time you try to parse something, and it also works in the rare case that something spans across three or more chunks.

Great! The current code is so clunky..

from node-yencode.

animetosho commented on July 21, 2024

I've got a version of sabyenc with yencode integrated here, if you want to play around with it: https://github.com/animetosho/sabyenc/tree/integrate_yencode

There's one major issue remaining: I can't see how to define different compile options for particular files, or the ability to define dependent C libraries in setuptools. I've worked around this by invoking the compiler manually, which seems to work, though probably isn't the nicest solution.
In particular, it currently executes before setup, so compiling dependencies will always be called when setup.py is (even if not building). It may be possible to avoid this with the cmdclass directive.

I also can't find a reliable way to determine the current build architecture. The closest thing I can find in Python is platform.machine(), but this has some problems:

it can return a bunch of different values for the same architecture
it is based off the kernel, which may not be the same as the Python interpreter (e.g. 32-bit Python running on a 64-bit kernel)

Do you know of a way to determine Python's build architecture? Or better, the target architecture, if setuptools allows cross-arch compilation.

In my changes, I've removed the 10% overallocation, since the NZB size is now ignored, and the size calculation can be relied upon.
I've left the bad part size correction - you may wish to consider rejecting the article immediately, as I'm not sure it does anything useful.

The changes introduce three test failures:

test_single_part: the test case contains a dot at the start of a line, which is invalid according to yEnc specs. SabYenc originally strips doubled dots only, whilst this decoder strips any dot following a new line. Since it's invalid, the spec doesn't really say which is better, but I felt the latter was, which is why I went with that.
test_partial: the old decoder actually tries to decode the =yend segment as yEnc encoded data, so gives a longer output. I don't get the point of the partial test - does the decoder ever get fed *part *of an article?
test_false_bytes: the new code completely ignores the passed in length, so passing in an invalid length won't stop the decoder

Enabling CRC check also affects the following cases:

test_single_part: the supplied article doesn't have a pcrc32 value, which is mandatory under yEnc specification, so is immediately marked as invalid. crc32 is given, however 'crc32' is optional, but 'pcrc32' is not. If you want to still allow such articles, you could allow searching for crc32 if pcrc32 is not found.
test_bad_filename_pickle: fails on the old code. I'm guessing the CRC is correct, but the length is wrong.
test_crc_pickles: likely same issue as with test_single_part

Here's another fun Python fact: when SSL is enabled for the usenet-connection, Python delivers each chunk in the TLS fragment size of 16k, resulting in a minimum of around 40 chunks in case of SSL-connections

Does Python have the ability to receive into a pre-allocated buffer/bytes object, rather than accept a new bytes object every receive?

from node-yencode.

Safihre commented on July 21, 2024

Amazing! It works 😃
I had to do 1 tweak though: the crcutil-1.0\examples is not included, but the files in there are required to compile so I manually copied them in from this repository.

Does Python have the ability to receive into a pre-allocated buffer/bytes object, rather than accept a new bytes object every receive?

Yes, but it still only does that 16k at a time.

from node-yencode.

animetosho commented on July 21, 2024

Ah, excluded by .gitignore. Missed that, thanks.

Yes, but it still only does that 16k at a time.

Why not pre-allocate, say, a 1MB buffer, then receive into that? There'll still be some overhead with small receives, but everything subsequent will only have to deal with one buffer.
(it would also have other benefits, such as enabling a buffer pool to relieve allocation/GC pressure)

Taking a quick look, Python does have socket.recv_into, which should do the trick.

from node-yencode.

Safihre commented on July 21, 2024

Deleted my previous post, because it turns out my script wasn't optimal. Allocating the bytearray takes a lot of time, so picking a proper size (like we could get from the NZB) makes a big difference. Updated the linked gist below.
Now buffers are always faster.

@animetosho I looked into the recv_into and it looks promising. When I checked it out a long time ago, it was slower than expected.
So just to be sure I made a speed-test, using nserv (NZBGet's super-fast NNTP-test-server) and end-of-article-checks.
https://gist.github.com/Safihre/12f9826631c0aaf47491c118511679c4

(lower = better)

EDIT: Some data updated, wrong copy-paste!

SSL:
RECV SIZE: 262144
Chunks: 828.125
Buffer: 781.25

RECV SIZE: 16384
Chunks: 640.625
Buffer: 609.375


NO SSL:
RECV SIZE: 262144
Chunks: 250.0
Buffer: 171.875

RECV SIZE: 16384
Chunks: 437.5
Buffer: 375.0

from node-yencode.

animetosho commented on July 21, 2024

Allocating the bytearray takes a lot of time, so picking a proper size (like we could get from the NZB) makes a big difference

I'd persist the buffer in that case. Perhaps assign one to each connection, and always recv into that buffer.
This saves you from having to reallocate it for every article. You can use the NZB size as a hint (or use some hard-coded value) and perhaps add a 10% margin on top. If a recv would exceed the buffer size, resize it (say, make it 10% larger). Most of the time, a resize should never be needed, so you're basically just writing into a fixed buffer.

Also, mentioned in the pull request, but it would also make sense for the decoded output to occupy a fixed buffer, instead of creating a new one each time.

Thanks for testing it!

from node-yencode.

Python bindings about node-yencode HOT 25 CLOSED

Comments (25)

Related Issues (12)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent