Giter Site home page Giter Site logo

UnicodeDecodeError about torrent_parser HOT 7 CLOSED

7sdream avatar 7sdream commented on July 2, 2024
UnicodeDecodeError

from torrent_parser.

Comments (7)

7sDream avatar 7sDream commented on July 2, 2024 1

I already change string decoding error handler to an argument in dev branch by commit ee3128b.

from torrent_parser.

7sDream avatar 7sDream commented on July 2, 2024 1

Thanks for your idea.

I will finish the customize hash fields API tomorrow and release a new version.

Due to the break change and so may thing be added, It will be 0.3.0.

(And yes, in 0.x.x break change don't need add the major version... I'm still considering when to reach the 1.0 ⌛)

from torrent_parser.

7sDream avatar 7sDream commented on July 2, 2024 1

v0.3.0 just released.

In this version, there are many way to deal with this problem:

import torrent_parser as tp

file = 'tests/test_files/utf8.encoding.error.torrent'

# way 1

data = tp.parse_torrent_file(file, errors='ignore')
print(data['magnet-info']['info_hash'])

data = tp.parse_torrent_file(file, errors='replace')
print(data['magnet-info']['info_hash'])

# way 2

data = tp.parse_torrent_file(file, hash_fields={'info_hash': (20, False)})
print(data['magnet-info']['info_hash'])

# way 3

data = tp.parse_torrent_file(file, hash_fields={'info_hash': (20, False)}, hash_raw=True)
print(data['magnet-info']['info_hash'])

# If you don't use any above option

try:
    data = tp.parse_torrent_file(file)
except tp.InvalidTorrentDataException as e:
    print(e)

the output:

jysL
�j��y�sL�
36fd06b595119b380df46ab2f2a0b579b1734ca8
b'6\xfd\x06\xb5\x95\x11\x9b8\r\xf4j\xb2\xf2\xa0\xb5y\xb1sL\xa8'
Fail to decode string at pos 16436 using encoding utf-8 when parser field "info_hash", maybe it is an hash field. You can use self.hash_field("info_hash") to let it be treated as hash value, so this error may disappear

the hash_field("info_hash") is added to the class:

with open(file, 'rb') as f:
    data = tp.TorrentFileParser(f).hash_field('info_hash').parse()
    print(data['magnet-info']['info_hash'])
    # 36fd06b595119b380df46ab2f2a0b579b1734ca8

with open(file, 'rb') as f:
    data = tp.BDecoder(f.read()).hash_field('info_hash').decode()
    print(data['magnet-info']['info_hash'])
    # 36fd06b595119b380df46ab2f2a0b579b1734ca8

from torrent_parser.

7sDream avatar 7sDream commented on July 2, 2024

I just merged your PR #5, but I came up with some ideas just now, and want to discuss them with you (and others).

  1. I notice the error happened because there is a field magnet-info.info_hash, which doesn't seem to be a string, instead, it's a piece of hash value. I'm wondering if I should/need add it to the field list whose member will be treated as hash automatically. (see line 108 and 189)

  2. The decoding error handler will become an option of TorrentFileParser class and parse_torrent_file shortcut function. It's default behavior will not change, that is, default value of it will be strict. You can use ignore or replace to avoid exception if you wish. But if I added info_hash to that list, your error will disappear automatically. So I think use strict as error handler and add try catch to bypass REAL invalid torrent is the best way.

  3. I can add an method to TorrentFileParser and TorrentFileCreator to let user add their own hash value field to that list. And meantime, the error message of string decode error will suggest user to use this method to add custom hash field to the list But I'm wondering if it is worth to do. And if I decide to do this, your magnet-info.info_hash will not be added to the list by default.

Waiting for your idea. (Only 1 day, then I will do in the way I like)

from torrent_parser.

yasuotakei avatar yasuotakei commented on July 2, 2024
  1. Yes. Maybe the field was created by an obscure client or private torrent index.

  2. For general use I think your suggestions of passing 'strict' to .decode() errors argument is okay.
    But for my use case, giving me the option to pass my own argument would be perfect. Fault tolerance is a desirable quality in a crawler. I need the 'ignore' or 'replace' flag as I wish to collect as many files as possible.
    Given the scale of my operation such errors are bound to happen, and I might lose out on thousands of potentialy working torrents. I have 87 torrents with the same magnet-info.hash_info issue right now. As long as the torrent works at the minimum, I add it.

  3. Yes it might be useful to a small percentage of users. If it is not too much work, add it and document it.

To conclude, I think if you add many different options for achieving many different goals(as long as you write good tests and documentation), your library will appeal to a broader audience.

Don't lock out a subset of users. If you need help tag your Kanban board with help wanted

Thank you.

from torrent_parser.

yasuotakei avatar yasuotakei commented on July 2, 2024

No problem take your time

from torrent_parser.

yasuotakei avatar yasuotakei commented on July 2, 2024

Very good and much appreciated. I think we can go ahead and close the issue.

from torrent_parser.

Related Issues (12)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.