UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence about pip HOT 8 OPEN

danerlt commented on July 30, 2024

UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence

from pip.

Comments (8)

uranusjr commented on July 30, 2024 1

Ah, right, I forgot about paths. Falling back with a deprecation warning sounds like the way to go.

from pip.

matthewhughes934 commented on July 30, 2024

Are you able to share the contents of the requirements.txt file you were using?

from pip.

danerlt commented on July 30, 2024

@matthewhughes934
The contents of my requirements.txt are as follows:

# server
supervisor==4.2.5
gunicorn==21.2.0
gevent==23.9.1

# web
Werkzeug==2.3.7
celery==5.2.7
click==8.1.7
dataclasses_json==0.6.4
Flask==2.3.3
Flask_Cors==3.0.10
Flask_Login==0.6.2
Flask_Migrate==4.0.5
Flask_RESTful==0.3.9
flask_sqlalchemy==3.0.5
SQLAlchemy==2.0.0
minio==7.2.4
psycopg2-binary==2.9.9
python-dotenv==1.0.1
redis==5.0.2
requests==2.31.0

# rag
langchain==0.1.16
llama-index==0.10.30
llama-index-core==0.10.30  # 这个必须手指定，不然构建的时候会去获取最新的版本，可能会有bug。
llama-index-retrievers-bm25==0.1.3
llama-index-storage-index-store-redis==0.1.2
llama-index-storage-kvstore-redis==0.1.3
llama-index-storage-docstore-mongodb==0.1.3
llama-index-vector-stores-milvus==0.1.10
llama-index-vector-stores-qdrant==0.2.5
llama-parse==0.4.1
rank-bm25==0.2.2
ragas==0.1.1
qdrant-client==1.9.0
pymongo==4.6.3
motor==3.4.0
asyncpg==0.29.0
spacy==3.7.4
jieba==0.42.1
./zh_core_web_sm-3.7.0-py3-none-any.whl
scikit-learn==1.4.2


# data loader 相关依赖
pypdf==4.2.0
pdfminer-six==20231228
PyMuPDF==1.24.2
docx2txt==0.8
python-docx==1.1.0
openpyxl==3.1.2

# 评估相关
dashscope==1.19.2
zhipuai==2.1.0

from pip.

danerlt commented on July 30, 2024

@matthewhughes934
I modified pip_internal\utils\encoding.py and added the ignore parameter to its data.decode method, which resolved the issue.

from pip.

uranusjr commented on July 30, 2024

It’s probably best to always use ascii with replace. We only allow ASCII in requirements, and anything else (e.g. comments) are ignored by the parser anyway.

A PR would be much welcomed.

from pip.

matthewhughes934 commented on July 30, 2024

I modified pip_internal\utils\encoding.py and added the ignore parameter to its data.decode method, which resolved the issue.

I guess the underlying issue was: the file looks to be UTF-8 encoded but you're working in an environment that uses a simplified Chinese locale, and so uses GBK for decoding. I guess an alternative solution would be to run Python in UTF-8 mode (https://docs.python.org/3/using/windows.html#utf-8-mode)

from pip.

matthewhughes934 commented on July 30, 2024

It’s probably best to always use ascii with replace. We only allow ASCII in requirements, and anything else (e.g. comments) are ignored by the parser anyway.

A PR would be much welcomed.

👍 happy to get a PR up. I'm wondering two things:

If I change auto_decode: are there places where we want decoding to fail (per errors="strict") or would it be ok to always replace? Or is there code elsewhere that should be changed?
🤔 Is there any potential for issues with multi-byte/non-ascii-extended encodings: I have no idea how common these might be, but I guess a consequence could be instead of getting a 'failed to decode' error you could get an error about pip failing to install a package named "��"

from pip.

pfmoore commented on July 30, 2024

We only allow ASCII in requirements, and anything else (e.g. comments) are ignored by the parser anyway.

Unfortunately, requirements aren't the only things in a requirement file. --requirement <path to file to include> could include arbitrary Unicode characters, and for that matter a simple local pathname is valid (and could be Unicode).

However, the documentation states that requirement files should be UTF-8 by default, so this seems like a simple bug in auto_decode - https://github.com/pypa/pip/blob/main/src/pip/_internal/utils/encoding.py#L35 should be using UTF-8. (And arguably the BOM detection in there is in violation of the spec, but IMO it's not worth changing).

Of course, even though this is technically a bug fix, it is still a breaking change, potentially, so we need to consider how we handle that. (We could fall back to the system encoding if UTF8 fails, with a deprecation warning - this won't avoid mojibake, but it will catch outright encoding failures).

from pip.

UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 548: illegal multibyte sequence about pip HOT 8 OPEN

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent