Is there an existing issue for this? <li class="

/assign <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Bug]: prefix for CN works well, but not postfix about milvus HOT 19 OPEN

SpiritedAwayLab commented on July 30, 2024

[Bug]: prefix for CN works well, but not postfix

from milvus.

Comments (19)

yanliang567 commented on July 30, 2024

/assign @xiaocai2333
please help to take a look.
/unassign

from milvus.

xiaofan-luan commented on July 30, 2024

chinese character is usually not one character in utf8 code. even mysql will has similar problem https://www.cnblogs.com/daxuejia/p/4558853.html

this might be a bug but something we need to discuss on how to improve

from milvus.

xiaofan-luan commented on July 30, 2024

/assign @longjiquan

from milvus.

SpiritedAwayLab commented on July 30, 2024

chinese character is usually not one character in utf8 code. even mysql will has similar problem https://www.cnblogs.com/daxuejia/p/4558853.html

this might be a bug but something we need to discuss on how to improve

But this is happened in 2015. In 2024, we wont be able to meet this issue. In mysql, if you create new column with utf8mb3_general_ci you can easily use like '%说话%' . And i have heard from somewhere, say the new version 2.4.0 have already fix this issue( for Chinese character matchop) . So i just want to make sure is this issue known by community and you haven't fix it yet,right? If so i will find alternative way to make it work.

from milvus.

xiaofan-luan commented on July 30, 2024

chinese character is usually not one character in utf8 code. even mysql will has similar problem https://www.cnblogs.com/daxuejia/p/4558853.html
this might be a bug but something we need to discuss on how to improve

But this is happened in 2015. In 2024, we wont be able to meet this issue. In mysql, if you create new column with utf8mb3_general_ci you can easily use like '%说话%' . And i have heard from somewhere, say the new version 2.4.0 have already fix this issue( for Chinese character matchop) . So i just want to make sure is this issue known by community and you haven't fix it yet,right? If so i will find alternative way to make it work.

That's because milvus don't support any charset right now. We take all the charset work as binarys.
If you do == and range that will work perfectly. _ and % might not work because the definition of 1 character is one byte in milvus, but chinese charactors can be 1-4 bytes

from milvus.

SpiritedAwayLab commented on July 30, 2024

OK I see. Thanks

…

chinese character is usually not one character in utf8 code. even mysql will has similar problem https://www.cnblogs.com/daxuejia/p/4558853.html this might be a bug but something we need to discuss on how to improve But this is happened in 2015. In 2024, we wont be able to meet this issue. In mysql, if you create new column with utf8mb3_general_ci you can easily use like '%说话%' . And i have heard from somewhere, say the new version 2.4.0 have already fix this issue( for Chinese character matchop) . So i just want to make sure is this issue known by community and you haven't fix it yet,right? If so i will find alternative way to make it work. That's because milvus don't support any charset right now. We take all the charset work as binarys. If you do == and range that will work perfectly. _ and % might not work because the definition of 1 character is one byte in milvus, but chinese charactors can be 1-4 bytes — Reply to this email directly, view it on GitHub <#32482 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIQHSBTHS77LHQTMNDTUA6DY6R35NAVCNFSM6AAAAABGQVNWDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRYGM4TGOBUGY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

from milvus.

xiaofan-luan commented on July 30, 2024

I thought you are trying to run something similar to ES match and match phrase.

That need the support of analyzer, which split a chunk into multiple tokens.
We will try to deliver this in 3.0

from milvus.

xiaofan-luan commented on July 30, 2024

@longjiquan can help on investigation

Is there an existing issue for this?

I have searched the existing issues

Environment
- Milvus version:2.4.0
Current Behavior

似乎字符串的模糊查询 LIKE，还是只能适用于前缀？（我指中文字符串）用英文的似乎都能奏效，但是中文只能搜索前缀比如： LIKE '电动车%' 。看到说2.4新增了中缀和后缀的查询，貌似只针对英文？

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

also if you can give a case where it failed it should also be very helpful

from milvus.

longjiquan commented on July 30, 2024

Yes, this issue may exist under current implementation. Now we use utf-8 to encode and store chinese but do the pattern matching based on their bytes stream.

For example, suppose the "**人" are encoded to:
[b1, b2, b3, b4, b5, b6], which "中" is represented by [b1, b2], "国" is represented by [b3, b4] and "人" is represented by [b5, b6].

However, the combination of [b2, b3] may represent a new chinese character, for example "汉", so "**人" may be matched even if you use "汉" to do the matching.

cc, @SpiritedAwayLab @xiaofan-luan

from milvus.

SpiritedAwayLab commented on July 30, 2024

utf-8

well, under this situation, which means if I use SQL filter：LIKE '%[b3,b4]%' . I will get result contains "**人" right?
However, the true is I will get nothing. After testing all cases, i found milvus only support prefix: LIKE '[b1,b2]%' ， postfix (LIKE '%[b5, b6]'), infix(LIKE '%[b3,b4]%' ) not working.

from milvus.

longjiquan commented on July 30, 2024

@SpiritedAwayLab I think it will be very helpful if you can share the dataset you tested.

from milvus.

xiaofan-luan commented on July 30, 2024

is there a chance we can follow user speicified encoding types?

from milvus.

SpiritedAwayLab commented on July 30, 2024

@SpiritedAwayLab I think it will be very helpful if you can share the dataset you tested.

there is no special with my dataset, I think. But anyway, I have chosen some columns include the text column which I cant do matchOP. And I have print out 10 cases. Please check the attachment to see if there is anything wrong with my dataset.
dataset.json

from milvus.

xiaofan-luan commented on July 30, 2024

what is the query not working in your case?

from milvus.

SpiritedAwayLab commented on July 30, 2024

I am using pymilvus now, the query:

client.query(collection_name="test_collection",
    filter="text like '%预期%'",
    limit=10)

Although there should be 2 dataset meet the filter criteria:
"id": "1-33116a6a041991769f74493fceca450c",
"id": "1-67e80fbe177c3f574955aa4352b3b989"
I got nothing with this query.

However, I can get one result if I only check prefix:

client.query(collection_name="test_collection",
    filter="text like '预期%'",
    limit=10)

I will get the data with id: "1-33116a6a041991769f74493fceca450c".

from milvus.

xiaofan-luan commented on July 30, 2024

@longjiquan that seems like bug? if % can match with nothing

from milvus.

SpiritedAwayLab commented on July 30, 2024

client.query(collection_name="test_collection",
    filter="text like '%%'",
    limit=10)

with this query, filter works as no filter. I can get all results.

from milvus.

longjiquan commented on July 30, 2024

@longjiquan that seems like bug? if % can match with nothing

I'll take a look.

from milvus.

longjiquan commented on July 30, 2024

thx for the use case, older regex query engine can't handle text with newline, already fixed in #32569 . @SpiritedAwayLab

from milvus.

[Bug]: prefix for CN works well, but not postfix about milvus HOT 19 OPEN

Comments (19)

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent