Giter Site home page Giter Site logo

Comments (19)

yanliang567 avatar yanliang567 commented on July 30, 2024

/assign @xiaocai2333
please help to take a look.
/unassign

from milvus.

xiaofan-luan avatar xiaofan-luan commented on July 30, 2024

chinese character is usually not one character in utf8 code. even mysql will has similar problem https://www.cnblogs.com/daxuejia/p/4558853.html

this might be a bug but something we need to discuss on how to improve

from milvus.

xiaofan-luan avatar xiaofan-luan commented on July 30, 2024

/assign @longjiquan

from milvus.

SpiritedAwayLab avatar SpiritedAwayLab commented on July 30, 2024

chinese character is usually not one character in utf8 code. even mysql will has similar problem https://www.cnblogs.com/daxuejia/p/4558853.html

this might be a bug but something we need to discuss on how to improve

But this is happened in 2015. In 2024, we wont be able to meet this issue. In mysql, if you create new column with utf8mb3_general_ci you can easily use like '%说话%' . And i have heard from somewhere, say the new version 2.4.0 have already fix this issue( for Chinese character matchop) . So i just want to make sure is this issue known by community and you haven't fix it yet,right? If so i will find alternative way to make it work.

from milvus.

xiaofan-luan avatar xiaofan-luan commented on July 30, 2024

chinese character is usually not one character in utf8 code. even mysql will has similar problem https://www.cnblogs.com/daxuejia/p/4558853.html
this might be a bug but something we need to discuss on how to improve

But this is happened in 2015. In 2024, we wont be able to meet this issue. In mysql, if you create new column with utf8mb3_general_ci you can easily use like '%说话%' . And i have heard from somewhere, say the new version 2.4.0 have already fix this issue( for Chinese character matchop) . So i just want to make sure is this issue known by community and you haven't fix it yet,right? If so i will find alternative way to make it work.

That's because milvus don't support any charset right now. We take all the charset work as binarys.
If you do == and range that will work perfectly. _ and % might not work because the definition of 1 character is one byte in milvus, but chinese charactors can be 1-4 bytes

from milvus.

SpiritedAwayLab avatar SpiritedAwayLab commented on July 30, 2024

from milvus.

xiaofan-luan avatar xiaofan-luan commented on July 30, 2024

I thought you are trying to run something similar to ES match and match phrase.

That need the support of analyzer, which split a chunk into multiple tokens.
We will try to deliver this in 3.0

from milvus.

xiaofan-luan avatar xiaofan-luan commented on July 30, 2024

@longjiquan can help on investigation

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.4.0

Current Behavior

似乎字符串的模糊查询 LIKE, 还是只能适用于前缀?(我指中文字符串)用英文的似乎都能奏效,但是中文只能搜索 前缀比如: LIKE '电动车%' 。看到说2.4新增了 中缀和后缀的查询,貌似只针对英文?

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

also if you can give a case where it failed it should also be very helpful

from milvus.

longjiquan avatar longjiquan commented on July 30, 2024

Yes, this issue may exist under current implementation. Now we use utf-8 to encode and store chinese but do the pattern matching based on their bytes stream.

For example, suppose the "**人" are encoded to:
[b1, b2, b3, b4, b5, b6], which "中" is represented by [b1, b2], "国" is represented by [b3, b4] and "人" is represented by [b5, b6].

However, the combination of [b2, b3] may represent a new chinese character, for example "汉", so "**人" may be matched even if you use "汉" to do the matching.

cc, @SpiritedAwayLab @xiaofan-luan

from milvus.

SpiritedAwayLab avatar SpiritedAwayLab commented on July 30, 2024

utf-8

well, under this situation, which means if I use SQL filter:LIKE '%[b3,b4]%' . I will get result contains "**人" right?
However, the true is I will get nothing. After testing all cases, i found milvus only support prefix: LIKE '[b1,b2]%' , postfix (LIKE '%[b5, b6]'), infix(LIKE '%[b3,b4]%' ) not working.

from milvus.

longjiquan avatar longjiquan commented on July 30, 2024

@SpiritedAwayLab I think it will be very helpful if you can share the dataset you tested.

from milvus.

xiaofan-luan avatar xiaofan-luan commented on July 30, 2024

is there a chance we can follow user speicified encoding types?

from milvus.

SpiritedAwayLab avatar SpiritedAwayLab commented on July 30, 2024

@SpiritedAwayLab I think it will be very helpful if you can share the dataset you tested.

there is no special with my dataset, I think. But anyway, I have chosen some columns include the text column which I cant do matchOP. And I have print out 10 cases. Please check the attachment to see if there is anything wrong with my dataset.
dataset.json

from milvus.

xiaofan-luan avatar xiaofan-luan commented on July 30, 2024

what is the query not working in your case?

from milvus.

SpiritedAwayLab avatar SpiritedAwayLab commented on July 30, 2024

I am using pymilvus now, the query:

client.query(collection_name="test_collection",
    filter="text like '%预期%'",
    limit=10)

Although there should be 2 dataset meet the filter criteria:
"id": "1-33116a6a041991769f74493fceca450c",
"id": "1-67e80fbe177c3f574955aa4352b3b989"
I got nothing with this query.

However, I can get one result if I only check prefix:

client.query(collection_name="test_collection",
    filter="text like '预期%'",
    limit=10)

I will get the data with id: "1-33116a6a041991769f74493fceca450c".

from milvus.

xiaofan-luan avatar xiaofan-luan commented on July 30, 2024

@longjiquan that seems like bug? if % can match with nothing

from milvus.

SpiritedAwayLab avatar SpiritedAwayLab commented on July 30, 2024
client.query(collection_name="test_collection",
    filter="text like '%%'",
    limit=10)

with this query, filter works as no filter. I can get all results.

from milvus.

longjiquan avatar longjiquan commented on July 30, 2024

@longjiquan that seems like bug? if % can match with nothing

I'll take a look.

from milvus.

longjiquan avatar longjiquan commented on July 30, 2024

thx for the use case, older regex query engine can't handle text with newline, already fixed in #32569 . @SpiritedAwayLab

from milvus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.