Comments (19)
/assign @xiaocai2333
please help to take a look.
/unassign
from milvus.
chinese character is usually not one character in utf8 code. even mysql will has similar problem https://www.cnblogs.com/daxuejia/p/4558853.html
this might be a bug but something we need to discuss on how to improve
from milvus.
/assign @longjiquan
from milvus.
chinese character is usually not one character in utf8 code. even mysql will has similar problem https://www.cnblogs.com/daxuejia/p/4558853.html
this might be a bug but something we need to discuss on how to improve
But this is happened in 2015. In 2024, we wont be able to meet this issue. In mysql, if you create new column with utf8mb3_general_ci you can easily use like '%说话%' . And i have heard from somewhere, say the new version 2.4.0 have already fix this issue( for Chinese character matchop) . So i just want to make sure is this issue known by community and you haven't fix it yet,right? If so i will find alternative way to make it work.
from milvus.
chinese character is usually not one character in utf8 code. even mysql will has similar problem https://www.cnblogs.com/daxuejia/p/4558853.html
this might be a bug but something we need to discuss on how to improveBut this is happened in 2015. In 2024, we wont be able to meet this issue. In mysql, if you create new column with utf8mb3_general_ci you can easily use like '%说话%' . And i have heard from somewhere, say the new version 2.4.0 have already fix this issue( for Chinese character matchop) . So i just want to make sure is this issue known by community and you haven't fix it yet,right? If so i will find alternative way to make it work.
That's because milvus don't support any charset right now. We take all the charset work as binarys.
If you do == and range that will work perfectly. _ and % might not work because the definition of 1 character is one byte in milvus, but chinese charactors can be 1-4 bytes
from milvus.
from milvus.
I thought you are trying to run something similar to ES match and match phrase.
That need the support of analyzer, which split a chunk into multiple tokens.
We will try to deliver this in 3.0
from milvus.
@longjiquan can help on investigation
Is there an existing issue for this?
- I have searched the existing issues
Environment
- Milvus version:2.4.0
Current Behavior
似乎字符串的模糊查询 LIKE, 还是只能适用于前缀?(我指中文字符串)用英文的似乎都能奏效,但是中文只能搜索 前缀比如: LIKE '电动车%' 。看到说2.4新增了 中缀和后缀的查询,貌似只针对英文?
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
No response
Anything else?
No response
also if you can give a case where it failed it should also be very helpful
from milvus.
Yes, this issue may exist under current implementation. Now we use utf-8 to encode and store chinese but do the pattern matching based on their bytes stream.
For example, suppose the "**人" are encoded to:
[b1, b2, b3, b4, b5, b6], which "中" is represented by [b1, b2], "国" is represented by [b3, b4] and "人" is represented by [b5, b6].
However, the combination of [b2, b3] may represent a new chinese character, for example "汉", so "**人" may be matched even if you use "汉" to do the matching.
cc, @SpiritedAwayLab @xiaofan-luan
from milvus.
utf-8
well, under this situation, which means if I use SQL filter:LIKE '%[b3,b4]%' . I will get result contains "**人" right?
However, the true is I will get nothing. After testing all cases, i found milvus only support prefix: LIKE '[b1,b2]%' , postfix (LIKE '%[b5, b6]'), infix(LIKE '%[b3,b4]%' ) not working.
from milvus.
@SpiritedAwayLab I think it will be very helpful if you can share the dataset you tested.
from milvus.
is there a chance we can follow user speicified encoding types?
from milvus.
@SpiritedAwayLab I think it will be very helpful if you can share the dataset you tested.
there is no special with my dataset, I think. But anyway, I have chosen some columns include the text column which I cant do matchOP. And I have print out 10 cases. Please check the attachment to see if there is anything wrong with my dataset.
dataset.json
from milvus.
what is the query not working in your case?
from milvus.
I am using pymilvus now, the query:
client.query(collection_name="test_collection",
filter="text like '%预期%'",
limit=10)
Although there should be 2 dataset meet the filter criteria:
"id": "1-33116a6a041991769f74493fceca450c",
"id": "1-67e80fbe177c3f574955aa4352b3b989"
I got nothing with this query.
However, I can get one result if I only check prefix:
client.query(collection_name="test_collection",
filter="text like '预期%'",
limit=10)
I will get the data with id: "1-33116a6a041991769f74493fceca450c".
from milvus.
@longjiquan that seems like bug? if % can match with nothing
from milvus.
client.query(collection_name="test_collection",
filter="text like '%%'",
limit=10)
with this query, filter works as no filter. I can get all results.
from milvus.
@longjiquan that seems like bug? if % can match with nothing
I'll take a look.
from milvus.
thx for the use case, older regex query engine can't handle text with newline, already fixed in #32569 . @SpiritedAwayLab
from milvus.
Related Issues (20)
- [Enhancement]: Add a utility to convert channel name string comparison to int one HOT 1
- [Bug]: Unable to set up development environment in Mac HOT 13
- [Bug]: Can not find pymilvus.model.hybrid HOT 16
- [Bug]: Load collection is failing for collection with sparse and dense vectors in 2.4 HOT 7
- [Enhancement]: consider using https://github.com/planetscale/vtprotobuf HOT 2
- [Bug]: Milvus Cluster encounters rpc error: code = DeadlineExceeded desc = context deadline exceeded probably when loading partitions HOT 32
- [Bug]: There are remaining segments that have not been merged compaction and GC HOT 5
- [Bug]: TestSparse_invalid_insert is flaky
- [Bug]: Insert data validation logic fails to check column type HOT 1
- [Bug]: [ERROR] [funcutil/parallel.go:88] [loadRemoteFunc] [error="failed to read stats_log/xxxxx: NoSuchKey HOT 4
- [Bug]: Load may become slower with large dataset with prewarm HOT 2
- [Bug]: collection delete entities , collection will crash HOT 2
- [Feature]: lru cache implementation merge to master
- [Feature]: Apple Silicone GPU (Metal Framework) & Apple Nural Engine (ANE) HOT 5
- [Feature]: cold-hot data cache HOT 1
- [Bug]: [benchmark][cluster][LRU] search and query failed in dml & dql scene HOT 3
- [Bug]: `querynode_entity_num` is not accurate HOT 1
- [Bug]: raise err `not allow to set partition name for collection with partition key` in binlog import
- [Enhancement]: Decouple compaction from shard HOT 1
- [Enhancement]: Refine SuspendBalance's behavior HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from milvus.