Comments (13)
there are many tiemout failure for querycoordv2 to call datacoord.
Suggestions:
- stop calling flush manually, if any
- stop creating more collections, if you are doing multitenancy, think of using partition keys.
- if cluster still can not recovered after that, please to a pprof for datacoord (port 9091), see what is the current performance bottle neck for datacoord.
- change queryCoord.brokerTimeout to longer time (by default 5s). this might be a bad idea but your cluster might be able to recover.
- check how many segments you have with birdwatcher. https://github.com/milvus-io/birdwatcher
This cluster is a little bit mess up, but over all it looks like datacoord has a bottle neck with too many collections and segments. If it possible I would recommend to rebuild a new cluster and rethink about your schema
from milvus.
did you do load partition before insert?
from milvus.
did you do load partition before insert?
Yes, partition is a chat on our CVP. And follow up questions on a chat will eventually lead to insert and manual flush operations.
from milvus.
/assign @ganderaj
please try as comments anove
from milvus.
I think for some reason you have too many segments in your cluster, this could happen becasue of:
- you have too many small collections -> Milvus cluster should not exceed 10000 collections
- you call flush on very time you do small insertion -> from the log each segment is only 10s of entities, for example
[2024/06/21 15:50:01.531 +00:00] [INFO] [datacoord/index_service.go:605] ["completeIndexInfo success"] [collectionID=449705667927213932] [indexID=449705667927213961] [totalRows=3] [indexRows=3] [pendingIndexRows=0] [state=Finished] [failReason=]
[2024/06/21 15:50:01.531 +00:00] [INFO] [datacoord/index_service.go:732] ["DescribeIndex success"] [traceID=1c70deb1a01aab13fb497332856d7b39] [collectionID=449705667927213932] [indexName=]
The current reason you cluster is crashed is due to
[2024/06/21 15:50:38.313 +00:00] [WARN] [meta/coordinator_broker.go:148] ["get recovery info failed"] [collectionID=449660350148322962] [partitionIDis="[]"] [error="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:107 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:339 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetRecoveryInfoV2\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:145 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).GetRecoveryInfoV2\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:211 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2.func1\n/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:249 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:121 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).UpdateCollectionNextTarget: rpc error: code = DeadlineExceeded desc = context deadline exceeded"] [errorVerbose="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570\n | github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n | \t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:107\n | github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetRecoveryInfoV2\n | \t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:339\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).GetRecoveryInfoV2\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:145\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2.func1\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:211\n | github.com/milvus-io/milvus/pkg/util/retry.Do\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:249\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).UpdateCollectionNextTarget\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:121\n | github.com/milvus-io/milvus/internal/querycoordv2/observers.(*TargetObserver).updateNextTarget\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/target_observer.go:283\n | github.com/milvus-io/milvus/internal/querycoordv2/observers.(*TargetObserver).check\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/target_observer.go:201\n | github.com/milvus-io/milvus/internal/querycoordv2/observers.(*taskDispatcher[...]).schedule.func1.1\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/task_dispatcher.go:101\n | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\n | github.com/panjf2000/ants/v2.(*goWorker).run.func1\n | \t/go/pkg/mod/github.com/panjf2000/ants/[email protected]/worker.go:67\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n | /go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:107 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n | /go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:339 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetRecoveryInfoV2\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:145 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).GetRecoveryInfoV2\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:211 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2.func1\n | /go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:249 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:121 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).UpdateCollectionNextTarget\nWraps: (3) rpc error: code = DeadlineExceeded desc = context deadline exceeded\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *status.Error"]
from milvus.
Hello @xiaofan-luan
Thank you for your review. Per our current design, we have around 3600 vectors spanned across 630 collections and 1715 partitions. We acknowledge that it is not ideal to work with too many partitions on Milvus and we are working on a new design based on Partition-key based multi-tenancy strategy.
However, in the meantime we wanted to better understand of the challenges in Milvus since the behavior is only intermittent. Can you please help me with below:
- Is
queryCoord.brokerTimeout
an allowed parameter inextraConfigFiles
which goes into helm values file? - I have rebuild a new test cluster. I have attempted to restore this database onto the new cluster which failed with "context deadline exceeded" error messages. Is this an expected behavior?
from milvus.
Restore command:
nohup /PATH/TO/milvus-backup restore --databases docqa_prod --name backup_None_2406221600 --restore_index true --config sbx_restore_prod03.yaml &
Restore log which includes Error messages: milvus_backup-restore_error.txt
from milvus.
Hello @xiaofan-luan
Thank you for your review. Per our current design, we have around 3600 vectors spanned across 630 collections and 1715 partitions. We acknowledge that it is not ideal to work with too many partitions on Milvus and we are working on a new design based on Partition-key based multi-tenancy strategy.
However, in the meantime we wanted to better understand of the challenges in Milvus since the behavior is only intermittent. Can you please help me with below:
- Is
queryCoord.brokerTimeout
an allowed parameter inextraConfigFiles
which goes into helm values file?- I have rebuild a new test cluster. I have attempted to restore this database onto the new cluster which failed with "context deadline exceeded" error messages. Is this an expected behavior?
- yes, you can change brokerTimeout with extra config
- some call in the cluster is really slow (getRecoveryInfoV2). this is a interface try to iterate on meta data. I don't have info about why this is slow(need pprof), but likely this is related to partition/collection number. We've optimized some key path on 2.4 later version
from milvus.
so you can try to upgrade and see if it work. If not then I will need pprof of datacoord to check which execution path eats your cpu
from milvus.
I have an interesting observation to share. We have different Milvus implementations at our organization, and I have chosen a database (10 Collections, 10 Partitions and 73K Vectors) from our Dev environment.
I tried to restore this database on the newly created Milvus (v2.4.1) instance which ended up in similar failure message: test_backup_error.txt. To validate the integrity of backup, I have attempted to restore the same onto the same milvus instance where the backup was taken which completed successfully.
Considering restore behavior is failing consistently with 3 different databases from 2 milvus instances, I assume something is wrong with my newly created milvus instance. Can you please review the error message and suggest what could be a potential cause?
from milvus.
/assign @wayblink
/unassign
from milvus.
/assign @wayblink /unassign
this is not a backup issue.
the problem is that when request to datacoord it timeout
from milvus.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
from milvus.
Related Issues (20)
- [Bug]: [benchmark][standalone] Milvus panic `panic: runtime error: index out of range [-1]` in concurrent dql scene HOT 1
- [Bug]: [Nightly] Milvus pod restart many times and panic for context deadline exceeded HOT 4
- [Enhancement]: Optimize the retrieval operations for dynamic fields. HOT 1
- [Bug]: Service Fails to Restart with GRPC Connection Errors After Being Killed with SIGKILL HOT 5
- [Bug]: [resource group]The error message is unclear and lacks specific meaning. HOT 3
- [Bug]: docker restart milvus-standalone container the loading time is particularly long, about ten million data has not been loaded more than 4 hours HOT 2
- [Bug]: lock was not released in time HOT 3
- [Bug]: Use `libopenblas-openmp` instead of `libopenblas` HOT 3
- [Feature]: Support index build and search on JSON/Dynamic field HOT 3
- [Bug]: Import data but "code":0 HOT 4
- [Enhancement]: user timer instead of ticker
- [Bug]: After restarting the milvus cluster, one of the querynodes takes up a lot of memory HOT 19
- [Bug]: fix proxy panic after Import related api failed HOT 2
- [Enhancement]: Pls make [Drop Role] api easy to use HOT 4
- [Bug]: [benchmark][cluster] queryNode restarts twice during concurrent dql & dml scene HOT 2
- [Bug]: [benchmark][standalone] Milvus standalone restart: `SIGSEGV: segmentation violation` in concurrent dql & upsert scene HOT 10
- [Bug]: odr-violation is reported when enabling ASAN in milvus HOT 3
- [Bug]: [benchmark][cluster] Building `Trie` index on VARCHAR field stuck: `failed to create index, C Runtime Exception: Assert \"offset < get_num_rows()\" => field data subscript out of range` HOT 5
- [Enhancement]: user timer instead of ticker in client/index.go
- [Bug]: Milvus build failure due to Proto resolution error HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from milvus.