Giter Site home page Giter Site logo

Comments (13)

xiaofan-luan avatar xiaofan-luan commented on September 15, 2024 1

there are many tiemout failure for querycoordv2 to call datacoord.

Suggestions:

  1. stop calling flush manually, if any
  2. stop creating more collections, if you are doing multitenancy, think of using partition keys.
  3. if cluster still can not recovered after that, please to a pprof for datacoord (port 9091), see what is the current performance bottle neck for datacoord.
  4. change queryCoord.brokerTimeout to longer time (by default 5s). this might be a bad idea but your cluster might be able to recover.
  5. check how many segments you have with birdwatcher. https://github.com/milvus-io/birdwatcher

This cluster is a little bit mess up, but over all it looks like datacoord has a bottle neck with too many collections and segments. If it possible I would recommend to rebuild a new cluster and rethink about your schema

from milvus.

xiaofan-luan avatar xiaofan-luan commented on September 15, 2024

did you do load partition before insert?

from milvus.

ganderaj avatar ganderaj commented on September 15, 2024

did you do load partition before insert?

Yes, partition is a chat on our CVP. And follow up questions on a chat will eventually lead to insert and manual flush operations.

from milvus.

yanliang567 avatar yanliang567 commented on September 15, 2024

/assign @ganderaj
please try as comments anove

from milvus.

xiaofan-luan avatar xiaofan-luan commented on September 15, 2024

I think for some reason you have too many segments in your cluster, this could happen becasue of:

  1. you have too many small collections -> Milvus cluster should not exceed 10000 collections
  2. you call flush on very time you do small insertion -> from the log each segment is only 10s of entities, for example
    [2024/06/21 15:50:01.531 +00:00] [INFO] [datacoord/index_service.go:605] ["completeIndexInfo success"] [collectionID=449705667927213932] [indexID=449705667927213961] [totalRows=3] [indexRows=3] [pendingIndexRows=0] [state=Finished] [failReason=]
    [2024/06/21 15:50:01.531 +00:00] [INFO] [datacoord/index_service.go:732] ["DescribeIndex success"] [traceID=1c70deb1a01aab13fb497332856d7b39] [collectionID=449705667927213932] [indexName=]

The current reason you cluster is crashed is due to

[2024/06/21 15:50:38.313 +00:00] [WARN] [meta/coordinator_broker.go:148] ["get recovery info failed"] [collectionID=449660350148322962] [partitionIDis="[]"] [error="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:107 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:339 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetRecoveryInfoV2\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:145 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).GetRecoveryInfoV2\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:211 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2.func1\n/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:249 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2\n/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:121 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).UpdateCollectionNextTarget: rpc error: code = DeadlineExceeded desc = context deadline exceeded"] [errorVerbose="stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace: rpc error: code = DeadlineExceeded desc = context deadline exceeded\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556\n | github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n | \t/go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570\n | github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n | \t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:107\n | github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetRecoveryInfoV2\n | \t/go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:339\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).GetRecoveryInfoV2\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:145\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2.func1\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:211\n | github.com/milvus-io/milvus/pkg/util/retry.Do\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:249\n | github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).UpdateCollectionNextTarget\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:121\n | github.com/milvus-io/milvus/internal/querycoordv2/observers.(*TargetObserver).updateNextTarget\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/target_observer.go:283\n | github.com/milvus-io/milvus/internal/querycoordv2/observers.(*TargetObserver).check\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/target_observer.go:201\n | github.com/milvus-io/milvus/internal/querycoordv2/observers.(*taskDispatcher[...]).schedule.func1.1\n | \t/go/src/github.com/milvus-io/milvus/internal/querycoordv2/observers/task_dispatcher.go:101\n | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\n | github.com/panjf2000/ants/v2.(*goWorker).run.func1\n | \t/go/pkg/mod/github.com/panjf2000/ants/[email protected]/worker.go:67\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) stack trace: /go/src/github.com/milvus-io/milvus/pkg/tracer/stack_trace.go:51 github.com/milvus-io/milvus/pkg/tracer.StackTrace\n | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:556 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).Call\n | /go/src/github.com/milvus-io/milvus/internal/util/grpcclient/client.go:570 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n | /go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:107 github.com/milvus-io/milvus/internal/distributed/datacoord/client.wrapGrpcCall[...]\n | /go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/client/client.go:339 github.com/milvus-io/milvus/internal/distributed/datacoord/client.(*Client).GetRecoveryInfoV2\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/coordinator_broker.go:145 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*CoordinatorBroker).GetRecoveryInfoV2\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:211 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2.func1\n | /go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44 github.com/milvus-io/milvus/pkg/util/retry.Do\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:249 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).PullNextTargetV2\n | /go/src/github.com/milvus-io/milvus/internal/querycoordv2/meta/target_manager.go:121 github.com/milvus-io/milvus/internal/querycoordv2/meta.(*TargetManager).UpdateCollectionNextTarget\nWraps: (3) rpc error: code = DeadlineExceeded desc = context deadline exceeded\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *status.Error"]

from milvus.

ganderaj avatar ganderaj commented on September 15, 2024

Hello @xiaofan-luan

Thank you for your review. Per our current design, we have around 3600 vectors spanned across 630 collections and 1715 partitions. We acknowledge that it is not ideal to work with too many partitions on Milvus and we are working on a new design based on Partition-key based multi-tenancy strategy.

However, in the meantime we wanted to better understand of the challenges in Milvus since the behavior is only intermittent. Can you please help me with below:

  • Is queryCoord.brokerTimeout an allowed parameter in extraConfigFiles which goes into helm values file?
  • I have rebuild a new test cluster. I have attempted to restore this database onto the new cluster which failed with "context deadline exceeded" error messages. Is this an expected behavior?

from milvus.

ganderaj avatar ganderaj commented on September 15, 2024

Restore command:
nohup /PATH/TO/milvus-backup restore --databases docqa_prod --name backup_None_2406221600 --restore_index true --config sbx_restore_prod03.yaml &

Restore log which includes Error messages: milvus_backup-restore_error.txt

from milvus.

xiaofan-luan avatar xiaofan-luan commented on September 15, 2024

Hello @xiaofan-luan

Thank you for your review. Per our current design, we have around 3600 vectors spanned across 630 collections and 1715 partitions. We acknowledge that it is not ideal to work with too many partitions on Milvus and we are working on a new design based on Partition-key based multi-tenancy strategy.

However, in the meantime we wanted to better understand of the challenges in Milvus since the behavior is only intermittent. Can you please help me with below:

  • Is queryCoord.brokerTimeout an allowed parameter in extraConfigFiles which goes into helm values file?
  • I have rebuild a new test cluster. I have attempted to restore this database onto the new cluster which failed with "context deadline exceeded" error messages. Is this an expected behavior?
  1. yes, you can change brokerTimeout with extra config
  2. some call in the cluster is really slow (getRecoveryInfoV2). this is a interface try to iterate on meta data. I don't have info about why this is slow(need pprof), but likely this is related to partition/collection number. We've optimized some key path on 2.4 later version

from milvus.

xiaofan-luan avatar xiaofan-luan commented on September 15, 2024

so you can try to upgrade and see if it work. If not then I will need pprof of datacoord to check which execution path eats your cpu

from milvus.

ganderaj avatar ganderaj commented on September 15, 2024

@xiaofan-luan

I have an interesting observation to share. We have different Milvus implementations at our organization, and I have chosen a database (10 Collections, 10 Partitions and 73K Vectors) from our Dev environment.

I tried to restore this database on the newly created Milvus (v2.4.1) instance which ended up in similar failure message: test_backup_error.txt. To validate the integrity of backup, I have attempted to restore the same onto the same milvus instance where the backup was taken which completed successfully.

Considering restore behavior is failing consistently with 3 different databases from 2 milvus instances, I assume something is wrong with my newly created milvus instance. Can you please review the error message and suggest what could be a potential cause?

from milvus.

yanliang567 avatar yanliang567 commented on September 15, 2024

/assign @wayblink
/unassign

from milvus.

xiaofan-luan avatar xiaofan-luan commented on September 15, 2024

/assign @wayblink /unassign

this is not a backup issue.
the problem is that when request to datacoord it timeout

from milvus.

stale avatar stale commented on September 15, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

from milvus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.