Environment info NooBaa Version: VERSION OD

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

bucket creation fails at the backend where-as db has the entry and it shows BucketAlreadyOwnedByYou when issued again about noobaa-core HOT 10 OPEN

rkomandu commented on May 27, 2024

bucket creation fails at the backend where-as db has the entry and it shows BucketAlreadyOwnedByYou when issued again

from noobaa-core.

Comments (10)

aspandey commented on May 27, 2024 1

As per my understanding, our main issue is this -
When the postgrace DB is Down and we create a bucket or account, it fails. Once the database is UP and running again and we try to create same bucket/account again, it says “BUCKET_ALREADY_OWNED_BY_YOU” even if the bucket does not show in list bucket. However eventually we could see the bucket in list.

Findings -

With the latest master branch, I could not see the issue.
I thought that #7167
Could have fixed the issue but removing this also had the same effect and could not reproduce the issue.

So when a database is completely down and we try to create a bucket it fails and when the DB is UP again we could again create the same bucket without failure.

Having said that, we have the following observation which may or may not be problematic, based on what kind of experience we want customers to have.

When we bring the DB down by editing replica count using
“kubectl edit statefulset noobaa-db-pg” it starts terminating the DB.
This process internally happens in steps which means even when the DB pod is in terminating state, it might be taking some request from core.
While DB is in a terminating state and going down and if we try to create buckets/accounts, it may return an error message to the user which could indicate that the bucket creation has failed.

However, if we bring the DB UP, after a few min/sec do the list_bucket, it shows the bucket name. That means the changes (done by system_store.make_changes) had gone to DB; it is just that the info, which we fetch in the bucket creation step, fails.
Which gives an impression that the bucket creation failed which actually got passed.

Example:
Just after changes the replica count of DB to 0 and saving the file, I tried to create bucket
“banana-1”

[host1@host1 noobaa-operator]$ nb api bucket create_bucket '{"name":"banana-1"}'
INFO[0000] ✅ Exists: NooBaa "noobaa"
INFO[0000] ✅ Exists: Service "noobaa-mgmt"
INFO[0000] ✅ Exists: Secret "noobaa-operator"
INFO[0000] ✅ Exists: Secret "noobaa-admin"
INFO[0000] ✈️ RPC: bucket.create_bucket() Request: map[name:banana-1]
WARN[0000] RPC: GetConnection creating connection to wss://localhost:34115/rpc/ 0xc000c49b00
INFO[0000] RPC: Connecting websocket (0xc000c49b00) &{RPC:0xc0005a1040 Address:wss://localhost:34115/rpc/ State:init WS: PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:0s cancelPings:}
INFO[0000] RPC: Connected websocket (0xc000c49b00) &{RPC:0xc0005a1040 Address:wss://localhost:34115/rpc/ State:init WS: PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:0s cancelPings:}
ERRO[0000] ⚠️ RPC: bucket.create_bucket() Response Error: Code=INTERNAL Message=the database system is shutting down
FATA[0000] ❌ the database system is shutting down

Here it looks like the bucket creation has failed.

=================================================
When looking at logs, we can see that make_changes to DB was successful.

^[[32mFeb-20 18:22:05.136^[[35m [WebServer/40] ^[[36m [L3]^[[39m core.rpc.rpc:: RPC._on_request: ENTER srv bucket_api.create_bucket reqid wss://localhost:34115/rpc/-0 connid ws://[::ffff:127.0.0.1]:50818/(9qag2ejid)

^[[32mFeb-20 18:22:05.138^[[35m [WebServer/40] ^[[36m [L1]^[[39m core.server.common_services.auth_server:: load_auth: { system: ^[[32m'admin'^[[39m } { account_id: ^[[32m'63f34f9c7cd54d002814d204'^[[39m, system_id: ^[[32m'63f34f8d7cd54d002814d1f4'^[[39m, role: ^[[32m'operator'^[[39m, authorized_by: ^[[32m'noobaa'^[[39m, iat: ^[[33m1676890013^[[39m }
^[[32mFeb-20 18:22:05.138^[[35m [WebServer/40] ^[[36m [L3]^[[39m core.server.common_services.auth_server:: load auth system: 63f34f8d7cd54d002814d1f4
^[[32mFeb-20 18:22:05.138^[[35m [WebServer/40] ^[[36m [L0]^[[39m core.server.system_services.bucket_server:: validate_non_nsfs_bucket_create_allowed: ^[[90mundefined^[[39m ^[[90mundefined^[[39m
^[[32mFeb-20 18:22:05.139^[[35m [WebServer/40] ^[[36m [L0]^[[39m core.server.notifications.dispatcher:: Adding ActivityLog entry { event: ^[[32m'bucket.create'^[[39m, level: ^[[32m'info'^[[39m, system: 63f34f8d7cd54d002814d1f4, actor: 63f34f9c7cd54d002814d204, bucket: 63f3ba4d7cd54d002814d2f3, desc: SENSITIVE-f4b6955d5488d167 }

^[[32mFeb-20 18:22:05.141^[[35m [WebServer/40] ^[[36m [LOG]^[[39m core.server.system_services.system_store:: SystemStore.make_changes: {
insert: {
tieringpolicies: [ { _id: 63f3ba4d7cd54d002814d2f2, name: ^[[32m'banana-1#led5a15e'^[[39m, system: 63f34f8d7cd54d002814d1f4, tiers: [ { tier: 63f3ba4d7cd54d002814d2f1, order: ^[[33m0^[[39m, spillover: ^[[33mfalse^[[39m, disabled: ^[[33mfalse^[[39m } ], chunk_split_config: { avg_chunk: ^[[33m4194304^[[39m, delta_chunk: ^[[33m1048576^[[39m } } ],
tiers: [
{ _id: 63f3ba4d7cd54d002814d2f1, name: ^[[32m'banana-1#led5a15e'^[[39m, system: 63f34f8d7cd54d002814d1f4, chunk_config: 63f34f8d7cd54d002814d1f7, data_placement: ^[[32m'SPREAD'^[[39m, mirrors: [ { _id: 63f3ba4d7cd54d002814d2f0, spread_pools: ^[[36m[Array]^[[39m } ] }
],
buckets: [ { _id: 63f3ba4d7cd54d002814d2f3, name: SENSITIVE-e1623878a99dce31, tag: ^[[32m''^[[39m, system: 63f34f8d7cd54d002814d1f4, owner_account: 63f34f9c7cd54d002814d204, tiering: 63f3ba4d7cd54d002814d2f2, storage_stats: { chunks_capacity: ^[[33m0^[[39m, blocks_size: ^[[33m0^[[39m, pools: {}, objects_size: ^[[33m0^[[39m, objects_count: ^[[33m0^[[39m, objects_hist: [], last_update: ^[[33m1676917140000^[[39m }, lambda_triggers: [], versioning: ^[[32m'DISABLED'^[[39m, object_lock_configuration: ^[[90mundefined^[[39m, master_key_id: 63f3ba4d7cd54d002814d2f4 } ],
master_keys: [ { _id: 63f3ba4d7cd54d002814d2f4, description: ^[[32m'master key of 63f3ba4d7cd54d002814d2f3 bucket'^[[39m, master_key_id: 63f34f8d7cd54d002814d1f5, cipher_type: ^[[32m'aes-256-gcm'^[[39m, cipher_key: <Buffer a5 27 d4 8a bc 27 d9 c2 ec ec 01 35 29 ab 0e 98 00 66 ed 0d b5 a7 80 26 2a 29 c6 fb ec 9b 8e 94>, cipher_iv: <Buffer 4d d8 94 e6 9b 00 48 f9 9a 26 58 3c 4c 70 93 b7>, disabled: ^[[33mfalse^[[39m } ]
},
update: {}
}

=====================================================
However in the next steps, probably getting info about the bucket, we get the error message.

^[[32mFeb-20 18:22:05.469^[[35m [WebServer/40] ^[[36m [L1]^[[39m core.rpc.rpc:: RPC._on_response: reqid 2@ws://[::ffff:172.17.0.1]:7244/(9q6or7sjm) connid ws://[::ffff:172.17.0.1]:7244/(9q6or7sjm)
^[[32mFeb-20 18:22:05.536^[[35m [WebServer/40] ^[[31m[ERROR]^[[39m core.rpc.rpc:: RPC._request: response ERROR srv server_inter_process_api.load_system_store reqid 2@ws://[::ffff:172.17.0.1]:7244/(9q6or7sjm) connid ws://[::ffff:172.17.0.1]:7244/(9q6or7sjm) params { since: ^[[33m1676917325139^[[39m } took [133.4+225.8=359.2] [RpcError: the database system is shutting down] { rpc_code: ^[[32m'INTERNAL'^[[39m, rpc_data: { retryable: ^[[33mtrue^[[39m } }
^[[32mFeb-20 18:22:05.539^[[35m [WebServer/40] ^[[36m [L0]^[[39m core.rpc.rpc_base_conn:: RPC CONNECTION CLOSED. got event from connection: ws://[::ffff:172.17.0.1]:7244/(9q6or7sjm) Error: publish_to_cluster: disconnect on error the database system is shutting down ws://[::ffff:172.17.0.1]:7244/(9q6or7sjm)

After bringing the DB up, list_bucket shows the list of buckets which we created.

name: banana-1
name: banana-2
name: orange-1
name: first.bucket

The current functionality will not bring the system in an unstable state but it certainly creates confusion in the mind of the user.
In my opinion, we should return the success message for bucket creation if it has succeeded or we should fail it completely and should update the DB..

from noobaa-core.

aspandey commented on May 27, 2024 1

@aspandey @rkomandu Have you tried it with NSFS buckets? I think that the symptom we saw was that the DB bucket entry exists but the equivalent folder in the file system was missing

NO, I have not tried it with NSFS bucket. When I started looking at the bug, I could see some issue without NSFS also so thought it might be enough to debug without NSFS.
I will try with NSFS bucket also.

from noobaa-core.

nimrod-becker commented on May 27, 2024

Quoting from a triage meeting.
A potential problem with the DB not responding, or responding slowly + No atomicity between DB entry creation and folder creation on the FS. Entry was created but folder was not.

Need to better handle errors between DB entry creation and folder

from noobaa-core.

nimrod-becker commented on May 27, 2024

Talking to Danny, he mentioned he saw something similar with account creation where the creation in DB failed but we got an error of account already exists.

This might be an issue with the system store that does not evacuate an entry in some cases where the db update itself failed.... need to be investigated

from noobaa-core.

aspandey commented on May 27, 2024

I tried reproducing this issue and also tried to come up with some solution -

1 - Started all the pods and then killed noobaa-db-pg-0 by modifying the replica count.
2 - Tried to create account/ buckets. It failed saying that "this.begin() must be called before sending queries on this transaction "
3 - I restarted the DB and tried to create same account/bucket again.
It failed but eventually it gets success. It sometimes take lot of time after which we can create same account/bucket.

We had this doubt that the cache in memory is not getting refreshed at the right time so I tried to
force this refresh using some variable which did not work.

Next thing I changed is the refresh threshold.
from
this.START_REFRESH_THRESHOLD = 10 * 60 * 1000;
this.FORCE_REFRESH_THRESHOLD = 60 * 60 * 1000;

to
this.START_REFRESH_THRESHOLD = 2 * 60 * 1000;
this.FORCE_REFRESH_THRESHOLD = 4 * 60 * 1000;

I observed after changing these values, when we bring the DB up account/bucket creation happens very soon as compared to previous values.

This is just one observation. I need to investigate more to come to some conclusive point.

from noobaa-core.

romayalon commented on May 27, 2024

Talking to Danny, he mentioned he saw something similar with account creation where the creation in DB failed but we got an error of account already exists.

This might be an issue with the system store that does not evacuate an entry in some cases where the db update itself failed.... need to be investigated

@nimrod-becker are you sure it's not a not updated cache thing?

from noobaa-core.

romayalon commented on May 27, 2024

@aspandey @rkomandu
Have you tried it with NSFS buckets? I think that the symptom we saw was that the DB bucket entry exists but the equivalent folder in the file system was missing

from noobaa-core.

rkomandu commented on May 27, 2024

@romayalon , our buckets are NSFS only for the SS.

from noobaa-core.

rkomandu commented on May 27, 2024

@aspandey
There is one more scenario here as per my steps mentioned in #7103 (comment) (first comment).

I see if there is a long gap on the DB being in "ContainerCreationError" state due to the backend (FS) is down for whatever reason and once the backend (FS) which we term as Storage Cluster is up, the noobaa-db come back to Running state , however the re-establish of connection between noobaa-core and noobaa-db is not happening as per my observation/post above. So is the hand-shake code in-place between noobaa-core and noobaa-db if the recovery is long in the DB case ?

from noobaa-core.

romayalon commented on May 27, 2024

@rkomandu Apologies for the confusion, I wanted to ask Ashish regarding the NSFS buckets

from noobaa-core.

bucket creation fails at the backend where-as db has the entry and it shows BucketAlreadyOwnedByYou when issued again about noobaa-core HOT 10 OPEN

Comments (10)

=================================================
When looking at logs, we can see that make_changes to DB was successful.

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (10)

================================================= When looking at logs, we can see that make_changes to DB was successful.

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

=================================================
When looking at logs, we can see that make_changes to DB was successful.