Comments (10)
As per my understanding, our main issue is this -
When the postgrace DB is Down and we create a bucket or account, it fails. Once the database is UP and running again and we try to create same bucket/account again, it says “BUCKET_ALREADY_OWNED_BY_YOU” even if the bucket does not show in list bucket. However eventually we could see the bucket in list.
Findings -
With the latest master branch, I could not see the issue.
I thought that #7167
Could have fixed the issue but removing this also had the same effect and could not reproduce the issue.
So when a database is completely down and we try to create a bucket it fails and when the DB is UP again we could again create the same bucket without failure.
Having said that, we have the following observation which may or may not be problematic, based on what kind of experience we want customers to have.
When we bring the DB down by editing replica count using
“kubectl edit statefulset noobaa-db-pg” it starts terminating the DB.
This process internally happens in steps which means even when the DB pod is in terminating state, it might be taking some request from core.
While DB is in a terminating state and going down and if we try to create buckets/accounts, it may return an error message to the user which could indicate that the bucket creation has failed.
However, if we bring the DB UP, after a few min/sec do the list_bucket, it shows the bucket name. That means the changes (done by system_store.make_changes) had gone to DB; it is just that the info, which we fetch in the bucket creation step, fails.
Which gives an impression that the bucket creation failed which actually got passed.
Example:
Just after changes the replica count of DB to 0 and saving the file, I tried to create bucket
“banana-1”
[host1@host1 noobaa-operator]$ nb api bucket create_bucket '{"name":"banana-1"}'
INFO[0000]
INFO[0000]
INFO[0000]
INFO[0000]
INFO[0000]
WARN[0000] RPC: GetConnection creating connection to wss://localhost:34115/rpc/ 0xc000c49b00
INFO[0000] RPC: Connecting websocket (0xc000c49b00) &{RPC:0xc0005a1040 Address:wss://localhost:34115/rpc/ State:init WS: PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:0s cancelPings:}
INFO[0000] RPC: Connected websocket (0xc000c49b00) &{RPC:0xc0005a1040 Address:wss://localhost:34115/rpc/ State:init WS: PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:0s cancelPings:}
ERRO[0000]
FATA[0000]
Here it looks like the bucket creation has failed.
=================================================
When looking at logs, we can see that make_changes to DB was successful.
^[[32mFeb-20 18:22:05.136^[[35m [WebServer/40] ^[[36m [L3]^[[39m core.rpc.rpc:: RPC._on_request: ENTER srv bucket_api.create_bucket reqid wss://localhost:34115/rpc/-0 connid ws://[::ffff:127.0.0.1]:50818/(9qag2ejid)
^[[32mFeb-20 18:22:05.138^[[35m [WebServer/40] ^[[36m [L1]^[[39m core.server.common_services.auth_server:: load_auth: { system: ^[[32m'admin'^[[39m } { account_id: ^[[32m'63f34f9c7cd54d002814d204'^[[39m, system_id: ^[[32m'63f34f8d7cd54d002814d1f4'^[[39m, role: ^[[32m'operator'^[[39m, authorized_by: ^[[32m'noobaa'^[[39m, iat: ^[[33m1676890013^[[39m }
^[[32mFeb-20 18:22:05.138^[[35m [WebServer/40] ^[[36m [L3]^[[39m core.server.common_services.auth_server:: load auth system: 63f34f8d7cd54d002814d1f4
^[[32mFeb-20 18:22:05.138^[[35m [WebServer/40] ^[[36m [L0]^[[39m core.server.system_services.bucket_server:: validate_non_nsfs_bucket_create_allowed: ^[[90mundefined^[[39m ^[[90mundefined^[[39m
^[[32mFeb-20 18:22:05.139^[[35m [WebServer/40] ^[[36m [L0]^[[39m core.server.notifications.dispatcher:: Adding ActivityLog entry { event: ^[[32m'bucket.create'^[[39m, level: ^[[32m'info'^[[39m, system: 63f34f8d7cd54d002814d1f4, actor: 63f34f9c7cd54d002814d204, bucket: 63f3ba4d7cd54d002814d2f3, desc: SENSITIVE-f4b6955d5488d167 }
^[[32mFeb-20 18:22:05.141^[[35m [WebServer/40] ^[[36m [LOG]^[[39m core.server.system_services.system_store:: SystemStore.make_changes: {
insert: {
tieringpolicies: [ { _id: 63f3ba4d7cd54d002814d2f2, name: ^[[32m'banana-1#led5a15e'^[[39m, system: 63f34f8d7cd54d002814d1f4, tiers: [ { tier: 63f3ba4d7cd54d002814d2f1, order: ^[[33m0^[[39m, spillover: ^[[33mfalse^[[39m, disabled: ^[[33mfalse^[[39m } ], chunk_split_config: { avg_chunk: ^[[33m4194304^[[39m, delta_chunk: ^[[33m1048576^[[39m } } ],
tiers: [
{ _id: 63f3ba4d7cd54d002814d2f1, name: ^[[32m'banana-1#led5a15e'^[[39m, system: 63f34f8d7cd54d002814d1f4, chunk_config: 63f34f8d7cd54d002814d1f7, data_placement: ^[[32m'SPREAD'^[[39m, mirrors: [ { _id: 63f3ba4d7cd54d002814d2f0, spread_pools: ^[[36m[Array]^[[39m } ] }
],
buckets: [ { _id: 63f3ba4d7cd54d002814d2f3, name: SENSITIVE-e1623878a99dce31, tag: ^[[32m''^[[39m, system: 63f34f8d7cd54d002814d1f4, owner_account: 63f34f9c7cd54d002814d204, tiering: 63f3ba4d7cd54d002814d2f2, storage_stats: { chunks_capacity: ^[[33m0^[[39m, blocks_size: ^[[33m0^[[39m, pools: {}, objects_size: ^[[33m0^[[39m, objects_count: ^[[33m0^[[39m, objects_hist: [], last_update: ^[[33m1676917140000^[[39m }, lambda_triggers: [], versioning: ^[[32m'DISABLED'^[[39m, object_lock_configuration: ^[[90mundefined^[[39m, master_key_id: 63f3ba4d7cd54d002814d2f4 } ],
master_keys: [ { _id: 63f3ba4d7cd54d002814d2f4, description: ^[[32m'master key of 63f3ba4d7cd54d002814d2f3 bucket'^[[39m, master_key_id: 63f34f8d7cd54d002814d1f5, cipher_type: ^[[32m'aes-256-gcm'^[[39m, cipher_key: <Buffer a5 27 d4 8a bc 27 d9 c2 ec ec 01 35 29 ab 0e 98 00 66 ed 0d b5 a7 80 26 2a 29 c6 fb ec 9b 8e 94>, cipher_iv: <Buffer 4d d8 94 e6 9b 00 48 f9 9a 26 58 3c 4c 70 93 b7>, disabled: ^[[33mfalse^[[39m } ]
},
update: {}
}
=====================================================
However in the next steps, probably getting info about the bucket, we get the error message.
^[[32mFeb-20 18:22:05.469^[[35m [WebServer/40] ^[[36m [L1]^[[39m core.rpc.rpc:: RPC._on_response: reqid 2@ws://[::ffff:172.17.0.1]:7244/(9q6or7sjm) connid ws://[::ffff:172.17.0.1]:7244/(9q6or7sjm)
^[[32mFeb-20 18:22:05.536^[[35m [WebServer/40] ^[[31m[ERROR]^[[39m core.rpc.rpc:: RPC._request: response ERROR srv server_inter_process_api.load_system_store reqid 2@ws://[::ffff:172.17.0.1]:7244/(9q6or7sjm) connid ws://[::ffff:172.17.0.1]:7244/(9q6or7sjm) params { since: ^[[33m1676917325139^[[39m } took [133.4+225.8=359.2] [RpcError: the database system is shutting down] { rpc_code: ^[[32m'INTERNAL'^[[39m, rpc_data: { retryable: ^[[33mtrue^[[39m } }
^[[32mFeb-20 18:22:05.539^[[35m [WebServer/40] ^[[36m [L0]^[[39m core.rpc.rpc_base_conn:: RPC CONNECTION CLOSED. got event from connection: ws://[::ffff:172.17.0.1]:7244/(9q6or7sjm) Error: publish_to_cluster: disconnect on error the database system is shutting down ws://[::ffff:172.17.0.1]:7244/(9q6or7sjm)
After bringing the DB up, list_bucket shows the list of buckets which we created.
- name: banana-1
- name: banana-2
- name: orange-1
- name: first.bucket
The current functionality will not bring the system in an unstable state but it certainly creates confusion in the mind of the user.
In my opinion, we should return the success message for bucket creation if it has succeeded or we should fail it completely and should update the DB..
from noobaa-core.
@aspandey @rkomandu Have you tried it with NSFS buckets? I think that the symptom we saw was that the DB bucket entry exists but the equivalent folder in the file system was missing
NO, I have not tried it with NSFS bucket. When I started looking at the bug, I could see some issue without NSFS also so thought it might be enough to debug without NSFS.
I will try with NSFS bucket also.
from noobaa-core.
Quoting from a triage meeting.
A potential problem with the DB not responding, or responding slowly + No atomicity between DB entry creation and folder creation on the FS. Entry was created but folder was not.
Need to better handle errors between DB entry creation and folder
from noobaa-core.
Talking to Danny, he mentioned he saw something similar with account creation where the creation in DB failed but we got an error of account already exists.
This might be an issue with the system store that does not evacuate an entry in some cases where the db update itself failed.... need to be investigated
from noobaa-core.
I tried reproducing this issue and also tried to come up with some solution -
1 - Started all the pods and then killed noobaa-db-pg-0 by modifying the replica count.
2 - Tried to create account/ buckets. It failed saying that "this.begin() must be called before sending queries on this transaction "
3 - I restarted the DB and tried to create same account/bucket again.
It failed but eventually it gets success. It sometimes take lot of time after which we can create same account/bucket.
We had this doubt that the cache in memory is not getting refreshed at the right time so I tried to
force this refresh using some variable which did not work.
Next thing I changed is the refresh threshold.
from
this.START_REFRESH_THRESHOLD = 10 * 60 * 1000;
this.FORCE_REFRESH_THRESHOLD = 60 * 60 * 1000;
to
this.START_REFRESH_THRESHOLD = 2 * 60 * 1000;
this.FORCE_REFRESH_THRESHOLD = 4 * 60 * 1000;
I observed after changing these values, when we bring the DB up account/bucket creation happens very soon as compared to previous values.
This is just one observation. I need to investigate more to come to some conclusive point.
from noobaa-core.
Talking to Danny, he mentioned he saw something similar with account creation where the creation in DB failed but we got an error of account already exists.
This might be an issue with the system store that does not evacuate an entry in some cases where the db update itself failed.... need to be investigated
@nimrod-becker are you sure it's not a not updated cache thing?
from noobaa-core.
@aspandey @rkomandu
Have you tried it with NSFS buckets? I think that the symptom we saw was that the DB bucket entry exists but the equivalent folder in the file system was missing
from noobaa-core.
@romayalon , our buckets are NSFS only for the SS.
from noobaa-core.
@aspandey
There is one more scenario here as per my steps mentioned in #7103 (comment) (first comment).
I see if there is a long gap on the DB being in "ContainerCreationError" state due to the backend (FS) is down for whatever reason and once the backend (FS) which we term as Storage Cluster is up, the noobaa-db come back to Running state , however the re-establish of connection between noobaa-core and noobaa-db is not happening as per my observation/post above. So is the hand-shake code in-place between noobaa-core and noobaa-db if the recovery is long in the DB case ?
from noobaa-core.
@rkomandu Apologies for the confusion, I wanted to ask Ashish regarding the NSFS buckets
from noobaa-core.
Related Issues (20)
- NSFS | versioning | Implement list-object-versions HOT 1
- Automatically comment out mongo-db code in Tests.Dockerfile when using Mac with M1 HOT 1
- Use Singelton design in node (MDStore for example) HOT 1
- Change allowed group in web console HOT 3
- noobaa_nsfs_list_buckets_count from S3 Ops are getting wrong metrics, always getting 1 value. HOT 2
- From delete buckets operation's metrics, remove `error_write_count` and `error_read_count` for I/O metrics. HOT 1
- from I/O Ops, `noobaa_nsfs_read_count` and `noobaa_nsfs_read_bytes` are unclear, when to expect these I/O metrics HOT 3
- metrics params name should be identical, for I/O Ops, S3Ops, and FS Ops HOT 2
- bucket list and namespace not shown in noobaa monitoring pod for odf 4.12 HOT 1
- Add the Option to Define Default Backing Store by the User Explicitly HOT 3
- RPC requests timeouts due to _serial_load semaphore deadlock HOT 2
- NSFS | MPU | GET/HEAD OBEJCT | Support of part-number parameter
- vacuumAndAnalyze runs also in standalone mode when there's no DB connection HOT 1
- Support versioned objects in log based replication
- PANIC: process uncaughtException Error: retry update after advisory lock release iostats HOT 2
- Supporting supplemental groups for accounts
- listObjects delimiter=/ prefix= returns infinite loop of first entry HOT 1
- The "npm run web" requires "/etc/noobaa_configured_dns.conf" but it is not automatically created. HOT 5
- `npm run web` has an error because it points to mongo. It must be postgres. HOT 2
- A 4GB object can not put due to "InvalidBucketState" HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from noobaa-core.