We are seeing frequent/intermittent segmentation faults with v0.1.44 and aws-c-http
v0.6.19, the problem is also present on master
as of today (c1198ae):
[DEBUG] 2022-08-20 01:03:31.497 connection-manager [139657729799936] id=(nil): Acquire connection
FATAL: Received signal 11 (Segmentation fault)
Backtrace (most recent call first)
#10 <?> at 0x7f04b9f75980 in __restore_rt
#9 <?> at 0x55afd6f0535b in s_s3_client_acquired_retry_token
#8 <?> at 0x55afd6f60ff2 in s_exponential_retry_task
#7 <?> at 0x55afd6fceb39 in aws_task_run
#6 <?> at 0x55afd6fced90 in s_run_all
#5 <?> at 0x55afd6fcf1db in aws_task_scheduler_run_all
#4 <?> at 0x55afd6f6a215 in s_main_loop
#3 <?> at 0x55afd6fd3100 in thread_fn
#2 <?> at 0x55afd3f35304 in thread_metrics_start_routine
#1 <?> at 0x7f04b9f6a6db in start_thread
This happens on the first attempt, not on retry.
It turns out that the connection_manager
is NULL
when it is needed:
// aws-c-http/source/connection_manager.c
void aws_http_connection_manager_acquire_connection(
struct aws_http_connection_manager *manager,
aws_http_connection_manager_on_connection_setup_fn *callback,
void *user_data) {
AWS_LOGF_DEBUG(AWS_LS_HTTP_CONNECTION_MANAGER, "id=%p: Acquire connection", (void *)manager); // <=== HERE
The above function is used as the .acquire_http_connection function
pointer of the s_s3_client_default_vtable
, and invoked from s_s3_client_acquired_retry_token
like this:
// aws-c-s3/source/s3_client.c
static void s_s3_client_acquired_retry_token(
struct aws_retry_strategy *retry_strategy,
int error_code,
struct aws_retry_token *token,
void *user_data) {
// ....
client->vtable->acquire_http_connection(
endpoint->http_connection_manager, s_s3_client_on_acquire_http_connection, connection);
}
So the endpoint->http_connection_manager
is NULL
when it should not be.
Adding more logging
At debug
level, the logs do not reveal much, hence we used this patch to log more information:
[INFO] 2022-08-20 01:03:31.496 AuthSigning [139657322956544] (id=0x7f047f82c050) Http request successfully built final authorization value via algorithm SigV4, with contents
AWS4-HMAC-SHA256 Credential=ASIAYO6OYNH5WYXVUNEK/20220820/us-east-1/s3/aws4_request, SignedHeaders=content-length;content-type;host;x-amz-acl;x-amz-content-sha256;x-amz-date;x-amz-security-token, Signature=0da7d
7a8a6257b602c2234599874dd5dba60f43ec6f51143efdc67a78269aced
[DEBUG] 2022-08-20 01:03:31.496 task-scheduler [139657528473344] id=0x7f04898cf158: Scheduling s3_client_process_work_task task for immediate execution
[DEBUG] 2022-08-20 01:03:31.496 task-scheduler [139657528473344] id=0x7f04898cf158: Running s3_client_process_work_task task with <Running> status
[DEBUG] 2022-08-20 01:03:31.496 S3Client [139657528473344] id=0x7f04898cf000 s_s3_client_process_work_default - Moving relevant synced_data into threaded_data.
[DEBUG] 2022-08-20 01:03:31.496 S3Client [139657528473344] id=0x7f04898cf000 s_s3_client_process_work_default - Processing any new meta requests.
[DEBUG] 2022-08-20 01:03:31.496 S3Client [139657528473344] id=0x7f04898cf000 Updating meta requests.
[DEBUG] 2022-08-20 01:03:31.496 S3Client [139657528473344] id=0x7f04898cf000 Updating connections, assigning requests where possible.
[ERROR] 2022-08-20 01:03:31.496 S3Client [139657528473344] id=0x7f04898cf000 0x7f04a4fd1ca0 s_s3_client_create_connection_for_request_default: (nil)
[DEBUG] 2022-08-20 01:03:31.497 exp-backoff-strategy [139657528473344] id=0x7f049a6c64d0: Initializing retry token 0x7f048bc12100
[DEBUG] 2022-08-20 01:03:31.497 task-scheduler [139657729799936] id=0x7f048bc12160: Scheduling aws_exponential_backoff_retry_task task for immediate execution
[INFO] 2022-08-20 01:03:31.497 S3ClientStats [139657528473344] id=0x7f04898cf000 Requests-in-flight(approx/exact):1/1 Requests-preparing:0 Requests-queued:0 Requests-network(get/put/default/total):0/1/0/1 Requests-streaming-waiting:0 Requests-streaming:0 Endpoints(in-table/allocated):1/0
[DEBUG] 2022-08-20 01:03:31.497 task-scheduler [139657729799936] id=0x7f048bc12160: Running aws_exponential_backoff_retry_task task with <Running> status
[DEBUG] 2022-08-20 01:03:31.497 exp-backoff-strategy [139657729799936] id=0x7f049a6c64d0: Vending retry_token 0x7f048bc12100
[DEBUG] 2022-08-20 01:03:31.497 connection-manager [139657729799936] id=(nil): Acquire connection
FATAL: Received signal 11 (Segmentation fault)
Backtrace (most recent call first)
#10 <?> at 0x7f04b9f75980 in __restore_rt
#9 <?> at 0x55afd6f0535b in s_s3_client_acquired_retry_token
#8 <?> at 0x55afd6f60ff2 in s_exponential_retry_task
#7 <?> at 0x55afd6fceb39 in aws_task_run
#6 <?> at 0x55afd6fced90 in s_run_all
#5 <?> at 0x55afd6fcf1db in aws_task_scheduler_run_all
#4 <?> at 0x55afd6f6a215 in s_main_loop
#3 <?> at 0x55afd6fd3100 in thread_fn
#2 <?> at 0x55afd3f35304 in thread_metrics_start_routine
#1 <?> at 0x7f04b9f6a6db in start_thread
The line
[ERROR] 2022-08-20 01:03:31.496 S3Client [139657528473344] id=0x7f04898cf000 0x7f04a4fd1ca0 s_s3_client_create_connection_for_request_default: (nil)
shows that client 0x7f04898cf000 uses endpoint 0x7f04a4fd1ca0 whose http_connection_manager
is (nil)
. This is the bug, which produces the subsequent crash.
Here is the trace for that endpoint from the log:
[DEBUG] 2022-08-20 01:03:31.393 S3Endpoint [139658686979648] id=0x7f04a4fd1ca0: Created connection manager 0x7f0430ea0fc0 for endpoint
[ERROR] 2022-08-20 01:03:31.393 S3Client [139658686979648] id=0x7f04898cf000 0x7f04a4fd1ca0 aws_s3_client_make_meta_request: aurora-simulation-prod-logs.s3.us-east-1.amazonaws.com ADDED 1
[ERROR] 2022-08-20 01:03:31.394 S3Client [139657528473344] id=0x7f04898cf000 0x7f04a4fd1ca0 s_s3_client_create_connection_for_request_default: 0x7f0430ea0fc0
[ERROR] 2022-08-20 01:03:31.493 S3Endpoint [139657616553728] id=0x7f04a4fd1ca0: 0x7f04898cf000 aws_s3_client_endpoint_release, count = 2
[ERROR] 2022-08-20 01:03:31.493 S3Endpoint [139657528473344] id=0x7f04a4fd1ca0: 0x7f04898cf000 aws_s3_client_endpoint_release, count = 2
[ERROR] 2022-08-20 01:03:31.493 S3Endpoint [139657528473344] id=0x7f04a4fd1ca0: s_s3_endpoint_ref_count_zero
[ERROR] 2022-08-20 01:03:31.493 S3Endpoint [139657528473344] id=0x7f04a4fd1ca0: s_s3_endpoint_ref_count_zero - removing connection_manager
[ERROR] 2022-08-20 01:03:31.495 S3Client [139658686979648] id=0x7f04898cf000 0x7f04a4fd1ca0 aws_s3_client_make_meta_request: aurora-simulation-prod-logs.s3.us-east-1.amazonaws.com REF conman: (nil)
[ERROR] 2022-08-20 01:03:31.496 S3Client [139657528473344] id=0x7f04898cf000 0x7f04a4fd1ca0 s_s3_client_create_connection_for_request_default: (nil)
Pay close attention to the thread IDs in this part:
[ERROR] 2022-08-20 01:03:31.493 S3Endpoint [139657616553728] id=0x7f04a4fd1ca0: 0x7f04898cf000 aws_s3_client_endpoint_release, count = 2
[ERROR] 2022-08-20 01:03:31.493 S3Endpoint [139657528473344] id=0x7f04a4fd1ca0: 0x7f04898cf000 aws_s3_client_endpoint_release, count = 2
That means that aws_s3_client_endpoint_release
was invoked from 2 different threads at the same time (possibly synchronized via mutex), both ended up calling aws_ref_count_release(&endpoint->ref_count)
, decrementing from 2 -> 1 -> 0
.
Both were seeing an endpoint->ref_count.ref_count
of 2 initially, hence neither ejected the endpoint from the hash table.
Here is the function with above logging added:
// source/s3_endpoint.c
void aws_s3_client_endpoint_release(struct aws_s3_client *client, struct aws_s3_endpoint *endpoint) {
AWS_PRECONDITION(endpoint);
AWS_PRECONDITION(client);
AWS_PRECONDITION(endpoint->handled_by_client);
AWS_LOGF_ERROR(
AWS_LS_S3_ENDPOINT,
"id=%p: %p aws_s3_client_endpoint_release, count = %d",
(void *)endpoint,
(void *)client,
aws_atomic_load_int(&endpoint->ref_count.ref_count)); // <== BOTH SEE 2 HERE
/* BEGIN CRITICAL SECTION */
{
aws_s3_client_lock_synced_data(client);
/* The last refcount to release */
if (aws_atomic_load_int(&endpoint->ref_count.ref_count) == 1) { // <== BOTH SEE 2 HERE
AWS_LOGF_ERROR(
AWS_LS_S3_ENDPOINT,
"id=%p: aws_s3_client_endpoint_release - removing from hashtable",
(void *)endpoint);
aws_hash_table_remove(&client->synced_data.endpoints, endpoint->host_name, NULL, NULL);
}
aws_s3_client_unlock_synced_data(client);
}
/* END CRITICAL SECTION */
aws_ref_count_release(&endpoint->ref_count); // <== BOTH CALL THIS
}
Hence the expected "aws_s3_client_endpoint_release - removing from hashtable"
does not appear in the log.
Instead, since endpoint->ref_count
now reaches 0, the http_connection_manager
of endpoint 0x7f04a4fd1ca0 is now removed, released and set to NULL
:
[ERROR] 2022-08-20 01:03:31.493 S3Endpoint [139657616553728] id=0x7f04a4fd1ca0: 0x7f04898cf000 aws_s3_client_endpoint_release, count = 2
[ERROR] 2022-08-20 01:03:31.493 S3Endpoint [139657528473344] id=0x7f04a4fd1ca0: 0x7f04898cf000 aws_s3_client_endpoint_release, count = 2
[ERROR] 2022-08-20 01:03:31.493 S3Endpoint [139657528473344] id=0x7f04a4fd1ca0: s_s3_endpoint_ref_count_zero
[ERROR] 2022-08-20 01:03:31.493 S3Endpoint [139657528473344] id=0x7f04a4fd1ca0: s_s3_endpoint_ref_count_zero - removing connection_manager
When next the s3_client
wants to make a request, it finds an entry in said hash table, uses it, and logs the fact that the http_connection_manager
field of that endpoint is (nil)
:
[DEBUG] 2022-08-20 01:03:31.495 S3Client [139657528473344] id=0x7f04898cf000 Updating connections, assigning requests where possible.
[DEBUG] 2022-08-20 01:03:31.495 S3MetaRequest [139658686979648] id=0x7f04a3dd0200 Created new Default Meta Request.
[ERROR] 2022-08-20 01:03:31.495 S3Client [139658686979648] id=0x7f04898cf000 0x7f04a4fd1ca0 aws_s3_client_make_meta_request: aurora-simulation-prod-logs.s3.us-east-1.amazonaws.com REF conman: (nil) <=== HERE
[INFO] 2022-08-20 01:03:31.495 S3Client [139658686979648] id=0x7f04898cf000: Created meta request 0x7f04a3dd0200
[INFO] 2022-08-20 01:03:31.495 S3ClientStats [139657528473344] id=0x7f04898cf000 Requests-in-flight(approx/exact):0/0 Requests-preparing:0 Requests-queued:0 Requests-network(get/put/default/total):0/0/0/0 Requests-streaming-waiting:0 Requests-streaming:0 Endpoints(in-table/allocated):1/0
It also reports that it has 1 endpoint (0x7f04a4fd1ca0) in-table, and 0 allocated.
So the cause of the problem is an inconsistency:
- when the
s3_client
loads an s3_endpoint from the hash table, it expects its http_connection_manager
to not be NULL
,
- the race condition in
aws_s3_client_endpoint_release
prevented a required release of the endpoint from the hash table.
Cause of the race condition
Here is a possible sequence of two threads, T1 and T2, where .Lock()
stands for taking the synced_data.lock
:
T1.Lock()
// blocks T2, which also needs to go through this section
// reference count is 2, hence do nothing
T1.Unlock()
T2.Lock() // now grabs the lock
// reference count is 2, hence do nothing
T2.Unlock()
T1.aws_ref_count_release(&endpoint->ref_count); // Atomic, happens in sequence
T2.aws_ref_count_release(&endpoint->ref_count);
// Value of endpoint->ref_count is now 0
The same would happen if T2 was allowed to decrement endpoint->ref_count
first.
Fix
Since ref_count.ref_count
is read/seen within the critical section, but modified outside of the critical section, the race condition can occur.
It can be avoided by pulling aws_ref_count_release(&endpoint->ref_count);
inside the critical section.