Describe the bug Test fails frequently on Gradle: <div class="

With 100x runs, I saw 19 failures: testLoggingInterceptor_make

So the ultimate problem for both tests seems to be this line: <div class="highligh

Ah, I think I figured it out. Per <a href="https://kotlinlang.org/api/kotlinx.coroutin

[BUG]: Flaky Test: testLoggingInterceptor_makeNetworkCall_succeeds about oppia-android HOT 5 CLOSED

adhiamboperes commented on August 16, 2024

[BUG]: Flaky Test: testLoggingInterceptor_makeNetworkCall_succeeds

from oppia-android.

Comments (5)

BenHenning commented on August 16, 2024

I could be wrong, but is there actually a way to mark this test as flaky in code? Bazel does technically support a way by auto-rerunning flaky tests, but I think it'd be better just to fix the test, instead, since we don't want to encourage checking in flaky code (where we can avoid it).

Currently running 100x runs to see how flaky the test is.

from oppia-android.

BenHenning commented on August 16, 2024

With 100x runs, I saw 19 failures:

testLoggingInterceptor_makeNetworkCall_succeeds 14/19 times (14% failure rate)
testLoggingInterceptor_makeNetworkCallWithInvalidUrl_failsAndCompletes 5/19 times (5% failure rate)

Edit: Also, here's the command run:

bazel test --runs_per_test=100 //data:src/test/java/org/oppia/android/data/backends/gae/NetworkLoggingInterceptorTest

Unfortunately, running the tests above in isolation doesn't seem to actually catch any flakes, so these flakes are seemingly caused by state interaction between the tests of the suite.

Specific failures that we're seeing for each:

testLoggingInterceptor_makeNetworkCall_succeeds:

testLoggingInterceptor_makeNetworkCall_succeeds(org.oppia.android.data.backends.gae.NetworkLoggingInterceptorTest)
java.lang.IllegalStateException: This job has not completed yet
        at kotlinx.coroutines.JobSupport.getCompletedInternal$kotlinx_coroutines_core(JobSupport.kt:1199)
        at kotlinx.coroutines.DeferredCoroutine.getCompleted(Builders.common.kt:100)
        at org.oppia.android.data.backends.gae.NetworkLoggingInterceptorTest.testLoggingInterceptor_makeNetworkCall_succeeds(NetworkLoggingInterceptorTest.kt:99)

testLoggingInterceptor_makeNetworkCallWithInvalidUrl_failsAndCompletes:

testLoggingInterceptor_makeNetworkCallWithInvalidUrl_failsAndCompletes(org.oppia.android.data.backends.gae.NetworkLoggingInterceptorTest)
java.lang.IllegalStateException: This job has not completed yet
        at kotlinx.coroutines.JobSupport.getCompletedInternal$kotlinx_coroutines_core(JobSupport.kt:1199)
        at kotlinx.coroutines.DeferredCoroutine.getCompleted(Builders.common.kt:100)
        at org.oppia.android.data.backends.gae.NetworkLoggingInterceptorTest.testLoggingInterceptor_makeNetworkCallWithInvalidUrl_failsAndCompletes(NetworkLoggingInterceptorTest.kt:123)

from oppia-android.

BenHenning commented on August 16, 2024

So the ultimate problem for both tests seems to be this line:

    val firstRequest = firstRequestsDeferred.getCompleted().single()

This was caused indirectly by #5402 since it fixed the tests to properly verify what they're trying to verify.

Note that #5402 also made similar changes to ConsoleLoggerTest. I verified that this test, fortunately, doesn't have any flakes per the following run:

bazel test --runs_per_test=100 //utility/src/test/java/org/oppia/android/util/logging:ConsoleLoggerTest

However, this is probably a false result since, per the previous comment, it seems that multiple tests need to cooperate in order to trigger the failure.

from oppia-android.

BenHenning commented on August 16, 2024

I've added a bunch of println instrumentation lines to better understand what's happening, and disabled the testLoggingInterceptor_makeCrashingNetworkCall_failsAndCompletes test just to make sure it isn't impacting the flakiness (and to reduce noise).

See https://gist.github.com/BenHenning/473e76df316acdfc99c609c363db155f for the instrumentation used for these results.

Here's the debug output when testLoggingInterceptor_makeNetworkCall_succeeds fails:

@@@@@@ [B] pre mockWebServer.enqueue(mockResponse)
@@@@@@ [B] pre request execute
@@@@@@ NetworkLoggingInterceptor.intercept()
@@@@@@ intercept() pre-launch success emit launch
@@@@@@ intercept() pre-launch fail 1 emit launch
@@@@@@ intercept() top-level is done
@@@@@@ [B] pre coroutines sync
@@@@@@ [B] pre networkLoggingInterceptor.logNetworkCallFlow.take(1).toList()
@@@@@@ [B] pre networkLoggingInterceptor.logFailedNetworkCallFlow.take(1).toList()
@@@@@@ intercept() pre actual success emit()
@@@@@@ intercept() pre actual fail 1 emit()
@@@@@@ [B] post networkLoggingInterceptor.logNetworkCallFlow.take(1).toList()
@@@@@@ [B] post networkLoggingInterceptor.logFailedNetworkCallFlow.take(1).toList()
@@@@@@ intercept() post actual fail 1 emit()
@@@@@@ intercept() post actual success emit()
@@@@@@ [B] pre test coroutine sync
@@@@@@ [B] pre firstRequestsDeferred get completed
@@@@@@ [B] pre firstFailingRequestsDeferred get completed
@@@@@@ [B] post all getCompleted()s
.@@@@@@ [A] pre mockWebServer.enqueue(MockResponse().setBody(testResponseBody))
@@@@@@ [A] pre request execute
@@@@@@ NetworkLoggingInterceptor.intercept()
@@@@@@ intercept() pre-launch success emit launch
@@@@@@ intercept() top-level is done
@@@@@@ [A] pre coroutines sync
@@@@@@ intercept() pre actual success emit()
@@@@@@ [A] pre networkLoggingInterceptor.logNetworkCallFlow.take(1).toList()
@@@@@@ intercept() post actual success emit()
@@@@@@ [A] pre test coroutine sync
@@@@@@ [A] pre get completed
E
Time: 14.058
There was 1 failure:
1) testLoggingInterceptor_makeNetworkCall_succeeds(org.oppia.android.data.backends.gae.NetworkLoggingInterceptorTest)
java.lang.IllegalStateException: This job has not completed yet
        at kotlinx.coroutines.JobSupport.getCompletedInternal$kotlinx_coroutines_core(JobSupport.kt:1199)
        at kotlinx.coroutines.DeferredCoroutine.getCompleted(Builders.common.kt:100)
        at org.oppia.android.data.backends.gae.NetworkLoggingInterceptorTest.testLoggingInterceptor_makeNetworkCall_succeeds(NetworkLoggingInterceptorTest.kt:108)

FAILURES!!!
Tests run: 2,  Failures: 1

Here's the debug output when testLoggingInterceptor_makeNetworkCall_succeeds passes:

@@@@@@ [B] pre mockWebServer.enqueue(mockResponse)
@@@@@@ [B] pre request execute
@@@@@@ NetworkLoggingInterceptor.intercept()
@@@@@@ intercept() pre-launch success emit launch
@@@@@@ intercept() pre-launch fail 1 emit launch
@@@@@@ intercept() top-level is done
@@@@@@ [B] pre coroutines sync
@@@@@@ [B] pre networkLoggingInterceptor.logNetworkCallFlow.take(1).toList()
@@@@@@ [B] pre networkLoggingInterceptor.logFailedNetworkCallFlow.take(1).toList()
@@@@@@ intercept() pre actual success emit()
@@@@@@ intercept() pre actual fail 1 emit()
@@@@@@ [B] post networkLoggingInterceptor.logFailedNetworkCallFlow.take(1).toList()
@@@@@@ [B] post networkLoggingInterceptor.logNetworkCallFlow.take(1).toList()
@@@@@@ intercept() post actual fail 1 emit()
@@@@@@ intercept() post actual success emit()
@@@@@@ [B] pre test coroutine sync
@@@@@@ [B] pre firstRequestsDeferred get completed
@@@@@@ [B] pre firstFailingRequestsDeferred get completed
@@@@@@ [B] post all getCompleted()s
.@@@@@@ [A] pre mockWebServer.enqueue(MockResponse().setBody(testResponseBody))
@@@@@@ [A] pre request execute
@@@@@@ NetworkLoggingInterceptor.intercept()
@@@@@@ intercept() pre-launch success emit launch
@@@@@@ intercept() top-level is done
@@@@@@ [A] pre coroutines sync
@@@@@@ intercept() pre actual success emit()
@@@@@@ [A] pre networkLoggingInterceptor.logNetworkCallFlow.take(1).toList()
@@@@@@ [A] post networkLoggingInterceptor.logNetworkCallFlow.take(1).toList()
@@@@@@ intercept() post actual success emit()
@@@@@@ [A] pre test coroutine sync
@@@@@@ [A] pre get completed
@@@@@@ [A] post get completed

Time: 4.972

OK (2 tests)

The diff is interesting:

Specific takeaways:

logNetworkCallFlow and logFailedNetworkCallFlow are processed in reverse order in testLoggingInterceptor_makeNetworkCallWithInvalidUrl_failsAndCompletes in the passing case.
Fetching networkLoggingInterceptor.logNetworkCallFlow in testLoggingInterceptor_makeNetworkCall_succeeds actually finishes (plus getCompleted finishes--obviously that fails for the failure case since it's the actual exception being thrown).

I'm not certain the order difference matters here, and we're otherwise not getting interesting information on why the flow isn't completing.

from oppia-android.

BenHenning commented on August 16, 2024

Ah, I think I figured it out. Per https://kotlinlang.org/api/kotlinx.coroutines/kotlinx-coroutines-core/kotlinx.coroutines.flow/-mutable-shared-flow.html MutableSharedFlow has no replay caching by default which makes it time sensitive to subscriptions. We can see in the above logs that emit() is actually being called before we try to consume the flow which means there's a data race between starting to observe the flow and a value actually being delivered (events delivered will only block if there's another event being delivered, not if there zero subscribers--those are simply lost).

Adding another additional synchronization barrier seems to fix the problem by ensuring flow consumption happens before emit() is called.

from oppia-android.

[BUG]: Flaky Test: testLoggingInterceptor_makeNetworkCall_succeeds about oppia-android HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent