-
Type:
Bug
-
Resolution: Gone away
-
Priority:
Major - P3
-
None
-
Affects Version/s: 3.6.5
-
Component/s: Replication
-
None
-
ALL
-
7
-
None
-
3
-
None
-
None
-
None
-
None
-
None
-
None
In this test, we start a batch insert on a background thread, and block the batch with a failpoint, then use mongobridge to partition off the primary temporarily. We expect the primary to notice the partition and step down, causing the insert to fail with (originally) a network error when the primary closed connections during stepdown, or (post-SERVER-38516) an InterruptedDueToStepdown error.
Meanwhile we wait for a new node to be elected, then partition it off so it steps down again, and finally unpartition the old primary and wait for it to be primary again.
It is at this point we join the background thread, which should have gotten the expected error by now.
There's a race condition however: We don't wait to make sure the original primary ever steps down. We could partition it off, wait for a new primary to be elected while the old one is still primary (split brain), then unpartition the old primary quickly enough that it never steps down at all. The insert thread fails because it doesn't get the error it expects.
A "waitForState" that ensures the original primary steps down should fix the rare failure.