Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-39755

Race in interrupted_batch_insert.js

    • Type: Icon: Bug Bug
    • Resolution: Gone away
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.6.5
    • Component/s: Replication
    • Labels:
      None
    • ALL
    • 7

      In this test, we start a batch insert on a background thread, and block the batch with a failpoint, then use mongobridge to partition off the primary temporarily. We expect the primary to notice the partition and step down, causing the insert to fail with (originally) a network error when the primary closed connections during stepdown, or (post-SERVER-38516) an InterruptedDueToStepdown error.

      Meanwhile we wait for a new node to be elected, then partition it off so it steps down again, and finally unpartition the old primary and wait for it to be primary again.

      It is at this point we join the background thread, which should have gotten the expected error by now.

      There's a race condition however: We don't wait to make sure the original primary ever steps down. We could partition it off, wait for a new primary to be elected while the old one is still primary (split brain), then unpartition the old primary quickly enough that it never steps down at all. The insert thread fails because it doesn't get the error it expects.

      A "waitForState" that ensures the original primary steps down should fix the rare failure.

            Assignee:
            jesse@mongodb.com A. Jesse Jiryu Davis
            Reporter:
            jesse@mongodb.com A. Jesse Jiryu Davis
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: