Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-20814

'majority' write stuck waiting for replication after both secondaries fell off the end of the oplog

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Replication
    • Labels:
      None
    • ALL

      Happened during execution of jstests/sharding/conf_server_write_concern.js. A write, which uses writeConcern: 'majority' got stuck in the awaitReplication call, while the secondaries entered the RECOVERING state.

      Below are the relevant lines from the logs and the call stacks are attached.

      This is the Evergreen task (it is from a patch build with only test changes).

      ...
      [js_test:conf_server_write_concern] 2015-10-07T19:12:57.365+0000 d20018| 2015-10-07T19:12:57.364+0000 I REPL     [ReplicationExecutor] could not find member to sync from
      [js_test:conf_server_write_concern] 2015-10-07T19:12:57.365+0000 d20018| 2015-10-07T19:12:57.365+0000 I REPL     [ReplicationExecutor] Member ip-10-168-64-199:20017 is now in state SECONDARY
      
      ...
      [js_test:conf_server_write_concern] 2015-10-07T19:13:05.470+0000 d20016| 2015-10-07T19:13:05.469+0000 I REPL     [ReplicationExecutor] transition to PRIMARY
      ...
      [js_test:conf_server_write_concern] 2015-10-07T19:13:06.373+0000 d20017| 2015-10-07T19:13:06.372+0000 I REPL     [ReplicationExecutor] Member ip-10-168-64-199:20016 is now in state PRIMARY
      [js_test:conf_server_write_concern] 2015-10-07T19:13:06.444+0000 d20016| 2015-10-07T19:13:06.444+0000 I REPL     [rsSync] transition to primary complete; database writes are now permitted
      ...
      [js_test:conf_server_write_concern] 2015-10-07T19:13:07.405+0000 d20018| 2015-10-07T19:13:07.404+0000 I REPL     [ReplicationExecutor] Member ip-10-168-64-199:20016 is now in state PRIMARY
      ...
      [js_test:conf_server_write_concern] 2015-10-07T19:13:08.408+0000 d20018| 2015-10-07T19:13:08.408+0000 I REPL     [ReplicationExecutor] syncing from: ip-10-168-64-199:20016
      ...
      [js_test:conf_server_write_concern] 2015-10-07T19:13:08.450+0000 d20018| 2015-10-07T19:13:08.449+0000 I REPL     [ReplicationExecutor] could not find member to sync from
      ...
      [js_test:conf_server_write_concern] 2015-10-07T19:13:09.481+0000 d20016| 2015-10-07T19:13:09.480+0000 I REPL     [ReplicationExecutor] Member ip-10-168-64-199:20017 is now in state RECOVERING
      ...
      [js_test:conf_server_write_concern] 2015-10-07T19:14:08.686+0000 d20018| 2015-10-07T19:14:08.686+0000 I REPL     [ReplicationExecutor] syncing from: ip-10-168-64-199:20016
      ...
      [js_test:conf_server_write_concern] 2015-10-07T19:14:08.688+0000 d20018| 2015-10-07T19:14:08.687+0000 W REPL     [rsBackgroundSync] we are too stale to use ip-10-168-64-199:20016 as a sync source
      ...
      [js_test:conf_server_write_concern] 2015-10-07T19:14:08.689+0000 d20018| 2015-10-07T19:14:08.688+0000 I REPL     [ReplicationExecutor] could not find member to sync from
      [js_test:conf_server_write_concern] 2015-10-07T19:14:08.689+0000 d20018| 2015-10-07T19:14:08.688+0000 E REPL     [rsBackgroundSync] too stale to catch up
      [js_test:conf_server_write_concern] 2015-10-07T19:14:08.689+0000 d20018| 2015-10-07T19:14:08.688+0000 I REPL     [rsBackgroundSync] our last optime : (term: 0, timestamp: Oct  7 19:12:55:1)
      [js_test:conf_server_write_concern] 2015-10-07T19:14:08.690+0000 d20018| 2015-10-07T19:14:08.688+0000 I REPL     [rsBackgroundSync] oldest available is (term: 1, timestamp: Oct  7 19:13:08:11)
      [js_test:conf_server_write_concern] 2015-10-07T19:14:08.690+0000 d20018| 2015-10-07T19:14:08.688+0000 I REPL     [rsBackgroundSync] See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember
      [js_test:conf_server_write_concern] 2015-10-07T19:14:08.690+0000 d20018| 2015-10-07T19:14:08.688+0000 I REPL     [ReplicationExecutor] transition to RECOVERING
      ...
      	
      [js_test:conf_server_write_concern] 2015-10-07T19:14:09.625+0000 d20017| 2015-10-07T19:14:09.623+0000 I REPL     [ReplicationExecutor] could not find member to sync from
      [js_test:conf_server_write_concern] 2015-10-07T19:14:09.625+0000 d20017| 2015-10-07T19:14:09.624+0000 E REPL     [rsBackgroundSync] too stale to catch up
      [js_test:conf_server_write_concern] 2015-10-07T19:14:09.625+0000 d20017| 2015-10-07T19:14:09.624+0000 I REPL     [rsBackgroundSync] our last optime : (term: 0, timestamp: Oct  7 19:12:55:1)
      [js_test:conf_server_write_concern] 2015-10-07T19:14:09.625+0000 d20017| 2015-10-07T19:14:09.624+0000 I REPL     [rsBackgroundSync] oldest available is (term: 1, timestamp: Oct  7 19:13:08:11)
      [js_test:conf_server_write_concern] 2015-10-07T19:14:09.626+0000 d20017| 2015-10-07T19:14:09.624+0000 I REPL     [ReplicationExecutor] Member ip-10-168-64-199:20018 is now in state RECOVERING
      [js_test:conf_server_write_concern] 2015-10-07T19:14:09.626+0000 d20017| 2015-10-07T19:14:09.624+0000 I REPL     [rsBackgroundSync] See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember
      ...
      [js_test:conf_server_write_concern] 2015-10-07T19:14:09.618+0000 d20016| 2015-10-07T19:14:09.618+0000 I REPL     [ReplicationExecutor] Member ip-10-168-64-199:20018 is now in state RECOVERING
      ...
      

            Assignee:
            milkie@mongodb.com Eric Milkie
            Reporter:
            kaloian.manassiev@mongodb.com Kaloian Manassiev
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: