Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-13351

Slaves halt replication if oplog getmore returns $err

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Replication
    • None
    • Replication
    • ALL

      In a master-slave configuration, slaves (fatally) halt replication if the oplog getmore returns an $err response. Instead, slaves should retry the oplog query.

      The master will return an $err response to an oplog getmore under a number of conditions, most notably if a closeAllDatabases command was just issued (this can be evidenced during dropdb_race.js test failures under the small_oplog passthrough).

      This issue does not affect replica set secondaries (secondaries correctly retry the oplog query when this condition occurs).

      Reproduce with the following script, which forces an $err response with a killOp (failure observed on line 14, after replication halts).

      var rt = new ReplTest("foo");
      var m = rt.start(true);
      var s = rt.start(false);
      var db = m.getDB("test")
      
      for(var i=0; i<5; i++) {
        for(var j=0; j<1000; j++) {
            db.foo.insert({});
        }
        var x = db.currentOp({ns:'local.oplog.$main', op:'getmore'}).inprog;
        if (x.length) {
            db.killOp(x[0].opid)
        }
        assert.writeOK(db.foo.insert({}, {writeConcern:{w:2, wtimeout:20000}}));
      }
      

      Log output for the above script, showing the failure:

       m31001| 2014-03-25T18:09:48.560-0400 [replslave] resync: cloning database test to get an initial copy
       m31000| 2014-03-25T18:09:48.560-0400 [initandlisten] connection accepted from 127.0.0.1:57289 #3 (3 connections now open)
       m31001| 2014-03-25T18:09:48.560-0400 [FileAllocator] allocating new datafile /data/db/foo-slave/test.ns, filling with zeroes...
       m31001| 2014-03-25T18:09:48.627-0400 [FileAllocator] done allocating datafile /data/db/foo-slave/test.ns, size: 16MB,  took 0.066 secs
       m31001| 2014-03-25T18:09:48.652-0400 [FileAllocator] allocating new datafile /data/db/foo-slave/test.0, filling with zeroes...
       m31001| 2014-03-25T18:09:48.713-0400 [FileAllocator] done allocating datafile /data/db/foo-slave/test.0, size: 16MB,  took 0.06 secs
       m31001| 2014-03-25T18:09:48.740-0400 [replslave] build index on: test.foo properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "test.foo" }
       m31001| 2014-03-25T18:09:48.741-0400 [replslave] 	 building index using bulk method
       m31001| 2014-03-25T18:09:48.741-0400 [replslave] build index done.  scanned 114 total records. 0.001 secs
       m31001| 2014-03-25T18:09:48.741-0400 [replslave] resync: done with initial clone for db: test
       m31000| 2014-03-25T18:09:48.741-0400 [conn3] end connection 127.0.0.1:57289 (2 connections now open)
       m31000| 2014-03-25T18:09:49.746-0400 [slaveTracking] build index on: local.slaves properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "local.slaves" }
       m31000| 2014-03-25T18:09:49.747-0400 [slaveTracking] 	 added index to empty collection
       m31000| 2014-03-25T18:09:55.100-0400 [conn1] going to kill op: op: 3916.0
       m31000| 2014-03-25T18:09:55.105-0400 [conn2] getmore local.oplog.$main cursorid:19256671971 ntoreturn:0 keyUpdates:0 exception: operation was interrupted code:11601 numYields:0 locks(micros) r:81 nreturned:1 reslen:71 20ms
       m31001| 2014-03-25T18:09:55.105-0400 [replslave] sync error: problem querying remote oplog record
       m31001| 2014-03-25T18:09:55.105-0400 [replslave] op: { $err: "operation was interrupted", code: 11601 }
       m31001| 2014-03-25T18:09:55.105-0400 [replslave] halting replication
       m31001| 2014-03-25T18:09:55.105-0400 [replslave] caught SyncException
       m31001| 2014-03-25T18:09:55.105-0400 [replslave] repl: sleep 10 sec before next pass
       m31001| 2014-03-25T18:10:05.107-0400 [replslave] all sources dead: sync error: no ts found querying remote oplog record, sleeping for 5 seconds
       m31000| 2014-03-25T18:10:05.107-0400 [conn2] killcursors: found 0 of 1
       m31000| 2014-03-25T18:10:05.107-0400 [conn2] end connection 127.0.0.1:57288 (1 connection now open)
       m31001| 2014-03-25T18:10:10.108-0400 [replslave] all sources dead: sync error: no ts found querying remote oplog record, sleeping for 5 seconds
       m31000| 2014-03-25T18:10:15.106-0400 [conn1] command test.$cmd command: insert { insert: "foo", documents: [ { _id: ObjectId('5331feb3cd9f37898e100a9f') } ], ordered: true, writeConcern: { w: 2.0, wtimeout: 20000.0 } } keyUpdates:0 numYields:0 locks(micros) w:3 reslen:161 20001ms
       m31001| 2014-03-25T18:10:15.108-0400 [replslave] all sources dead: sync error: no ts found querying remote oplog record, sleeping for 5 seconds
      assert: write concern failed with errors: {
      	"nInserted" : 1,
      	"writeConcernError" : {
      		"code" : 64,
      		"errInfo" : {
      			"wtimeout" : true
      		},
      		"errmsg" : "waiting for replication timed out"
      	}
      }
      Error: write concern failed with errors: {
      	"nInserted" : 1,
      	"writeConcernError" : {
      		"code" : 64,
      		"errInfo" : {
      			"wtimeout" : true
      		},
      		"errmsg" : "waiting for replication timed out"
      	}
      }
          at Error (<anonymous>)
          at doassert (src/mongo/shell/assert.js:11:14)
          at Function.assert.writeOK (src/mongo/shell/assert.js:363:13)
          at foo.js:14:10
      2014-03-25T18:10:15.154-0400 Error: write concern failed with errors: {
      	"nInserted" : 1,
      	"writeConcernError" : {
      		"code" : 64,
      		"errInfo" : {
      			"wtimeout" : true
      		},
      		"errmsg" : "waiting for replication timed out"
      	}
      } at src/mongo/shell/assert.js:13
      

            Assignee:
            backlog-server-repl [DO NOT USE] Backlog - Replication Team
            Reporter:
            rassi J Rassi
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: