[SERVER-13351] Slaves halt replication if oplog getmore returns $err Created: 25/Mar/14  Updated: 06/Dec/22  Resolved: 12/Feb/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: J Rassi Assignee: Backlog - Replication Team
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-13435 dropdb_race.js in small_oplog (causes... Closed
Assigned Teams:
Replication
Operating System: ALL
Participants:

 Description   

In a master-slave configuration, slaves (fatally) halt replication if the oplog getmore returns an $err response. Instead, slaves should retry the oplog query.

The master will return an $err response to an oplog getmore under a number of conditions, most notably if a closeAllDatabases command was just issued (this can be evidenced during dropdb_race.js test failures under the small_oplog passthrough).

This issue does not affect replica set secondaries (secondaries correctly retry the oplog query when this condition occurs).

Reproduce with the following script, which forces an $err response with a killOp (failure observed on line 14, after replication halts).

var rt = new ReplTest("foo");
var m = rt.start(true);
var s = rt.start(false);
var db = m.getDB("test")
 
for(var i=0; i<5; i++) {
  for(var j=0; j<1000; j++) {
      db.foo.insert({});
  }
  var x = db.currentOp({ns:'local.oplog.$main', op:'getmore'}).inprog;
  if (x.length) {
      db.killOp(x[0].opid)
  }
  assert.writeOK(db.foo.insert({}, {writeConcern:{w:2, wtimeout:20000}}));
}

Log output for the above script, showing the failure:

 m31001| 2014-03-25T18:09:48.560-0400 [replslave] resync: cloning database test to get an initial copy
 m31000| 2014-03-25T18:09:48.560-0400 [initandlisten] connection accepted from 127.0.0.1:57289 #3 (3 connections now open)
 m31001| 2014-03-25T18:09:48.560-0400 [FileAllocator] allocating new datafile /data/db/foo-slave/test.ns, filling with zeroes...
 m31001| 2014-03-25T18:09:48.627-0400 [FileAllocator] done allocating datafile /data/db/foo-slave/test.ns, size: 16MB,  took 0.066 secs
 m31001| 2014-03-25T18:09:48.652-0400 [FileAllocator] allocating new datafile /data/db/foo-slave/test.0, filling with zeroes...
 m31001| 2014-03-25T18:09:48.713-0400 [FileAllocator] done allocating datafile /data/db/foo-slave/test.0, size: 16MB,  took 0.06 secs
 m31001| 2014-03-25T18:09:48.740-0400 [replslave] build index on: test.foo properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "test.foo" }
 m31001| 2014-03-25T18:09:48.741-0400 [replslave] 	 building index using bulk method
 m31001| 2014-03-25T18:09:48.741-0400 [replslave] build index done.  scanned 114 total records. 0.001 secs
 m31001| 2014-03-25T18:09:48.741-0400 [replslave] resync: done with initial clone for db: test
 m31000| 2014-03-25T18:09:48.741-0400 [conn3] end connection 127.0.0.1:57289 (2 connections now open)
 m31000| 2014-03-25T18:09:49.746-0400 [slaveTracking] build index on: local.slaves properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "local.slaves" }
 m31000| 2014-03-25T18:09:49.747-0400 [slaveTracking] 	 added index to empty collection
 m31000| 2014-03-25T18:09:55.100-0400 [conn1] going to kill op: op: 3916.0
 m31000| 2014-03-25T18:09:55.105-0400 [conn2] getmore local.oplog.$main cursorid:19256671971 ntoreturn:0 keyUpdates:0 exception: operation was interrupted code:11601 numYields:0 locks(micros) r:81 nreturned:1 reslen:71 20ms
 m31001| 2014-03-25T18:09:55.105-0400 [replslave] sync error: problem querying remote oplog record
 m31001| 2014-03-25T18:09:55.105-0400 [replslave] op: { $err: "operation was interrupted", code: 11601 }
 m31001| 2014-03-25T18:09:55.105-0400 [replslave] halting replication
 m31001| 2014-03-25T18:09:55.105-0400 [replslave] caught SyncException
 m31001| 2014-03-25T18:09:55.105-0400 [replslave] repl: sleep 10 sec before next pass
 m31001| 2014-03-25T18:10:05.107-0400 [replslave] all sources dead: sync error: no ts found querying remote oplog record, sleeping for 5 seconds
 m31000| 2014-03-25T18:10:05.107-0400 [conn2] killcursors: found 0 of 1
 m31000| 2014-03-25T18:10:05.107-0400 [conn2] end connection 127.0.0.1:57288 (1 connection now open)
 m31001| 2014-03-25T18:10:10.108-0400 [replslave] all sources dead: sync error: no ts found querying remote oplog record, sleeping for 5 seconds
 m31000| 2014-03-25T18:10:15.106-0400 [conn1] command test.$cmd command: insert { insert: "foo", documents: [ { _id: ObjectId('5331feb3cd9f37898e100a9f') } ], ordered: true, writeConcern: { w: 2.0, wtimeout: 20000.0 } } keyUpdates:0 numYields:0 locks(micros) w:3 reslen:161 20001ms
 m31001| 2014-03-25T18:10:15.108-0400 [replslave] all sources dead: sync error: no ts found querying remote oplog record, sleeping for 5 seconds
assert: write concern failed with errors: {
	"nInserted" : 1,
	"writeConcernError" : {
		"code" : 64,
		"errInfo" : {
			"wtimeout" : true
		},
		"errmsg" : "waiting for replication timed out"
	}
}
Error: write concern failed with errors: {
	"nInserted" : 1,
	"writeConcernError" : {
		"code" : 64,
		"errInfo" : {
			"wtimeout" : true
		},
		"errmsg" : "waiting for replication timed out"
	}
}
    at Error (<anonymous>)
    at doassert (src/mongo/shell/assert.js:11:14)
    at Function.assert.writeOK (src/mongo/shell/assert.js:363:13)
    at foo.js:14:10
2014-03-25T18:10:15.154-0400 Error: write concern failed with errors: {
	"nInserted" : 1,
	"writeConcernError" : {
		"code" : 64,
		"errInfo" : {
			"wtimeout" : true
		},
		"errmsg" : "waiting for replication timed out"
	}
} at src/mongo/shell/assert.js:13



 Comments   
Comment by Eric Milkie [ 12/Feb/18 ]

master/slave has gone away

Comment by J Rassi [ 26/Mar/14 ]

--autoresync will force a full resync after an oplog getmore fails (and will not resync more than once every ten minutes), but a full resync shouldn't be required to recover from this error. From discussion with eliot, I've come to understand that this failure is expected / transient (and will be triggered with closeAllDatabases), and that the slave should be retrying the oplog query with the adjusted optime, instead of performing a full resync.

Comment by Eric Milkie [ 26/Mar/14 ]

Won't --autoresync avoid this problem?

Generated at Thu Feb 08 03:31:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.