Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Replication
Labels:
None

Assigned Teams:

Replication
Operating System:
ALL
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

In a master-slave configuration, slaves (fatally) halt replication if the oplog getmore returns an $err response. Instead, slaves should retry the oplog query.

The master will return an $err response to an oplog getmore under a number of conditions, most notably if a closeAllDatabases command was just issued (this can be evidenced during dropdb_race.js test failures under the small_oplog passthrough).

This issue does not affect replica set secondaries (secondaries correctly retry the oplog query when this condition occurs).

Reproduce with the following script, which forces an $err response with a killOp (failure observed on line 14, after replication halts).

var rt = new ReplTest("foo");
var m = rt.start(true);
var s = rt.start(false);
var db = m.getDB("test")

for(var i=0; i<5; i++) {
  for(var j=0; j<1000; j++) {
      db.foo.insert({});
  }
  var x = db.currentOp({ns:'local.oplog.$main', op:'getmore'}).inprog;
  if (x.length) {
      db.killOp(x[0].opid)
  }
  assert.writeOK(db.foo.insert({}, {writeConcern:{w:2, wtimeout:20000}}));
}

Log output for the above script, showing the failure:

 m31001| 2014-03-25T18:09:48.560-0400 [replslave] resync: cloning database test to get an initial copy
 m31000| 2014-03-25T18:09:48.560-0400 [initandlisten] connection accepted from 127.0.0.1:57289 #3 (3 connections now open)
 m31001| 2014-03-25T18:09:48.560-0400 [FileAllocator] allocating new datafile /data/db/foo-slave/test.ns, filling with zeroes...
 m31001| 2014-03-25T18:09:48.627-0400 [FileAllocator] done allocating datafile /data/db/foo-slave/test.ns, size: 16MB,  took 0.066 secs
 m31001| 2014-03-25T18:09:48.652-0400 [FileAllocator] allocating new datafile /data/db/foo-slave/test.0, filling with zeroes...
 m31001| 2014-03-25T18:09:48.713-0400 [FileAllocator] done allocating datafile /data/db/foo-slave/test.0, size: 16MB,  took 0.06 secs
 m31001| 2014-03-25T18:09:48.740-0400 [replslave] build index on: test.foo properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "test.foo" }
 m31001| 2014-03-25T18:09:48.741-0400 [replslave] 	 building index using bulk method
 m31001| 2014-03-25T18:09:48.741-0400 [replslave] build index done.  scanned 114 total records. 0.001 secs
 m31001| 2014-03-25T18:09:48.741-0400 [replslave] resync: done with initial clone for db: test
 m31000| 2014-03-25T18:09:48.741-0400 [conn3] end connection 127.0.0.1:57289 (2 connections now open)
 m31000| 2014-03-25T18:09:49.746-0400 [slaveTracking] build index on: local.slaves properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "local.slaves" }
 m31000| 2014-03-25T18:09:49.747-0400 [slaveTracking] 	 added index to empty collection
 m31000| 2014-03-25T18:09:55.100-0400 [conn1] going to kill op: op: 3916.0
 m31000| 2014-03-25T18:09:55.105-0400 [conn2] getmore local.oplog.$main cursorid:19256671971 ntoreturn:0 keyUpdates:0 exception: operation was interrupted code:11601 numYields:0 locks(micros) r:81 nreturned:1 reslen:71 20ms
 m31001| 2014-03-25T18:09:55.105-0400 [replslave] sync error: problem querying remote oplog record
 m31001| 2014-03-25T18:09:55.105-0400 [replslave] op: { $err: "operation was interrupted", code: 11601 }
 m31001| 2014-03-25T18:09:55.105-0400 [replslave] halting replication
 m31001| 2014-03-25T18:09:55.105-0400 [replslave] caught SyncException
 m31001| 2014-03-25T18:09:55.105-0400 [replslave] repl: sleep 10 sec before next pass
 m31001| 2014-03-25T18:10:05.107-0400 [replslave] all sources dead: sync error: no ts found querying remote oplog record, sleeping for 5 seconds
 m31000| 2014-03-25T18:10:05.107-0400 [conn2] killcursors: found 0 of 1
 m31000| 2014-03-25T18:10:05.107-0400 [conn2] end connection 127.0.0.1:57288 (1 connection now open)
 m31001| 2014-03-25T18:10:10.108-0400 [replslave] all sources dead: sync error: no ts found querying remote oplog record, sleeping for 5 seconds
 m31000| 2014-03-25T18:10:15.106-0400 [conn1] command test.$cmd command: insert { insert: "foo", documents: [ { _id: ObjectId('5331feb3cd9f37898e100a9f') } ], ordered: true, writeConcern: { w: 2.0, wtimeout: 20000.0 } } keyUpdates:0 numYields:0 locks(micros) w:3 reslen:161 20001ms
 m31001| 2014-03-25T18:10:15.108-0400 [replslave] all sources dead: sync error: no ts found querying remote oplog record, sleeping for 5 seconds
assert: write concern failed with errors: {
	"nInserted" : 1,
	"writeConcernError" : {
		"code" : 64,
		"errInfo" : {
			"wtimeout" : true
		},
		"errmsg" : "waiting for replication timed out"
	}
}
Error: write concern failed with errors: {
	"nInserted" : 1,
	"writeConcernError" : {
		"code" : 64,
		"errInfo" : {
			"wtimeout" : true
		},
		"errmsg" : "waiting for replication timed out"
	}
}
    at Error (<anonymous>)
    at doassert (src/mongo/shell/assert.js:11:14)
    at Function.assert.writeOK (src/mongo/shell/assert.js:363:13)
    at foo.js:14:10
2014-03-25T18:10:15.154-0400 Error: write concern failed with errors: {
	"nInserted" : 1,
	"writeConcernError" : {
		"code" : 64,
		"errInfo" : {
			"wtimeout" : true
		},
		"errmsg" : "waiting for replication timed out"
	}
} at src/mongo/shell/assert.js:13

related to

SERVER-13435 dropdb_race.js in small_oplog (causes replication to halt on slave)

Closed

Assignee:: [DO NOT USE] Backlog - Replication Team
Reporter:: J Rassi (Inactive)
Participants:: [DO NOT USE] Backlog - Replication Team, Eric Milkie, J Rassi
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Mar 25 2014 10:11:25 PM UTC
Updated:: Dec 06 2022 05:08:38 AM UTC
Resolved:: Feb 12 2018 06:19:23 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates