Details
-
Bug
-
Resolution: Done
-
Major - P3
-
None
-
None
-
None
-
Replication
-
ALL
Description
In a master-slave configuration, slaves (fatally) halt replication if the oplog getmore returns an $err response. Instead, slaves should retry the oplog query.
The master will return an $err response to an oplog getmore under a number of conditions, most notably if a closeAllDatabases command was just issued (this can be evidenced during dropdb_race.js test failures under the small_oplog passthrough).
This issue does not affect replica set secondaries (secondaries correctly retry the oplog query when this condition occurs).
Reproduce with the following script, which forces an $err response with a killOp (failure observed on line 14, after replication halts).
var rt = new ReplTest("foo"); |
var m = rt.start(true); |
var s = rt.start(false); |
var db = m.getDB("test") |
|
|
for(var i=0; i<5; i++) { |
for(var j=0; j<1000; j++) { |
db.foo.insert({});
|
}
|
var x = db.currentOp({ns:'local.oplog.$main', op:'getmore'}).inprog; |
if (x.length) { |
db.killOp(x[0].opid)
|
}
|
assert.writeOK(db.foo.insert({}, {writeConcern:{w:2, wtimeout:20000}}));
|
}
|
Log output for the above script, showing the failure:
m31001| 2014-03-25T18:09:48.560-0400 [replslave] resync: cloning database test to get an initial copy
|
m31000| 2014-03-25T18:09:48.560-0400 [initandlisten] connection accepted from 127.0.0.1:57289 #3 (3 connections now open)
|
m31001| 2014-03-25T18:09:48.560-0400 [FileAllocator] allocating new datafile /data/db/foo-slave/test.ns, filling with zeroes...
|
m31001| 2014-03-25T18:09:48.627-0400 [FileAllocator] done allocating datafile /data/db/foo-slave/test.ns, size: 16MB, took 0.066 secs
|
m31001| 2014-03-25T18:09:48.652-0400 [FileAllocator] allocating new datafile /data/db/foo-slave/test.0, filling with zeroes...
|
m31001| 2014-03-25T18:09:48.713-0400 [FileAllocator] done allocating datafile /data/db/foo-slave/test.0, size: 16MB, took 0.06 secs
|
m31001| 2014-03-25T18:09:48.740-0400 [replslave] build index on: test.foo properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "test.foo" }
|
m31001| 2014-03-25T18:09:48.741-0400 [replslave] building index using bulk method
|
m31001| 2014-03-25T18:09:48.741-0400 [replslave] build index done. scanned 114 total records. 0.001 secs
|
m31001| 2014-03-25T18:09:48.741-0400 [replslave] resync: done with initial clone for db: test
|
m31000| 2014-03-25T18:09:48.741-0400 [conn3] end connection 127.0.0.1:57289 (2 connections now open)
|
m31000| 2014-03-25T18:09:49.746-0400 [slaveTracking] build index on: local.slaves properties: { v: 1, key: { _id: 1 }, name: "_id_", ns: "local.slaves" }
|
m31000| 2014-03-25T18:09:49.747-0400 [slaveTracking] added index to empty collection
|
m31000| 2014-03-25T18:09:55.100-0400 [conn1] going to kill op: op: 3916.0
|
m31000| 2014-03-25T18:09:55.105-0400 [conn2] getmore local.oplog.$main cursorid:19256671971 ntoreturn:0 keyUpdates:0 exception: operation was interrupted code:11601 numYields:0 locks(micros) r:81 nreturned:1 reslen:71 20ms
|
m31001| 2014-03-25T18:09:55.105-0400 [replslave] sync error: problem querying remote oplog record
|
m31001| 2014-03-25T18:09:55.105-0400 [replslave] op: { $err: "operation was interrupted", code: 11601 }
|
m31001| 2014-03-25T18:09:55.105-0400 [replslave] halting replication
|
m31001| 2014-03-25T18:09:55.105-0400 [replslave] caught SyncException
|
m31001| 2014-03-25T18:09:55.105-0400 [replslave] repl: sleep 10 sec before next pass
|
m31001| 2014-03-25T18:10:05.107-0400 [replslave] all sources dead: sync error: no ts found querying remote oplog record, sleeping for 5 seconds
|
m31000| 2014-03-25T18:10:05.107-0400 [conn2] killcursors: found 0 of 1
|
m31000| 2014-03-25T18:10:05.107-0400 [conn2] end connection 127.0.0.1:57288 (1 connection now open)
|
m31001| 2014-03-25T18:10:10.108-0400 [replslave] all sources dead: sync error: no ts found querying remote oplog record, sleeping for 5 seconds
|
m31000| 2014-03-25T18:10:15.106-0400 [conn1] command test.$cmd command: insert { insert: "foo", documents: [ { _id: ObjectId('5331feb3cd9f37898e100a9f') } ], ordered: true, writeConcern: { w: 2.0, wtimeout: 20000.0 } } keyUpdates:0 numYields:0 locks(micros) w:3 reslen:161 20001ms
|
m31001| 2014-03-25T18:10:15.108-0400 [replslave] all sources dead: sync error: no ts found querying remote oplog record, sleeping for 5 seconds
|
assert: write concern failed with errors: {
|
"nInserted" : 1,
|
"writeConcernError" : {
|
"code" : 64,
|
"errInfo" : {
|
"wtimeout" : true
|
},
|
"errmsg" : "waiting for replication timed out"
|
}
|
}
|
Error: write concern failed with errors: {
|
"nInserted" : 1,
|
"writeConcernError" : {
|
"code" : 64,
|
"errInfo" : {
|
"wtimeout" : true
|
},
|
"errmsg" : "waiting for replication timed out"
|
}
|
}
|
at Error (<anonymous>)
|
at doassert (src/mongo/shell/assert.js:11:14)
|
at Function.assert.writeOK (src/mongo/shell/assert.js:363:13)
|
at foo.js:14:10
|
2014-03-25T18:10:15.154-0400 Error: write concern failed with errors: {
|
"nInserted" : 1,
|
"writeConcernError" : {
|
"code" : 64,
|
"errInfo" : {
|
"wtimeout" : true
|
},
|
"errmsg" : "waiting for replication timed out"
|
}
|
} at src/mongo/shell/assert.js:13
|
Attachments
Issue Links
- related to
-
SERVER-13435 dropdb_race.js in small_oplog (causes replication to halt on slave)
-
- Closed
-