|
Having a capped collection without an _id field (and without _id index) causes a 2.2.2 secondary to replicate much slower than a 2.0.4 secondary.
Symptoms are:
- slow response time
- long write locks (several seconds)
- replication falls behind
Steps to reproduce:
- start replica set with 3 nodes on 2.0.4
- create capped collection
- create index on
{uuid:1}
- insert docs in a loop that don't have an _id
use test
|
db.coll.drop()
|
db.createCollection("coll", {capped:true, size:100000});
|
db.coll.ensureIndex({uuid:1, unique:true})
|
var i = 0;
|
while (true) {
|
db._mongo.insert("test.coll", {value: i, uuid: ObjectId()});
|
i += 1;
|
}
|
- while docs are inserted, kill one secondary, upgrade to 2.2.2, restart
The upgraded secondary is very unresponsive:
mongotop shows writelocks taken for a long time (> 5 seconds)
ns total read write 2013-01-18T05:51:47
|
test.coll 5560ms 70ms 5490ms
|
local.replset.minvalid 15ms 0ms 15ms
|
admin. 0ms 0ms 0ms
|
admin.system.indexes 0ms 0ms 0ms
|
config. 0ms 0ms 0ms
|
config.settings 0ms 0ms 0ms
|
|
ns total read write 2013-01-18T05:51:54
|
test.coll 6810ms 65ms 6745ms
|
local.replset.minvalid 13ms 0ms 13ms
|
admin. 0ms 0ms 0ms
|
admin.system.indexes 0ms 0ms 0ms
|
config. 0ms 0ms 0ms
|
config.settings 0ms 0ms 0ms
|
serverStatus commands are very slow:
Fri Jan 18 16:58:45 [conn4] serverStatus was very slow: { after basic: 0, middle of mem: 0, after mem: 0, after connections: 0, after extra info: 0, after counters: 0, after repl: 0, after asserts: 0, after dur: 1890, at end: 1890 }
|
Fri Jan 18 16:58:45 [conn5] serverStatus was very slow: { after basic: 0, middle of mem: 0, after mem: 0, after connections: 0, after extra info: 0, after counters: 0, after repl: 0, after asserts: 0, after dur: 1850, at end: 1850 }
|
Fri Jan 18 16:58:45 [conn4] command admin.$cmd command: { serverStatus: 1 } ntoreturn:1 keyUpdates:0 locks(micros) r:17 reslen:2791 2025ms
|
Fri Jan 18 16:58:45 [conn5] command admin.$cmd command: { serverStatus: 1 } ntoreturn:1 keyUpdates:0 locks(micros) r:26 reslen:2791 1977ms
|
Large replication batches:
Fri Jan 18 16:59:17 [rsSync] replication batch size is 5001
|
The 2.2.2 secondary is falling more and more behind on replication:
replset:PRIMARY> db.printSlaveReplicationInfo()
|
source: capslock.local:27018
|
syncedTo: Fri Jan 18 2013 17:07:40 GMT+1100 (EST)
|
= 81 secs ago (0.02hrs)
|
source: capslock.local:27019
|
syncedTo: Fri Jan 18 2013 17:09:00 GMT+1100 (EST)
|
= 1 secs ago (0hrs)
|
replset:PRIMARY> rs.status()
|
{
|
"set" : "replset",
|
"date" : ISODate("2013-01-18T06:09:09Z"),
|
"myState" : 1,
|
"members" : [
|
{
|
"_id" : 0,
|
"name" : "capslock.local:27017",
|
"health" : 1,
|
"state" : 1,
|
"stateStr" : "PRIMARY",
|
"optime" : Timestamp(1358489349000, 253),
|
"optimeDate" : ISODate("2013-01-18T06:09:09Z"),
|
"self" : true
|
},
|
{
|
"_id" : 1,
|
"name" : "capslock.local:27018",
|
"health" : 1,
|
"state" : 2,
|
"stateStr" : "SECONDARY",
|
"uptime" : 136,
|
"optime" : Timestamp(1358489263000, 1426),
|
"optimeDate" : ISODate("2013-01-18T06:07:43Z"),
|
"lastHeartbeat" : ISODate("2013-01-18T06:09:07Z"),
|
"pingMs" : 8
|
},
|
{
|
"_id" : 2,
|
"name" : "capslock.local:27019",
|
"health" : 1,
|
"state" : 2,
|
"stateStr" : "SECONDARY",
|
"uptime" : 3623,
|
"optime" : Timestamp(1358489348000, 971),
|
"optimeDate" : ISODate("2013-01-18T06:09:08Z"),
|
"lastHeartbeat" : ISODate("2013-01-18T06:09:08Z"),
|
"pingMs" : 5
|
}
|
],
|
"ok" : 1
|
}
|
Turning off prefetching with --replIndexPrefetch none did not make any difference.
Steps to fix
- dump the capped collection
- kill all mongod processes
- upgrade to 2.2.2
- bring all (upgraded) nodes back online
- restore --drop the capped collection
Since then, replication is fine and both secondaries are keeping up with the primary under the same load
Logs are attached. Secondary "rs2" was the one that was upgraded first and showed the slow response.
|