-
Type: Bug
-
Resolution: Won't Fix
-
Priority: Major - P3
-
None
-
Affects Version/s: 2.2.2
-
Component/s: None
-
None
-
ALL
Having a capped collection without an _id field (and without _id index) causes a 2.2.2 secondary to replicate much slower than a 2.0.4 secondary.
Symptoms are:
- slow response time
- long write locks (several seconds)
- replication falls behind
Steps to reproduce:
- start replica set with 3 nodes on 2.0.4
- create capped collection
- create index on {uuid:1}
- insert docs in a loop that don't have an _id
use test db.coll.drop() db.createCollection("coll", {capped:true, size:100000}); db.coll.ensureIndex({uuid:1, unique:true}) var i = 0; while (true) { db._mongo.insert("test.coll", {value: i, uuid: ObjectId()}); i += 1; }
- while docs are inserted, kill one secondary, upgrade to 2.2.2, restart
The upgraded secondary is very unresponsive:
mongotop shows writelocks taken for a long time (> 5 seconds)
ns total read write 2013-01-18T05:51:47 test.coll 5560ms 70ms 5490ms local.replset.minvalid 15ms 0ms 15ms admin. 0ms 0ms 0ms admin.system.indexes 0ms 0ms 0ms config. 0ms 0ms 0ms config.settings 0ms 0ms 0ms ns total read write 2013-01-18T05:51:54 test.coll 6810ms 65ms 6745ms local.replset.minvalid 13ms 0ms 13ms admin. 0ms 0ms 0ms admin.system.indexes 0ms 0ms 0ms config. 0ms 0ms 0ms config.settings 0ms 0ms 0ms
serverStatus commands are very slow:
Fri Jan 18 16:58:45 [conn4] serverStatus was very slow: { after basic: 0, middle of mem: 0, after mem: 0, after connections: 0, after extra info: 0, after counters: 0, after repl: 0, after asserts: 0, after dur: 1890, at end: 1890 } Fri Jan 18 16:58:45 [conn5] serverStatus was very slow: { after basic: 0, middle of mem: 0, after mem: 0, after connections: 0, after extra info: 0, after counters: 0, after repl: 0, after asserts: 0, after dur: 1850, at end: 1850 } Fri Jan 18 16:58:45 [conn4] command admin.$cmd command: { serverStatus: 1 } ntoreturn:1 keyUpdates:0 locks(micros) r:17 reslen:2791 2025ms Fri Jan 18 16:58:45 [conn5] command admin.$cmd command: { serverStatus: 1 } ntoreturn:1 keyUpdates:0 locks(micros) r:26 reslen:2791 1977ms
Large replication batches:
Fri Jan 18 16:59:17 [rsSync] replication batch size is 5001
The 2.2.2 secondary is falling more and more behind on replication:
replset:PRIMARY> db.printSlaveReplicationInfo() source: capslock.local:27018 syncedTo: Fri Jan 18 2013 17:07:40 GMT+1100 (EST) = 81 secs ago (0.02hrs) source: capslock.local:27019 syncedTo: Fri Jan 18 2013 17:09:00 GMT+1100 (EST) = 1 secs ago (0hrs) replset:PRIMARY> rs.status() { "set" : "replset", "date" : ISODate("2013-01-18T06:09:09Z"), "myState" : 1, "members" : [ { "_id" : 0, "name" : "capslock.local:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "optime" : Timestamp(1358489349000, 253), "optimeDate" : ISODate("2013-01-18T06:09:09Z"), "self" : true }, { "_id" : 1, "name" : "capslock.local:27018", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 136, "optime" : Timestamp(1358489263000, 1426), "optimeDate" : ISODate("2013-01-18T06:07:43Z"), "lastHeartbeat" : ISODate("2013-01-18T06:09:07Z"), "pingMs" : 8 }, { "_id" : 2, "name" : "capslock.local:27019", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 3623, "optime" : Timestamp(1358489348000, 971), "optimeDate" : ISODate("2013-01-18T06:09:08Z"), "lastHeartbeat" : ISODate("2013-01-18T06:09:08Z"), "pingMs" : 5 } ], "ok" : 1 }
Turning off prefetching with --replIndexPrefetch none did not make any difference.
Steps to fix
- dump the capped collection
- kill all mongod processes
- upgrade to 2.2.2
- bring all (upgraded) nodes back online
- restore --drop the capped collection
Since then, replication is fine and both secondaries are keeping up with the primary under the same load
Logs are attached. Secondary "rs2" was the one that was upgraded first and showed the slow response.
- is related to
-
SERVER-4907 sparse indexes break replication of array based operators
- Closed
-
SERVER-12071 updates from oplog must use _id or do a table scan
- Closed