Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-8225

slow replication and long write locks caused by capped collection without _id field and index, affects 2.2.2 but not 2.0.4

    • Type: Icon: Bug Bug
    • Resolution: Won't Fix
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.2.2
    • Component/s: None
    • Labels:
      None
    • ALL

      Having a capped collection without an _id field (and without _id index) causes a 2.2.2 secondary to replicate much slower than a 2.0.4 secondary.

      Symptoms are:
      • slow response time
      • long write locks (several seconds)
      • replication falls behind
      Steps to reproduce:
      1. start replica set with 3 nodes on 2.0.4
      2. create capped collection
      3. create index on {uuid:1}
      4. insert docs in a loop that don't have an _id
        use test
        db.coll.drop()
        db.createCollection("coll", {capped:true, size:100000});
        db.coll.ensureIndex({uuid:1, unique:true})
        var i = 0;
        while (true) {
        	db._mongo.insert("test.coll", {value: i, uuid: ObjectId()});
        	i += 1;
        }
        
      5. while docs are inserted, kill one secondary, upgrade to 2.2.2, restart

      The upgraded secondary is very unresponsive:

      mongotop shows writelocks taken for a long time (> 5 seconds)

                                  ns       total        read       write		2013-01-18T05:51:47
                           test.coll      5560ms        70ms      5490ms
              local.replset.minvalid        15ms         0ms        15ms
                              admin.         0ms         0ms         0ms
                admin.system.indexes         0ms         0ms         0ms
                             config.         0ms         0ms         0ms
                     config.settings         0ms         0ms         0ms
      
                                  ns       total        read       write		2013-01-18T05:51:54
                           test.coll      6810ms        65ms      6745ms
              local.replset.minvalid        13ms         0ms        13ms
                              admin.         0ms         0ms         0ms
                admin.system.indexes         0ms         0ms         0ms
                             config.         0ms         0ms         0ms
                     config.settings         0ms         0ms         0ms
      

      serverStatus commands are very slow:

      Fri Jan 18 16:58:45 [conn4] serverStatus was very slow: { after basic: 0, middle of mem: 0, after mem: 0, after connections: 0, after extra info: 0, after counters: 0, after repl: 0, after asserts: 0, after dur: 1890, at end: 1890 }
      Fri Jan 18 16:58:45 [conn5] serverStatus was very slow: { after basic: 0, middle of mem: 0, after mem: 0, after connections: 0, after extra info: 0, after counters: 0, after repl: 0, after asserts: 0, after dur: 1850, at end: 1850 }
      Fri Jan 18 16:58:45 [conn4] command admin.$cmd command: { serverStatus: 1 } ntoreturn:1 keyUpdates:0 locks(micros) r:17 reslen:2791 2025ms
      Fri Jan 18 16:58:45 [conn5] command admin.$cmd command: { serverStatus: 1 } ntoreturn:1 keyUpdates:0 locks(micros) r:26 reslen:2791 1977ms
      

      Large replication batches:

      Fri Jan 18 16:59:17 [rsSync] replication batch size is 5001
      

      The 2.2.2 secondary is falling more and more behind on replication:

      replset:PRIMARY> db.printSlaveReplicationInfo()
      source:   capslock.local:27018
      	 syncedTo: Fri Jan 18 2013 17:07:40 GMT+1100 (EST)
      		 = 81 secs ago (0.02hrs)
      source:   capslock.local:27019
      	 syncedTo: Fri Jan 18 2013 17:09:00 GMT+1100 (EST)
      		 = 1 secs ago (0hrs)
      replset:PRIMARY> rs.status()
      {
      	"set" : "replset",
      	"date" : ISODate("2013-01-18T06:09:09Z"),
      	"myState" : 1,
      	"members" : [
      		{
      			"_id" : 0,
      			"name" : "capslock.local:27017",
      			"health" : 1,
      			"state" : 1,
      			"stateStr" : "PRIMARY",
      			"optime" : Timestamp(1358489349000, 253),
      			"optimeDate" : ISODate("2013-01-18T06:09:09Z"),
      			"self" : true
      		},
      		{
      			"_id" : 1,
      			"name" : "capslock.local:27018",
      			"health" : 1,
      			"state" : 2,
      			"stateStr" : "SECONDARY",
      			"uptime" : 136,
      			"optime" : Timestamp(1358489263000, 1426),
      			"optimeDate" : ISODate("2013-01-18T06:07:43Z"),
      			"lastHeartbeat" : ISODate("2013-01-18T06:09:07Z"),
      			"pingMs" : 8
      		},
      		{
      			"_id" : 2,
      			"name" : "capslock.local:27019",
      			"health" : 1,
      			"state" : 2,
      			"stateStr" : "SECONDARY",
      			"uptime" : 3623,
      			"optime" : Timestamp(1358489348000, 971),
      			"optimeDate" : ISODate("2013-01-18T06:09:08Z"),
      			"lastHeartbeat" : ISODate("2013-01-18T06:09:08Z"),
      			"pingMs" : 5
      		}
      	],
      	"ok" : 1
      }
      

      Turning off prefetching with --replIndexPrefetch none did not make any difference.

      Steps to fix
      1. dump the capped collection
      2. kill all mongod processes
      3. upgrade to 2.2.2
      4. bring all (upgraded) nodes back online
      5. restore --drop the capped collection

      Since then, replication is fine and both secondaries are keeping up with the primary under the same load

      Logs are attached. Secondary "rs2" was the one that was upgraded first and showed the slow response.

            Assignee:
            milkie@mongodb.com Eric Milkie
            Reporter:
            thomas.rueckstiess@mongodb.com Thomas Rueckstiess
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: