Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Won't Fix
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.2.2
Component/s: None
Labels:
None

Operating System:
ALL
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Having a capped collection without an _id field (and without _id index) causes a 2.2.2 secondary to replicate much slower than a 2.0.4 secondary.

Symptoms are:

slow response time
long write locks (several seconds)
replication falls behind

Steps to reproduce:

start replica set with 3 nodes on 2.0.4
create capped collection
create index on {uuid:1}

insert docs in a loop that don't have an _id

use test
db.coll.drop()
db.createCollection("coll", {capped:true, size:100000});
db.coll.ensureIndex({uuid:1, unique:true})
var i = 0;
while (true) {
	db._mongo.insert("test.coll", {value: i, uuid: ObjectId()});
	i += 1;
}

while docs are inserted, kill one secondary, upgrade to 2.2.2, restart

The upgraded secondary is very unresponsive:

mongotop shows writelocks taken for a long time (> 5 seconds)

                            ns       total        read       write		2013-01-18T05:51:47
                     test.coll      5560ms        70ms      5490ms
        local.replset.minvalid        15ms         0ms        15ms
                        admin.         0ms         0ms         0ms
          admin.system.indexes         0ms         0ms         0ms
                       config.         0ms         0ms         0ms
               config.settings         0ms         0ms         0ms

                            ns       total        read       write		2013-01-18T05:51:54
                     test.coll      6810ms        65ms      6745ms
        local.replset.minvalid        13ms         0ms        13ms
                        admin.         0ms         0ms         0ms
          admin.system.indexes         0ms         0ms         0ms
                       config.         0ms         0ms         0ms
               config.settings         0ms         0ms         0ms

serverStatus commands are very slow:

Fri Jan 18 16:58:45 [conn4] serverStatus was very slow: { after basic: 0, middle of mem: 0, after mem: 0, after connections: 0, after extra info: 0, after counters: 0, after repl: 0, after asserts: 0, after dur: 1890, at end: 1890 }
Fri Jan 18 16:58:45 [conn5] serverStatus was very slow: { after basic: 0, middle of mem: 0, after mem: 0, after connections: 0, after extra info: 0, after counters: 0, after repl: 0, after asserts: 0, after dur: 1850, at end: 1850 }
Fri Jan 18 16:58:45 [conn4] command admin.$cmd command: { serverStatus: 1 } ntoreturn:1 keyUpdates:0 locks(micros) r:17 reslen:2791 2025ms
Fri Jan 18 16:58:45 [conn5] command admin.$cmd command: { serverStatus: 1 } ntoreturn:1 keyUpdates:0 locks(micros) r:26 reslen:2791 1977ms

Large replication batches:

Fri Jan 18 16:59:17 [rsSync] replication batch size is 5001

The 2.2.2 secondary is falling more and more behind on replication:

replset:PRIMARY> db.printSlaveReplicationInfo()
source:   capslock.local:27018
	 syncedTo: Fri Jan 18 2013 17:07:40 GMT+1100 (EST)
		 = 81 secs ago (0.02hrs)
source:   capslock.local:27019
	 syncedTo: Fri Jan 18 2013 17:09:00 GMT+1100 (EST)
		 = 1 secs ago (0hrs)
replset:PRIMARY> rs.status()
{
	"set" : "replset",
	"date" : ISODate("2013-01-18T06:09:09Z"),
	"myState" : 1,
	"members" : [
		{
			"_id" : 0,
			"name" : "capslock.local:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"optime" : Timestamp(1358489349000, 253),
			"optimeDate" : ISODate("2013-01-18T06:09:09Z"),
			"self" : true
		},
		{
			"_id" : 1,
			"name" : "capslock.local:27018",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 136,
			"optime" : Timestamp(1358489263000, 1426),
			"optimeDate" : ISODate("2013-01-18T06:07:43Z"),
			"lastHeartbeat" : ISODate("2013-01-18T06:09:07Z"),
			"pingMs" : 8
		},
		{
			"_id" : 2,
			"name" : "capslock.local:27019",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 3623,
			"optime" : Timestamp(1358489348000, 971),
			"optimeDate" : ISODate("2013-01-18T06:09:08Z"),
			"lastHeartbeat" : ISODate("2013-01-18T06:09:08Z"),
			"pingMs" : 5
		}
	],
	"ok" : 1
}

Turning off prefetching with --replIndexPrefetch none did not make any difference.

Steps to fix

dump the capped collection
kill all mongod processes
upgrade to 2.2.2
bring all (upgraded) nodes back online
restore --drop the capped collection

Since then, replication is fine and both secondaries are keeping up with the primary under the same load

Logs are attached. Secondary "rs2" was the one that was upgraded first and showed the slow response.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

logs.zip
359 kB
Jan 18 2013 07:06:39 AM UTC

is related to

SERVER-4907 sparse indexes break replication of array based operators

Closed

SERVER-12071 updates from oplog must use _id or do a table scan

Closed

Assignee:: Eric Milkie
Reporter:: Thomas Rueckstiess (Inactive)
Participants:: Daniel Pasette, Eric Milkie, Thomas Rueckstiess
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: Jan 18 2013 07:03:56 AM UTC
Updated:: Dec 12 2013 08:24:46 PM UTC
Resolved:: Jan 18 2013 07:34:17 PM UTC

Details

Description

Symptoms are:

Steps to reproduce:

Steps to fix

Attachments

Attachments

Issue Links

Activity

People

Dates