|
Similar to SERVER-17140 but much more insidious; this issue prevented a downed replica from making an initial sync 2x in a row; we had to resort to tcp_tw_reuse to get the replica back online.
Scenario: replica set with, among other things, a large collection of long-term documents (10 million objects / 190GB of data) and a small collection of highly transient (1000 objects / 5MB of data).
During initial sync it takes about an hour for the bulk of the data to be replicated and the indices to be built for the big collection – during this time a large number of the documents in the small collection are updated and deleted. Once the sync is completed the log exploded into a fire-hose of:
I REPL [repl writer worker 12] replication failed to apply update: { ts: Timestamp 1432762108000|244, h: -8874966953786363575, v: 2, op: "u", ns: "lotus_tem
|
I REPL [repl writer worker 4] replication failed to apply update: { ts: Timestamp 1432762108000|232, h: -2423412449985218813, v: 2, op: "u", ns: "lotus_temp
|
I REPL [repl writer worker 10] replication failed to apply update: { ts: Timestamp 1432762108000|259, h: 6507151402806398582, v: 2, op: "u", ns: "lotus_temp
|
I REPL [repl writer worker 13] replication failed to apply update: { ts: Timestamp 1432762108000|251, h: 8927018345590464930, v: 2, op: "u", ns: "lotus_temp
|
I REPL [repl writer worker 8] replication failed to apply update: { ts: Timestamp 1432762108000|242, h: 7518410875297535456, v: 2, op: "u", ns: "lotus_temp.
|
I REPL [repl writer worker 12] replication info adding missing object
|
I REPL [repl writer worker 12] replication missing object not found on source. presumably deleted later in oplog
|
I REPL [repl writer worker 12] replication o2: { _id: ObjectId('5566368006c56f080dfc5033') }
|
I REPL [repl writer worker 12] replication o firstfield: $set
|
I REPL [repl writer worker 4] replication info adding missing object
|
I REPL [repl writer worker 4] replication missing object not found on source. presumably deleted later in oplog
|
I REPL [repl writer worker 4] replication o2: { _id: ObjectId('556636ec06c56f330efc4fc1') }
|
I REPL [repl writer worker 4] replication o firstfield: $set
|
I REPL [repl writer worker 10] replication info adding missing object
|
I REPL [repl writer worker 10] replication missing object not found on source. presumably deleted later in oplog
|
I REPL [repl writer worker 10] replication o2: { _id: ObjectId('556636ca07c56f642fa8f0a5') }
|
I REPL [repl writer worker 10] replication o firstfield: $set
|
I REPL [repl writer worker 13] replication info adding missing object
|
I REPL [repl writer worker 13] replication missing object not found on source. presumably deleted later in oplog
|
I REPL [repl writer worker 13] replication o2: { _id: ObjectId('5566366c06c56f340dfc4fdc') }
|
I REPL [repl writer worker 13] replication o firstfield: $set
|
I REPL [repl writer worker 8] replication info adding missing object
|
I REPL [repl writer worker 8] replication missing object not found on source. presumably deleted later in oplog
|
I REPL [repl writer worker 8] replication o2: { _id: ObjectId('556634a508c56ffa267b23cd') }
|
I REPL [repl writer worker 8] replication o firstfield: $set
|
This goes on for about 140,000 lines of log in about a minute – during which time it seems to be making a whole lot of short-lived connections – we can see this happening in the log for the server it’s syncing from:
2015-05-27T19:18:43.379-0400 I NETWORK [conn100064] end connection x.x.x.x:35686 (1114 connections now open)
|
2015-05-27T19:18:43.379-0400 I NETWORK [initandlisten] connection accepted from x.x.x.x:35687 #100065 (1115 connections now open)
|
2015-05-27T19:18:43.380-0400 I NETWORK [conn100065] end connection x.x.x.x:35687 (1114 connections now open)
|
2015-05-27T19:18:43.380-0400 I NETWORK [initandlisten] connection accepted from x.x.x.x:35688 #100066 (1115 connections now open)
|
2015-05-27T19:18:43.380-0400 I NETWORK [conn100066] end connection x.x.x.x:35688 (1114 connections now open)
|
2015-05-27T19:18:43.381-0400 I NETWORK [initandlisten] connection accepted from x.x.x.x:35689 #100067 (1115 connections now open)
|
2015-05-27T19:18:43.381-0400 I NETWORK [conn100067] end connection x.x.x.x:35689 (1114 connections now open)
|
2015-05-27T19:18:43.382-0400 I NETWORK [initandlisten] connection accepted from x.x.x.x:35690 #100068 (1115 connections now open)
|
2015-05-27T19:18:43.382-0400 I NETWORK [conn100068] end connection x.x.x.x:35690 (1114 connections now open)
|
2015-05-27T19:18:43.383-0400 I NETWORK [initandlisten] connection accepted from x.x.x.x:35691 #100069 (1115 connections now open)
|
2015-05-27T19:18:43.383-0400 I NETWORK [conn100069] end connection x.x.x.x:35691 (1114 connections now open)
|
2015-05-27T19:18:43.384-0400 I NETWORK [initandlisten] connection accepted from x.x.x.x:35692 #100070 (1115 connections now open)
|
2015-05-27T19:18:43.384-0400 I NETWORK [conn100070] end connection x.x.x.x:35692 (1114 connections now open)
|
Note the freakishly short lifetimes! This of course isn’t healthy for the primary but it gets worse than that; on the new replica, while this firehose was blasting the primary, I ran a quick check of the TIME_WAIT reservations:
# netstat -an | grep TIME_WAIT | wc -l
|
25246
|
This means we’re exhausting network ports at a furious rate and of course we end up choking:
I REPL [repl writer worker 2] replication info adding missing object
|
W NETWORK [repl writer worker 2] Failed to connect to x.x.x.x:27017, reason: errno:99 Cannot assign requested address
|
I REPL [repl writer worker 2] repl: couldn't connect to server war-mongo2:27017 (x.x.x.x), connection attempt failed
|
W REPL [repl writer worker 2] network problem detected while connecting to the sync source, attempt 1 of 3
|
W NETWORK [repl writer worker 2] Failed to connect to x.x.x.x:27017, reason: errno:99 Cannot assign requested address
|
I REPL [repl writer worker 2] repl: couldn't connect to server war-mongo2:27017 (x.x.x.x), connection attempt failed
|
W REPL [repl writer worker 2] network problem detected while connecting to the sync source, attempt 2 of 3
|
W NETWORK [repl writer worker 2] Failed to connect to x.x.x.x:27017, reason: errno:99 Cannot assign requested address
|
I REPL [repl writer worker 2] repl: couldn't connect to server war-mongo2:27017 (x.x.x.x), connection attempt failed
|
W REPL [repl writer worker 2] network problem detected while connecting to the sync source, attempt 3 of 3
|
I - [repl writer worker 2] Assertion: 15916:Can no longer connect to initial sync source: war-mongo2:27017
|
I CONTROL [repl writer worker 2]
|
0xf6d639 0xf0d371 0xef25cf 0xef267c 0xcaced2 0xcac176 0xcb2b5c 0xf0418b 0xfbab94 0x7f0eb110ce9a 0x7f0eafbe338d
|
----- BEGIN BACKTRACE -----
|
{"backtrace":[{"b":"400000","o":"B6D639"},{"b":"400000","o":"B0D371"},{"b":"400000","o":"AF25CF"},{"b":"400000","o":"AF267C"},{"b":"400000","o":"8ACED2"},{"b":"400000","o":"8AC176"},{"b":"4
|
mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf6d639]
|
mongod(_ZN5mongo10logContextEPKc+0xE1) [0xf0d371]
|
mongod(_ZN5mongo11msgassertedEiPKc+0xAF) [0xef25cf]
|
mongod(+0xAF267C) [0xef267c]
|
mongod(_ZN5mongo4repl4Sync13getMissingDocEPNS_16OperationContextEPNS_8DatabaseERKNS_7BSONObjE+0x622) [0xcaced2]
|
mongod(_ZN5mongo4repl4Sync11shouldRetryEPNS_16OperationContextERKNS_7BSONObjE+0x196) [0xcac176]
|
mongod(_ZN5mongo4repl21multiInitialSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x8C) [0xcb2b5c]
|
mongod(_ZN5mongo10threadpool6Worker4loopERKSs+0x2FB) [0xf0418b]
|
mongod(+0xBBAB94) [0xfbab94]
|
libpthread.so.0(+0x7E9A) [0x7f0eb110ce9a]
|
libc.so.6(clone+0x6D) [0x7f0eafbe338d]
|
----- END BACKTRACE -----
|
2015-05-27T19:18:56.663-0400 E REPL [repl writer worker 2] writer worker caught exception: :: caused by :: 15916 Can no longer connect to initial sync source: war-mongo2:27017 on: { ts
|
2015-05-27T19:18:56.663-0400 I - [repl writer worker 2] Fatal Assertion 16361
|
2015-05-27T19:18:56.663-0400 I - [repl writer worker 2]
|
|
***aborting after fassert() failure
|
We had this happen twice before we started messing with kernel parameters which ultimately got us up and running.
Since we also rebuild larger "mostly write no delete" replica-sets the same day and never saw this it seems likely that our highly-transient collection is causing this sort of connection abuse.
|