[SERVER-18721] Initial-sync can exhaust available ports with rapid short-lived connections Created: 28/May/15  Updated: 07/Apr/23  Resolved: 08/Jun/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.0.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Glen Miner Assignee: Alan Zheng
Resolution: Done Votes: 3
Labels: PM248
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux 3.13.0-53-generic #89~precise1-Ubuntu SMP Wed May 20 17:42:16 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

32-core / 128GB RAM, all SSD, monster servers.


Issue Links:
Depends
Duplicate
is duplicated by SERVER-19176 Mongo instance dies while creating a ... Closed
Problem/Incident
Related
related to SERVER-42078 mongodb is unable to replay the oplog... Closed
Operating System: Linux
Steps To Reproduce:

1. Have large collection that will take an hour to sync
2. Have a script create, update, delete objects in another collection at a high rate (10/second I'd guess)
3. Add a replica to the set
4. Once indices are built, watch connection for spam on primary and watch for TIME_WAIT sockets on the replica; if you can run fast enough you can exhaust the available ports.

Our workaround:

sudo vi /etc/sysctl.conf
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 10000 61000
sudo sysctl -p

Participants:
Case:

 Description   

Similar to SERVER-17140 but much more insidious; this issue prevented a downed replica from making an initial sync 2x in a row; we had to resort to tcp_tw_reuse to get the replica back online.

Scenario: replica set with, among other things, a large collection of long-term documents (10 million objects / 190GB of data) and a small collection of highly transient (1000 objects / 5MB of data).

During initial sync it takes about an hour for the bulk of the data to be replicated and the indices to be built for the big collection – during this time a large number of the documents in the small collection are updated and deleted. Once the sync is completed the log exploded into a fire-hose of:

I REPL     [repl writer worker 12] replication failed to apply update: { ts: Timestamp 1432762108000|244, h: -8874966953786363575, v: 2, op: "u", ns: "lotus_tem
I REPL     [repl writer worker 4] replication failed to apply update: { ts: Timestamp 1432762108000|232, h: -2423412449985218813, v: 2, op: "u", ns: "lotus_temp
I REPL     [repl writer worker 10] replication failed to apply update: { ts: Timestamp 1432762108000|259, h: 6507151402806398582, v: 2, op: "u", ns: "lotus_temp
I REPL     [repl writer worker 13] replication failed to apply update: { ts: Timestamp 1432762108000|251, h: 8927018345590464930, v: 2, op: "u", ns: "lotus_temp
I REPL     [repl writer worker 8] replication failed to apply update: { ts: Timestamp 1432762108000|242, h: 7518410875297535456, v: 2, op: "u", ns: "lotus_temp.
I REPL     [repl writer worker 12] replication info adding missing object
I REPL     [repl writer worker 12] replication missing object not found on source. presumably deleted later in oplog
I REPL     [repl writer worker 12] replication o2: { _id: ObjectId('5566368006c56f080dfc5033') }
I REPL     [repl writer worker 12] replication o firstfield: $set
I REPL     [repl writer worker 4] replication info adding missing object
I REPL     [repl writer worker 4] replication missing object not found on source. presumably deleted later in oplog
I REPL     [repl writer worker 4] replication o2: { _id: ObjectId('556636ec06c56f330efc4fc1') }
I REPL     [repl writer worker 4] replication o firstfield: $set
I REPL     [repl writer worker 10] replication info adding missing object
I REPL     [repl writer worker 10] replication missing object not found on source. presumably deleted later in oplog
I REPL     [repl writer worker 10] replication o2: { _id: ObjectId('556636ca07c56f642fa8f0a5') }
I REPL     [repl writer worker 10] replication o firstfield: $set
I REPL     [repl writer worker 13] replication info adding missing object
I REPL     [repl writer worker 13] replication missing object not found on source. presumably deleted later in oplog
I REPL     [repl writer worker 13] replication o2: { _id: ObjectId('5566366c06c56f340dfc4fdc') }
I REPL     [repl writer worker 13] replication o firstfield: $set
I REPL     [repl writer worker 8] replication info adding missing object
I REPL     [repl writer worker 8] replication missing object not found on source. presumably deleted later in oplog
I REPL     [repl writer worker 8] replication o2: { _id: ObjectId('556634a508c56ffa267b23cd') }
I REPL     [repl writer worker 8] replication o firstfield: $set

This goes on for about 140,000 lines of log in about a minute – during which time it seems to be making a whole lot of short-lived connections – we can see this happening in the log for the server it’s syncing from:

2015-05-27T19:18:43.379-0400 I NETWORK  [conn100064] end connection x.x.x.x:35686 (1114 connections now open)
2015-05-27T19:18:43.379-0400 I NETWORK  [initandlisten] connection accepted from x.x.x.x:35687 #100065 (1115 connections now open)
2015-05-27T19:18:43.380-0400 I NETWORK  [conn100065] end connection x.x.x.x:35687 (1114 connections now open)
2015-05-27T19:18:43.380-0400 I NETWORK  [initandlisten] connection accepted from x.x.x.x:35688 #100066 (1115 connections now open)
2015-05-27T19:18:43.380-0400 I NETWORK  [conn100066] end connection x.x.x.x:35688 (1114 connections now open)
2015-05-27T19:18:43.381-0400 I NETWORK  [initandlisten] connection accepted from x.x.x.x:35689 #100067 (1115 connections now open)
2015-05-27T19:18:43.381-0400 I NETWORK  [conn100067] end connection x.x.x.x:35689 (1114 connections now open)
2015-05-27T19:18:43.382-0400 I NETWORK  [initandlisten] connection accepted from x.x.x.x:35690 #100068 (1115 connections now open)
2015-05-27T19:18:43.382-0400 I NETWORK  [conn100068] end connection x.x.x.x:35690 (1114 connections now open)
2015-05-27T19:18:43.383-0400 I NETWORK  [initandlisten] connection accepted from x.x.x.x:35691 #100069 (1115 connections now open)
2015-05-27T19:18:43.383-0400 I NETWORK  [conn100069] end connection x.x.x.x:35691 (1114 connections now open)
2015-05-27T19:18:43.384-0400 I NETWORK  [initandlisten] connection accepted from x.x.x.x:35692 #100070 (1115 connections now open)
2015-05-27T19:18:43.384-0400 I NETWORK  [conn100070] end connection x.x.x.x:35692 (1114 connections now open)

Note the freakishly short lifetimes! This of course isn’t healthy for the primary but it gets worse than that; on the new replica, while this firehose was blasting the primary, I ran a quick check of the TIME_WAIT reservations:

# netstat -an | grep TIME_WAIT | wc -l
25246

This means we’re exhausting network ports at a furious rate and of course we end up choking:

I REPL     [repl writer worker 2] replication info adding missing object
W NETWORK  [repl writer worker 2] Failed to connect to x.x.x.x:27017, reason: errno:99 Cannot assign requested address
I REPL     [repl writer worker 2] repl: couldn't connect to server war-mongo2:27017 (x.x.x.x), connection attempt failed
W REPL     [repl writer worker 2] network problem detected while connecting to the sync source, attempt 1 of 3
W NETWORK  [repl writer worker 2] Failed to connect to x.x.x.x:27017, reason: errno:99 Cannot assign requested address
I REPL     [repl writer worker 2] repl: couldn't connect to server war-mongo2:27017 (x.x.x.x), connection attempt failed
W REPL     [repl writer worker 2] network problem detected while connecting to the sync source, attempt 2 of 3
W NETWORK  [repl writer worker 2] Failed to connect to x.x.x.x:27017, reason: errno:99 Cannot assign requested address
I REPL     [repl writer worker 2] repl: couldn't connect to server war-mongo2:27017 (x.x.x.x), connection attempt failed
W REPL     [repl writer worker 2] network problem detected while connecting to the sync source, attempt 3 of 3
I -        [repl writer worker 2] Assertion: 15916:Can no longer connect to initial sync source: war-mongo2:27017
I CONTROL  [repl writer worker 2]
0xf6d639 0xf0d371 0xef25cf 0xef267c 0xcaced2 0xcac176 0xcb2b5c 0xf0418b 0xfbab94 0x7f0eb110ce9a 0x7f0eafbe338d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B6D639"},{"b":"400000","o":"B0D371"},{"b":"400000","o":"AF25CF"},{"b":"400000","o":"AF267C"},{"b":"400000","o":"8ACED2"},{"b":"400000","o":"8AC176"},{"b":"4
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf6d639]
 mongod(_ZN5mongo10logContextEPKc+0xE1) [0xf0d371]
 mongod(_ZN5mongo11msgassertedEiPKc+0xAF) [0xef25cf]
 mongod(+0xAF267C) [0xef267c]
 mongod(_ZN5mongo4repl4Sync13getMissingDocEPNS_16OperationContextEPNS_8DatabaseERKNS_7BSONObjE+0x622) [0xcaced2]
 mongod(_ZN5mongo4repl4Sync11shouldRetryEPNS_16OperationContextERKNS_7BSONObjE+0x196) [0xcac176]
 mongod(_ZN5mongo4repl21multiInitialSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x8C) [0xcb2b5c]
 mongod(_ZN5mongo10threadpool6Worker4loopERKSs+0x2FB) [0xf0418b]
 mongod(+0xBBAB94) [0xfbab94]
 libpthread.so.0(+0x7E9A) [0x7f0eb110ce9a]
 libc.so.6(clone+0x6D) [0x7f0eafbe338d]
-----  END BACKTRACE  -----
2015-05-27T19:18:56.663-0400 E REPL     [repl writer worker 2] writer worker caught exception:  :: caused by :: 15916 Can no longer connect to initial sync source: war-mongo2:27017 on: { ts
2015-05-27T19:18:56.663-0400 I -        [repl writer worker 2] Fatal Assertion 16361
2015-05-27T19:18:56.663-0400 I -        [repl writer worker 2]
 
***aborting after fassert() failure

We had this happen twice before we started messing with kernel parameters which ultimately got us up and running.

Since we also rebuild larger "mostly write no delete" replica-sets the same day and never saw this it seems likely that our highly-transient collection is causing this sort of connection abuse.



 Comments   
Comment by Evin Roesle [ 08/Jun/20 ]

I am closing this ticket as this issue is fixed by upgrading to 4.4. Please upgrade to 4.4 or later version if you are experiencing this issue.

Comment by Glen Miner [ 24/Aug/17 ]

Last workaround (net.ipv4.tcp_tw_reuse = 1, net.ipv4.ip_local_port_range = 10000 61000) worked again and we again saw peak of ~51K connections in TIME_WAIT but with reuse it was able to keep up and not crash).

One thing I wondered from https://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux#socket-lingering was if SO_LINGER on the replication thread was something to consider turning off and blocking on the close() call so that you don't overrun things – might be a surgical strike that solves without having to face the bigger task of reusing a connection?

Comment by Glen Miner [ 24/Aug/17 ]

FWIW we just upgraded to 3.4.7 and ran into exactly the same problem.

Comment by Eric Milkie [ 28/May/15 ]

Thanks for filing this.
We're rewriting the fetching logic for initial sync for MongoDB 3.2 so that it no longer initiates a new connection for each missing document.

Generated at Thu Feb 08 03:48:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.