Priority: Major - P3
Affects Version/s: 2.6.10, 3.0.4
Fix Version/s: None
I'm adding a new replica to a mongo replica set with a huge ~3Tb single collection in it. The collection is very active, which means that while Mongo is seeding the data from the primary and building its primary key index, huge amounts of changes occur in the table.
The problem occurs when this new replica starts to catch up on oplog and performs checks against the primary. It looks like Mongo creates a new connection for each document (500+ new connections a second created from the replica to the primary according to our monitoring). After some time of that intense flood of connections we hit the limit on ephemeral source ports available for use because connections spend some time in TIME_WAIT before being recycled (even with tw_reuse and tw_recycle connection rate is high enough to exhaust all the ports).
Each time we hit the issue, Mongo receives a connection error (errno 99: Cannot assign requested address) from the kernel. There is logic in SyncTail::getMissingDoc to perform up to 3 retries before giving up and in some cases it helps – kernel manages to free up some ports and connection succeeds upon retry. But since it takes a huge amount of time to seed the servers and build the primary index (up to 24 hours), the amounts of changes Mongo needs to apply to catch up with oplog is huge, so there are many chances for it to fail all 3 retries and die with an Assertion: 15916:Can no longer connect to initial sync source message.
At the moment I'm trying to create the replica for the 5th time (each try taking 1-2 days) and so far I've managed to catch the moment mongo started hitting the limit of 60K TIME_WAIT sockets in time to be able to run cpulimit on the process to slow it down and reduce the rate of socket creation enough so that the kernel could recycle those sockets fast enough, but this is clearly an issue that needs fixing in the code.
Some additional resources:
- TIME_WAIT sockets count on the replica during the incident: https://www.evernote.com/l/ADQS6_1-TVpGKpMgSiHHxX8f0AG3Yvhe-ZgB/image.png
- Netstat excerpt with TIME_WAIT sockets during the incident: https://www.evernote.com/l/ADRM1hjdAeZJFZnjPTwv-9VnxDBweX3kZvsB/image.png
- TCP stack status graph, showing 500-600 new connections established a second from the replica: https://www.evernote.com/l/ADQku5K0WhtEgZ4q0Lei5mHEG9sRpfZHpiAB/image.png