Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.6.10, 3.0.4
Component/s: Networking, Replication
Labels:
None

Operating System:
ALL
Steps To Reproduce:

Hide

Get a huge mongo collection (3Tb), put it under an intense create/update/delete load (100+ updates sec), try to add a replica to the given replicaset, see it fail every time.

Show
Get a huge mongo collection (3Tb), put it under an intense create/update/delete load (100+ updates sec), try to add a replica to the given replicaset, see it fail every time.
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

I'm adding a new replica to a mongo replica set with a huge ~3Tb single collection in it. The collection is very active, which means that while Mongo is seeding the data from the primary and building its primary key index, huge amounts of changes occur in the table.

The problem occurs when this new replica starts to catch up on oplog and performs checks against the primary. It looks like Mongo creates a new connection for each document (500+ new connections a second created from the replica to the primary according to our monitoring). After some time of that intense flood of connections we hit the limit on ephemeral source ports available for use because connections spend some time in TIME_WAIT before being recycled (even with tw_reuse and tw_recycle connection rate is high enough to exhaust all the ports).

Each time we hit the issue, Mongo receives a connection error (errno 99: Cannot assign requested address) from the kernel. There is logic in SyncTail::getMissingDoc to perform up to 3 retries before giving up and in some cases it helps – kernel manages to free up some ports and connection succeeds upon retry. But since it takes a huge amount of time to seed the servers and build the primary index (up to 24 hours), the amounts of changes Mongo needs to apply to catch up with oplog is huge, so there are many chances for it to fail all 3 retries and die with an Assertion: 15916:Can no longer connect to initial sync source message.

At the moment I'm trying to create the replica for the 5th time (each try taking 1-2 days) and so far I've managed to catch the moment mongo started hitting the limit of 60K TIME_WAIT sockets in time to be able to run cpulimit on the process to slow it down and reduce the rate of socket creation enough so that the kernel could recycle those sockets fast enough, but this is clearly an issue that needs fixing in the code.

Some additional resources:

TIME_WAIT sockets count on the replica during the incident: https://www.evernote.com/l/ADQS6_1-TVpGKpMgSiHHxX8f0AG3Yvhe-ZgB/image.png
Netstat excerpt with TIME_WAIT sockets during the incident: https://www.evernote.com/l/ADRM1hjdAeZJFZnjPTwv-9VnxDBweX3kZvsB/image.png
TCP stack status graph, showing 500-600 new connections established a second from the replica: https://www.evernote.com/l/ADQku5K0WhtEgZ4q0Lei5mHEG9sRpfZHpiAB/image.png

duplicates

SERVER-18721 Initial-sync can exhaust available ports with rapid short-lived connections

Closed

Assignee:: Ramon Fernandez
Reporter:: Oleksiy Kovyrin
Participants:: Oleksiy Kovyrin, Ramon Fernandez
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: Jun 28 2015 03:29:24 PM UTC
Updated:: Aug 03 2015 09:39:10 PM UTC
Resolved:: Aug 03 2015 09:39:10 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates

PagerDuty