[SERVER-19689] mongos dead peer detection Created: 31/Jul/15  Updated: 12/Dec/23

Status: Backlog
Project: Core Server
Component/s: Networking, Sharding
Affects Version/s: 3.0.5
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Luke Prochazka Assignee: Backlog - Cluster Scalability
Resolution: Unresolved Votes: 1
Labels: AdiZ
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Cluster Scalability
Participants:

 Description   

Currently there is no socket timeout option for mongos connections to the mongod. In the event of a mongod failure, the convergence time is increased as the TCP socket needs to close completely.

The delay can is exacerbated when there is a slow query or idle connection, as there is no "keep alive" packet to detect a socket timeout.



 Comments   
Comment by Adam Flynn [ 31/Jul/15 ]

Specifically, the behaviour I'd like to see here is closing connections when a mongod is marked DOWN within a replica set. In the case of a hard server failure where open connections aren't closed, the election happens within about 30s but mongos (and the app) stall for min(server reboot time, OS TCP keepalive timeout, time to manually restart mongos). In default configurations, TCP keepalive timeout is pretty huge.

In the case of hardware failure (where you can't immediately reboot the server), the smallest term in there is probably time for an operator to restart mongos (5-10+ minutes if you have to page someone). If you can immediately reboot the server, it's still a 5-minute delay.

Since ReplicaSetMonitorWatcher knows the mongod is down within about 30s, convergence times can be improved by an order of magnitude by having detection of a DOWN member preempt all of its open connections currently blocking in recv.

Generated at Thu Feb 08 03:51:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.