[SERVER-14993] Mongos does not abort non-responsive update operations on primary change Created: 21/Aug/14 Updated: 06/Dec/22 Resolved: 13/Jun/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 2.6.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jose Luis Pedrosa | Assignee: | [DO NOT USE] Backlog - Sharding Team |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Assigned Teams: |
Sharding
|
| Operating System: | ALL |
| Participants: |
| Description |
|
Hi We were using a 2 node replica set with an extra arbitrer and a mongos on top of them (servers were started with --shardsvr and --replset). Mongos was running in a linux box, while the mongod were running in windows, all using 2.6.4. In scenarios 1 and 2, updates were aborted with errors, and started working once the new primary was elected. In the 3rd scenario the client gets stuck in the update command (using c# driver latest stable version). we could overcome the issue by setting the socketTimeoutMS option in the connection string, but in principle the preferred behaviour is the one shown in scenarios 1 and 2. Test has been repeted several times to avoid mistakes. As a side note times were: 3, 16 and 23 seconds, for the three scenarios. |
| Comments |
| Comment by Jose Luis Pedrosa [ 30/Aug/14 ] |
|
Hi Greg I think that case you mention, is pretty extreme as the order of magnitude of response time for a conection (even over internet an thousands of kilometers) should be as much as few hundreds of milis, while the failover is few seconds in the best case scenario, in the same vlan with latencies below ms. If the operation is already completed, it means that the primary in that moment did not even detected that he can´t comunicate with the other memebers (otherwise the election process would kick in), and that would mean at least heartbeatTimeoutSecs after. And also the op could not be replicated to the other nodes, so it would be rolled back. Also if we think about the consecuences of aborting existing CRUD operations on new primary election, it would be that the application side would think as failed a success operation (that as I said I understand it will be rolled back once the comms are restablished because the replication). I think that possible issue is much better than leaving connections hang indefinitely. In my opinion, it should not be "abort non-responsive" it should be, abort CRUD operations when a new primary is elected in the replica set. BR |
| Comment by Greg Studer [ 29/Aug/14 ] |
|
> as even if they would reach the server it would not be primary anymore and should not be processed I think this all really boils down to "is the replica set timeout generally a good proxy for mongos to replica set timeout, or do we need additional configuration"? |
| Comment by Jose Luis Pedrosa [ 29/Aug/14 ] |
|
@Greg, I´m not 100% sure if I am of following you. I think that all queries (without any special write concern), inserts/updates/deletes should be aborted when the new primary is elected, (Not when it's marked as un-healthy) as even if they would reach the server it would not be primary anymore and should not be processed. I agree that some commands may not be aborted, but I don´t think that applies to CRUD ops. Do you agree? |
| Comment by Greg Studer [ 29/Aug/14 ] |
|
Better handling of non-responsive nodes is definitely something we're looking to improve. It's not clear that aborting a write when a new primary is detected is always the correct behavior, however. A sporadic network problem or unexpected load which may be delaying your response can often also trigger replica set failovers - especially across data centers, etc. Without timing information there's (currently) no very general way to guess whether the operation has been lost or whether the replica set was simply faster to reconfigure and tell mongos than the write response arriving. Short-term, I think operation-level timeouts would best address this - in these tests it seems your writes are expected to complete and fail quickly with little network delay (because your network is assumed fast), and that information can be passed to mongos or a driver: |
| Comment by Andy Schwerin [ 29/Aug/14 ] |
|
jlpedrosa, someone on the sharding team will take care of transforming this ticket into the described feature request and triaging it. Thanks. |
| Comment by Jose Luis Pedrosa [ 29/Aug/14 ] |
|
Ok, So we are aligned. So do you want me to open other ticket with your sugested name? we leave this one open? Do you need anything from me? Thanks for the kick answer. |
| Comment by Andy Schwerin [ 29/Aug/14 ] |
|
Yeah, that's because when a host disappears from the network, other host OSes can't distinguish between slow network traffic and failure. As a result, the OS never informs the MongoS that anything's wrong, because it doesn't know itself. In the other forms of failure you simulated, the MongoD (test 1) or the OS on node D (test 2) know what's going on, and are able to transmit a message. In case 1, it's the stepDown command. In test 2, the OS on node D sends an explicit hang-up message to the MongoS's OS to end the TCP connection for the process that it knows died. |
| Comment by Jose Luis Pedrosa [ 29/Aug/14 ] |
|
Hi Andy Thanks! Yes is the inflights connections the one that get stuck (as I said just tunning the timeout fixes the issue, actually I keep on using the same db ad collection objects and keeps working). In this case I don't think the driver is involved, as I'm behind a mongos. the client does not have any connection (tcp) to the replica set. When we kill the process, and the TCP socket is resetd by the OS running mongod, then mongos is smart enough to abort the operations in progress, but not if it is caused due to a failover that not involvers hard reset of the connections. Thanks again. Rgds JL |
| Comment by Andy Schwerin [ 29/Aug/14 ] |
|
So the problem is that in Scenario 3, MongoS cannot distinguish between Box D being slow and it being down, until the connection timeout expires. This is in some ways a TCP issue, but perhaps MongoS could more proactively monitor the replica set to detect the selection of the new primary. The write(s) in progress when the network cable is removed would be abandoned by the MongoS when it realizes that node D is no longer primary (keep in mind that MongoS doesn't yet know that node D is down, because the TCP timeout hasn't expired). greg_10gen, I think this boils down to a feature request to have MongoS (and perhaps also drivers that talk to replica sets) be more proactive in monitoring for primary changes and perhaps to quickly report as failures writes that are blocked on a replica set node that is now believed not to be primary. Something like the following: "When MongoS detects that a shard has elected a new primary, it should abort write operations blocked on the old primary." Separately, jlpedrosa, if when the client is stuck in its update you fire up a new client and connect to the same mongos to perform an update, does it also get stuck, or does it successfully contact the new primary? I think the problem you're experiencing may only affect operations that are in flight when the cable to node D is unplugged. |
| Comment by Jose Luis Pedrosa [ 29/Aug/14 ] |
|
Hi Andy In the failovers we ensured that the mongod in BOX D was the primary. Best regards |
| Comment by Andy Schwerin [ 29/Aug/14 ] |
|
jlpedrosa, could you clarify which network link or node is disabled in scenario 3? |