[SERVER-31238] Stale mongos nodes can fail moveChunk commands without ever refreshing Created: 25/Sep/17 Updated: 30/Oct/23 Resolved: 04/Nov/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.4.9, 3.5.13 |
| Fix Version/s: | 4.3.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Dianna Hohensee (Inactive) | Assignee: | Janna Golden |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sharding-wfbf-day | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Backport Requested: |
v4.4, v4.2, v4.0
|
||||||||
| Sprint: | Sharding 2019-10-21, Sharding 2019-11-04, Sharding 2019-11-18 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 8 | ||||||||
| Description |
|
The mongos used to send moveChunk directly against the shard, so the shardVersion protocol was in affect and stale mongos nodes would refresh their routing tables and retry. However, mongos nodes no longer send moveChunk against a shard node, but rather against the config server. This means that a stale mongos can receive a command like this
and forward it to the config server with the chunk bounds (MinKey, MaxKey). The config server then forwards it to the shard, which fails it with IncompatibleShardingMetadata, chunk does not exist — another mongos previously split the chunk. This error passes back through the config server to the mongos, which just fails. This is a regression from 3.2: moving the balancer to the config server in v3.4 changed the moveChunk behavior. |
| Comments |
| Comment by Matthew Saltz (Inactive) [ 04/May/20 ] |
|
Saw this failure in a 4.0 patch build, and it's a one-line change, so I think it's worth backporting |
| Comment by Githook User [ 22/Oct/19 ] |
|
Author: {'username': 'jannaerin', 'email': 'janna.golden@mongodb.com', 'name': 'Janna Golden'}Message: |
| Comment by Esha Maharishi (Inactive) [ 21/Oct/19 ] |
|
Note, mongos's moveChunk command does force a refresh before determining what chunk the find argument is in. |
| Comment by Dianna Hohensee (Inactive) [ 02/Oct/17 ] |
|
BF-5972 Portrays a similar issue, but is returning StaleShardVersion somehow. Maybe it's a matter of changing cluster moveChunk's retry policy and/or making it recreate the command with the correct, newly refreshed bounds (when given the 'find' field). |