[SERVER-31238] Stale mongos nodes can fail moveChunk commands without ever refreshing Created: 25/Sep/17  Updated: 30/Oct/23  Resolved: 04/Nov/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.4.9, 3.5.13
Fix Version/s: 4.3.1

Type: Bug Priority: Major - P3
Reporter: Dianna Hohensee (Inactive) Assignee: Janna Golden
Resolution: Fixed Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4, v4.2, v4.0
Sprint: Sharding 2019-10-21, Sharding 2019-11-04, Sharding 2019-11-18
Participants:
Linked BF Score: 8

 Description   

The mongos used to send moveChunk directly against the shard, so the shardVersion protocol was in affect and stale mongos nodes would refresh their routing tables and retry.

However, mongos nodes no longer send moveChunk against a shard node, but rather against the config server. This means that a stale mongos can receive a command like this

{
    moveChunk: nss,
    find: {_id: 1},
    to: 'shard1'
}

and forward it to the config server with the chunk bounds (MinKey, MaxKey). The config server then forwards it to the shard, which fails it with IncompatibleShardingMetadata, chunk does not exist — another mongos previously split the chunk. This error passes back through the config server to the mongos, which just fails.

This is a regression from 3.2: moving the balancer to the config server in v3.4 changed the moveChunk behavior.



 Comments   
Comment by Matthew Saltz (Inactive) [ 04/May/20 ]

Saw this failure in a 4.0 patch build, and it's a one-line change, so I think it's worth backporting

Comment by Githook User [ 22/Oct/19 ]

Author:

{'username': 'jannaerin', 'email': 'janna.golden@mongodb.com', 'name': 'Janna Golden'}

Message: SERVER-31238 Add awaitLastOpTimeCommitted between split and move chunk commands in multi_mongos2.js
Branch: master
https://github.com/mongodb/mongo/commit/f21fbad4a68fb9447a26c355a006e387b9eccd7b

Comment by Esha Maharishi (Inactive) [ 21/Oct/19 ]

Note, mongos's moveChunk command does force a refresh before determining what chunk the find argument is in.

Comment by Dianna Hohensee (Inactive) [ 02/Oct/17 ]

BF-5972 Portrays a similar issue, but is returning StaleShardVersion somehow. Maybe it's a matter of changing cluster moveChunk's retry policy and/or making it recreate the command with the correct, newly refreshed bounds (when given the 'find' field).

Generated at Thu Feb 08 04:26:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.