When a shard restarts, it loses all its sharding metadata information. So if a mongos that has sharding information sends a write command to the restarted shard, it will get a stale config error from the shard because the shard contains no shard version (see note 1). The mongos will see the stale error with shard zero version from the response and will decide to perform a full reload (see note 2). In setups with very huge number of chunks (in millions), it takes time for the entire chunk metadata to be loaded and this can cause issues because this is done under a mutex, and will cause other threads which got the same response from the shard to decide to perform a full reload as well. This is exacerbated by the fact that the thread will have to acquire and release the same contentious mutex multiple times until it finishes, so it can hold on to the newly fully loaded data for even longer periods of time. Also note that for every full reload, it will create a new instance of ChunkManager and will try to atomically replace the old one with the new one when the reload finishes. In a mongos with multiple threads trying to execute a write command, it can create a situation where several threads will queue up trying to perform a full reload and some threads have loaded their own copy of the chunk metadata but are blocked waiting for the same mutex the other threads are waiting for the full reload. In certain cases with large enough chunks and simultaneous write command operations, it can spiral out of control, consume too much memory and ultimately get killed by the OOM killer in the operating system.
In mongod write command execution path, version is checked first inside here:
sets the error in the response, and then performs a refresh afterwards:
Shard response with zero version will result to unknown comparison result:
and ultimately, causing it to flush the entire chunk manager: