Priority: Major - P3
Affects Version/s: 2.6.9, 3.0.2, 3.2.10, 3.4.0-rc0
Fix Version/s: 3.4.0-rc1
When the special top/bottom splits are done it can result in a chunk which spans a tag range, leading to that chunk residing on the incorrect shard for (part of) the tag range. When the balancer runs it will then split the chunk to the tag range boundary and move the chunk to the tagged shard, but this leaves the chunk, which includes partial tag ranges, on a shard not allocated to that tag range temporarily.
When using tag-aware sharding there are circumstances where a chunk will be moved to the wrong shard, i.e. chunks containing a (sub) range of a given tag can be moved to shards which are not associated with that tag.
This issue described here is one in which documents are being inserted (all of which target a single Tag range) yet a split and move can occur during the insertion process in which the top chunk is moved to a non-aligned shard. New documents in this upper range can therefore end up on the wrong shard (though the balancer does seem to move the chunk to a valid shard shortly afterwards).
The issue may occur because the command to create a tag range seems to cause a split at the lower end of the range but not at the top. Auto-splits which later occur can result in a split point being generated below the max key for that tag range, with the resulting top chunk (which contains a portion of the tag-range) being moved to an invalid shard.
A workaround seems to be to manually create a split point at the top end of the tag range and then moving the resulting chunk (spanning the whole tag range) to an appropriate shard.
The same thing could be accomplished by creating 'dummy' tag ranges for all sections outside of the real tag ranges, effectively creating tags spanning the entire range, from min to max. This will automatically create a split at the top of the real tag ranges (because they are also the bottom of the dummy tag ranges). In this case, the move will occur automatically (but it does take some time, i.e. 30+ seconds, for the move to occur)
A repro script has been provided. It uses mtools to create a 4 shard cluster with 2 tag ranges (each tag is assigned to 2 shards). It then inserts data which targets a single tag range but there are brief periods (after a split) where the top chunk is moved to a shard which is not associated with that tag. The balancer will quickly realise this and move the chunk back - but this intermediate state can cause many issues, for example, it can mean expensive data transfers between different sites (in both directions).