Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.0.28, 4.2.19
Affects Version/s: 4.0.26
Component/s: Sharding
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v4.2
Sprint:
Sharding EMEA 2021-10-04, Sharding EMEA 2021-10-18, Sharding EMEA 2021-11-01, Sharding EMEA 2021-11-15, Sharding EMEA 2021-11-29, Sharding EMEA 2021-12-13, Sharding EMEA 2021-12-27
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Recently, we have some sharding cluster with version 4.0.26. sometime we will get a result that update operation is extremely slow, about tens of seconds to a few minutes.

After in-depth analysis, I think it's a BUG here.

First , when moveChunk happens, A chunk will move from shard A to shard B , B will cleanup this chunk data first and will wait for cleanup to make sure that the new chunk data wouldn't delete by another older cleanup task. That is , moveChunk will cost a very long time, up to 15 minutes (rangeDeleterBatchDelayMS ) .

// Wait for any other, overlapping queued deletions to drain        
auto status = CollectionShardingRuntime::waitForClean(opCtx, _nss, _epoch, footprint);

Secondly, there is a jara https://jira.mongodb.org/browse/SERVER-56779 , and from 4.0.26 , MongoDB do not use collection distributed lock for chunk merges,and use the ActiveMigrationsRegistry. But it cause a new sense

 *   - Move || Move (same chunk): The second move will join the first
 *   - Move || Move (different chunks or collections): The second move will result in a
 *                                             ConflictingOperationInProgress error
 *   - Move || Split/Merge (same collection): The second operation will block behind the first
 *   - Move/Split/Merge || Split/Merge (for different collections): Can proceed concurrently

That is split will be blocked by movechunk until the moveChunk ended.

last, in 4.0.26 ,the auto-split is alse trigger by mongos, and is a part of the update operation.

So sometimes there is such a scene, a chunk moved from shard A to shard B , and then it is moved from shard B to shard A, the second moveChunk task will be blocked， up to 15 minutes。then the update operation will be blocked by splitChunk, and splitchunk is waiting for last moveChunk

from 4.2 ,auto-split is triggered by mongod , and it's an asynchronous task. So this problem is only affect 4.0.26.

is caused by

SERVER-56779 Do not use the collection distributed lock for chunk merges

Closed

Assignee:: Kaloian Manassiev
Reporter:: FirstName lipengchong
Participants:: FirstName lipengchong, Githook User, Kaloian Manassiev
Votes:: 0 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: Sep 14 2021 03:40:09 AM UTC
Updated:: Oct 29 2023 09:48:39 PM UTC
Resolved:: Dec 14 2021 11:50:22 AM UTC
Confidence Status Last Update:: 13/Dec/21 4:52 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates