Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 4.2.2
Component/s: Replication, Sharding
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Steps To Reproduce:

Hide

Deploy a Sharded Cluster with at least 2 shards, deployed as a PSA Replica Set.

Begin filling a sharded collection with lots of data so that the balancer starts running.

After a short while when checking the sharding status via sh.status() errors similar to the one below will appear:

{{ Failed with error 'aborted', from shard3rs to shard2rs}}

Show
Deploy a Sharded Cluster with at least 2 shards, deployed as a PSA Replica Set. Begin filling a sharded collection with lots of data so that the balancer starts running. After a short while when checking the sharding status via sh.status() errors similar to the one below will appear: {{ Failed with error 'aborted', from shard3rs to shard2rs}}
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

I have a MongoDB Deployment with 3 shards, all deployed as a PSA (Primary, Secondary, Arbiter) Replica Set.

The Cluster works fine as long as the balancer is stopped.

When I enable the balancer it successfully moves chunks for a short while but then begins failing with moveChunk.errors.

This is the error I see on the primary for shard3rs:

2019-12-16T09:39:51.860+0000 I SHARDING [conn47] about to log metadata event into changelog: \{ _id: "edaaf0746692:27017-2019-12-16T09:39:51.860+0000-5df750e7dc45e3a1a34c6889", server: "edaaf0746692:27017", shard: "shard3rs", clientAddr: "10.0.1.72:49758", time: new Date(1576489191860), what: "moveChunk.error", ns: "database.accounts.events", details: { min: { subscriberId: -1352160598807904125 }, max: \{ subscriberId: -1324388048193741545 }, from: "shard3rs", to: "shard2rs" } }
2019-12-16T09:39:52.084+0000 W SHARDING [conn47] Chunk move failed :: caused by :: OperationFailed: Data transfer error: waiting for replication timed out

On the shard2rs the chunk was being moved to I see the same:

2019-12-16T09:39:51.831+0000 I SHARDING [Collection-Range-Deleter] Error when waiting for write concern after removing database.accounts.events range [\{ subscriberId: -1352160598807904125 }, \{ subscriberId: -1324388048193741545 }) : waiting for replication timed out}}
2019-12-16T09:39:51.831+0000 I SHARDING [Collection-Range-Deleter] Abandoning deletion of latest range in database.accounts.events after local deletions because of replication failure
{{2019-12-16T09:39:51.831+0000 I SHARDING [migrateThread] waiting for replication timed out

So it looks like the secondary on shard3rs can't keep up with the deletions and the moveChunk fails after a timeout as the replica set hasn't confirmed the deletions yet.

From the first time a moveChunk.error occurs the replica set get's out of sync and the replication lag just keeps on growing without ever making it back again. The CPU starts rising to 100% as the replica set is trying to keep up while the balancer continues executing moveChunk commands which keep on failing with the same error. This even happens when the balancer is stopped afterwards via sh.stopBalancer()

In theory this shouldn't be happening.

According to the documentation the default _secondaryThrottle setting for wiredTiger on MongoDB > 3.4 is false, so that the migration process does not wait for replication to a secondary but continues immediately with the next document.

I can confirm that _secondaryThrottle is not set:

use config
db.settings.find({})

{ "_id" : "balancer", "mode" : "off", "stopped" : true }
{ "_id" : "chunksize", "value" : 16 }
{ "_id" : "autosplit", "enabled" : false }

So why does the migration still fails with an error of "waiting for replication timed out"?

If necessary I can supply logs of the whole cluster to a secure upload. (Unsure if the Jira file attachment makes them publicly accesible)

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

SERVER-45163.png
244 kB
Dec 19 2019 02:54:46 PM UTC

Assignee:: Dmitry Agranat
Reporter:: Jascha Brinkmann
Participants:: Dmitry Agranat, Jascha Brinkmann
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: Dec 16 2019 10:17:24 AM UTC
Updated:: Dec 20 2019 02:05:41 PM UTC
Resolved:: Dec 19 2019 02:57:36 PM UTC

Details

Description

Attachments

Attachments

Activity

People

Dates