[SERVER-45163] Balancer fails with moveChunk.error 'waiting for replication timed out' despite _secondaryThrottle unset Created: 16/Dec/19  Updated: 20/Dec/19  Resolved: 19/Dec/19

Status: Closed
Project: Core Server
Component/s: Replication, Sharding
Affects Version/s: 4.2.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jascha Brinkmann Assignee: Dmitry Agranat
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File SERVER-45163.png    
Issue Links:
Duplicate
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Deploy a Sharded Cluster with at least 2 shards, deployed as a PSA Replica Set.

Begin filling a sharded collection with lots of data so that the balancer starts running.

After a short while when checking the sharding status via sh.status() errors similar to the one below will appear:

{{ Failed with error 'aborted', from shard3rs to shard2rs}}

Participants:

 Description   

I have a MongoDB Deployment with 3 shards, all deployed as a PSA (Primary, Secondary, Arbiter) Replica Set.
 
The Cluster works fine as long as the balancer is stopped.
 
When I enable the balancer it successfully moves chunks for a short while but then begins failing with moveChunk.errors.
 
This is the error I see on the primary for shard3rs:
 

2019-12-16T09:39:51.860+0000 I SHARDING [conn47] about to log metadata event into changelog: \{ _id: "edaaf0746692:27017-2019-12-16T09:39:51.860+0000-5df750e7dc45e3a1a34c6889", server: "edaaf0746692:27017", shard: "shard3rs", clientAddr: "10.0.1.72:49758", time: new Date(1576489191860), what: "moveChunk.error", ns: "database.accounts.events", details: { min: { subscriberId: -1352160598807904125 }, max: \{ subscriberId: -1324388048193741545 }, from: "shard3rs", to: "shard2rs" } }
2019-12-16T09:39:52.084+0000 W SHARDING [conn47] Chunk move failed :: caused by :: OperationFailed: Data transfer error: waiting for replication timed out

On the shard2rs the chunk was being moved to I see the same:

2019-12-16T09:39:51.831+0000 I SHARDING [Collection-Range-Deleter] Error when waiting for write concern after removing database.accounts.events range [\{ subscriberId: -1352160598807904125 }, \{ subscriberId: -1324388048193741545 }) : waiting for replication timed out}}
2019-12-16T09:39:51.831+0000 I SHARDING [Collection-Range-Deleter] Abandoning deletion of latest range in database.accounts.events after local deletions because of replication failure
{{2019-12-16T09:39:51.831+0000 I SHARDING [migrateThread] waiting for replication timed out

 
So it looks like the secondary on shard3rs can't keep up with the deletions and the moveChunk fails after a timeout as the replica set hasn't confirmed the deletions yet.
 
From the first time a moveChunk.error occurs the replica set get's out of sync and the replication lag just keeps on growing without ever making it back again. The CPU starts rising to 100% as the replica set is trying to keep up while the balancer continues executing moveChunk commands which keep on failing with the same error. This even happens when the balancer is stopped afterwards via sh.stopBalancer()
 
In theory this shouldn't be happening.
 
According to the documentation the default _secondaryThrottle setting for wiredTiger on MongoDB > 3.4 is false, so that the migration process does not wait for replication to a secondary but continues immediately with the next document.
 
I can confirm that _secondaryThrottle is not set:
 
use config
db.settings.find({})

{ "_id" : "balancer", "mode" : "off", "stopped" : true }
{ "_id" : "chunksize", "value" : 16 }
{ "_id" : "autosplit", "enabled" : false }

So why does the migration still fails with an error of "waiting for replication timed out"?

If necessary I can supply logs of the whole cluster to a secure upload. (Unsure if the Jira file attachment makes them publicly accesible)



 Comments   
Comment by Jascha Brinkmann [ 20/Dec/19 ]

I have been running this for a couple of hours more. Result is the same:

  balancer:
        Currently enabled:  yes
        Currently running:  yes
        Collections with active migrations: 
                database.accounts.events started at Fri Dec 20 2019 13:55:48 GMT+0000 (UTC)
        Failed balancer rounds in last 5 attempts:  3
        Migration Results for the last 24 hours: 
                442 : Success
                1 : Failed with error 'aborted', from shard1rs to shard2rs
                360 : Failed with error 'aborted', from shard2rs to shard1rs
                335 : Failed with error 'aborted', from shard3rs to shard1rs

Moving Chunks works for a short while but then all successive balancing operations fail after the first encountered moveChunk.error. The only avail is completely restart the whole cluster where it works again for a short while after entering into the same vicious loop. The only difference to before is that I now see hundreds of these errors in the logs:

2019-12-20T13:39:22.649+0000 I  SHARDING [migrateThread] migrate failed: IllegalOperation: _transferMods failed in STEADY STATE:  :: caused by :: No active migrations were found for collection database.accounts.events
2019-12-20T13:39:22.649+0000 I  SHARDING [migrateThread] Abandoning in-migration of database.accounts.events range [{ subscriberId: 2562303366878155278 }, { subscriberId: 2571626787395687297 }); scheduling deletion of any documents already copied

CPU is once again up 100% and cluster comes to a halt. 

Comment by Jascha Brinkmann [ 20/Dec/19 ]

Hey Dmitry,

thanks for your input. We are still in the testing phase so as I wrote in the community channel I've tried with different type of node sizes still encountering the same issue.

After your suggestion I have tried once more, upgrading all cluster nodes to 4 CPUs and 8GB RAM. The cache size is now 14 times larger.

While the initial import to the sharded collection works fine and without any major problems, the balancer is still having troubles in moving Chunks afterwards.

To test this I removed the other shards from the cluster via db.adminCommand(

{ removeShard: "shard2rs" }

) in order to get the balancer to rebalance all chunks back to the first shard.

After a short while I see once again a lot of moveChunk.errors: (Please note that these stats are for a brand new cluster)

  balancer:
        Currently enabled:  yes
        Currently running:  yes
        Collections with active migrations: 
                database.accounts.events started at Fri Dec 20 2019 09:06:46 GMT+0000 (UTC)
        Failed balancer rounds in last 5 attempts:  0
        Migration Results for the last 24 hours: 
                173 : Success
                104 : Failed with error 'aborted', from shard3rs to shard1rs
                112 : Failed with error 'aborted', from shard2rs to shard1rs
                1 : Failed with error 'aborted', from shard1rs to shard2rs

This time I do see an error message on the first shard in addition to the moveChunk.errors in the other shards:

2019-12-20T07:36:06.994+0000 I  SHARDING [migrateThread] about to log metadata event into changelog: { _id: "f7727e09c92e:27017-2019-12-20T07:36:06.994+0000-5dfc79e60840ae25682e6765", server: "f7727e09c92e:27017", shard: "shard1rs", clientAddr: "", time: new Date(1576827366994), what: "moveChunk.to", ns: "database.accounts.events", details: { min: { subscriberId: -7164067322086218274 }, max: { subscriberId: -7126168440325733131 }, step 1 of 6: 16, step 2 of 6: 20893, step 3 of 6: 7742, step 4 of 6: 8, note: "aborted", errmsg: "aborted" } }
2019-12-20T07:36:07.016+0000 I  SHARDING [migrateThread] migrate failed: IllegalOperation: _transferMods failed in STEADY STATE:  :: caused by :: No active migrations were found for collection database.accounts.events

Only about 40% of the moveChunk commands seems to be successfull. Is this the expected behaviour?

Comment by Dmitry Agranat [ 19/Dec/19 ]

Hi jascha.brinkmann+mongodb@gmail.com,

With 1 Core and 250 MB cache size per node, this cluster is under-provisioned. Please refer to our Production Notes to make sure your cluster is configured to sustain your workload.

As you can see bellow, chunks are being aborted (marked A, B, C and D) when your single core is pegged. This in turn creates and snow-ball effect of lagging secondaries which add more pressure on the cache and so on.

I will go ahead and close this ticket as I do not see MongoDB bug in this case. I also see you have already started a conversation in our community channel to discuss troubleshooting and tuning options.

Regards,
Dima

Comment by Jascha Brinkmann [ 18/Dec/19 ]

Hey Dima,

I have uploaded all the requested files.

As far as I am aware there is no diagnostic.data or db directory for mongoS nodes or at least I wasn't able to locate it. So I have only uploaded the log files for mongoS nodes. Please note that the cluster currently only has two mongoS nodes instead of three.

These are all nodes:

  • mongos-1
  • mongos-2
  • configrs-1
  • configrs-2
  • configrs-3
  • shard1rs-1
  • shard1rs-2
  • shard1rs-3
  • shard2rs-1
  • shard2rs-2
  • shard2rs-3
  • shard3rs-1
  • shard3rs-2
  • shard3rs-3

Please note as well that I have since wiped at least once the data directory of one of the secondary instances of the replica sets in order to encourage an inital sync and get the replica lag back down to zero. So the diagnostics.data for some of the secondary nodes might be incomplete. Although it looks like I was able to recover the previous contents but just wanted you to be aware.

Please let me know if you need anything else.

Thanks and best regards

Comment by Dmitry Agranat [ 18/Dec/19 ]

Hi jascha.brinkmann+mongodb@gmail.com,

Thank you for the provided information so far. In order to further investigate this issue, we'll need to collect some more data.

Please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location.

We will need this information from all shard members, 9 replica set nodes (PSS * 3) and 3 mongoS nodes. Per each node we'll need both mongod.log and diagnostic.data.

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Thanks,
Dima

Comment by Jascha Brinkmann [ 17/Dec/19 ]

Shard 1:

shard1rs:PRIMARY> rs.conf()
{
	"_id" : "shard1rs",
	"version" : 3,
	"protocolVersion" : NumberLong(1),
	"writeConcernMajorityJournalDefault" : true,
	"members" : [
		{
			"_id" : 0,
			"host" : "database_shard-01-01:27017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 2,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 1,
			"host" : "database_shard-01-02:27017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 2,
			"host" : "database_shard-01-03:27017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		}
	],
	"settings" : {
		"chainingAllowed" : true,
		"heartbeatIntervalMillis" : 2000,
		"heartbeatTimeoutSecs" : 10,
		"electionTimeoutMillis" : 10000,
		"catchUpTimeoutMillis" : -1,
		"catchUpTakeoverDelayMillis" : 30000,
		"getLastErrorModes" : {
			
		},
		"getLastErrorDefaults" : {
			"w" : 1,
			"wtimeout" : 0
		},
		"replicaSetId" : ObjectId("5df7bcdba8e92a74b2d2c03b")
	}
}

Shard 2:

shard2rs:PRIMARY> rs.conf()
{
	"_id" : "shard2rs",
	"version" : 3,
	"protocolVersion" : NumberLong(1),
	"writeConcernMajorityJournalDefault" : true,
	"members" : [
		{
			"_id" : 0,
			"host" : "database_shard-02-01:27017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 2,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 1,
			"host" : "database_shard-02-02:27017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 2,
			"host" : "database_shard-02-03:27017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		}
	],
	"settings" : {
		"chainingAllowed" : true,
		"heartbeatIntervalMillis" : 2000,
		"heartbeatTimeoutSecs" : 10,
		"electionTimeoutMillis" : 10000,
		"catchUpTimeoutMillis" : -1,
		"catchUpTakeoverDelayMillis" : 30000,
		"getLastErrorModes" : {
			
		},
		"getLastErrorDefaults" : {
			"w" : 1,
			"wtimeout" : 0
		},
		"replicaSetId" : ObjectId("5df7bcfabadad6205d8ae6ca")
	}
}

Shard 3:

shard3rs:PRIMARY> rs.conf()
{
	"_id" : "shard3rs",
	"version" : 3,
	"protocolVersion" : NumberLong(1),
	"writeConcernMajorityJournalDefault" : true,
	"members" : [
		{
			"_id" : 0,
			"host" : "database_shard-03-01:27017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 2,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 1,
			"host" : "database_shard-03-02:27017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 2,
			"host" : "database_shard-03-03:27017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		}
	],
	"settings" : {
		"chainingAllowed" : true,
		"heartbeatIntervalMillis" : 2000,
		"heartbeatTimeoutSecs" : 10,
		"electionTimeoutMillis" : 10000,
		"catchUpTimeoutMillis" : -1,
		"catchUpTakeoverDelayMillis" : 30000,
		"getLastErrorModes" : {
			
		},
		"getLastErrorDefaults" : {
			"w" : 1,
			"wtimeout" : 0
		},
		"replicaSetId" : ObjectId("5df7bd146221a173873fcc42")
	}
}

Comment by Dmitry Agranat [ 17/Dec/19 ]

Hi jascha.brinkmann+mongodb@gmail.com, thank you for clarification and providing more context. One last thing I'd like to check before collecting all the information, could you post the output of rs.conf() from both shards?

Thanks,
Dima

Comment by Jascha Brinkmann [ 17/Dec/19 ]

Hey Dmitry, thanks for your response. Sorry that I forgot to mention it, but I did had {{ readConcern: majority }} disabled as advised in the documentation.

It seems that I was able to resolve the specific timeout error by restarting the whole cluster.

As I was still getting lots of {{ moveChunk.errors }} I removed the whole cluster and set it up once more, this time as a PSS (Primary, Secondary, Secondary) deployment as I was suspecting that maybe taking advantage of flow control would resolve the balancer issues.

But I still encounter the same problem: Hundreds of {{ moveChunk.errors }}. This time the error doesn't seem to be due to timedout replication as there is at least one replica set always up to date (even so one of the secondaries of a replica set always seems to have issues in keeping up) but due to something else.

        Migration Results for the last 24 hours: 
                41 : Success
                271 : Failed with error 'aborted', from shard3rs to shard2rs
                1 : Failed with error 'aborted', from shard1rs to shard2rs
                12 : Failed with error 'aborted', from shard3rs to shard1rs 

Unfortunately in the logs on shard3rs I only see a single entry for each failed {{ moveChunk }} command:

2019-12-17T13:33:44.996+0000 I  SHARDING [conn987] about to log metadata event into changelog: { _id: "4c99ef4ea474:27017-2019-12-17T13:33:44.996+0000-5df8d93808ad73f454bcb1dc", server: "4c99ef4ea474:27017", shard: "shard3rs", clientAddr: "10.0.1.123:57712", time: new Date(1576589624996), what: "moveChunk.error", ns: "database.accounts.events", details: { min: { subscriberId: -282506604107036732 }, max: { subscriberId: -270313584911101356 }, from: "shard3rs", to: "shard2rs" } }

Only very occasionally (maybe for every 100 failed moveChunk commands) I see this:

2019-12-17T07:26:31.184+0000 I  SHARDING [conn987] about to log metadata event into changelog: { _id: "4c99ef4ea474:27017-2019-12-17T07:26:31.184+0000-5df8832708ad73f454b3880c", server: "4c99ef4ea474:27017", shard: "shard3rs", clientAddr: "10.0.1.123:57712", time: new Date(1576567591184), what: "moveChunk.error", ns: "database.accounts.events", details: { min: { subscriberId: -527165117499500881 }, max: { subscriberId: -513290454848528462 }, from: "shard3rs", to: "shard2rs" } }
2019-12-17T07:26:31.217+0000 W  SHARDING [conn987] Chunk move failed :: caused by :: ConflictingOperationInProgress: Unable to start new migration because this shard is currently receiving chunk [{ subscriberId: -527165117499500881 }, { subscriberId: -513290454848528462 }) for namespace database.accounts.events from shard3rs

But the second {{ Chunk move failed }} error actually seems to be unrelated to the previous {{ moveChunk.error }} right before it as it is already for the next {{ moveChunk }} command following immediately afterwards. So those two errors are actually only related as that it still hasn't completely aborted the failed first {{ moveChunk }} and is already trying to start it again as you can see with the matching {{

{ subscriberId: -527165117499500881 }

, { subscriberId: -513290454848528462 } }} . Thats as well why I only see it ocassionally as most of the time it just continues trying to move the next chunk without problem until it fails again at the end.

There are no errrors within the {{ shard2rs }} and {{ shard1rs }} logs.

The constant aborted moving of chunks seems to be as well the reason why my replica sets can't keep up with the constant application and rolling back of changes:

shard1rs:PRIMARY> rs.printSlaveReplicationInfo()
source: database_shard-01-02:27017
	syncedTo: Tue Dec 17 2019 14:01:29 GMT+0000 (UTC)
	1 secs (0 hrs) behind the primary 
source: database_shard-01-03:27017
	syncedTo: Tue Dec 17 2019 10:59:47 GMT+0000 (UTC)
	10903 secs (3.03 hrs) behind the primary 
 
shard2rs:PRIMARY> rs.printSlaveReplicationInfo()
source: database_shard-02-02:27017
	syncedTo: Tue Dec 17 2019 10:14:49 GMT+0000 (UTC)
	14069 secs (3.91 hrs) behind the primary 
source: database_shard-02-03:27017
	syncedTo: Tue Dec 17 2019 14:09:17 GMT+0000 (UTC)
	1 secs (0 hrs) behind the primary
 
shard3rs:PRIMARY> rs.printSlaveReplicationInfo()
source: database_shard-03-02:27017
	syncedTo: Tue Dec 17 2019 01:40:27 GMT+0000 (UTC)
	44933 secs (12.48 hrs) behind the primary 
source: database_shard-03-03:27017
	syncedTo: Tue Dec 17 2019 14:09:19 GMT+0000 (UTC)
	1 secs (0 hrs) behind the primary

Happy to share the logs, just let me know if it is save to upload them here.

Comment by Dmitry Agranat [ 17/Dec/19 ]

Hi jascha.brinkmann+mongodb@gmail.com,

In your comment, you've mentioned this replica set uses arbiters, and (I assume) readConcern:majority is enabled. This is not a recommended configuration. Please see our documentation how to address this issue.

Once this is addressed and you still have the same issue, please let us know and we'll be happy to take a look.

Thanks,
Dima

Generated at Thu Feb 08 05:08:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.