[SERVER-32696] All replica set nodes freeze during heavy update with enableMajorityReadConcern Created: 14/Jan/18  Updated: 30/Oct/23  Resolved: 19/Jan/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Michael Smith Assignee: Bruce Lucas (Inactive)
Resolution: Fixed Votes: 0
Labels: RF, SWNA
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 16.04


Operating System: ALL
Participants:

 Description   

I have a 3-node replica set running version 3.4.10 on Ubuntu 16.04.

I ran a schema update that touched all 7 million rows of a collection with a $set and a $rename. Because one of the secondaries is about 30ms away in Azure, I used majority write concern to slow down the update and make sure at least one of the secondaries would stay in sync.

The query started at 14:19:29. At that point the Azure slave was probably 3-5 minutes behind because of earlier schema migrations. But by 14:27:00, the main secondary was unable to get results for oplog queries:

Jan 13 14:27:00 secondary mongod.27017[28273]: [replication-163] Restarting oplog query due to error: ExceededTimeLimit: Operation timed out, request was RemoteCommand 18482564 -- target:primary:27017 db:local expDate:2018-01-13T14:27:00.216+0000 cmd:{ getMore: 16483339842, collection: "oplog.rs", maxTimeMS: 5000, term: 25, lastKnownCommittedOpTime: { ts: Timestamp 1515853539000|8343, t: 25 } }. Last fetched optime (with hash): { ts: Timestamp 1515853555000|1852, t: 25 }[1175973526525408650]. Restarts remaining: 3

That's also the time the replica set stopped accepting connections from clients.

To get things running again I had to kill all three mongod processes (and then kill -9 because the shutdown tends to hang while in this state).

After letting the nodes sync up, I was able to reproduce this again with the same query.

I can provide logs and the query privately if that would be useful.

Just guessing based on what I learned in SERVER-32398, maybe the primary froze up because it ran out of cache while waiting for the secondary to apply changes. But the update was running with majority read concern so I would have thought the secondary couldn't have gotten far enough behind for that to occur.



 Comments   
Comment by Bruce Lucas (Inactive) [ 19/Jan/18 ]

Happy to have been able to help.

Comment by Michael Smith [ 19/Jan/18 ]

Thanks Bruce! Yes, this has been very helpful.

Mike

Comment by Bruce Lucas (Inactive) [ 18/Jan/18 ]

Hi Michael,

I couldn't find any documentation about the freeze at 20% dirty data. Is there a parameter to configure this threshold? I'd rather use more cache. Although, if we're able to blow 20% in 17 seconds, perhaps it wouldn't help anyway.

This is not exposed as a normal mongod parameter, but can be configured using the mongod wiredTigerEngineConfigString parameter to change the WiredTiger eviction_dirty_trigger parameter from its default setting of 20. However I think you are correct: based on the rate at which lag was growing and the proportion of the required work that had been done, you are likely to still hit the same problem, so I would suggest instead breaking the update into batches.

The oplog fetch stalling when the dirty cache threshold is reached seems like it could be a problem in other situations, too. There's no recovery from this without restarting all nodes, and no log indicating what's happening. This is a little scary - is it still a problem in 3.6?

In 3.6 operations may still temporarily stall under cache pressure in order to allow time for cache data to be evicted. However they won't stall indefinitely in 3.6: under extreme cache pressure, for example when a large amount of data is pinned in cache as it was in this case, rather than stalling indefinitely WiredTiger writes some of the cache to a lookaside table on disk to make room in memory. I was able to reproduce the deadlock you encountered in 3.4, but observed that in 3.6, as expected, rather than deadlocking, the cache lookaside table is used to relieve cache pressure.

To break up an update that touches all documents in a collection, I suppose I could do a find() to get all the _id's, then iterate through each, using the bulk API to generate an update with majority read concern. Are there better approaches?

Rather than iterating through the _ids and updating the documents one by one, you could fetch the _ids in sorted order, and then do the updates on ranges of _ids using $gte and $lt, keeping the batches small enough to avoid a large lag.

Please let me know if this addresses your concerns.

Bruce

Comment by Michael Smith [ 18/Jan/18 ]

Hi Bruce,

This is really interesting. Thanks a lot for digging in and providing this analysis. I really appreciate it.

I'm glad to hear the situation is improved in 3.6.x.

Unfortunately I'm waiting for Spring Data MongoDB 1.10.10 to be released before I can upgrade to 3.6.x; the current version doesn't use the cursor option for aggregations (https://jira.spring.io/browse/DATAMONGO-1824).

I also need majority read concern in a couple of places (not in this collection, though).

1. I couldn't find any documentation about the freeze at 20% dirty data. Is there a parameter to configure this threshold? I'd rather use more cache. Although, if we're able to blow 20% in 17 seconds, perhaps it wouldn't help anyway.
2. The oplog fetch stalling when the dirty cache threshold is reached seems like it could be a problem in other situations, too. There's no recovery from this without restarting all nodes, and no log indicating what's happening. This is a little scary - is it still a problem in 3.6?
3. To break up an update that touches all documents in a collection, I suppose I could do a find() to get all the _id's, then iterate through each, using the bulk API to generate an update with majority read concern. Are there better approaches?

Comment by Bruce Lucas (Inactive) [ 17/Jan/18 ]

Hi Michael,

It looks like you're encountering an issue caused by a combination of enableMajorityReadConcern, heavy writes, and replication lag.

In the second incident (2018-01-13T16:14:48Z) for example replication lag started at 0, but during the bulk update built to about 17 seconds on one secondary and 30 seconds on the other (Azure) secondary. This causes the replica set majority commit point to lag behind by about 17 seconds, which in turn requires all nodes to keep 17 seconds worth of updates in cache to satisfy potential read concern majority reads. Because of the high rate of updates this results in exceeding the limit of 20% cache dirty data, which causes operations to stall on the primary, including the oplog fetch operation. At this point we're stuck: the majority commit point can't advance because replication is stalled, so the cache remains above the dirty data limit, preventing replication from proceeding.

The information you provided about the document size allows us to do a sanity check: we were doing about 10 k updates per second, generating about 5 kB of dirty data per update for the collection and a similar amount for the oplog, so a replication lag of 17 seconds will generate enough dirty data to reach the limit of 20% of cache.

I've taken the liberty to adjust the ticket summary to reflect the relationship to enableMajorityReadConcern.

Here are some options to avoid this issue:

  • Remove the enableMajorityReadConcern option if you don't actually require majority read concern operations.
  • Throttle write operations to avoid replication lag. Using majority write concern as you are doing is the right approach; however this won't help for bulk updates such as this because the write concern applies to the operation as a whole, not to individual updates. You can throttle the bulk update by breaking it up into many smaller write concern majority updates.
  • Upgrade to 3.6; the implementation of majority read concern is improved so that it should not get stuck due to a full cache.

Please let me know if any of these steps addresses your issue.

Thanks,
Bruce

Comment by Michael Smith [ 16/Jan/18 ]

Hi Bruce,

I've uploaded collection stats and a redacted document (post-update) to the portal.

	"size" : 41953688284,
	"count" : 7415254,
	"avgObjSize" : 5657,
	"storageSize" : 8824041472,

Thanks,
Mike

Comment by Bruce Lucas (Inactive) [ 16/Jan/18 ]

Hi Michael,

Thanks for the information. Can you also tell me more about the collection - I understand it has 7 million documents; what is the total size? Would it be possible to upload a typical document (redacting field values if necessary) to the same upload portal?

Thanks
Bruce

Comment by Michael Smith [ 15/Jan/18 ]

Thanks Bruce. Files are uploaded. All times UTC.

As a shortcut you could look at the query that was issued at 2017-01-13 16:14:48. At that point we were re-running only the failing query, so all 3 nodes were starting off from a position of being in sync. Replication died around 16:17:56:

Jan 13 16:17:56 secondary mongod.27017[24014]: [replication-2] Restarting oplog query due to error: ExceededTimeLimit: Operation timed out, request was RemoteCommand 135328 -- target:primary:27017 db:local expDate:2018-01-13T16:17:56.043+0000 cmd:{ getMore: 16467793176, collection: "oplog.rs", maxTimeMS: 5000, term: 32, lastKnownCommittedOpTime: { ts: Timestamp 1515860194000|1741, t: 32 } }. Last fetched optime (with hash): { ts: Timestamp 1515860211000|553, t: 32 }[1537854950139136924]. Restarts remaining: 3

The earlier occurrence was at 14:19:29 but there had been some updates in the 10-12 minutes leading up to it that hadn't fully synced to the Azure replica.

Comment by Bruce Lucas (Inactive) [ 15/Jan/18 ]

Hi Michael,

Thanks for your offer to provide logs and query. Can you please upload the logs, together with an archive of the $dbpath/diagnostic.data directory, from each node of the cluster, to this secure upload portal. Note that there is some urgency to uploading the diagnostic.data information as it will age out in a few days. You can also upload a file containing the query to the same location.

Also, can you please clarify the timeline: what is the timezone of the times on Jan 13 of the incident that you mention in your initial comment? Also, can you please provide the timeline (including date and timezone) of the second attempt that you mention after letting the nodes sync up.

Thanks,
Bruce

Generated at Thu Feb 08 04:31:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.