[SERVER-32696] All replica set nodes freeze during heavy update with enableMajorityReadConcern Created: 14/Jan/18 Updated: 30/Oct/23 Resolved: 19/Jan/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Michael Smith | Assignee: | Bruce Lucas (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | RF, SWNA | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 16.04 |
||
| Operating System: | ALL |
| Participants: |
| Description |
|
I have a 3-node replica set running version 3.4.10 on Ubuntu 16.04. I ran a schema update that touched all 7 million rows of a collection with a $set and a $rename. Because one of the secondaries is about 30ms away in Azure, I used majority write concern to slow down the update and make sure at least one of the secondaries would stay in sync. The query started at 14:19:29. At that point the Azure slave was probably 3-5 minutes behind because of earlier schema migrations. But by 14:27:00, the main secondary was unable to get results for oplog queries:
That's also the time the replica set stopped accepting connections from clients. To get things running again I had to kill all three mongod processes (and then kill -9 because the shutdown tends to hang while in this state). After letting the nodes sync up, I was able to reproduce this again with the same query. I can provide logs and the query privately if that would be useful. Just guessing based on what I learned in |
| Comments |
| Comment by Bruce Lucas (Inactive) [ 19/Jan/18 ] | ||||
|
Happy to have been able to help. | ||||
| Comment by Michael Smith [ 19/Jan/18 ] | ||||
|
Thanks Bruce! Yes, this has been very helpful. Mike | ||||
| Comment by Bruce Lucas (Inactive) [ 18/Jan/18 ] | ||||
|
Hi Michael,
This is not exposed as a normal mongod parameter, but can be configured using the mongod wiredTigerEngineConfigString parameter to change the WiredTiger eviction_dirty_trigger parameter from its default setting of 20. However I think you are correct: based on the rate at which lag was growing and the proportion of the required work that had been done, you are likely to still hit the same problem, so I would suggest instead breaking the update into batches.
In 3.6 operations may still temporarily stall under cache pressure in order to allow time for cache data to be evicted. However they won't stall indefinitely in 3.6: under extreme cache pressure, for example when a large amount of data is pinned in cache as it was in this case, rather than stalling indefinitely WiredTiger writes some of the cache to a lookaside table on disk to make room in memory. I was able to reproduce the deadlock you encountered in 3.4, but observed that in 3.6, as expected, rather than deadlocking, the cache lookaside table is used to relieve cache pressure.
Rather than iterating through the _ids and updating the documents one by one, you could fetch the _ids in sorted order, and then do the updates on ranges of _ids using $gte and $lt, keeping the batches small enough to avoid a large lag. Please let me know if this addresses your concerns. Bruce | ||||
| Comment by Michael Smith [ 18/Jan/18 ] | ||||
|
Hi Bruce, This is really interesting. Thanks a lot for digging in and providing this analysis. I really appreciate it. I'm glad to hear the situation is improved in 3.6.x. Unfortunately I'm waiting for Spring Data MongoDB 1.10.10 to be released before I can upgrade to 3.6.x; the current version doesn't use the cursor option for aggregations (https://jira.spring.io/browse/DATAMONGO-1824). I also need majority read concern in a couple of places (not in this collection, though). 1. I couldn't find any documentation about the freeze at 20% dirty data. Is there a parameter to configure this threshold? I'd rather use more cache. Although, if we're able to blow 20% in 17 seconds, perhaps it wouldn't help anyway. | ||||
| Comment by Bruce Lucas (Inactive) [ 17/Jan/18 ] | ||||
|
Hi Michael, It looks like you're encountering an issue caused by a combination of enableMajorityReadConcern, heavy writes, and replication lag. In the second incident (2018-01-13T16:14:48Z) for example replication lag started at 0, but during the bulk update built to about 17 seconds on one secondary and 30 seconds on the other (Azure) secondary. This causes the replica set majority commit point to lag behind by about 17 seconds, which in turn requires all nodes to keep 17 seconds worth of updates in cache to satisfy potential read concern majority reads. Because of the high rate of updates this results in exceeding the limit of 20% cache dirty data, which causes operations to stall on the primary, including the oplog fetch operation. At this point we're stuck: the majority commit point can't advance because replication is stalled, so the cache remains above the dirty data limit, preventing replication from proceeding. The information you provided about the document size allows us to do a sanity check: we were doing about 10 k updates per second, generating about 5 kB of dirty data per update for the collection and a similar amount for the oplog, so a replication lag of 17 seconds will generate enough dirty data to reach the limit of 20% of cache. I've taken the liberty to adjust the ticket summary to reflect the relationship to enableMajorityReadConcern. Here are some options to avoid this issue:
Please let me know if any of these steps addresses your issue. Thanks, | ||||
| Comment by Michael Smith [ 16/Jan/18 ] | ||||
|
Hi Bruce, I've uploaded collection stats and a redacted document (post-update) to the portal.
Thanks, | ||||
| Comment by Bruce Lucas (Inactive) [ 16/Jan/18 ] | ||||
|
Hi Michael, Thanks for the information. Can you also tell me more about the collection - I understand it has 7 million documents; what is the total size? Would it be possible to upload a typical document (redacting field values if necessary) to the same upload portal? Thanks | ||||
| Comment by Michael Smith [ 15/Jan/18 ] | ||||
|
Thanks Bruce. Files are uploaded. All times UTC. As a shortcut you could look at the query that was issued at 2017-01-13 16:14:48. At that point we were re-running only the failing query, so all 3 nodes were starting off from a position of being in sync. Replication died around 16:17:56:
The earlier occurrence was at 14:19:29 but there had been some updates in the 10-12 minutes leading up to it that hadn't fully synced to the Azure replica. | ||||
| Comment by Bruce Lucas (Inactive) [ 15/Jan/18 ] | ||||
|
Hi Michael, Thanks for your offer to provide logs and query. Can you please upload the logs, together with an archive of the $dbpath/diagnostic.data directory, from each node of the cluster, to this secure upload portal. Note that there is some urgency to uploading the diagnostic.data information as it will age out in a few days. You can also upload a file containing the query to the same location. Also, can you please clarify the timeline: what is the timezone of the times on Jan 13 of the incident that you mention in your initial comment? Also, can you please provide the timeline (including date and timezone) of the second attempt that you mention after letting the nodes sync up. Thanks, |