[SERVER-35339] Complete recovery failure after unclean shutdown Created: 01/Jun/18 Updated: 27/Dec/18 Resolved: 17/Nov/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.6.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Marc Fletcher | Assignee: | Bruce Lucas (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | SWKB | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Steps To Reproduce: | Under provision the secondary so that it eventually runs out of memory and crashes. |
||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
| Comments |
| Comment by Marc Fletcher [ 19/Jun/18 ] | ||||||||||||||||||||||
|
I've spent quite a bit of time looking at the application triggers for these issues. Currently we have a very large mono mongo sharded cluster for everything (v3.0.12) and are in the process of transferring independent loads to independent mongo replica sets. One of the additional motivations for doing this transfer is to be able to better understand the load profiles of each service and optimise accordingly (if this can also help here then it's a win/win). As I mentioned in an earlier comment we saw corresponding increases in DB load in the mono cluster at the same moment this new replica set was hitting 100% CPU usage and then crashing. We indeed tracked this load increase back to a process that was occasionally creating large documents of time series data (single array containing 100K+ entries) and then in quick succession $push(ing) additional elements onto that array. By reducing the allowed time range for each bucket in the time series, and thereby considerably reducing the maximum size of this array, these load issues appear to be completely resolved. We've had everything running for a few days now in this configuration and the CPU/memory loads are now very flat. We have also manually triggered a few ungraceful shutdowns by hard killing instances in the replica set and successful recovery has been achieved each time. | ||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 14/Jun/18 ] | ||||||||||||||||||||||
|
Hi Marc, We've taken another look at the data and it appears that you also hit
I've adjust the linked cases to reflect that you hit both issues. Regarding the first issue, we are interested in learning more about the application triggers for this. One trigger that we know about is an application that applies a lot of small updates to large documnts; does that match your application? Thanks, | ||||||||||||||||||||||
| Comment by Dmitry Agranat [ 14/Jun/18 ] | ||||||||||||||||||||||
|
Hi MarcF, I am going to resolve this ticket as a duplicate of Thanks, | ||||||||||||||||||||||
| Comment by Marc Fletcher [ 14/Jun/18 ] | ||||||||||||||||||||||
|
| ||||||||||||||||||||||
| Comment by Dmitry Agranat [ 10/Jun/18 ] | ||||||||||||||||||||||
|
Hi MarcF, Thanks for clarifying that the increased CPU load on 1 CPU Instance was a result of the increased workload concurrency and this is expected. In regards to issue #2:
As you've mentioned during your initial comment, the Instance ran out of disk space. We can see this in the mongod log:
Once the Instance ran out of disk space, you deleted the contents of the dbpath and performed the initial sync which was successful. This is expected. However, the reason you ran out of disk space is most likely As you've mentioned:
Increasing the memory size helps in this situation but a fix will be a part of Thanks, | ||||||||||||||||||||||
| Comment by Marc Fletcher [ 08/Jun/18 ] | ||||||||||||||||||||||
|
The CPU credits are never depleted so I would never expect any throttling on these instances (they only get reset when we change instance type):
Having done some further investigation I have found some evidence to suggest the burst in load is due to an increase in the number of DB writes being made (apologies for the misleading comment that "load on this cluster should typically remain constant"). Referring to the instance observed June 7th 07:30 UTC there is a commensurate increase in the number 'active writes':
Active writes go from 0 to 40-50, which possibly also matches the number of increased processes. Further evidence to suggest the increased write load is expected is that we are concurrently writing the same data to another cluster which is also showing increased load on the same collection at matching times (an older one that is currently doing everything in our infrastructure, and something we are trying to break down into loads of use case specific clusters). This may suggest that issue #1 referred to above is a consequence of our use case, which just leaves issue #2 as the unexpected one. At every event I continue to add further logging so if issue #1 occurs again I will hopefully be able to show conclusively it's expected. Kind regards, Marc | ||||||||||||||||||||||
| Comment by Dmitry Agranat [ 07/Jun/18 ] | ||||||||||||||||||||||
|
Hi MarcF, Thank you for providing all the information, it was very useful. In regards to the event #1 (100% CPU). Based on the provided information, these events are periodic and have a very common behavior: there is a burst (during the last event on June 7th this lasted ~30 minutes) of 100% user CPU utilization correlated with an increase of system processes running (from 1-2 to 50-60):
Since we do not have the information about the CPU Credits on your instance at this time, we do not know if your sudden spike in CPU usage is a result of being throttled by Amazon, a sudden burst of work that was previously throttled by Amazon or by an actual increase in the workload.
Could you provide some input in regards to the bursts of the increased number of processes running? Is this expected based on your workload? If these bursts of the increased number of running processes is unexpected, as an experiment, are you able to perform the same workload on fixed performance instances with more than 2 vCPUs instead of t2 burstable instances?
We can look into the second concern once we understand the reason behind the reported CPU bursts. Thanks, | ||||||||||||||||||||||
| Comment by Marc Fletcher [ 07/Jun/18 ] | ||||||||||||||||||||||
|
Further to my update I believe there are two unexpected things happening here:
With a view to provide further debug information I increased the default primary (RS0-0) from t2.medium to a t2.large instance last night. At the time of writing we have just experienced issue #1 again (between 07:25 and 08:35 UTC, see graph) and because of the increased primary instance size it didn't crash. I have included the additional information that may help determine the cause of issue #1:
Replica set metrics relating to the above logs:
EBS /var/mongodata (100GB gp2) metrics relating to the above logs (which doesn't show any associated changes with the high CPU and memory):
Please let me know if any further information might be useful.
Regards, Marc | ||||||||||||||||||||||
| Comment by Marc Fletcher [ 05/Jun/18 ] | ||||||||||||||||||||||
|
Many thanks for the quick response. Please find the requested information as follows. For references all instances start off being t2.medium. EBS volumes are not initialised from snapshots so I don’t expect prewarming to be an issue here. Here are the top level system stats across the replica set during the period of interest (starting 2018-06-03 14:00 UTC): All of the requested data came to ~500MB so have been uploaded to the secure storage: diagnostic.data_RS0-0.zip - The diagnostic data from RS0-0 Timeline:All times are UTC. Sunday 3rdRS0-0 RS0-1 RS0-2 Tuesday 5th16:20 16:33 16:47 16:58 17:02 17:14 17:23 17:25 | ||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 01/Jun/18 ] | ||||||||||||||||||||||
|
Hi Marc, In order for us to diagnose this, can you please
from each node of a replica set that's recently experienced this issue. You can either attach to this ticket if the files are less than 150 MB, or upload to this secure private portal. Also, please provide a timeline of the incident so we are sure to look at the right place in the data. Thanks, | ||||||||||||||||||||||
| Comment by Marc Fletcher [ 01/Jun/18 ] | ||||||||||||||||||||||
|
Apologies but log1 should be split on the noformat tags and the comment:
|