[SERVER-35339] Complete recovery failure after unclean shutdown Created: 01/Jun/18  Updated: 27/Dec/18  Resolved: 17/Nov/18

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.6.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Marc Fletcher Assignee: Bruce Lucas (Inactive)
Resolution: Duplicate Votes: 0
Labels: SWKB
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File CPU.png     PNG File image-2018-06-07-09-53-02-151.png     PNG File image-2018-06-07-09-55-51-824.png     PNG File image-2018-06-08-17-14-04-212.png     Text File log1.txt     Text File log2.txt     Text File log3.txt     JPEG File top_level_replicaset_stats.JPG    
Issue Links:
Duplicate
duplicates SERVER-34938 Secondary slowdown or hang due to con... Closed
duplicates SERVER-34941 Add testing to cover cases where time... Closed
Related
is related to SERVER-36495 Cache pressure issues during recovery... Closed
Operating System: ALL
Steps To Reproduce:

Under provision the secondary so that it eventually runs out of memory and crashes.

Participants:

 Description   

Environment:

  • 3 server replica sets on AWS (t2.medium) running 3.6.5. 
  • All writes employ MAJORITY write concern.
  • Default journaling is enabled.

Expected behaviour:

That the supported recovery methods return the instance to health.

Observed Behaviour:

After an unclean shutdown a secondary never recovers on it's own, never making it past the final step in the following sample log, see log1.txt.

Clearly the instance had run out of disk space at this point (100GB provisioned for a database normally 1.6GB). Here are the contents of the /var/mongodata folder, see log2.txt.
 
So it appears the culprit is entirely the WiredTigerLAS.wt file.

For additional information the df output at this point:

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        7.9G   60K  7.9G   1% /dev
tmpfs           7.9G     0  7.9G   0% /dev/shm
/dev/xvda1       20G  2.8G   17G  14% /
/dev/xvdi       100G  100G  140K 100% /var/mongodata

The only option to recover this instance is to do a full resync (after deleting the contents of /var/mongodata), see log3.txt.
 
The initial sync currently takes less than 60 seconds but this will obviously not be suitable once the size of the data set grows.



 Comments   
Comment by Marc Fletcher [ 19/Jun/18 ]

I've spent quite a bit of time looking at the application triggers for these issues. Currently we have a very large mono mongo sharded cluster for everything (v3.0.12) and are in the process of transferring independent loads to independent mongo replica sets. One of the additional motivations for doing this transfer is to be able to better understand the load profiles of each service and optimise accordingly (if this can also help here then it's a win/win). 

As I mentioned in an earlier comment we saw corresponding increases in DB load in the mono cluster at the same moment this new replica set was hitting 100% CPU usage and then crashing. We indeed tracked this load increase back to a process that was occasionally creating large documents of time series data (single array containing 100K+ entries) and then in quick succession $push(ing) additional elements onto that array. By reducing the allowed time range for each bucket in the time series, and thereby considerably reducing the maximum size of this array, these load issues appear to be completely resolved. We've had everything running for a few days now in this configuration and the CPU/memory loads are now very flat. We have also manually triggered a few ungraceful shutdowns by hard killing instances in the replica set and successful recovery has been achieved each time.

Comment by Bruce Lucas (Inactive) [ 14/Jun/18 ]

Hi Marc,

We've taken another look at the data and it appears that you also hit SERVER-34938. Specifically

  • After the server restart at 06-05 16:17, the node transitioned to secondary and attempted to catch up from the considerable lag that had built up, and during this process you hit SERVER-34938. This eventually resulted in a crash due to running out of memory.
  • After the subsequent server restart at 06-0 16:47 the server began recovery oplog application to re-estabilish a locally consistent state of the db (which is distinct from using the oplog to catch up a lagged secondary, and you encountered SERVER-34941.

I've adjust the linked cases to reflect that you hit both issues.

Regarding the first issue, we are interested in learning more about the application triggers for this. One trigger that we know about is an application that applies a lot of small updates to large documnts; does that match your application?

Thanks,
Bruce

Comment by Dmitry Agranat [ 14/Jun/18 ]

Hi MarcF,

I am going to resolve this ticket as a duplicate of SERVER-34941, you can watch SERVER-34941 for updates.

Thanks,
Dima

Comment by Marc Fletcher [ 14/Jun/18 ]

SERVER-34941 sounds like it might be related. The problem appears to be fairly critical in my case though as it's not just 'running out of disk space'. If the entire database is 1GB (and resync able in less than 60 seconds) and after 100GB it's still not managed to restore itself  this sounds like an infinite loop that will never be resolved regardless of disk space.

Comment by Dmitry Agranat [ 10/Jun/18 ]

Hi MarcF,

Thanks for clarifying that the increased CPU load on 1 CPU Instance was a result of the increased workload concurrency and this is expected.

In regards to issue #2:

The inability for mongo to recover after the failover.

As you've mentioned during your initial comment, the Instance ran out of disk space. We can see this in the mongod log:

2018-05-30T19:39:55.036+0000 E STORAGE  [thread1] WiredTiger error (28) [1527709195:36383][4246:0x7f5fe8569700], eviction-server: cache eviction thread error: No space left on device

Once the Instance ran out of disk space, you deleted the contents of the dbpath and performed the initial sync which was successful. This is expected.

However, the reason you ran out of disk space is most likely SERVER-34941 where a secondary gets stuck with cache full. One of the effects of SERVER-34941 is lots of lookaside table activity and the growth of the WiredTigerLAS.wt file.

As you've mentioned:

and because of the increased primary instance size it didn't crash

Increasing the memory size helps in this situation but a fix will be a part of SERVER-34941.

Thanks,
Dima

Comment by Marc Fletcher [ 08/Jun/18 ]

The CPU credits are never depleted so I would never expect any throttling on these instances (they only get reset when we change instance type):

Having done some further investigation I have found some evidence to suggest the burst in load is due to an increase in the number of DB writes being made (apologies for the misleading comment that "load on this cluster should typically remain constant"). Referring to the instance observed June 7th 07:30 UTC there is a commensurate increase in the number 'active writes':

insert query update delete getmore command dirty  used flushes vsize   res qrw  arw net_in net_out conn set repl                time
    *0    *0     *0     *0       0     8|0  0.0% 38.3%       0 3.36G 1.53G 0|0  1|0   783b   71.1k  116 rs0  PRI Jun  7 07:30:56.487
    *0    *0     *0     *0       0     8|0  0.0% 38.3%       0 3.36G 1.53G 0|0  1|0   726b   70.3k  116 rs0  PRI Jun  7 07:30:57.487
    *0    *0     *0     *0       0     4|0  0.0% 38.3%       0 3.36G 1.53G 0|0  1|0   547b   67.1k  116 rs0  PRI Jun  7 07:30:58.488
    *0    *0     *0     *0       0    14|0  0.0% 38.3%       0 3.36G 1.53G 0|0  1|0  1.34k   72.2k  116 rs0  PRI Jun  7 07:30:59.487
    *0    *0     *0     *0       1    11|0  0.0% 38.3%       0 3.36G 1.53G 0|0  1|0  2.68k   72.4k  116 rs0  PRI Jun  7 07:31:00.487
    *0    *0     *0     *0       0     3|0  0.0% 38.3%       0 3.36G 1.53G 0|0  1|0   493b   66.9k  116 rs0  PRI Jun  7 07:31:01.488
    *0    *0     *0     *0       0     6|0  0.0% 38.3%       0 3.36G 1.53G 0|0  1|0   610b   68.7k  116 rs0  PRI Jun  7 07:31:02.487
    *0    *0     *0     *0       0     8|0  0.0% 38.3%       0 3.36G 1.53G 0|0  1|0   726b   70.2k  116 rs0  PRI Jun  7 07:31:03.487
    *0    *0     *0     *0       0     5|0  0.0% 38.3%       0 3.36G 1.53G 0|0  1|0   609b   68.6k  116 rs0  PRI Jun  7 07:31:04.488
    *0    *0     *0     *0       1     7|0  0.0% 38.3%       0 3.36G 1.53G 0|0  1|0  1.75k   69.6k  116 rs0  PRI Jun  7 07:31:05.487
insert query update delete getmore command dirty  used flushes vsize   res qrw  arw net_in net_out conn set repl                time
    *0    *0     *0     *0       0     8|0  0.0% 38.3%       0 3.36G 1.53G 0|0  1|0   726b   70.3k  116 rs0  PRI Jun  7 07:31:06.487
    *0    *0     *0     *0       1    15|0  0.0% 38.3%       0 3.36G 1.53G 0|0  1|0  4.97k   72.7k  116 rs0  PRI Jun  7 07:31:07.496
    *0    *0     *0     *0       0    10|0  0.0% 38.3%       0 3.36G 1.53G 0|0  1|0  1.12k   70.1k  116 rs0  PRI Jun  7 07:31:08.488
    *0    *0     *0     *0       0    10|0  0.0% 38.3%       0 3.36G 1.53G 0|0  1|0   842b   71.9k  116 rs0  PRI Jun  7 07:31:09.487
    *0    *0     *0     *0       0     8|0  0.0% 38.3%       0 3.36G 1.53G 0|0  1|0   726b   70.2k  116 rs0  PRI Jun  7 07:31:10.487
    *0    *0     72     *0      11    46|0  0.7% 38.1%       0 3.80G 2.05G 0|0 2|42  47.8k    106k  121 rs0  PRI Jun  7 07:31:11.623
    *0    *0     25     *0      39    83|0  1.2% 38.2%       0 3.88G 2.14G 0|0 1|41  89.5k    154k  121 rs0  PRI Jun  7 07:31:12.487
    *0    *0     21     *0      29    75|0  1.3% 38.3%       0 3.88G 2.14G 0|0 1|42  69.0k    126k  122 rs0  PRI Jun  7 07:31:13.487
    *0    *0     18     *0      25    62|0  1.6% 38.3%       0 3.88G 2.14G 0|0 1|43  59.6k    116k  122 rs0  PRI Jun  7 07:31:14.488
    *0    *0     16     *0      20    46|0  1.8% 38.3%       0 3.89G 2.18G 0|0 1|44  46.1k    104k  122 rs0  PRI Jun  7 07:31:15.487

 Active writes go from 0 to 40-50, which possibly also matches the number of increased processes.

Further evidence to suggest the increased write load is expected is that we are concurrently writing the same data to another cluster which is also showing increased load on the same collection at matching times (an older one that is currently doing everything in our infrastructure, and something we are trying to break down into loads of use case specific clusters).

This may suggest that issue #1 referred to above is a consequence of our use case, which just leaves issue #2 as the unexpected one.

At every event I continue to add further logging so if issue #1 occurs again I will hopefully be able to show conclusively it's expected.

Kind regards,

Marc

Comment by Dmitry Agranat [ 07/Jun/18 ]

Hi MarcF,

Thank you for providing all the information, it was very useful.

In regards to the event #1 (100% CPU). Based on the provided information, these events are periodic and have a very common behavior: there is a burst (during the last event on June 7th this lasted ~30 minutes) of 100% user CPU utilization correlated with an increase of system processes running (from 1-2 to 50-60):

Since we do not have the information about the CPU Credits on your instance at this time, we do not know if your sudden spike in CPU usage is a result of being throttled by Amazon, a sudden burst of work that was previously throttled by Amazon or by an actual increase in the workload.

the load on this cluster should typically remain constant

Could you provide some input in regards to the bursts of the increased number of processes running? Is this expected based on your workload?

If these bursts of the increased number of running processes is unexpected, as an experiment, are you able to perform the same workload on fixed performance instances with more than 2 vCPUs instead of t2 burstable instances?

Further to my update I believe there are two unexpected things happening here:

1) The high CPU and memory usage that causes the primary failover (the load on this cluster should typically remain constant).
2) The inability for mongo to recover after the failover.

We can look into the second concern once we understand the reason behind the reported CPU bursts.

Thanks,
Dima

Comment by Marc Fletcher [ 07/Jun/18 ]

Further to my update I believe there are two unexpected things happening here:

  1. The high CPU and memory usage that causes the primary failover (the load on this cluster should typically remain constant).
  2. The inability for mongo to recover after the failover.

With a view to provide further debug information I increased the default primary (RS0-0) from t2.medium to a t2.large instance last night. At the time of writing we have just experienced issue #1 again (between 07:25 and 08:35 UTC, see graph) and because of the increased primary instance size it didn't crash. I have included the additional information that may help determine the cause of issue #1:

  • mongod.log_RS0-0_2018-06-07.zip (secure file store) - The mongod.log file just after the high CPU period.
  • diagnostic.data_RS0-0_2018-06-07.zip (secure file store) - The diagnostic data to match the mongod.log file.

Replica set metrics relating to the above logs:

EBS /var/mongodata (100GB gp2) metrics relating to the above logs (which doesn't show any associated changes with the high CPU and memory):

Please let me know if any further information might be useful.

 

Regards,

Marc

Comment by Marc Fletcher [ 05/Jun/18 ]

Many thanks for the quick response. Please find the requested information as follows.

For references all instances start off being t2.medium. EBS volumes are not initialised from snapshots so I don’t expect prewarming to be an issue here.

Here are the top level system stats across the replica set during the period of interest (starting 2018-06-03 14:00 UTC):

All of the requested data came to ~500MB so have been uploaded to the secure storage:

diagnostic.data_RS0-0.zip - The diagnostic data from RS0-0
diagnostic.data_RS0-1.zip - The diagnostic data from RS0-1
diagnostic.data_RS0-2.zip - The diagnostic data from RS0-2
mongod.log_RS0-0.zip - The mongod.log file for RS0-0. The full timespan discussed in the timeline was enclosed by a single file.
mongod.log_RS0-1.zip - The mongod.log file for RS0-1. The full timespan discussed in the timeline was enclosed by a single file.
mongod.log_RS0-2.zip - The mongod.log files for RS0-2. The full timespan discussed in the timeline required multiple files to be enclosed in a single zip file.
df_RS0-0_at_2018-06-05_17-19-UTC.log - The file contents of /var/mongodata on RS0-0 at 17:19 UTC 5th July.
mongostat_RS0-0.log - The mongostat log covering the period on Sunday 3rd.
mongotop_RS0-0.log - The mongotop log covering the period on Sunday 3rd.
rs.status_at_2018-06-05_16-20-UTC.log - The rs.status output at 16:20 UTC 5th July.

Timeline:

All times are UTC.

Sunday 3rd

RS0-0
21:16 - CPU usage hit's 100%
21:31 - RES memory usage jumps from a steady 1.6G to 2.9G (see attached mongostat_RS0-0.log)
22:20 - CPU drops to 0 as instance crashes.

RS0-1
22:20 - CPU bumps upto 25% around the time RS0-0 crashes, presumably as this instance takes over being primary.

RS0-2
CPU remains flat throughout, I suspect as the replica-set has gone read only when RS0-1 subsequently crashes and no further writes occur.

Tuesday 5th

16:20
Mongo applications manually restarted on RS0-0 and RS0-1.
RS0-1 becomes primary and is in sync with RS0-2 immediately. For reference RS0-0 has a higher member priority.
RS0-0 is 41.96 hours behind the primary as is not syncing even though the oplog contains 83 hours of data
See attached rs.status_at_2018-06-05_16-20-UTC.log attachment taken from RS0-1.

16:33
RS0-0 crashes due to a memory allocation error, see logs.

16:47
For the purposes of recreating the issue I described initially with the high disk space usage I increased RS0-0 to a t2-xlarge instance.
RS0-0 /var/mongodata size is 7.3G

16:58
RS0-0 /var/mongodata size is 16.0G

17:02
RS0-0 /var/mongodata is 24.0G

17:14
RS0-0 /var/mongodata is 32G and continuing to grow (see attached df_RS0-0_at_2018-06-05_17-19-UTC.log for file sizes)
I attempted to shutdown the mongod application on RS0-0, after waiting 4 minutes I hard killed it using kill -9
Deleted all contents of /var/mongodata (after taking a copy of the diagnostic data) and restarted the instance as t2.medium so that it performs an initial sync.

17:23
RS0-0 is restarted as t2.medium and beings an initial sync

17:25
RS0-0 has completed the initial sync and become primary (due to the rs config priorities).

Comment by Bruce Lucas (Inactive) [ 01/Jun/18 ]

Hi Marc,

In order for us to diagnose this, can you please

  • archive and upload the contents of diagnostic.data
  • compress and upload the complete mongod log files

from each node of a replica set that's recently experienced this issue. You can either attach to this ticket if the files are less than 150 MB, or upload to this secure private portal.

Also, please provide a timeline of the incident so we are sure to look at the right place in the data.

Thanks,
Bruce

Comment by Marc Fletcher [ 01/Jun/18 ]

Apologies but log1 should be split on the noformat tags and the comment:

I left the secondary in this state for a few hours to see if it would eventually get there as I was observing high CPU and disk usage. After about two hours the following additional information was output in the log:

Generated at Thu Feb 08 04:39:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.