Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-35339

Complete recovery failure after unclean shutdown

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Duplicate
    • Affects Version/s: 3.6.5
    • Fix Version/s: None
    • Component/s: WiredTiger
    • Labels:
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide

      Under provision the secondary so that it eventually runs out of memory and crashes.

      Show
      Under provision the secondary so that it eventually runs out of memory and crashes.

      Description

      Environment:

      • 3 server replica sets on AWS (t2.medium) running 3.6.5. 
      • All writes employ MAJORITY write concern.
      • Default journaling is enabled.

      Expected behaviour:

      That the supported recovery methods return the instance to health.

      Observed Behaviour:

      After an unclean shutdown a secondary never recovers on it's own, never making it past the final step in the following sample log, see log1.txt.

      Clearly the instance had run out of disk space at this point (100GB provisioned for a database normally 1.6GB). Here are the contents of the /var/mongodata folder, see log2.txt.
       
      So it appears the culprit is entirely the WiredTigerLAS.wt file.

      For additional information the df output at this point:

      Filesystem      Size  Used Avail Use% Mounted on
      devtmpfs        7.9G   60K  7.9G   1% /dev
      tmpfs           7.9G     0  7.9G   0% /dev/shm
      /dev/xvda1       20G  2.8G   17G  14% /
      /dev/xvdi       100G  100G  140K 100% /var/mongodata
      

      The only option to recover this instance is to do a full resync (after deleting the contents of /var/mongodata), see log3.txt.
       
      The initial sync currently takes less than 60 seconds but this will obviously not be suitable once the size of the data set grows.

        Attachments

        1. CPU.png
          CPU.png
          119 kB
        2. image-2018-06-07-09-53-02-151.png
          image-2018-06-07-09-53-02-151.png
          153 kB
        3. image-2018-06-07-09-55-51-824.png
          image-2018-06-07-09-55-51-824.png
          125 kB
        4. image-2018-06-08-17-14-04-212.png
          image-2018-06-08-17-14-04-212.png
          49 kB
        5. log1.txt
          14 kB
        6. log2.txt
          4 kB
        7. log3.txt
          44 kB
        8. top_level_replicaset_stats.JPG
          top_level_replicaset_stats.JPG
          70 kB

          Issue Links

            Activity

              People

              Assignee:
              bruce.lucas Bruce Lucas
              Reporter:
              MarcF Marc Fletcher
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: