Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-7452

Improve logging when recovery (and RTS) is taking a long time

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major - P3
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 5.1 Required
    • Component/s: None
    • Labels:
      None
    • Sprint:
      Storage - Ra 2021-05-31

      Description

      This ticket is the WiredTiger portion for SERVER-56222. Please look at SERVER ticket for more details and the context. Here is an extract from the SERVER ticket for this ticket completion:

      WT-7442 addresses an issue in rollback to stable, which makes the rollback/recovery process much longer than previous releases. WiredTiger also doesn't generate any logs to keep the user aware of what is going on. From the server ticket:

      To illustrate, it's been observed that a mongod spends multiple hours recovering after having a message similar to the following printed in the log with no other clear signs of progress:

      {"t":{"$date":"2021-04-21T00:51:14.703+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1618966274:703153][5978:0x7f67cef96bc0], file:collection-2-1893086824266225355.wt, txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 62628 through 62628"}}
      {"t":{"$date":"2021-04-21T00:51:14.759+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1618966274:759611][5978:0x7f67cef96bc0], file:collection-2-1893086824266225355.wt, txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Set global recovery timestamp: (1618965463, 1)"}}
      {"t":{"$date":"2021-04-21T00:51:14.759+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1618966274:759674][5978:0x7f67cef96bc0], file:collection-2-1893086824266225355.wt, txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Set global oldest timestamp: (1618965458, 1)"}}
      

      The next message in the log - the timestamp show how large the gap in the logging is:

      {"t":{"$date":"2021-04-21T03:17:08.587+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1618975028:587005][5978:0x7f67cef96bc0], file:index-989625-1893086824266225355.wt, WT_SESSION.checkpoint: [WT_VERB_CHECKPOINT_PROGRESS] Checkpoint has been running for 843 seconds and wrote: 5000 pages (3 MB)"}}
      

      Though WiredTiger does have some verbose logging during the recovery, we need to re-evaluate if we can do better, especially during rollback to stable. Though WT-7442 is going to address the time taken to rollback when there are a lot of dhandles, this ticket should go over the code and add more logging to keep the user informed. Also, we should make an effort in identifying other portions of recovery where we can improve logging verbose messages.

      SERVER-56222 is marked blocked on this ticket. Once the WiredTiger ticket is resolved re-evaluate if any work is required for the server ticket. At the moment we do not think there is any server work required.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              backlog-server-storage-engines Backlog - Storage Engines Team
              Reporter:
              sulabh.mahajan Sulabh Mahajan
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated: