[SERVER-56222] Startup logging should provide regular status updates for the undergoing recovery Created: 21/Apr/21  Updated: 06/Dec/22  Resolved: 21/May/21

Status: Closed
Project: Core Server
Component/s: Logging
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Dmitry Ryabtsev Assignee: Backlog - Storage Engines Team
Resolution: Done Votes: 9
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on WT-7452 Improve logging when recovery (and RT... Closed
Related
related to WT-7442 RTS to open dhandle only when the dha... Closed
Assigned Teams:
Storage Engines
Participants:
Case:

 Description   

It has been observed in the most recent versions of MongoDB (4.4.5) that startup recovery of an instance with a large data set can take several hours.

This is problematic for TSEs and for the users as the status of the mongod process is not entirely clear.

For example:

  • There is no indication how many phases or steps the recovery process has in total
  • Where the current progress is
  • How many steps are outstanding

To illustrate, it's been observed that a mongod spends multiple hours recovering after having a message similar to the following printed in the log with no other clear signs of progress:

{"t":{"$date":"2021-04-21T00:51:14.703+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1618966274:703153][5978:0x7f67cef96bc0], file:collection-2-1893086824266225355.wt, txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 62628 through 62628"}}
{"t":{"$date":"2021-04-21T00:51:14.759+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1618966274:759611][5978:0x7f67cef96bc0], file:collection-2-1893086824266225355.wt, txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Set global recovery timestamp: (1618965463, 1)"}}
{"t":{"$date":"2021-04-21T00:51:14.759+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1618966274:759674][5978:0x7f67cef96bc0], file:collection-2-1893086824266225355.wt, txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Set global oldest timestamp: (1618965458, 1)"}}



 Comments   
Comment by Luke Pearson [ 21/May/21 ]

WT-7452 has added additional logging to rollback to stable that will log a message every 20 or so seconds dependant on how long a single file takes to rollback.

Comment by Sulabh Mahajan [ 27/Apr/21 ]

daniel.gottlieb, very likely it will be recovery_progress. I will make a note on WT-7452 that if we use a verbose category other than the above three, we might need some work in the execution layer (can use this ticket for that work in such a case). For now, marking this ticket blocked.

Comment by Daniel Gottlieb (Inactive) [ 27/Apr/21 ]

sulabh.mahajan, is the logging WT-7452 planning to add fall in these verbose settings? (Slurping from the link) either:

  • recovery_progress
  • checkpoint_progress
  • compact_progress

If so, your proposal to mark this as blocked makes sense to me. AFAIK, nothing specific about MDB was identified to be improved.

Comment by Sulabh Mahajan [ 27/Apr/21 ]

I created WT-7452 to address logging messages during roll back to stable (and possibly more during recovery in general). I am not sure if there is any work required from MongoDB to address logging messages. If not we can mark this ticket blocked on the WT ticket, and close it when the WT ticket is done. daniel.gottlieb, can you confirm if there is any work required from execution?

Comment by Sulabh Mahajan [ 27/Apr/21 ]

Though WT-7442 has made it evident that we do not provide enough logs during recovery, WT-7442 is not going to address it. We need to create a different WT ticket to address.

Generated at Thu Feb 08 05:38:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.