[SERVER-74331] rollback-to-stables cause mongod shutdowns to take up to an hour in 4.4 Created: 23/Feb/23  Updated: 14/Mar/23  Resolved: 14/Mar/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ian Springer Assignee: Yuan Fang
Resolution: Duplicate Votes: 4
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-52815 Investigate calling 'rollback_to_stab... Backlog
Operating System: ALL
Participants:
Case:

 Description   

Since upgrading from 4.2 to 4.4, we are experiencing extremely long shutdown times for shard replica mongod processes. We are seeing this in our production clusters, as well as our much smaller test clusters. With 4.2, mongod shutdowns typically took less than a minute. Now they often take 30-60 minutes. When they do take such a long time, it is always because they are executing a WiredTiger rollback-to-stable. You can see logs like the following spewing out every 20 seconds:

 

 

{"t":{"$date":"2023-02-23T02:43:14.628+00:00"},"s":"I",  "c":"STORAGE",  "id":22322,   "ctx":"conn1315","msg":"Shutting down checkpoint thread"}
{"t":{"$date":"2023-02-23T02:43:14.628+00:00"},"s":"I",  "c":"STORAGE",  "id":22323,   "ctx":"conn1315","msg":"Finished shutting down checkpoint thread"}
{"t":{"$date":"2023-02-23T02:43:14.631+00:00"},"s":"I",  "c":"STORAGE",  "id":4795902, "ctx":"conn1315","msg":"Closing WiredTiger","attr":{"closeConfig":"leak_memory=true,"}}
{"t":{"$date":"2023-02-23T02:43:20.914+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn1292","msg":"Connection ended","attr":{"remote":"10.160.171.190:37458","connectionId":1292,"connectionCount":18}}
{"t":{"$date":"2023-02-23T02:43:21.988+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn262","msg":"Connection ended","attr":{"remote":"10.160.170.250:49662","connectionId":262,"connectionCount":17}}
{"t":{"$date":"2023-02-23T02:43:34.634+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120214:634344][3628:0x7f70507e1700], file:cdpactivationtest/collection-0--6346230458143384831.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 20 seconds and has inspected 597611 files. For more detailed logging, enable WT_VERB_RTS"}}
{"t":{"$date":"2023-02-23T02:43:47.381+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn1290","msg":"Connection ended","attr":{"remote":"10.160.171.190:37008","connectionId":1290,"connectionCount":16}}
{"t":{"$date":"2023-02-23T02:43:54.636+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120234:636642][3628:0x7f70507e1700], file:config/collection-112432-5941089686580730609.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 40 seconds and has inspected 601768 files. For more detailed logging, enable WT_VERB_RTS"}}
{"t":{"$date":"2023-02-23T02:44:14.637+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120254:637117][3628:0x7f70507e1700], file:config/collection-13319-559579135213939646.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 60 seconds and has inspected 605383 files. For more detailed logging, enable WT_VERB_RTS"}}
{"t":{"$date":"2023-02-23T02:44:34.638+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120274:638360][3628:0x7f70507e1700], file:config/collection-153892-5941089686580730609.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 80 seconds and has inspected 609305 files. For more detailed logging, enable WT_VERB_RTS"}}
{"t":{"$date":"2023-02-23T02:44:54.632+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120294:632279][3628:0x7f70507e1700], file:config/collection-153892-5941089686580730609.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 100 seconds and has inspected 613075 files. For more detailed logging, enable WT_VERB_RTS"}}
{"t":{"$date":"2023-02-23T02:45:14.638+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120314:638068][3628:0x7f70507e1700], file:config/collection-196706-5941089686580730609.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 120 seconds and has inspected 616721 files. For more detailed logging, enable WT_VERB_RTS"}}
{"t":{"$date":"2023-02-23T02:45:34.632+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120334:632897][3628:0x7f70507e1700], file:config/collection-220381-5941089686580730609.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 140 seconds and has inspected 620387 files. For more detailed logging, enable WT_VERB_RTS"}}
{"t":{"$date":"2023-02-23T02:45:54.635+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120354:635610][3628:0x7f70507e1700], file:config/collection-244326-5941089686580730609.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 160 seconds and has inspected 624076 files. For more detailed logging, enable WT_VERB_RTS"}}
{"t":{"$date":"2023-02-23T02:46:14.632+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120374:632253][3628:0x7f70507e1700], file:config/collection-270505-5941089686580730609.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 180 seconds and has inspected 627528 files. For more detailed logging, enable WT_VERB_RTS"}}
... 

 

I checked the system stats during the rollback-to-stable, and none of them were saturated. I also took a sampling of the system calls being made using strace:

 

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0       910           read
  0.00    0.000000           0       503           write
  0.00    0.000000           0       461           open
  0.00    0.000000           0       591           close
  0.00    0.000000           0       585           stat
  0.00    0.000000           0         6           writev
  0.00    0.000000           0     85452           select
  0.00    0.000000           0       708           sched_yield
  0.00    0.000000           0       446           sendmsg
  0.00    0.000000           0       998           recvmsg
  0.00    0.000000           0       260           getdents
  0.00    0.000000           0         6           rename
  0.00    0.000000           0      3627           gettimeofday
  0.00    0.000000           0        65           getrusage
  0.00    0.000000           0       585           statfs
  0.00    0.000000           0     89303     37032 futex
  0.00    0.000000           0        65           sched_getaffinity
  0.00    0.000000           0        10        10 restart_syscall
  0.00    0.000000           0     44993           clock_gettime
  0.00    0.000000           0       130           openat
------ ----------- ----------- --------- --------- ----------------
100.00    0.000000                229704     37042 total 

I used db.adminCommand( { "setParameter": 1, "wiredTigerEngineRuntimeConfig": "verbose=[rts]"}) to enable verbose logging for WiredTiger RTS. This produced a ton of output, but most of it consisted of the following 3 log messages:

 

 

{"t":{"$date":"2023-02-16T21:46:07.465+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"SignalHandler","msg":"WiredTiger message","attr":{"message":"[1676583967:465207][24280:0x7f09ea138700], file:test2nto/index-1020619--8977920817120252160.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_RTS(session)] skip rollback to stable on file file:test2nto/index-4666-8107731100219616386.wt because its checkpoint address length is 0"}}
 
{"t":{"$date":"2023-02-16T21:46:07.459+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"SignalHandler","msg":"WiredTiger message","attr":{"message":"[1676583967:459380][24280:0x7f09ea138700], file:test2nto/index-1020619--8977920817120252160.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_RTS(session)] file:test2nto/index-1021448--8977920817120252160.wt: tree skipped with durable timestamp: (0, 0) and stable timestamp: (1676583941, 44) or txnid: 0"}}
 
{"t":{"$date":"2023-02-16T21:46:07.436+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"SignalHandler","msg":"WiredTiger message","attr":{"message":"[1676583967:436685][24280:0x7f09ea138700], file:test2nto/index-1020156--8977920817120252160.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_RTS(session)] 0x5583f59294e0: page walk skipped"}} 

This snippet from https://jira.mongodb.org/browse/SERVER-52815 appears to explain why these rollback-to-stables are now happening on every shutdown in 4.4:

This specifically became of interest because one of the costs of WT shutdown in 4.2 and earlier was typically bounded by how much data had come into the system since the last checkpoint (or how far the stable timestamp moved since the last checkpoint). But with durable history, the cost is now bounded by how much data exists ahead of the stable timestamp (which is much easier to grow). This is because WT now calls rollback_to_stable inside of WT_CONNECTION::close.

But it's not clear if there's any way to change the behavior. We would like to go back to how it behaved in 4.2 and earlier, where we were able to shutdown our mongod processes in a minute or so. An hour long shutdown time is unacceptable. It makes rolling restarts for upgrades and other maintenance take 50 times longer. And even more importantly, it prevents us from expediently dealing with cluster availability and performance issues.

 

I have tried the following tweaks to the shutdown process  in hopes one of them might prevent the RTS, but none helped:

  • make replica hidden before shutting it down (we always do this anyway)
  • remove replica from replica set before shutting it down
  • run db.fsyncLock() to disable writes prior to shutdown
  • use db.shutdownServer({ force: true }) to shutdown rather than SIGTERM

 

I also tried stopping the server with "kill -9". This is the one thing that actually worked. It resulted in the subsequent mongod restart taking longer - 5 minutes, rather than the typical 1 minute, but no RTS was performed. Nevertheless, using SIGKILL seems like a bad idea, since it gives Mongo no opportunity to gracefully close files, network connections, etc.

 

We see this issue about 4 out of 5 times in our test cluster, which consists of:

  • 2 shards
  • 4 replicas per shard, 3 active w/ local SSD's and 1 hidden EBS for backup purposes
  • 150 databases with about 100 collections each
  • 250 GB total data

There is a load generator running that ensures there are regular updates being made to one of the databases, but the load is light.

 

Please let me know if there's any other information I can provide to help you diagnose this issue. 

 

Thank you!

 



 Comments   
Comment by Yuan Fang [ 14/Mar/23 ]

Hi ian.springer@salesforce.com,

Thank you for your report. We have investigated this issue previously and expect it to be resolved in v4.4 after the backport of WT-8652, which is estimated to be included in the upcoming release (4.4.20) - rc0 on April 4 (BACKPORT-14306). As a result, we are closing this ticket. If you have any further questions, please let us know and we will reopen this ticket.

Regards,
Yuan

Generated at Thu Feb 08 06:27:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.