Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-74331

rollback-to-stables cause mongod shutdowns to take up to an hour in 4.4

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None
    • ALL

      Since upgrading from 4.2 to 4.4, we are experiencing extremely long shutdown times for shard replica mongod processes. We are seeing this in our production clusters, as well as our much smaller test clusters. With 4.2, mongod shutdowns typically took less than a minute. Now they often take 30-60 minutes. When they do take such a long time, it is always because they are executing a WiredTiger rollback-to-stable. You can see logs like the following spewing out every 20 seconds:

       

       

      {"t":{"$date":"2023-02-23T02:43:14.628+00:00"},"s":"I",  "c":"STORAGE",  "id":22322,   "ctx":"conn1315","msg":"Shutting down checkpoint thread"}
      {"t":{"$date":"2023-02-23T02:43:14.628+00:00"},"s":"I",  "c":"STORAGE",  "id":22323,   "ctx":"conn1315","msg":"Finished shutting down checkpoint thread"}
      {"t":{"$date":"2023-02-23T02:43:14.631+00:00"},"s":"I",  "c":"STORAGE",  "id":4795902, "ctx":"conn1315","msg":"Closing WiredTiger","attr":{"closeConfig":"leak_memory=true,"}}
      {"t":{"$date":"2023-02-23T02:43:20.914+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn1292","msg":"Connection ended","attr":{"remote":"10.160.171.190:37458","connectionId":1292,"connectionCount":18}}
      {"t":{"$date":"2023-02-23T02:43:21.988+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn262","msg":"Connection ended","attr":{"remote":"10.160.170.250:49662","connectionId":262,"connectionCount":17}}
      {"t":{"$date":"2023-02-23T02:43:34.634+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120214:634344][3628:0x7f70507e1700], file:cdpactivationtest/collection-0--6346230458143384831.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 20 seconds and has inspected 597611 files. For more detailed logging, enable WT_VERB_RTS"}}
      {"t":{"$date":"2023-02-23T02:43:47.381+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn1290","msg":"Connection ended","attr":{"remote":"10.160.171.190:37008","connectionId":1290,"connectionCount":16}}
      {"t":{"$date":"2023-02-23T02:43:54.636+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120234:636642][3628:0x7f70507e1700], file:config/collection-112432-5941089686580730609.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 40 seconds and has inspected 601768 files. For more detailed logging, enable WT_VERB_RTS"}}
      {"t":{"$date":"2023-02-23T02:44:14.637+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120254:637117][3628:0x7f70507e1700], file:config/collection-13319-559579135213939646.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 60 seconds and has inspected 605383 files. For more detailed logging, enable WT_VERB_RTS"}}
      {"t":{"$date":"2023-02-23T02:44:34.638+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120274:638360][3628:0x7f70507e1700], file:config/collection-153892-5941089686580730609.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 80 seconds and has inspected 609305 files. For more detailed logging, enable WT_VERB_RTS"}}
      {"t":{"$date":"2023-02-23T02:44:54.632+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120294:632279][3628:0x7f70507e1700], file:config/collection-153892-5941089686580730609.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 100 seconds and has inspected 613075 files. For more detailed logging, enable WT_VERB_RTS"}}
      {"t":{"$date":"2023-02-23T02:45:14.638+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120314:638068][3628:0x7f70507e1700], file:config/collection-196706-5941089686580730609.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 120 seconds and has inspected 616721 files. For more detailed logging, enable WT_VERB_RTS"}}
      {"t":{"$date":"2023-02-23T02:45:34.632+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120334:632897][3628:0x7f70507e1700], file:config/collection-220381-5941089686580730609.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 140 seconds and has inspected 620387 files. For more detailed logging, enable WT_VERB_RTS"}}
      {"t":{"$date":"2023-02-23T02:45:54.635+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120354:635610][3628:0x7f70507e1700], file:config/collection-244326-5941089686580730609.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 160 seconds and has inspected 624076 files. For more detailed logging, enable WT_VERB_RTS"}}
      {"t":{"$date":"2023-02-23T02:46:14.632+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120374:632253][3628:0x7f70507e1700], file:config/collection-270505-5941089686580730609.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 180 seconds and has inspected 627528 files. For more detailed logging, enable WT_VERB_RTS"}}
      ... 
      

       

      I checked the system stats during the rollback-to-stable, and none of them were saturated. I also took a sampling of the system calls being made using strace:

       

      % time     seconds  usecs/call     calls    errors syscall
      ------ ----------- ----------- --------- --------- ----------------
        0.00    0.000000           0       910           read
        0.00    0.000000           0       503           write
        0.00    0.000000           0       461           open
        0.00    0.000000           0       591           close
        0.00    0.000000           0       585           stat
        0.00    0.000000           0         6           writev
        0.00    0.000000           0     85452           select
        0.00    0.000000           0       708           sched_yield
        0.00    0.000000           0       446           sendmsg
        0.00    0.000000           0       998           recvmsg
        0.00    0.000000           0       260           getdents
        0.00    0.000000           0         6           rename
        0.00    0.000000           0      3627           gettimeofday
        0.00    0.000000           0        65           getrusage
        0.00    0.000000           0       585           statfs
        0.00    0.000000           0     89303     37032 futex
        0.00    0.000000           0        65           sched_getaffinity
        0.00    0.000000           0        10        10 restart_syscall
        0.00    0.000000           0     44993           clock_gettime
        0.00    0.000000           0       130           openat
      ------ ----------- ----------- --------- --------- ----------------
      100.00    0.000000                229704     37042 total 

      I used db.adminCommand( { "setParameter": 1, "wiredTigerEngineRuntimeConfig": "verbose=[rts]"}) to enable verbose logging for WiredTiger RTS. This produced a ton of output, but most of it consisted of the following 3 log messages:

       

       

      {"t":{"$date":"2023-02-16T21:46:07.465+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"SignalHandler","msg":"WiredTiger message","attr":{"message":"[1676583967:465207][24280:0x7f09ea138700], file:test2nto/index-1020619--8977920817120252160.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_RTS(session)] skip rollback to stable on file file:test2nto/index-4666-8107731100219616386.wt because its checkpoint address length is 0"}}
      
      {"t":{"$date":"2023-02-16T21:46:07.459+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"SignalHandler","msg":"WiredTiger message","attr":{"message":"[1676583967:459380][24280:0x7f09ea138700], file:test2nto/index-1020619--8977920817120252160.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_RTS(session)] file:test2nto/index-1021448--8977920817120252160.wt: tree skipped with durable timestamp: (0, 0) and stable timestamp: (1676583941, 44) or txnid: 0"}}
      
      {"t":{"$date":"2023-02-16T21:46:07.436+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"SignalHandler","msg":"WiredTiger message","attr":{"message":"[1676583967:436685][24280:0x7f09ea138700], file:test2nto/index-1020156--8977920817120252160.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_RTS(session)] 0x5583f59294e0: page walk skipped"}} 

      This snippet from https://jira.mongodb.org/browse/SERVER-52815 appears to explain why these rollback-to-stables are now happening on every shutdown in 4.4:

      This specifically became of interest because one of the costs of WT shutdown in 4.2 and earlier was typically bounded by how much data had come into the system since the last checkpoint (or how far the stable timestamp moved since the last checkpoint). But with durable history, the cost is now bounded by how much data exists ahead of the stable timestamp (which is much easier to grow). This is because WT now calls rollback_to_stable inside of WT_CONNECTION::close.

      But it's not clear if there's any way to change the behavior. We would like to go back to how it behaved in 4.2 and earlier, where we were able to shutdown our mongod processes in a minute or so. An hour long shutdown time is unacceptable. It makes rolling restarts for upgrades and other maintenance take 50 times longer. And even more importantly, it prevents us from expediently dealing with cluster availability and performance issues.

       

      I have tried the following tweaks to the shutdown process  in hopes one of them might prevent the RTS, but none helped:

      • make replica hidden before shutting it down (we always do this anyway)
      • remove replica from replica set before shutting it down
      • run db.fsyncLock() to disable writes prior to shutdown
      • use db.shutdownServer({ force: true }) to shutdown rather than SIGTERM

       

      I also tried stopping the server with "kill -9". This is the one thing that actually worked. It resulted in the subsequent mongod restart taking longer - 5 minutes, rather than the typical 1 minute, but no RTS was performed. Nevertheless, using SIGKILL seems like a bad idea, since it gives Mongo no opportunity to gracefully close files, network connections, etc.

       

      We see this issue about 4 out of 5 times in our test cluster, which consists of:

      • 2 shards
      • 4 replicas per shard, 3 active w/ local SSD's and 1 hidden EBS for backup purposes
      • 150 databases with about 100 collections each
      • 250 GB total data

      There is a load generator running that ensures there are regular updates being made to one of the databases, but the load is light.

       

      Please let me know if there's any other information I can provide to help you diagnose this issue. 

       

      Thank you!

       

            Assignee:
            yuan.fang@mongodb.com Yuan Fang
            Reporter:
            ian.springer@salesforce.com Ian Springer
            Votes:
            4 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated:
              Resolved: