Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-20876

Hang in scenario with sharded ttl collection under WiredTiger

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Critical - P2 Critical - P2
    • 3.0.8
    • Affects Version/s: 3.0.6, 3.0.7
    • Component/s: Performance, WiredTiger
    • Labels:
      None
    • Fully Compatible

      Setup:

      • 3.0.6
      • two shards, each standalone mongod
      • shard key is _id (not hashed), using standard default ObjectIDs.
      • WT cache size 10 GB
      • single ttl collection set to expire documents after 10 minutes
      • 4 threads inserting 1 kB documents into the ttl collection

      After about 25 minutes of run time (about 15 minutes after ttl deletions started) one of the shards became completely stuck. Stack traces captured after it became stuck are attached:

      • TTLMonitor, splitVector, serverStatus are all stuck waiting for full cache:
        • TTLMonitor is stuck in __wt_page_in_func while traversing the collection
        • serverStatus and splitVector are both stuck in WT somewhere within mongo::WiredTigerRecoveryUnit::_txnOpen - why would a transaction open be stuck waiting for a full cache?
      • no sign of any eviction activity (or of the eviction worker threads, but maybe they're hidden in the "??" call sites?).

      Here's an overview of the run for the shard that became stuck:

      • inserts are fast but unbalanced because only one shard at a time is active in this particular test (due to simple shard key).
      • insert rate drops very dramatically when ttl deletions begin at A. This performance hit seems suprisingly large.
      • ttl deletion passes run A-B, C-D, E-F, G-. Each begins with a bump in "ttl passes" and end with a bump in "ttl deletedDocuments"
      • WT range of IDs pinned correlates well with the deletion passes. However I don't believe there are intentional long-running transactions, but rather I suspect one or more deletions just run very slowly.
      • it appears that at H the cache becomes full and everything gets stuck, presumably because there are no evictions happening as per the stack traces above which were captured after H.
      • interestingly though evictions appear to have been happening as normal right up to H - so why would they stop?
      • there is no data after H because serverStatus has become stuck as well.
      • the only clear connection to any sharding-related activity that I can find is the stuck splitVector call; however it has been running frequently and seems to be stuck in the same way serverStatus is, so it seems unlikely it is a culprit.

        1. diagnostic.data.tar
          233 kB
        2. gdbmon.html
          206 kB
        3. gdbmon.log
          1.17 MB
        4. repro-04-gdbmon-s0r0.html
          6.29 MB
        5. repro-04-gdbmon-s0r0.log
          6.59 MB
        6. repro-06-diagnostic.data.tar
          1.84 MB
        7. repro-06-gdbmon.html
          5.70 MB
        8. repro-06-gdbmon.png
          repro-06-gdbmon.png
          448 kB
        9. repro-06-ts.png
          repro-06-ts.png
          135 kB
        10. repro-12.png
          repro-12.png
          141 kB
        11. repro-14.png
          repro-14.png
          111 kB
        12. repro-14-diagnostic.data.tar
          127 kB
        13. repro-14-gdbmon.html
          11.24 MB
        14. repro-14-gdbmon.png
          repro-14-gdbmon.png
          172 kB
        15. repro-15.png
          repro-15.png
          66 kB
        16. standalone.png
          standalone.png
          130 kB
        17. standalone-diagnostic.data.tar
          444 kB
        18. stuck.png
          stuck.png
          202 kB

            Assignee:
            michael.cahill@mongodb.com Michael Cahill (Inactive)
            Reporter:
            bruce.lucas@mongodb.com Bruce Lucas (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            17 Start watching this issue

              Created:
              Updated:
              Resolved: