Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-7366

Excessive eviction failures in v4.0 due to being blocked by hazard pointers

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major - P3
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: Backlog
    • Component/s: None
    • Labels:
      None

      Description

      Background

      A v4.0 instance running as part of our Evergreen infrastructure stalled for 10 minutes without accepting reads. There were a few slow queries that got killed by the Cloud Manager around this time though it's not clear whether this is related.

      Most of the abnormalities in the FTDC seem to stem from the fact that eviction is ineffective during this period due to hazard pointers blocking eviction. Checkpoint takes longer around this time but since eviction isn't blocked by the checkpoint (and is instead blocked by hazard pointer acquisitions), I'm led to believe that the ineffectiveness of eviction is slowing the checkpoint down since it can't read pages into cache rather than the other way around which is normally the case (checkpoint disrupting eviction).

      In the original HELP ticket, we speculated a bit on the operations that got killed (they are aggregations which can be expensive) but regardless of that, it's not clear why this would involve hazard pointers being held for extended periods of time.

      Some more preliminary investigation can be found on the linked HELP ticket but I've attached the FTDC, logs and relevant screenshot on this ticket.

      Goal

      We should aim to identify what the problem is and if the change is trivial, make a fix. If it's non-trivial, we should create another follow-up ticket to implement this change.

        Attachments

        1. log.tar.gz
          20.93 MB
        2. metrics-1.2021-03-10t15-04-41z-00000
          3.47 MB
        3. Screen Shot 2021-03-16 at 5.41.42 pm.png
          Screen Shot 2021-03-16 at 5.41.42 pm.png
          241 kB

          Activity

            People

            Assignee:
            backlog-server-storage-engines Backlog - Storage Engines Team
            Reporter:
            alex.cameron Alex Cameron
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Dates

              Created:
              Updated: