Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-22062

Foreground index build may hang 3.0.x WiredTiger node

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical - P2
    • Resolution: Won't Fix
    • Affects Version/s: 3.0.8
    • Fix Version/s: None
    • Component/s: WiredTiger
    • Labels:
    • Environment:
      Mongo 3.0.8 + parallel
    • Operating System:
      ALL
    • Sprint:
      Integrate+Tuning 15 (06/03/16)

      Description

      Using Mongo 3.0.8 when replicating large data set (this did'nt happen in smaller one) with max_threads = 4 and a patch for mongo which parallelize the cloning process, in our setup we used 17 threads cloning different dbs, each thread is holding a db lock (instead of a global lock)

      examining with gdb the cache flags shows STUCK bit is lit, here's the
      cache structure:

      {
        bytes_inmem = 60122154727,
        pages_inmem = 5668935,
        bytes_internal = 243665765,
        bytes_overflow = 0,
        bytes_evict = 486975238126,
        pages_evict = 5648345,
        bytes_dirty = 59789008836,
        pages_dirty = 6448,
        bytes_read = 1482481072,
        evict_max_page_size = 31232046,
        read_gen = 1682697,
        read_gen_oldest = 1682790,
        evict_cond = 0x39abcd0,
        evict_lock = {
          lock = {
            __data = {
              __lock = 0,
              __count = 0,
              __owner = 0,
              __nusers = 0,
              __kind = 0,
              __spins = 0,
              __elision = 0,
              __list = {
                __prev = 0x0,
                __next = 0x0
              }
            },
            __size = '\000' <repeats 39 times>,
            __align = 0
          },
          counter = 0,
          name = 0x17446c2 "cache eviction",
          id = 0 '\000',
          initialized = 1 '\001'
        },
        evict_walk_lock = {
          lock = {
            __data = {
              __lock = 0,
              __count = 0,
              __owner = 0,
              __nusers = 0,
              __kind = 0,
              __spins = 0,
              __elision = 0,
              __list = {
                __prev = 0x0,
                __next = 0x0
              }
            },
            __size = '\000' <repeats 39 times>,
            __align = 0
          },
          counter = 0,
          name = 0x17446d1 "cache walk",
          id = 0 '\000',
          initialized = 1 '\001'
        },
        evict_waiter_cond = 0x39abd40,
        eviction_trigger = 95,
        eviction_target = 80,
        eviction_dirty_target = 80,
        overhead_pct = 8,
        evict = 0x4bd4000,
        evict_current = 0x0,
        evict_candidates = 100,
        evict_entries = 100,
        evict_max = 400,
        evict_slots = 400,
        evict_file_next = 0x570f9c700,
        sync_request = 0,
        sync_complete = 0,
        cp_saved_read = 0,
        cp_current_read = 0,
        cp_skip_count = 0,
        cp_reserved = 0,
        cp_session = 0x0,
        cp_tid = 0,
        flags = 40
      }
      

      attached is a stack trace, as you can see all cloning threads are hung on the eviction condition "0x39abd40" (threads 51 through 67) which comes from __wt_cache_full_check() call
      Thread #8 also stuck on the same call, due to _deleteExcessDocuments call.
      The eviction server (Thread #2) is sleeping, and this happens constantly
      the eviction workers seem to have no work as there are 3 live eviction workers (threads 68 through 70) all of which are waiting on the same condition

      This situation reproduced itself over and over at some point during the initial clone, any idea as to why this happens would be great.
      The small patch for the parallelization is available here:
      https://github.com/liranms/mongo/commit/a216bb0d8159f8030b5d666ffa8869c57f28fcc0

        Attachments

        1. 22062_pinned_high.png
          22062_pinned_high.png
          26 kB
        2. 22062_pinned_low.png
          22062_pinned_low.png
          28 kB
        3. another_stack_trace
          134 kB
        4. fg_index_hang.png
          fg_index_hang.png
          31 kB
        5. fg_index_no_hang.png
          fg_index_no_hang.png
          37 kB
        6. sslog.log.gz
          2.75 MB
        7. stacktrace.txt
          130 kB
        8. timeseries.png
          timeseries.png
          151 kB

          Issue Links

            Activity

              People

              • Votes:
                1 Vote for this issue
                Watchers:
                26 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: