Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-2045

eviction server can get tapped for eviction and stall the system.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: WT2.7.0
    • Labels:
      None
    • # Replies:
      7
    • Last comment by Customer:
      true

      Description

      There are two places in the page-read code where a reading thread calls __wt_cache_eviction_check without first checking the WT_READ_NO_EVICT flag, and that can lead to deadlock. The case I have is a configuration with no eviction-worker threads, lots of eviction pressure, and so the application threads are busy doing eviction.

      The eviction server thread splits a page, is getting a hazard pointer on the parent, and for some reason can't get it immediately. The eviction server thread then gets pressed into doing more eviction, there are no pages on the queue because the application threads have emptied it, and things stall forever while the eviction server looks for pages on the empty eviction queue.

        Issue Links

          Activity

          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith@wiredtiger.com'}

          Message: WT-2045: There are two places in the page-read code where a reading
          thread calls __wt_cache_eviction_check without first checking the
          WT_READ_NO_EVICT flag, and that can lead to deadlock. The case I have
          is a configuration with no eviction-worker threads, lots of eviction
          pressure, and so the application threads are busy doing eviction.

          The eviction server thread splits a page, is getting a hazard pointer
          on the parent, and for some reason can't get it immediately. The
          eviction server thread then gets pressed into doing more eviction, there
          are no pages on the queue because the application threads have emptied
          it, and things stall forever while the eviction server looks for pages
          on the empty eviction queue.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/ce223acbf67101ddb1845e4f3a10c208ba82b067

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith@wiredtiger.com'} Message: WT-2045 : There are two places in the page-read code where a reading thread calls __wt_cache_eviction_check without first checking the WT_READ_NO_EVICT flag, and that can lead to deadlock. The case I have is a configuration with no eviction-worker threads, lots of eviction pressure, and so the application threads are busy doing eviction. The eviction server thread splits a page, is getting a hazard pointer on the parent, and for some reason can't get it immediately. The eviction server thread then gets pressed into doing more eviction, there are no pages on the queue because the application threads have emptied it, and things stall forever while the eviction server looks for pages on the empty eviction queue. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/ce223acbf67101ddb1845e4f3a10c208ba82b067
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith@wiredtiger.com'}

          Message: Update a comment to match the WT-2045 failure.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/41b410e08778ab1f328e35d323b60bcb39f18bf2

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith@wiredtiger.com'} Message: Update a comment to match the WT-2045 failure. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/41b410e08778ab1f328e35d323b60bcb39f18bf2
          Hide
          michael.cahill Michael Cahill added a comment -

          Keith Bostic, is there any reason to believe that this can't happen in MongoDB 3.0? In other words, should we backport?

          Show
          michael.cahill Michael Cahill added a comment - Keith Bostic , is there any reason to believe that this can't happen in MongoDB 3.0? In other words, should we backport?
          Hide
          michael.cahill Michael Cahill added a comment -

          P.S. I think we decided it was extremely unlikely because MongoDB configures 4 eviction workers, but that's advisory: it may not fork any workers and could still theoretically get into this state?

          Show
          michael.cahill Michael Cahill added a comment - P.S. I think we decided it was extremely unlikely because MongoDB configures 4 eviction workers, but that's advisory: it may not fork any workers and could still theoretically get into this state?
          Hide
          keith.bostic Keith Bostic added a comment -

          Michael Cahill, it might theoretically happen.

          MongoDB configures eviction=(threads_max=4) but doesn't configure threads_min, so we won't create eviction worker threads until there's cache pressure, which means the eviction server might get tapped to do eviction, and if somehow the eviction list became empty (an application thread was tapped for eviction and cleared it?), we could get into the deadlock.

          Obviously, that scenario – not enough cache pressure to start eviction workers, but somehow an application thread emptied the eviction list – is wildly unlikely, but I wouldn't want to claim it can't happen somehow.

          The backport should be trivial, so maybe that's reason enough to do it?

          Show
          keith.bostic Keith Bostic added a comment - Michael Cahill , it might theoretically happen. MongoDB configures eviction=(threads_max=4) but doesn't configure threads_min , so we won't create eviction worker threads until there's cache pressure, which means the eviction server might get tapped to do eviction, and if somehow the eviction list became empty (an application thread was tapped for eviction and cleared it?), we could get into the deadlock. Obviously, that scenario – not enough cache pressure to start eviction workers, but somehow an application thread emptied the eviction list – is wildly unlikely, but I wouldn't want to claim it can't happen somehow. The backport should be trivial, so maybe that's reason enough to do it?
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith@wiredtiger.com'}

          Message: WT-2045: Avoid a potential deadlock in the eviction server.

          There are two places in the page-read code where a reading
          thread calls __wt_cache_eviction_check without first checking the
          WT_READ_NO_EVICT flag, and that can lead to deadlock. The case I have
          is a configuration with no eviction-worker threads, lots of eviction
          pressure, and so the application threads are busy doing eviction.

          The eviction server thread splits a page, is getting a hazard pointer
          on the parent, and for some reason can't get it immediately. The
          eviction server thread then gets pressed into doing more eviction, there
          are no pages on the queue because the application threads have emptied
          it, and things stall forever while the eviction server looks for pages
          on the empty eviction queue.

          (cherry picked from commit 7c8ce54f261faaf643ce5d063f27bb78db0c2cd8)
          Branch: mongodb-3.0
          https://github.com/wiredtiger/wiredtiger/commit/36d6e156d62356caa15ef3df8d43212f1c32f4d1

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith@wiredtiger.com'} Message: WT-2045 : Avoid a potential deadlock in the eviction server. There are two places in the page-read code where a reading thread calls __wt_cache_eviction_check without first checking the WT_READ_NO_EVICT flag, and that can lead to deadlock. The case I have is a configuration with no eviction-worker threads, lots of eviction pressure, and so the application threads are busy doing eviction. The eviction server thread splits a page, is getting a hazard pointer on the parent, and for some reason can't get it immediately. The eviction server thread then gets pressed into doing more eviction, there are no pages on the queue because the application threads have emptied it, and things stall forever while the eviction server looks for pages on the empty eviction queue. (cherry picked from commit 7c8ce54f261faaf643ce5d063f27bb78db0c2cd8) Branch: mongodb-3.0 https://github.com/wiredtiger/wiredtiger/commit/36d6e156d62356caa15ef3df8d43212f1c32f4d1
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith@wiredtiger.com'}

          Message: WT-2045 Update a comment to match the failure.

          (cherry picked from commit 7540ec4cb094e49bddf368eb8001c1fd4e47dce1)
          Branch: mongodb-3.0
          https://github.com/wiredtiger/wiredtiger/commit/3cb514ed7863810f95848a46aea597b71eba7ede

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith@wiredtiger.com'} Message: WT-2045 Update a comment to match the failure. (cherry picked from commit 7540ec4cb094e49bddf368eb8001c1fd4e47dce1) Branch: mongodb-3.0 https://github.com/wiredtiger/wiredtiger/commit/3cb514ed7863810f95848a46aea597b71eba7ede

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                Days since reply:
                1 year, 35 weeks, 5 days ago
                Date of 1st Reply: