Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-2131

Algorithm used to wait to split into parent can be unfair

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: WT2.7.0
    • Labels:
      None
    • # Replies:
      8
    • Last comment by Customer:
      true

      Description

      There are cases when threads can wait for a long time for a chance to split into their parent. This is possible in a workload that:

      • Has a lot of cache pressure (i.e: relatively small cache and a large number of threads).
      • Generates in-memory internal pages with lots of entries.

      I have seen cases where a thread has waited for over 3 seconds in __split_parent before getting in to do an update.

        Issue Links

          Activity

          Hide
          alexander.gorrod Alexander Gorrod added a comment -

          I added instrumentation into __split_parent to track when a thread is spinning for a long time waiting for a chance to lock the parent. The instrumentation looked like (almost certainly won't apply):

          --- a/src/btree/bt_split.c
          +++ b/src/btree/bt_split.c
          @@ -7,6 +7,7 @@
            */
           
           #include "wt_internal.h"
          +#include <syscall.h>
           
           #define	WT_MEM_TRANSFER(from_decr, to_incr, len) do {			\
           	size_t __len = (len);						\
           
          @@ -857,6 +863,9 @@ __split_parent(WT_SESSION_IMPL *session, WT_REF *ref,
           	uint32_t i, j;
           	uint32_t deleted_entries, parent_entries, result_entries;
           	int complete, hazard;
          +	struct timespec end, start;
          +	uint64_t saved_split_count, yield_attempts;
          +	uint32_t start_waiters;
           
           	parent = NULL;			/* -Wconditional-uninitialized */
           	alloc_index = pindex = NULL;
           
          @@ -880,6 +897,10 @@ __split_parent(WT_SESSION_IMPL *session, WT_REF *ref,
           	 * reason to use a different lock if we have to block reconciliation
           	 * anyway.
           	 */
          +	WT_RET(__wt_epoch(session, &start));
          +	yield_attempts = 0;
          +	saved_split_count = ref->home->split_count;
          +	start_waiters = __wt_atomic_add32(&ref->home->waiter_count, 1);
           	for (;;) {
           		parent = ref->home;
           		F_CAS_ATOMIC(parent, WT_PAGE_RECONCILIATION, ret);
          @@ -904,11 +926,44 @@ __split_parent(WT_SESSION_IMPL *session, WT_REF *ref,
           		 * the child's exclusive lock (which causes reconciliation to
           		 * loop until the exclusive lock is resolved). If we can't lock
           		 * the parent, give up to avoid that deadlock.
          -		 */
          -		if (S2BT(session)->checkpointing)
          +		 *
          +		if (S2BT(session)->checkpointing) {
          +			(void)__wt_atomic_sub32(&parent->waiter_count, 1);
           			return (EBUSY);
          +		}*/
          +		if (++yield_attempts > 300000)
          +			parent->track_splits = 1;
           		__wt_yield();
           	}
          +	(void)__wt_atomic_sub32(&parent->waiter_count, 1);
          +	WT_TRET(__wt_epoch(session, &end));
          +	if (WT_TIMEDIFF(end, start) > 1000 * WT_MILLION) {
          +		parent->track_splits = 0;
          +		fprintf(stdout, "%d:%d: %s:%d:"
          +		    " WiredTiger split_parent page lock: duration: %lu us, parent entries %"PRIu32" yields: %" PRIu64
          +		    ", start split count: %" PRIu64
          +		    ", end split count: %" PRIu64
          +		    ", start waiter count: %" PRIu32
          +		    ", end waiter count: %" PRIu32
          +		    " parent: %p db: %s\n",
          +		    (int) time(NULL), (int)syscall(SYS_gettid), __FILE__, __LINE__,
          +		    WT_TIMEDIFF(end, start) / 1000,
          +		    WT_INTL_INDEX_GET_SAFE(parent)->entries, yield_attempts,
          +		    saved_split_count, parent->split_count,
          +		    start_waiters, parent->waiter_count,
          +		    parent, S2C(session)->home);
          +	}
          +	WT_RET(__wt_epoch(session, &start));
           
           	/*
           	 * We have exclusive access to split the parent, and at this point, the
          @@ -1128,16 +1200,42 @@ __split_parent(WT_SESSION_IMPL *session, WT_REF *ref,
           	 * are holding it locked.
           	 */
           	if (ret == 0 && !exclusive &&
          -	    __split_should_deepen(session, parent_ref))
          +	    __split_should_deepen(session, parent_ref)) {
          +		WT_RET(__wt_epoch(session, &start));
           		ret = __split_deepen(session, parent);
          +		WT_TRET(__wt_epoch(session, &end));
          +		if (WT_TIMEDIFF(end, start) > 1000 * WT_MILLION) {
          +			fprintf(stdout, "%d:%d: %s:%d:"
          +			    " WiredTiger split_deepen: duration: %lu us db: %s\n",
          +			    (int) time(NULL), (int)syscall(SYS_gettid), __FILE__, __LINE__,
          +			    WT_TIMEDIFF(end, start) / 1000, S2C(session)->home);
          +		}
          +	}
           
          +	++parent->split_count;
          +	if (parent->track_splits) {
          +		fprintf(stdout, "%d:%d: %s:%d:"
          +		    "WiredTiger split_deepen tracking, added refs: %d"
          +		    ", start split count: %" PRIu64
          +		    ", end split count: %" PRIu64
          +		    ", start waiter count: %" PRIu32
          +		    ", end waiter count: %" PRIu32
          +		    ", checkpointing? %s, parent %p, db %s\n",
          +		    (int) time(NULL), (int)syscall(SYS_gettid), __FILE__, __LINE__,
          +		    new_entries, saved_split_count,
          +		    parent->split_count,
          +		    start_waiters, parent->waiter_count,
          +		    S2BT(session)->checkpointing ? "yes" : "no",
          +		    parent, S2C(session)->home);
          +	}
           err:	if (!complete)
           		for (i = 0; i < parent_entries; ++i) {
           			next_ref = pindex->index[i];
           			if (next_ref->state == WT_REF_SPLIT)
           				next_ref->state = WT_REF_DELETED;
           		}
          --- a/src/include/btmem.h
          +++ b/src/include/btmem.h
          @@ -563,6 +563,9 @@ struct __wt_page {
           #undef	pg_var_entries
           #define	pg_var_entries	u.col_var.entries
           	} u;
          +	uint64_t split_count;
          +	uint64_t track_splits;
          +	uint32_t waiter_count;
           
           	/*
           	 * The page's type and flags are positioned at the end of the WT_PAGE
          

          That is: Once a thread has been waiting for a while to split the page, start logging other splits that are going on in that page.

          The results were like:

          split_parent page lock: duration: 1081397 us, parent entries 67317 yields: 760085, start split count: 9211, end split count: 9245, start waiter count: 6, end waiter count: 2
          

          So the thread:

          • Waited over a second before it could do the split.
          • The internal page has 67 thousand entries (see WT-2127)
          • This thread yielded 760 thousand times before switching the PAGE_RECONCILIATION flag (wasting a lot of CPU and context switches).
          • There were 34 other threads allowed to complete a split prior to this thread getting in, even though there were only 6 threads waiting when it arrived.
          Show
          alexander.gorrod Alexander Gorrod added a comment - I added instrumentation into __split_parent to track when a thread is spinning for a long time waiting for a chance to lock the parent. The instrumentation looked like (almost certainly won't apply): --- a/src/btree/bt_split.c +++ b/src/btree/bt_split.c @@ -7,6 +7,7 @@ */ #include "wt_internal.h" +#include <syscall.h> #define WT_MEM_TRANSFER(from_decr, to_incr, len) do { \ size_t __len = (len); \   @@ -857,6 +863,9 @@ __split_parent(WT_SESSION_IMPL *session, WT_REF *ref, uint32_t i, j; uint32_t deleted_entries, parent_entries, result_entries; int complete, hazard; + struct timespec end, start; + uint64_t saved_split_count, yield_attempts; + uint32_t start_waiters; parent = NULL; /* -Wconditional-uninitialized */ alloc_index = pindex = NULL;   @@ -880,6 +897,10 @@ __split_parent(WT_SESSION_IMPL *session, WT_REF *ref, * reason to use a different lock if we have to block reconciliation * anyway. */ + WT_RET(__wt_epoch(session, &start)); + yield_attempts = 0; + saved_split_count = ref->home->split_count; + start_waiters = __wt_atomic_add32(&ref->home->waiter_count, 1); for (;;) { parent = ref->home; F_CAS_ATOMIC(parent, WT_PAGE_RECONCILIATION, ret); @@ -904,11 +926,44 @@ __split_parent(WT_SESSION_IMPL *session, WT_REF *ref, * the child's exclusive lock (which causes reconciliation to * loop until the exclusive lock is resolved). If we can't lock * the parent, give up to avoid that deadlock. - */ - if (S2BT(session)->checkpointing) + * + if (S2BT(session)->checkpointing) { + (void)__wt_atomic_sub32(&parent->waiter_count, 1); return (EBUSY); + }*/ + if (++yield_attempts > 300000) + parent->track_splits = 1; __wt_yield(); } + (void)__wt_atomic_sub32(&parent->waiter_count, 1); + WT_TRET(__wt_epoch(session, &end)); + if (WT_TIMEDIFF(end, start) > 1000 * WT_MILLION) { + parent->track_splits = 0; + fprintf(stdout, "%d:%d: %s:%d:" + " WiredTiger split_parent page lock: duration: %lu us, parent entries %"PRIu32" yields: %" PRIu64 + ", start split count: %" PRIu64 + ", end split count: %" PRIu64 + ", start waiter count: %" PRIu32 + ", end waiter count: %" PRIu32 + " parent: %p db: %s\n", + (int) time(NULL), (int)syscall(SYS_gettid), __FILE__, __LINE__, + WT_TIMEDIFF(end, start) / 1000, + WT_INTL_INDEX_GET_SAFE(parent)->entries, yield_attempts, + saved_split_count, parent->split_count, + start_waiters, parent->waiter_count, + parent, S2C(session)->home); + } + WT_RET(__wt_epoch(session, &start)); /* * We have exclusive access to split the parent, and at this point, the @@ -1128,16 +1200,42 @@ __split_parent(WT_SESSION_IMPL *session, WT_REF *ref, * are holding it locked. */ if (ret == 0 && !exclusive && - __split_should_deepen(session, parent_ref)) + __split_should_deepen(session, parent_ref)) { + WT_RET(__wt_epoch(session, &start)); ret = __split_deepen(session, parent); + WT_TRET(__wt_epoch(session, &end)); + if (WT_TIMEDIFF(end, start) > 1000 * WT_MILLION) { + fprintf(stdout, "%d:%d: %s:%d:" + " WiredTiger split_deepen: duration: %lu us db: %s\n", + (int) time(NULL), (int)syscall(SYS_gettid), __FILE__, __LINE__, + WT_TIMEDIFF(end, start) / 1000, S2C(session)->home); + } + } + ++parent->split_count; + if (parent->track_splits) { + fprintf(stdout, "%d:%d: %s:%d:" + "WiredTiger split_deepen tracking, added refs: %d" + ", start split count: %" PRIu64 + ", end split count: %" PRIu64 + ", start waiter count: %" PRIu32 + ", end waiter count: %" PRIu32 + ", checkpointing? %s, parent %p, db %s\n", + (int) time(NULL), (int)syscall(SYS_gettid), __FILE__, __LINE__, + new_entries, saved_split_count, + parent->split_count, + start_waiters, parent->waiter_count, + S2BT(session)->checkpointing ? "yes" : "no", + parent, S2C(session)->home); + } err: if (!complete) for (i = 0; i < parent_entries; ++i) { next_ref = pindex->index[i]; if (next_ref->state == WT_REF_SPLIT) next_ref->state = WT_REF_DELETED; } --- a/src/include/btmem.h +++ b/src/include/btmem.h @@ -563,6 +563,9 @@ struct __wt_page { #undef pg_var_entries #define pg_var_entries u.col_var.entries } u; + uint64_t split_count; + uint64_t track_splits; + uint32_t waiter_count; /* * The page's type and flags are positioned at the end of the WT_PAGE That is: Once a thread has been waiting for a while to split the page, start logging other splits that are going on in that page. The results were like: split_parent page lock: duration: 1081397 us, parent entries 67317 yields: 760085, start split count: 9211, end split count: 9245, start waiter count: 6, end waiter count: 2 So the thread: Waited over a second before it could do the split. The internal page has 67 thousand entries (see WT-2127 ) This thread yielded 760 thousand times before switching the PAGE_RECONCILIATION flag (wasting a lot of CPU and context switches). There were 34 other threads allowed to complete a split prior to this thread getting in, even though there were only 6 threads waiting when it arrived.
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexg@wiredtiger.com'}

          Message: WT-2131 Switch to using a lock to control page splits.

          We used to spin on setting an atomic flag, but we saw cases where
          that wasn't fair (and burned CPU).

          This implementation adds a lock into the page structure - we could
          probably find a better place for that if the change is a general
          improvement.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/5bcbed387bd398b87f8fab0c47d04ed4ffb640e9

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexg@wiredtiger.com'} Message: WT-2131 Switch to using a lock to control page splits. We used to spin on setting an atomic flag, but we saw cases where that wasn't fair (and burned CPU). This implementation adds a lock into the page structure - we could probably find a better place for that if the change is a general improvement. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/5bcbed387bd398b87f8fab0c47d04ed4ffb640e9
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexg@wiredtiger.com'}

          Message: WT-2131 Add a new fair-lock type.

          That is a simple ticket based lock implementation. Switch the page
          lock from a read/write lock to a fair lock.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/ce1fc84bfe5914e0dd3cf9b715cece89ad684f88

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexg@wiredtiger.com'} Message: WT-2131 Add a new fair-lock type. That is a simple ticket based lock implementation. Switch the page lock from a read/write lock to a fair lock. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/ce1fc84bfe5914e0dd3cf9b715cece89ad684f88
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexg@wiredtiger.com'}

          Message: WT-2131 Switch all WT_PAGE_RECONCILIATION locks to fair locks.

          The main purpose is to keep a single lock for all page operations.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/744696e318c0d8042f32f8ab5d858fa1a8365d69

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexg@wiredtiger.com'} Message: WT-2131 Switch all WT_PAGE_RECONCILIATION locks to fair locks. The main purpose is to keep a single lock for all page operations. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/744696e318c0d8042f32f8ab5d858fa1a8365d69
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexg@wiredtiger.com'}

          Message: WT-2131 Fix __wt_fair_trylock implementation.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/6d7678f0c92f310139b53427f1fa6d0bea3857ae

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexg@wiredtiger.com'} Message: WT-2131 Fix __wt_fair_trylock implementation. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/6d7678f0c92f310139b53427f1fa6d0bea3857ae
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'}

          Message: Merge pull request #2224 from wiredtiger/split-fairer-single-lock

          WT-2131 Switch all WT_PAGE_RECONCILIATION locks to fair locks.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/3f226406880e988b9e00b112e554ffba1058a5cf

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'} Message: Merge pull request #2224 from wiredtiger/split-fairer-single-lock WT-2131 Switch all WT_PAGE_RECONCILIATION locks to fair locks. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/3f226406880e988b9e00b112e554ffba1058a5cf
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

          Message: Merge pull request #2213 from wiredtiger/split-fairer

          WT-2131 Switch to using a lock to control page splits.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/79f74e505ae7b15c3c695cdc72f71e4f9a105647

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'} Message: Merge pull request #2213 from wiredtiger/split-fairer WT-2131 Switch to using a lock to control page splits. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/79f74e505ae7b15c3c695cdc72f71e4f9a105647
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith@wiredtiger.com'}

          Message: WT-2131: don't grow the page structure.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/b338a6e14483badc2788b1f1ff96dc242dee1e56

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith@wiredtiger.com'} Message: WT-2131 : don't grow the page structure. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/b338a6e14483badc2788b1f1ff96dc242dee1e56

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                Days since reply:
                1 year, 22 weeks, 1 day ago
                Date of 1st Reply: