Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-2216

simplify row-store search loop slightly

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: WT2.8.0
    • Labels:
      None
    • # Replies:
      7
    • Last comment by Customer:
      true

      Description

      There's a test for internal pages with a single child in the row-store search function:

      /*
       * Fast-path internal pages with one child, a common case for
       * the root page in new trees.
       */
      if (pindex->entries == 1) {
              descent = pindex->index[0];
              goto descend;
      }
      

      Michael Cahill, you put this code in https://github.com/wiredtiger/wiredtiger/commit/875c10c72b6f5bdb53a8e14fc7e15f7491d6b412, do you remember where this made a difference?

      It seems to me it's only useful until we split the root, and then it's wasted from that point on, was there a benchmark that triggered this?

      1. aws.clang36
        9 kB
        Keith Bostic
      2. aws.gcc48
        9 kB
        Keith Bostic
      3. pixie.clang33
        9 kB
        Keith Bostic
      4. rr
        1 kB
        Keith Bostic

        Issue Links

          Activity

          Hide
          alexander.gorrod Alexander Gorrod added a comment -

          Keith Bostic Have you seen a workload where this check appears in performance traces, or are you being generally aggressive about our fast path search code?

          Show
          alexander.gorrod Alexander Gorrod added a comment - Keith Bostic Have you seen a workload where this check appears in performance traces, or are you being generally aggressive about our fast path search code?
          Hide
          keith.bostic Keith Bostic added a comment - - edited

          Alexander Gorrod, I hadn't measured the cost of the test, I was being generally aggressive about our fast path search code. That's why I asked if there's a known workload where this is the right thing to do, because this check is performed for every row-store internal page ever searched. Shrinking the amount of code in this loop is also known to help performance, we've seen code footprint make a difference here.

          All that said, it's a good question so I measured the results of the changes today, and naturally it turned out to be a real mess.

          I've attached a test program, op.c, and the script I'm using to run it.

          I used it to create big (5 million rows), and small (5 rows) trees, then timed 25 million WT_CURSOR.search operations. To force a full search, I removed the pinned-page test from the row-search function:

          diff --git a/src/btree/bt_cursor.c b/src/btree/bt_cursor.c
          index 69512f4..08ac671 100644
          --- a/src/btree/bt_cursor.c
          +++ b/src/btree/bt_cursor.c
          @@ -312,6 +312,7 @@ __wt_btcur_search(WT_CURSOR_BTREE *cbt)
                   * from the root.
                   */
                  valid = false;
          +#if 0
                  if (F_ISSET(cbt, WT_CBT_ACTIVE) &&
                      cbt->ref->page->read_gen != WT_READGEN_OLDEST) {
                          __wt_txn_cursor_op(session);
          @@ -321,6 +322,7 @@ __wt_btcur_search(WT_CURSOR_BTREE *cbt)
                              __cursor_col_search(session, cbt, cbt->ref));
                          valid = cbt->compare == 0 && __cursor_valid(cbt, &upd);
                  }
          +#endif
                  if (!valid) {
                          WT_ERR(__cursor_func_init(cbt, true));
          

          (There's a WT_CURSOR.reset commented out in the search path, but I suspect that's relatively heavier-weight.)

          I then did timed runs, using the develop branch, the develop branch with the test replaced so it will never fire, and with the test removed. It turned out removing that chunk of code on pixiebob loses us about 4% on the run. Since that's impossible, I dug further, and as far as I can tell, the position in the instruction cache is dominating those runs, and I can get the timing to change based on where I load the wt_row_search function.

          I've attached 3 run files: pixiebob (clang 3.3), and an AWS c3.4xlarge instance with clang 3.6 and gcc48, they show the full runs with a diff of the tree for each build.

          tl;dr

          test pixiebob clang 3.3 c3.4xlarge gcc 4.8 c3.4xlarge clang 3.6
          small tree develop 4.84 4.64 4.65
          small tree no-op 4.84 4.74 4.74
          small tree remove 4.95 4.67 4.75
          large tree develop 11.49 11.12 11.16
          large tree no-op 11.48 11.54 11.41
          large tree remove 11.95 11.01 11.04

          There's an interesting effect in there: the "no-op" test is quite a bit slower than the existing test (the existing test is pindex.entries == 1, and I compared it against pindex->entries == 0 or pindex->entries == UINT32_MAX), which confuses me; I'm pretty sure none of those comparisons are ever true in the large-tree run, but it's consistently slower to never match against 0 or UINT32_MAX, vs. never matching against 1.

          Based on this, I'm still inclined to remove the code:

          • I'm discounting the small test runs because I don't think trees with a single leaf page are common enough to optimize for (back to the original question of what workload this was intended to address),
          • on better hardware and more recent compiler versions, removing the code is marginally faster,
          • and, having less code in a performance path is generally better than more code.

          Copying Sasha on this write-up, hoping she has comments or ways to make this run faster.
          Cc: Alexandra (Sasha) Fedorova

          Show
          keith.bostic Keith Bostic added a comment - - edited Alexander Gorrod , I hadn't measured the cost of the test, I was being generally aggressive about our fast path search code. That's why I asked if there's a known workload where this is the right thing to do, because this check is performed for every row-store internal page ever searched. Shrinking the amount of code in this loop is also known to help performance, we've seen code footprint make a difference here. All that said, it's a good question so I measured the results of the changes today, and naturally it turned out to be a real mess. I've attached a test program, op.c, and the script I'm using to run it. I used it to create big (5 million rows), and small (5 rows) trees, then timed 25 million WT_CURSOR.search operations. To force a full search, I removed the pinned-page test from the row-search function: diff --git a/src/btree/bt_cursor.c b/src/btree/bt_cursor.c index 69512f4..08ac671 100644 --- a/src/btree/bt_cursor.c +++ b/src/btree/bt_cursor.c @@ -312,6 +312,7 @@ __wt_btcur_search(WT_CURSOR_BTREE *cbt) * from the root. */ valid = false; +#if 0 if (F_ISSET(cbt, WT_CBT_ACTIVE) && cbt->ref->page->read_gen != WT_READGEN_OLDEST) { __wt_txn_cursor_op(session); @@ -321,6 +322,7 @@ __wt_btcur_search(WT_CURSOR_BTREE *cbt) __cursor_col_search(session, cbt, cbt->ref)); valid = cbt->compare == 0 && __cursor_valid(cbt, &upd); } +#endif if (!valid) { WT_ERR(__cursor_func_init(cbt, true)); (There's a WT_CURSOR.reset commented out in the search path, but I suspect that's relatively heavier-weight.) I then did timed runs, using the develop branch, the develop branch with the test replaced so it will never fire, and with the test removed. It turned out removing that chunk of code on pixiebob loses us about 4% on the run. Since that's impossible, I dug further, and as far as I can tell, the position in the instruction cache is dominating those runs, and I can get the timing to change based on where I load the wt_row_search function. I've attached 3 run files: pixiebob (clang 3.3), and an AWS c3.4xlarge instance with clang 3.6 and gcc48, they show the full runs with a diff of the tree for each build. tl;dr test pixiebob clang 3.3 c3.4xlarge gcc 4.8 c3.4xlarge clang 3.6 small tree develop 4.84 4.64 4.65 small tree no-op 4.84 4.74 4.74 small tree remove 4.95 4.67 4.75 large tree develop 11.49 11.12 11.16 large tree no-op 11.48 11.54 11.41 large tree remove 11.95 11.01 11.04 There's an interesting effect in there: the "no-op" test is quite a bit slower than the existing test (the existing test is pindex.entries == 1 , and I compared it against pindex->entries == 0 or pindex->entries == UINT32_MAX ), which confuses me; I'm pretty sure none of those comparisons are ever true in the large-tree run, but it's consistently slower to never match against 0 or UINT32_MAX, vs. never matching against 1. Based on this, I'm still inclined to remove the code: I'm discounting the small test runs because I don't think trees with a single leaf page are common enough to optimize for (back to the original question of what workload this was intended to address), on better hardware and more recent compiler versions, removing the code is marginally faster, and, having less code in a performance path is generally better than more code. Copying Sasha on this write-up, hoping she has comments or ways to make this run faster. Cc: Alexandra (Sasha) Fedorova
          Hide
          fedorova Alexandra (Sasha) Fedorova added a comment -

          Pretty crazy! @keith.bostic, to clarify: are you suggesting that when that code snippet is removed, the overall code layout changes, so that two or more frequently accessed code sections end up mapped to the same set of cache lines and so there's more i-cache misses? If that is in fact what is occurring, that would be pretty difficult to fix, taking into account different cache architectures (whose implementation details are mostly secret), compilers, etc. If you are interested, I could confirm whether this is likely to be the effect you are seeing by measuring the I-cache misses.

          Show
          fedorova Alexandra (Sasha) Fedorova added a comment - Pretty crazy! @keith.bostic, to clarify: are you suggesting that when that code snippet is removed, the overall code layout changes, so that two or more frequently accessed code sections end up mapped to the same set of cache lines and so there's more i-cache misses? If that is in fact what is occurring, that would be pretty difficult to fix, taking into account different cache architectures (whose implementation details are mostly secret), compilers, etc. If you are interested, I could confirm whether this is likely to be the effect you are seeing by measuring the I-cache misses.
          Hide
          keith.bostic Keith Bostic added a comment -

          Alexandra (Sasha) Fedorova, yes, that's the only thing I can think of that explains what I'm seeing: why else would removing code slow things down, is there some other explanation?

          However, that doesn't explain why the 'no-op' test is slower than the test in the code, I'm left arguing the compiler generates better code for x == 1 vs. x == 0 or a really big number, which I don't really believe.

          Show
          keith.bostic Keith Bostic added a comment - Alexandra (Sasha) Fedorova , yes, that's the only thing I can think of that explains what I'm seeing: why else would removing code slow things down, is there some other explanation? However, that doesn't explain why the 'no-op' test is slower than the test in the code, I'm left arguing the compiler generates better code for x == 1 vs. x == 0 or a really big number , which I don't really believe.
          Hide
          alexander.gorrod Alexander Gorrod added a comment -

          I also crafted a wtperf configuration that should take the fast path as often as possible. It populated a table with 1 million entries in a tree and didn't split into the root (so the root page contains a single entry). Then executes a parallel read-only workload.

          I didn't see any performance difference when removing this check on any of the machines I tested. The configuration I used is:

          conn_config="cache_size=500MB"
          table_config="type=file"
          icount=1000000
          report_interval=5
          reopen_connection=false
          run_time=120
          value_sz=50
          populate_threads=2
          threads=((count=8,reads=1))
          

          That is a vote for taking the test out as per the open pull request.

          Show
          alexander.gorrod Alexander Gorrod added a comment - I also crafted a wtperf configuration that should take the fast path as often as possible. It populated a table with 1 million entries in a tree and didn't split into the root (so the root page contains a single entry). Then executes a parallel read-only workload. I didn't see any performance difference when removing this check on any of the machines I tested. The configuration I used is: conn_config="cache_size=500MB" table_config="type=file" icount=1000000 report_interval=5 reopen_connection=false run_time=120 value_sz=50 populate_threads=2 threads=((count=8,reads=1)) That is a vote for taking the test out as per the open pull request.
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith@wiredtiger.com'}

          Message: WT-2216: remove minor test in the row-store search loop.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/7ba2f94b95bf76c9642372df35940b919975a2bb

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith@wiredtiger.com'} Message: WT-2216 : remove minor test in the row-store search loop. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/7ba2f94b95bf76c9642372df35940b919975a2bb
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'name': u'Ramon Fernandez', u'email': u'ramon@mongodb.com'}

          Message: Import wiredtiger-wiredtiger-2.7.0-559-g07966a4.tar.gz from wiredtiger branch mongodb-3.2

          ref: 3c2ad56..07966a4

          WT-1517 schema format edge cases
          WT-1801 Add a directory sync after rollback of a WT_SESSION::rename operation
          WT-2060 Simplify aggregation of statistics
          WT-2073 metadata cleanups
          WT-2099 Seeing memory underflow messages
          WT-2113 truncate01 sometimes fails
          WT-2142 Connection cleanup in Python tests
          WT-2177 Add an optional per-thread seed to random number generator
          WT-2198 bulk load and column store appends
          WT-2216 simplify row-store search loop slightly
          WT-2225 New split code performance impact
          WT-2231 pinned page cursor searches could check parent keys
          WT-2235 wt printlog option without unicode
          WT-2242 WiredTiger treats dead trees the same as other trees in eviction
          WT-2244 Trigger in-memory splits sooner
          WT-2245 WTPERF Truncate has no ability to catch up when it falls behind
          WT-2246 column-store append searches the leaf page; the maximum record number fails CRUD operations
          WT-2247 variable-length column-store in-memory page splits
          WT-2256 WTPERFs throttle option fires in bursts
          WT-2257 wtperf doesn't handle overriding workload config
          WT-2258 WiredTiger preloads pages even when direct-IO is configured.
          WT-2259 __wt_evict_file_exclusive_on() should clear WT_BTREE_NO_EVICTION on error
          WT-2260 Workloads evict internal pages unexpectedly
          WT-2262 Random sampling is skewed by tree shape
          WT-2265 Wiredtiger related change in ppc64le specific code block in gcc.h
          WT-2266 Add wtperf config to set if perf thresholds are fatal
          WT-2267 Improve wtperf throttling implementation to provide steady load
          WT-2269 wtperf should dump its config everytime it runs
          WT-2272 Stress test assertion in the sweep server
          WT-2275 broken DB after application crash
          WT-2276 tool to decode checkpoint addr
          WT-2277 Remove WT check against big-endian systems
          WT-2279 Define WT_PAUSE(), WT_FULL_BARRIER(), etc when s390x is defined
          WT-2281 wtperf smoke.sh fails on ppc64le
          WT-2282 error in wt_txn_update_oldest verbose message test
          WT-2283 retry in txn_update_oldest results in a hang
          WT-2284 Repeated macro definition
          WT-2285 configure should set BUFFER_ALIGNMENT_DEFAULT to 4kb on linux
          WT-2287 WT_SESSION.rebalance
          WT-2289 failure in fast key check
          WT-2290 WT_SESSION.compact could be more effective.
          WT-2291 Random cursor walk inefficient in skip list only trees
          WT-2295 WT_SESSION.create does a full-scan of the main table
          WT-2296 New log algorithm needs improving for sync/flush settings
          WT-2297 Fix off-by-one error in Huffman config file parsing
          WT-2299 upper-level WiredTiger code is reaching into the block manager
          WT-2301 Add reading a range to wtperf
          WT-2303 Build warning in wtperf
          WT-2304 wtperf crash dumping config
          WT-2305 Fix coverity scan issues on 23/12/2015
          WT-2307 Internal page splits can corrupt cursor iteration
          WT-2308 custom extractor for ref_cursors in join cursor
          WT-2311 Support Sparc
          WT-2312 re-creating a deleted column-store page can corrupt the in-memory tree
          WT-2313 sweep-server: conn_dhandle.c, 610: dhandle != conn->cache->evict_file_next
          WT-2314 page-swap error handling is inconsistent
          WT-2316 stress test failure: WT_CURSOR.prev out-of-order returns
          WT-2320 Only check copyright when cutting releases
          WT-2321 WT-2321: race between eviction and worker threads on the eviction queue
          WT-2326 Change WTPERF to use new memory allocation functions instead of the standard
          WT-2328 schema drop does direct unlink, it should use a block manager interface.
          WT-2331 Checking of search() result for reference cursors before join()
          WT-2332 Bug in logging write-no-sync mode
          WT-2333 Add a flag so drop doesn't block
          WT-2335 NULL pointer crash in config_check_search with invalid configuration string
          WT-2338 Disable using pre-allocated log files when backup cursor is open
          WT-2339 format post-rebalance verify failure (stress run #11586)
          WT-2340 Add logging guarantee assertions, whitespace
          WT-2342 Enhance wtperf to support background create and drop operations
          WT-2344 OS X compiler warning
          WT-2347 Java: schema format edge cases
          WT-2348 xargs -P isn't portable
          WT-2355 Fix minor scratch buffer usage in logging
          SERVER-21833 Compact does not release space to the system with WiredTiger
          SERVER-21887 $sample takes disproportionately long time on newly created collection
          SERVER-22064 Coverity analysis defect 77699: Unchecked return value
          SERVER-21944 WiredTiger changes for 3.2.2
          Branch: v3.2
          https://github.com/mongodb/mongo/commit/5d6532f3d5227ff76f62c4810c98a4ef4d0c8c56

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'name': u'Ramon Fernandez', u'email': u'ramon@mongodb.com'} Message: Import wiredtiger-wiredtiger-2.7.0-559-g07966a4.tar.gz from wiredtiger branch mongodb-3.2 ref: 3c2ad56..07966a4 WT-1517 schema format edge cases WT-1801 Add a directory sync after rollback of a WT_SESSION::rename operation WT-2060 Simplify aggregation of statistics WT-2073 metadata cleanups WT-2099 Seeing memory underflow messages WT-2113 truncate01 sometimes fails WT-2142 Connection cleanup in Python tests WT-2177 Add an optional per-thread seed to random number generator WT-2198 bulk load and column store appends WT-2216 simplify row-store search loop slightly WT-2225 New split code performance impact WT-2231 pinned page cursor searches could check parent keys WT-2235 wt printlog option without unicode WT-2242 WiredTiger treats dead trees the same as other trees in eviction WT-2244 Trigger in-memory splits sooner WT-2245 WTPERF Truncate has no ability to catch up when it falls behind WT-2246 column-store append searches the leaf page; the maximum record number fails CRUD operations WT-2247 variable-length column-store in-memory page splits WT-2256 WTPERFs throttle option fires in bursts WT-2257 wtperf doesn't handle overriding workload config WT-2258 WiredTiger preloads pages even when direct-IO is configured. WT-2259 __wt_evict_file_exclusive_on() should clear WT_BTREE_NO_EVICTION on error WT-2260 Workloads evict internal pages unexpectedly WT-2262 Random sampling is skewed by tree shape WT-2265 Wiredtiger related change in ppc64le specific code block in gcc.h WT-2266 Add wtperf config to set if perf thresholds are fatal WT-2267 Improve wtperf throttling implementation to provide steady load WT-2269 wtperf should dump its config everytime it runs WT-2272 Stress test assertion in the sweep server WT-2275 broken DB after application crash WT-2276 tool to decode checkpoint addr WT-2277 Remove WT check against big-endian systems WT-2279 Define WT_PAUSE(), WT_FULL_BARRIER(), etc when s390x is defined WT-2281 wtperf smoke.sh fails on ppc64le WT-2282 error in wt_txn_update_oldest verbose message test WT-2283 retry in txn_update_oldest results in a hang WT-2284 Repeated macro definition WT-2285 configure should set BUFFER_ALIGNMENT_DEFAULT to 4kb on linux WT-2287 WT_SESSION.rebalance WT-2289 failure in fast key check WT-2290 WT_SESSION.compact could be more effective. WT-2291 Random cursor walk inefficient in skip list only trees WT-2295 WT_SESSION.create does a full-scan of the main table WT-2296 New log algorithm needs improving for sync/flush settings WT-2297 Fix off-by-one error in Huffman config file parsing WT-2299 upper-level WiredTiger code is reaching into the block manager WT-2301 Add reading a range to wtperf WT-2303 Build warning in wtperf WT-2304 wtperf crash dumping config WT-2305 Fix coverity scan issues on 23/12/2015 WT-2307 Internal page splits can corrupt cursor iteration WT-2308 custom extractor for ref_cursors in join cursor WT-2311 Support Sparc WT-2312 re-creating a deleted column-store page can corrupt the in-memory tree WT-2313 sweep-server: conn_dhandle.c, 610: dhandle != conn->cache->evict_file_next WT-2314 page-swap error handling is inconsistent WT-2316 stress test failure: WT_CURSOR.prev out-of-order returns WT-2320 Only check copyright when cutting releases WT-2321 WT-2321 : race between eviction and worker threads on the eviction queue WT-2326 Change WTPERF to use new memory allocation functions instead of the standard WT-2328 schema drop does direct unlink, it should use a block manager interface. WT-2331 Checking of search() result for reference cursors before join() WT-2332 Bug in logging write-no-sync mode WT-2333 Add a flag so drop doesn't block WT-2335 NULL pointer crash in config_check_search with invalid configuration string WT-2338 Disable using pre-allocated log files when backup cursor is open WT-2339 format post-rebalance verify failure (stress run #11586) WT-2340 Add logging guarantee assertions, whitespace WT-2342 Enhance wtperf to support background create and drop operations WT-2344 OS X compiler warning WT-2347 Java: schema format edge cases WT-2348 xargs -P isn't portable WT-2355 Fix minor scratch buffer usage in logging SERVER-21833 Compact does not release space to the system with WiredTiger SERVER-21887 $sample takes disproportionately long time on newly created collection SERVER-22064 Coverity analysis defect 77699: Unchecked return value SERVER-21944 WiredTiger changes for 3.2.2 Branch: v3.2 https://github.com/mongodb/mongo/commit/5d6532f3d5227ff76f62c4810c98a4ef4d0c8c56

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                Days since reply:
                1 year, 12 weeks, 4 days ago
                Date of 1st Reply: