Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-10424

cursor::search_near slow performance if many deleted items are present

    • 8
    • Storage Engines - 2023-01-10, StorEng - 2023-01-24, StorEng - 2023-02-07, 2023-02-23 "Stoney Baloney", 2023-03-21 Ellen Ripley, 2023-04-04 Bibbidi-Bobbidi-Boo
    • v6.3, v6.2, v6.0, v5.0, v4.4

      Problem description:

      While working on a customer issue, I wrote a small reproducer to find out the performance of search_near if many deleted items were present. I noticed that the performance was extremely slow.

      Reproducer:

       class test_practice(wttest.WiredTigerTestCase):
          uri = "table:test_reposition01"
          format_values = [
              ('row_integer', dict(key_format='i', value_format='S')),
          ]
          conn_config = "cache_size=1GB,statistics=(all),statistics_log=(wait=1,json=true,on_close=true)"
          scenarios = make_scenarios(format_values)
      
          def test_practice(self):
              format = 'key_format={},value_format={}'.format(self.key_format, self.value_format)
              self.session.create(self.uri, format)
      
              session1 = self.conn.open_session()
              cur1 = session1.open_cursor(self.uri)
      
              self.conn.set_timestamp('oldest_timestamp=' + self.timestamp_str(1))
              nrows = 1_000_000
              rows = 200_000
      
              for i in range(1, nrows):
                  cur1[i] = 'aaaa'
              cur1.reset()
      
              self.prout("A")
              for i in range(1, nrows):
                  self.session.begin_transaction()
                  num = random.randrange(1, nrows)
                  cur1.set_key(num)
                  cur1.search_near()
                  cur1.remove()
                  self.session.commit_transaction('commit_timestamp=' + self.timestamp_str(2))
      
              self.prout("B")
              sleep(5)
              for i in range(1, rows):
                  # num = random.randrange(1, nrows)
                  cur1[i] = 'aaaa'
              cur1.reset()
              self.prout("C")
      
              sleep(5)
              self.conn.set_timestamp('oldest_timestamp=' + self.timestamp_str(3))
              t0 = time.time()
              for i in range(1, rows):
                  self.session.begin_transaction()
                  num = random.randrange(1, nrows)
                  cur1.set_key(num)
                  cur1.search_near()
                  cur1.set_value("213")
                  cur1.update()        
                  if i % 1000 == 0:
                      self.prout("H")
                  self.session.commit_transaction('commit_timestamp=' + self.timestamp_str(4))
              t1 = time.time()
              total = t1-t0
              self.prout(str(total))
      

      output:

      ubuntu@ip-10-122-7-59:~/wiredtiger$ ./run_python_test.sh test_practice
      [pid:31042]: A
      [pid:31042]: B
      [pid:31042]: C
      [pid:31042]: H
      [pid:31042]: H
      [pid:31042]: H
      [pid:31042]: H
      [pid:31042]: H
      [pid:31042]: H
      .
      .
      .
      [pid:31042]: 1179.7585327625275
      .
      ----------------------------------------------------------------------
      Ran 1 test in 1243.290sOK 

      It took nearly 1179 seconds or 20 minutes for the last operation to finish. 

      Note:

      Reducing the cache size to 100MB improves the performance drastically i.e. from 20 minutes to 40 seconds

            Assignee:
            siddhartha.mahajan@mongodb.com Sid Mahajan
            Reporter:
            siddhartha.mahajan@mongodb.com Sid Mahajan
            Votes:
            0 Vote for this issue
            Watchers:
            23 Start watching this issue

              Created:
              Updated:
              Resolved: