Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-5621

Unrecoverable WT_ROLLBACK error

    • Type: Icon: Bug Bug
    • Resolution: Gone away
    • Priority: Icon: Minor - P4 Minor - P4
    • None
    • Affects Version/s: WT3.2.1
    • Component/s: Cache and eviction
    • Labels:
      None

      I've been trying to run some performance/stress tests with WiredTiger 3.2.1, but I've encountered what seems to be an unrecoverable state.

      The error that I get is `WT_ROLLBACK: conflict between concurrent operations` error on `cursor->insert`, but there are no concurrent operations. Tried adding retry logic, sleep, flush, but with no change in the behavior. Tried to debug it, and I suspect that the problem is in the cache, which cannot handle the load.

      I'm adding a simple test app, that reproduces it every time. Also, you can see that in the begging the average speed is pretty good, but just before entering this state it's about 10x (for example on my machine, it starts with 1-2us, and right before I get the error it's ~25u). Different configurations and/or keys lead to extending/shrinking the break time, but the end result is the same. (I've omitted the error handling for simplicity.)

       I don't think I can upload a file, so will just paste it (sorry if the formatting is bad).

       

      #include <iostream>
      #include <chrono>
      
      #include <boost/uuid/random_generator.hpp>
      #include <boost/filesystem/operations.hpp>
      #include <boost/format.hpp>
      
      #include <wiredtiger.h>
      
      using namespace std::chrono;
      using ull = unsigned long long;
      
      const auto table = "table:test";
      const auto dbPath = "WT";
      
      const ull keys = 10 * 1000 * 1000;  // 10M
      const ull batchSize = 100000;       // 100K
      
      static boost::uuids::random_generator generator;
      
      microseconds ElapsedTime(high_resolution_clock::time_point since)
      {
          return std::chrono::duration_cast<microseconds>(high_resolution_clock::now() - since);
      }
      
      void InsertUnsafe(WT_SESSION* session, ull i)
      {
          WT_ITEM k;
          WT_ITEM v;
          WT_CURSOR* cursor;
          session->open_cursor(session, table, nullptr, nullptr, &cursor);  
        
          for (ull b = 1; b <= std::min(batchSize, keys - i); b++)
          {
              const auto key = generator();
      
              k.data = key.data;
              k.size = key.size();
      
              v.data = key.data;
              v.size = key.size();
      
              cursor->set_key(cursor, &k);
              cursor->set_value(cursor, &v);
              const auto rc = cursor->insert(cursor);
              if (rc == WT_ROLLBACK)
              {
                  cursor->close(cursor);
                  throw std::exception(wiredtiger_strerror(rc));
              }
          }
          cursor->close(cursor);
      }
      
      int main()
      {
          // Setup
          boost::filesystem::remove_all(dbPath);
          boost::filesystem::create_directories(dbPath);
      
          WT_CONNECTION* connection;
          wiredtiger_open(dbPath, nullptr, "create", &connection);
      
          WT_SESSION* session;
          connection->open_session(connection, nullptr, nullptr, &session);
          session->create(session, table, nullptr);
          session->close(session, nullptr);
      
          // Test
          const auto start = high_resolution_clock::now();
          auto lastUpdate = start;
          ull lastCount = 0;
      
          for (ull i = 0; i <= keys; i += batchSize)
          {
              if (i > lastCount && ElapsedTime(lastUpdate) > milliseconds(250))
              {
                  std::cout << boost::format("%d/%d (%dus avg time)...          \n") % i % keys %
                               (ElapsedTime(lastUpdate).count() / (i - lastCount));
                  lastUpdate = high_resolution_clock::now();
                  lastCount = i;
              }
      
              connection->open_session(connection, nullptr, nullptr, &session);
              session->begin_transaction(session, nullptr);
      
              try
              {
                  InsertUnsafe(session, i);
              }
              catch (std::exception& e)
              {
                  std::cout << std::endl << e.what() << std::endl;
      
                  i -= batchSize;
      
                  session->rollback_transaction(session, nullptr);
                  session->close(session, nullptr);
                  continue;
              }
      
              session->commit_transaction(session, nullptr);
              session->close(session, nullptr);
          }
      
          std::cout << "Average time: " << ElapsedTime(start).count() / keys << "us          " << std::endl;
      
          return 0;
      }

       

       

       

            Assignee:
            backlog-server-storage-engines [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            danislav.kirov@lucidlink.com Danislav Kirov
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: