withTransaction API retries too frequently

XMLWordPrintableJSON

    • Needed
    • Hide

      Summary of necessary driver changes

      •  

      Commits for syncing spec/prose tests
      (and/or refer to an existing language POC if needed)

      •  

      Context for other referenced/linked tickets

      •  
      Show
      Summary of necessary driver changes   Commits for syncing spec/prose tests (and/or refer to an existing language POC if needed)   Context for other referenced/linked tickets  
    • $i18n.getText("admin.common.words.hide")
      Key Status/Resolution FixVersion
      CDRIVER-6084 Blocked
      CXX-3332 Blocked
      CSHARP-5712 Blocked
      GODRIVER-3647 Blocked
      JAVA-5950 Blocked
      NODE-7122 Blocked
      PYTHON-5518 Blocked
      PHPLIB-1714 Blocked
      RUBY-3701 Blocked
      RUST-2268 Blocked
      $i18n.getText("admin.common.words.show")
      #scriptField, #scriptField *{ border: 1px solid black; } #scriptField{ border-collapse: collapse; } #scriptField td { text-align: center; /* Center-align text in table cells */ } #scriptField td.key { text-align: left; /* Left-align text in the Key column */ } #scriptField a { text-decoration: none; /* Remove underlines from links */ border: none; /* Remove border from links */ } /* Add green background color to cells with FixVersion */ #scriptField td.hasFixVersion { background-color: #00FF00; /* Green color code */ } #scriptField td.willNotDo { background-color: #FF0000; /* Red color code */ } /* Center-align the first row headers */ #scriptField th { text-align: center; } Key Status/Resolution FixVersion CDRIVER-6084 Blocked CXX-3332 Blocked CSHARP-5712 Blocked GODRIVER-3647 Blocked JAVA-5950 Blocked NODE-7122 Blocked PYTHON-5518 Blocked PHPLIB-1714 Blocked RUBY-3701 Blocked RUST-2268 Blocked

      Note that this is not really a bug because the spec was designed to retry immediately on purpose.

      The withTransaction API retries immediately when encountering a TransientTransactionError. I think this may cause resource utilization problems on both the client and server in real world use cases.

      As a simple example, let's say the client is running two concurrent transactions (A and B) that touch the same document. One of these transactions will error with TrasientTransactionError (caused by a WriteConflict) and immediately retry, let's assume this is transaction B. If A is still in progress then the retry of B will also fail with the same error; it will only succeed after A has completed. This can lead to hundreds of failed transactions per second.

      Another example is what happens with many concurrent transactions that all contend on the same document. I've provided an example in withTransaction.py that starts 200 threads that all attempt a single with_transaction call on the same document. The output looks like this:

      $ python3.7 withTransaction.py
      Testing 200 threads
      Finished RunOrderTransaction with 0 retry attempts
      Finished RunOrderTransaction with 0 retry attempts
      Finished RunOrderTransaction with 0 retry attempts
      Finished RunOrderTransaction with 1 retry attempts
      Finished RunOrderTransaction with 2 retry attempts
      Finished RunOrderTransaction with 2 retry attempts
      Finished RunOrderTransaction with 3 retry attempts
      Finished RunOrderTransaction with 4 retry attempts
      ...
      Finished RunOrderTransaction with 46 retry attempts
      Finished RunOrderTransaction with 48 retry attempts
      Finished RunOrderTransaction with 49 retry attempts
      Finished RunOrderTransaction with 51 retry attempts
      Finished RunOrderTransaction with 50 retry attempts
      Finished RunOrderTransaction with 51 retry attempts
      ...
      Finished RunOrderTransaction with 108 retry attempts
      Finished RunOrderTransaction with 112 retry attempts
      Finished RunOrderTransaction with 116 retry attempts
      Finished RunOrderTransaction with 112 retry attempts
      Finished RunOrderTransaction with 114 retry attempts
      Finished RunOrderTransaction with 116 retry attempts
      Finished RunOrderTransaction with 118 retry attempts
      Finished RunOrderTransaction with 116 retry attempts
      All threads completed after 21.644919872283936 seconds
      

      One solution to this problem could be adding a delay before attempting to retry. When I change with_transaction to have a 250 millisecond retry delay the withTransaction script completes much faster and with much fewer retry attempts:

      $ python3.7 withTransaction.py
      Testing 200 threads
      Finished RunOrderTransaction with 0 retry attempts
      ...
      Finished RunOrderTransaction with 27 retry attempts
      Finished RunOrderTransaction with 28 retry attempts
      Finished RunOrderTransaction with 30 retry attempts
      Finished RunOrderTransaction with 30 retry attempts
      Finished RunOrderTransaction with 31 retry attempts
      Finished RunOrderTransaction with 33 retry attempts
      All threads completed after 10.54933214187622 seconds
      

      Note that a fixed retry delay is only one solution. We can also investigate others, like exponential backoff or something else.

            Assignee:
            Iris Ho
            Reporter:
            Shane Harvey
            Jib Adegunloye Jib Adegunloye
            Votes:
            1 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated: