Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-30636

RangeDeleter assert failed because of replication lag

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: 3.2.14, 3.4.7
    • Fix Version/s: 3.2.17, 3.4.9
    • Component/s: Sharding
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v3.2

      Description

      The problem is quite the same as SERVER-22618,but it's already closed, so I create a new issue here.

      2017-08-14T10:00:53.015+0800 I -        [RangeDeleter] Assertion: 64:waiting for replication timed out
      2017-08-14T10:00:53.018+0800 I CONTROL  [RangeDeleter]
       0x16f6552 0x16a0d28 0x16896c8 0x168977c 0xc8cad4 0x1231e5c 0x122f679 0x2001310 0x2b642e152aa1 0x2b642e45093d
      ----- BEGIN BACKTRACE -----
      {"backtrace":[{"b":"400000","o":"12F6552","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"12A0D28","s":"_ZN5mongo10logContextEPKc"},{"b":"400000","o":"12896C8","s":"_ZN5mongo11msgassertedEiPKc"},{"b":"400000","o":"128977C"},{"b":"400000","o":"88CAD4","s":"_ZN5mongo7Helpers11removeRangeEPNS_16OperationContextERKNS_8KeyRangeEbRKNS_19WriteConcernOptionsEPNS0_11RemoveSaverEbb"},{"b":"400000","o":"E31E5C","s":"_ZN5mongo17RangeDeleterDBEnv11deleteRangeEPNS_16OperationContextERKNS_16RangeDeleteEntryEPxPSs"},{"b":"400000","o":"E2F679","s":"_ZN5mongo12RangeDeleter6doWorkEv"},{"b":"400000","o":"1C01310","s":"execute_native_thread_routine"},{"b":"2B642E14B000","o":"7AA1"},{"b":"2B642E368000","o":"E893D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.13", "gitVersion" : "22ec9e93b40c85fc7cae7d56e7d6a02fd811088c", "compiledModules" : [ "rocks" ], "uname" : { "sysname" : "Linux", "release" : "2.6.32-358.23.2.ali1195.el6.x86_64", "version" : "#1 SMP Wed Oct 29 20:23:37 CST 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFF6E0FF000", "elfType" : 3, "buildId" : "64BEF22A9E37EEE1F8FC59AB5AE35209E49D47EA" }, { "b" : "2B642DABA000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "BCA23A35FA002CA6FC960E2794BCA3AE9F5BD371" }, { "b" : "2B642DCC2000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "F9230CF0BAC57ADEF2764E4F59E8556753907D31" }, { "b" : "2B642DEC7000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "C919E2ECEAE285422EC98BF3EB8C089E8DE9F1AA" }, { "b" : "2B642E14B000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "0E51B5AD014B2F88B369C39264F02DE3AFC38A63" }, { "b" : "2B642E368000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "6F269DCAFA81F126A3671240850D8EE9A5AF543A" }, { "b" : "2B642D898000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "2946077BEE84E65DD56C7732F2D89F536DBE0966" } ] }}
       mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x16f6552]
       mongod(_ZN5mongo10logContextEPKc+0x138) [0x16a0d28]
       mongod(_ZN5mongo11msgassertedEiPKc+0x88) [0x16896c8]
       mongod(+0x128977C) [0x168977c]
       mongod(_ZN5mongo7Helpers11removeRangeEPNS_16OperationContextERKNS_8KeyRangeEbRKNS_19WriteConcernOptionsEPNS0_11RemoveSaverEbb+0x13F4) [0xc8cad4]
       mongod(_ZN5mongo17RangeDeleterDBEnv11deleteRangeEPNS_16OperationContextERKNS_16RangeDeleteEntryEPxPSs+0x44C) [0x1231e5c]
       mongod(_ZN5mongo12RangeDeleter6doWorkEv+0x209) [0x122f679]
       mongod(execute_native_thread_routine+0x20) [0x2001310]
       libpthread.so.0(+0x7AA1) [0x2b642e152aa1]
       libc.so.6(clone+0x6D) [0x2b642e45093d]
      -----  END BACKTRACE  -----
      

      I dive into the source code and find the reason.

      In function Helpers::removeRange, it try to ignore the writeconcern timeout error by comparing the error code with ErrorCodes::ExceededTimeLimit.

       if (writeConcern.shouldWaitForOtherNodes() && numDeleted > 0) {
                  repl::ReplicationCoordinator::StatusAndDuration replStatus =
                      repl::getGlobalReplicationCoordinator()->awaitReplication(
                          txn,
                          repl::ReplClientInfo::forClient(txn->getClient()).getLastOp(),
                          writeConcern);
                  if (replStatus.status.code() == ErrorCodes::ExceededTimeLimit) {
                      warning(LogComponent::kSharding) << "replication to secondaries for removeRange at "
                                                          "least 60 seconds behind";
                  } else {
                      massertStatusOK(replStatus.status);
                  }
                  millisWaitingForReplication += replStatus.duration;
              }
      

      But in the implement of awaitReplication, it return ErrorCodes::WriteConcernFailed when timeout because of replication lag.

              if (writeConcern.wTimeout != WriteConcernOptions::kNoTimeout &&
                  elapsed > Milliseconds{writeConcern.wTimeout}) {
                  return StatusAndDuration(
                      Status(ErrorCodes::WriteConcernFailed, "waiting for replication timed out"),
                      elapsed);
              }
      

      I am trying to create a pull request to fix it, but find the master branch has changed a lot, but both the latest version of 3.2 and 3.4 have this problem.

      I am confused why removeRange need assert when awaitReplication failed, maybe we can remove the assert or change the error code of awaitReplication.

        Attachments

          Activity

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: