Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 3.2.17, 3.4.9
Affects Version/s: 3.2.14, 3.4.7
Component/s: Sharding
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v3.2
Case:
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The problem is quite the same as ~~SERVER-22618~~，but it's already closed， so I create a new issue here.

2017-08-14T10:00:53.015+0800 I -        [RangeDeleter] Assertion: 64:waiting for replication timed out
2017-08-14T10:00:53.018+0800 I CONTROL  [RangeDeleter]
 0x16f6552 0x16a0d28 0x16896c8 0x168977c 0xc8cad4 0x1231e5c 0x122f679 0x2001310 0x2b642e152aa1 0x2b642e45093d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"12F6552","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"12A0D28","s":"_ZN5mongo10logContextEPKc"},{"b":"400000","o":"12896C8","s":"_ZN5mongo11msgassertedEiPKc"},{"b":"400000","o":"128977C"},{"b":"400000","o":"88CAD4","s":"_ZN5mongo7Helpers11removeRangeEPNS_16OperationContextERKNS_8KeyRangeEbRKNS_19WriteConcernOptionsEPNS0_11RemoveSaverEbb"},{"b":"400000","o":"E31E5C","s":"_ZN5mongo17RangeDeleterDBEnv11deleteRangeEPNS_16OperationContextERKNS_16RangeDeleteEntryEPxPSs"},{"b":"400000","o":"E2F679","s":"_ZN5mongo12RangeDeleter6doWorkEv"},{"b":"400000","o":"1C01310","s":"execute_native_thread_routine"},{"b":"2B642E14B000","o":"7AA1"},{"b":"2B642E368000","o":"E893D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.13", "gitVersion" : "22ec9e93b40c85fc7cae7d56e7d6a02fd811088c", "compiledModules" : [ "rocks" ], "uname" : { "sysname" : "Linux", "release" : "2.6.32-358.23.2.ali1195.el6.x86_64", "version" : "#1 SMP Wed Oct 29 20:23:37 CST 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFF6E0FF000", "elfType" : 3, "buildId" : "64BEF22A9E37EEE1F8FC59AB5AE35209E49D47EA" }, { "b" : "2B642DABA000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "BCA23A35FA002CA6FC960E2794BCA3AE9F5BD371" }, { "b" : "2B642DCC2000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "F9230CF0BAC57ADEF2764E4F59E8556753907D31" }, { "b" : "2B642DEC7000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "C919E2ECEAE285422EC98BF3EB8C089E8DE9F1AA" }, { "b" : "2B642E14B000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "0E51B5AD014B2F88B369C39264F02DE3AFC38A63" }, { "b" : "2B642E368000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "6F269DCAFA81F126A3671240850D8EE9A5AF543A" }, { "b" : "2B642D898000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "2946077BEE84E65DD56C7732F2D89F536DBE0966" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x16f6552]
 mongod(_ZN5mongo10logContextEPKc+0x138) [0x16a0d28]
 mongod(_ZN5mongo11msgassertedEiPKc+0x88) [0x16896c8]
 mongod(+0x128977C) [0x168977c]
 mongod(_ZN5mongo7Helpers11removeRangeEPNS_16OperationContextERKNS_8KeyRangeEbRKNS_19WriteConcernOptionsEPNS0_11RemoveSaverEbb+0x13F4) [0xc8cad4]
 mongod(_ZN5mongo17RangeDeleterDBEnv11deleteRangeEPNS_16OperationContextERKNS_16RangeDeleteEntryEPxPSs+0x44C) [0x1231e5c]
 mongod(_ZN5mongo12RangeDeleter6doWorkEv+0x209) [0x122f679]
 mongod(execute_native_thread_routine+0x20) [0x2001310]
 libpthread.so.0(+0x7AA1) [0x2b642e152aa1]
 libc.so.6(clone+0x6D) [0x2b642e45093d]
-----  END BACKTRACE  -----

I dive into the source code and find the reason.

In function Helpers::removeRange, it try to ignore the writeconcern timeout error by comparing the error code with ErrorCodes::ExceededTimeLimit.

 if (writeConcern.shouldWaitForOtherNodes() && numDeleted > 0) {
            repl::ReplicationCoordinator::StatusAndDuration replStatus =
                repl::getGlobalReplicationCoordinator()->awaitReplication(
                    txn,
                    repl::ReplClientInfo::forClient(txn->getClient()).getLastOp(),
                    writeConcern);
            if (replStatus.status.code() == ErrorCodes::ExceededTimeLimit) {
                warning(LogComponent::kSharding) << "replication to secondaries for removeRange at "
                                                    "least 60 seconds behind";
            } else {
                massertStatusOK(replStatus.status);
            }
            millisWaitingForReplication += replStatus.duration;
        }

But in the implement of awaitReplication, it return ErrorCodes::WriteConcernFailed when timeout because of replication lag.

        if (writeConcern.wTimeout != WriteConcernOptions::kNoTimeout &&
            elapsed > Milliseconds{writeConcern.wTimeout}) {
            return StatusAndDuration(
                Status(ErrorCodes::WriteConcernFailed, "waiting for replication timed out"),
                elapsed);
        }

I am trying to create a pull request to fix it, but find the master branch has changed a lot, but both the latest version of 3.2 and 3.4 have this problem.

I am confused why removeRange need assert when awaitReplication failed, maybe we can remove the assert or change the error code of awaitReplication.

Assignee:: Nathan Myers (Inactive)
Reporter:: Zhang Youdong
Participants:: bhargav, Githook User, Nathan Myers, Ramon Fernandez Marina, Zhang Youdong
Votes:: 0 Vote for this issue
Watchers:: 10 Start watching this issue

Created:: Aug 14 2017 07:56:38 AM UTC
Updated:: Oct 30 2023 11:14:22 PM UTC
Resolved:: Aug 28 2017 08:38:52 PM UTC

Details

Description

Attachments

Forms

Activity

People

Dates