[SERVER-30636] RangeDeleter assert failed because of replication lag Created: 14/Aug/17  Updated: 30/Oct/23  Resolved: 28/Aug/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.14, 3.4.7
Fix Version/s: 3.2.17, 3.4.9

Type: Bug Priority: Major - P3
Reporter: Zhang Youdong Assignee: Nathan Myers
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.2
Participants:
Case:

 Description   

The problem is quite the same as SERVER-22618,but it's already closed, so I create a new issue here.

2017-08-14T10:00:53.015+0800 I -        [RangeDeleter] Assertion: 64:waiting for replication timed out
2017-08-14T10:00:53.018+0800 I CONTROL  [RangeDeleter]
 0x16f6552 0x16a0d28 0x16896c8 0x168977c 0xc8cad4 0x1231e5c 0x122f679 0x2001310 0x2b642e152aa1 0x2b642e45093d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"12F6552","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"12A0D28","s":"_ZN5mongo10logContextEPKc"},{"b":"400000","o":"12896C8","s":"_ZN5mongo11msgassertedEiPKc"},{"b":"400000","o":"128977C"},{"b":"400000","o":"88CAD4","s":"_ZN5mongo7Helpers11removeRangeEPNS_16OperationContextERKNS_8KeyRangeEbRKNS_19WriteConcernOptionsEPNS0_11RemoveSaverEbb"},{"b":"400000","o":"E31E5C","s":"_ZN5mongo17RangeDeleterDBEnv11deleteRangeEPNS_16OperationContextERKNS_16RangeDeleteEntryEPxPSs"},{"b":"400000","o":"E2F679","s":"_ZN5mongo12RangeDeleter6doWorkEv"},{"b":"400000","o":"1C01310","s":"execute_native_thread_routine"},{"b":"2B642E14B000","o":"7AA1"},{"b":"2B642E368000","o":"E893D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.13", "gitVersion" : "22ec9e93b40c85fc7cae7d56e7d6a02fd811088c", "compiledModules" : [ "rocks" ], "uname" : { "sysname" : "Linux", "release" : "2.6.32-358.23.2.ali1195.el6.x86_64", "version" : "#1 SMP Wed Oct 29 20:23:37 CST 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFF6E0FF000", "elfType" : 3, "buildId" : "64BEF22A9E37EEE1F8FC59AB5AE35209E49D47EA" }, { "b" : "2B642DABA000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "BCA23A35FA002CA6FC960E2794BCA3AE9F5BD371" }, { "b" : "2B642DCC2000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "F9230CF0BAC57ADEF2764E4F59E8556753907D31" }, { "b" : "2B642DEC7000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "C919E2ECEAE285422EC98BF3EB8C089E8DE9F1AA" }, { "b" : "2B642E14B000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "0E51B5AD014B2F88B369C39264F02DE3AFC38A63" }, { "b" : "2B642E368000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "6F269DCAFA81F126A3671240850D8EE9A5AF543A" }, { "b" : "2B642D898000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "2946077BEE84E65DD56C7732F2D89F536DBE0966" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x16f6552]
 mongod(_ZN5mongo10logContextEPKc+0x138) [0x16a0d28]
 mongod(_ZN5mongo11msgassertedEiPKc+0x88) [0x16896c8]
 mongod(+0x128977C) [0x168977c]
 mongod(_ZN5mongo7Helpers11removeRangeEPNS_16OperationContextERKNS_8KeyRangeEbRKNS_19WriteConcernOptionsEPNS0_11RemoveSaverEbb+0x13F4) [0xc8cad4]
 mongod(_ZN5mongo17RangeDeleterDBEnv11deleteRangeEPNS_16OperationContextERKNS_16RangeDeleteEntryEPxPSs+0x44C) [0x1231e5c]
 mongod(_ZN5mongo12RangeDeleter6doWorkEv+0x209) [0x122f679]
 mongod(execute_native_thread_routine+0x20) [0x2001310]
 libpthread.so.0(+0x7AA1) [0x2b642e152aa1]
 libc.so.6(clone+0x6D) [0x2b642e45093d]
-----  END BACKTRACE  -----

I dive into the source code and find the reason.

In function Helpers::removeRange, it try to ignore the writeconcern timeout error by comparing the error code with ErrorCodes::ExceededTimeLimit.

 if (writeConcern.shouldWaitForOtherNodes() && numDeleted > 0) {
            repl::ReplicationCoordinator::StatusAndDuration replStatus =
                repl::getGlobalReplicationCoordinator()->awaitReplication(
                    txn,
                    repl::ReplClientInfo::forClient(txn->getClient()).getLastOp(),
                    writeConcern);
            if (replStatus.status.code() == ErrorCodes::ExceededTimeLimit) {
                warning(LogComponent::kSharding) << "replication to secondaries for removeRange at "
                                                    "least 60 seconds behind";
            } else {
                massertStatusOK(replStatus.status);
            }
            millisWaitingForReplication += replStatus.duration;
        }

But in the implement of awaitReplication, it return ErrorCodes::WriteConcernFailed when timeout because of replication lag.

        if (writeConcern.wTimeout != WriteConcernOptions::kNoTimeout &&
            elapsed > Milliseconds{writeConcern.wTimeout}) {
            return StatusAndDuration(
                Status(ErrorCodes::WriteConcernFailed, "waiting for replication timed out"),
                elapsed);
        }

I am trying to create a pull request to fix it, but find the master branch has changed a lot, but both the latest version of 3.2 and 3.4 have this problem.

I am confused why removeRange need assert when awaitReplication failed, maybe we can remove the assert or change the error code of awaitReplication.



 Comments   
Comment by bhargav [ 16/Oct/17 ]

After upgrading 3.4 replica set not working, below are the logs. we have fresh dataset contain only system log and table.

{ "ok" : 0, "errmsg" : "'10.10.10.25:27017' has data already, cannot initiate set.", "code" : 110, "codeName" : "CannotInitializeNodeWithData" }

We have followed the same step as previous version.

Thanks,
Bhargav

Comment by Githook User [ 29/Aug/17 ]

Author:

{'name': 'Nathan Myers', 'username': 'nathan-myers-mongo', 'email': 'nathan.myers@10gen.com'}

Message: Merge pull request zyd_com/SERVER-30636

Backport SERVER-30636 b39e21c774d0ff3afeac1730f08d8defed38044d (v3.4) to v3.2
Thanks to Youdong Zhang
Branch: v3.2
https://github.com/mongodb/mongo/commit/281b824dc09f5a104e39ff4538dd95f2fb3ef902

Comment by Nathan Myers [ 28/Aug/17 ]

zyd_com Yes, the PR does completely fix the problem. It will go out in 3.4.8, and is scheduled for a 3.2 release. Thank you again for the patch and the work behind it.

Comment by Githook User [ 28/Aug/17 ]

Author:

{'username': 'nathan-myers-mongo', 'name': 'Nathan Myers', 'email': 'nathan.myers@10gen.com'}

Message: SERVER-30636 (lint)
Branch: v3.4
https://github.com/mongodb/mongo/commit/b39e21c774d0ff3afeac1730f08d8defed38044d

Comment by Githook User [ 28/Aug/17 ]

Author:

{'username': 'nathan-myers-mongo', 'name': 'Nathan Myers', 'email': 'nathan.myers@10gen.com'}

Message: SERVER-30636 Detect range deletion write concern timeout correctly
Branch: v3.4
https://github.com/mongodb/mongo/commit/d30891747be8d61e3b5a8f07451cee08d5ba2d38

Comment by Zhang Youdong [ 23/Aug/17 ]

@Nathan Myers

What's the progress of this issue, does the PR fix the problem?

Comment by Nathan Myers [ 17/Aug/17 ]

zyd_com

No, your fix is correct. Without it, the range deleter will skip deleting remaining documents in the range.

Comment by Zhang Youdong [ 17/Aug/17 ]

@Nathan Myers

I thought the assertion is by desin, so just skip the WriteConcernFailed error code in the PR.

It's much better to fix the problem by catching the exception, thank you.

Comment by Nathan Myers [ 16/Aug/17 ]

Thank you for the PR. I agree that this should fix the immediate problem.

I have investigated the master branch, and this problem cannot arise in the new range deletion system in 3.5.

In releases 3.4 and 3.2, the massertStatusOK statement (like its replacement in 3.4, uassertStatusOK) throws an exception that is not caught in the calling range deleter thread function. A complete fix would add code to the thread function to catch the exception and not terminate the program. I will do that code, and I will keep this ticket open until we have that fix in place. which is caught in deleteRange, producing the backtrace and aborting the range deletion.

Comment by Zhang Youdong [ 16/Aug/17 ]

I created a pull request for v3.4, and it can be also used for v3.2. detail see https://github.com/mongodb/mongo/pull/1171

Comment by Nathan Myers [ 15/Aug/17 ]

Thank you, YZ. A PR just for 3.2 and/or 3.4 would be welcome. The 3.6 code is very different, but I will see if it needs similar attention.

Comment by Ramon Fernandez Marina [ 14/Aug/17 ]

Thanks for the detailed report zyd_com, we're investigating.

Generated at Thu Feb 08 04:24:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.