[SERVER-28877] Cancel race in replication executor can cause elections to assert. Created: 20/Apr/17  Updated: 30/Oct/23  Resolved: 18/May/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.4.3
Fix Version/s: 3.2.15, 3.4.6, 3.5.8

Type: Bug Priority: Major - P3
Reporter: Bodenhaltung [X] Assignee: Matthew Russotto
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 14.04.5 LTS
db version v3.4.3
git version: f07437fb5a6cca07c10bafa78365456eb1d6d5e1
OpenSSL version: OpenSSL 1.0.2k 26 Jan 2017
allocator: tcmalloc
modules: none
build environment:
distmod: ubuntu1404
distarch: x86_64
target_arch: x86_64


Attachments: Text File rs.status.txt    
Issue Links:
Backports
Depends
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.4, v3.2
Sprint: Repl 2017-05-29
Participants:
Linked BF Score: 51

 Description   

Output from mongodb.log:

2017-04-19T13:15:23.807Z I COMMAND  [conn50682] command local.oplog.rs command: getMore { getMore: 16914237176, collection: "oplog.rs", maxTimeMS: 5000, term: 161, lastKnownCommittedOpTime: { ts: Timestamp 1492607700000|2, t: 161 } } originatingCommand: { find: "oplog.rs", filter: { ts: { $gte: Timestamp 1492156061000|1 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 161 }
 planSummary: COLLSCAN cursorid:16914237176 keysExamined:0 docsExamined:0 numYields:1 nreturned:0 reslen:292 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, oplog: { acquireCount: { r: 3 } } } protocol:op_command 6625ms
2017-04-19T13:15:46.612Z I COMMAND  [conn51155] command local.oplog.rs command: getMore { getMore: 16914237176, collection: "oplog.rs", maxTimeMS: 5000, term: 161, lastKnownCommittedOpTime: { ts: Timestamp 1492607722000|1, t: 161 } } originatingCommand: { find: "oplog.rs", filter: { ts: { $gte: Timestamp 1492156061000|1 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 161 }
 planSummary: COLLSCAN cursorid:16914237176 keysExamined:0 docsExamined:0 numYields:0 nreturned:0 reslen:292 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, oplog: { acquireCount: { r: 2 } } } protocol:op_command 17700ms
2017-04-19T13:15:46.614Z I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2017-04-19T13:15:46.614Z I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
2017-04-19T13:15:46.614Z I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mdbfh:27017
2017-04-19T13:15:46.614Z I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mdb1:27017
2017-04-19T13:15:46.614Z I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mdbarb:27017
2017-04-19T13:15:46.656Z I NETWORK  [conn51178] received client metadata from 37.120.167.74:33775 conn51178: { driver: { name: "NetworkInterfaceASIO-Replication", version: "3.4.2" }, os: { type: "Linux",
name: "Ubuntu", architecture: "x86_64", version: "14.04" } }
2017-04-19T13:15:46.670Z I NETWORK  [conn51179] received client metadata from 84.39.96.207:38332 conn51179: { driver: { name: "NetworkInterfaceASIO-Replication", version: "3.4.3" }, os: { type: "Linux", n
ame: "PRETTY_NAME="Debian GNU/Linux 8 (jessie)"", architecture: "x86_64", version: "Kernel 2.6.32-042stab120.18" } }
2017-04-19T13:15:46.740Z I REPL     [ReplicationExecutor] Error in heartbeat request to mdb1:27017; ExceededTimeLimit: Remote command timed out while waiting to get a connection from the pool, took 12407ms, timeout was set to 10000ms
2017-04-19T13:15:46.742Z I REPL     [ReplicationExecutor] Error in heartbeat request to mdbfh:27017; ExceededTimeLimit: Remote command timed out while waiting to get a connection from the pool, took 12532ms, timeout was set to 10000ms
2017-04-19T13:15:46.742Z I REPL     [ReplicationExecutor] Error in heartbeat request to mdbarb:27017; ExceededTimeLimit: Remote command timed out while waiting to get a connection from the pool, took 12533ms, timeout was set to 10000ms
2017-04-19T13:15:46.938Z I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to mdb1:27017
2017-04-19T13:15:46.942Z I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to mdbfh:27017
2017-04-19T13:15:46.942Z I REPL     [ReplicationExecutor] VoteRequester(term 161 dry run) received a no vote from mdb1:27017 with reason "candidate's data is staler than mine"; response message: { term: 161, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }
2017-04-19T13:15:46.963Z I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to mdbarb:27017
2017-04-19T13:15:46.975Z I REPL     [ReplicationExecutor] VoteRequester(term 161 dry run) received a no vote from mdbarb:27017 with reason "can see a healthy primary of equal or greater priority"; response message: { term: 161, voteGranted: false, reason: "can see a healthy primary of equal or greater priority", ok: 1.0 }
2017-04-19T13:15:46.996Z I -        [ReplicationExecutor] Invariant failure !_callbacks.empty() src/mongo/db/repl/scatter_gather_runner.cpp 149
2017-04-19T13:15:46.997Z I -        [ReplicationExecutor]
 
***aborting after invariant() failure
 
2017-04-19T13:15:47.344Z F -        [ReplicationExecutor] Got signal: 6 (Aborted).
 
 0x7f12b5fb2981 0x7f12b5fb1a79 0x7f12b5fb1f5d 0x7f12b362f330 0x7f12b328dc37 0x7f12b3291028 0x7f12b5243b9c 0x7f12b5a6b654 0x7f12b5a31ba1 0x7f12b5a2e719 0x7f12b5a36784 0x7f12b6a24510 0x7f12b3627184 0x7f12b3354bed
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"7F12B4A34000","o":"157E981","s":"_ZN5mongo15printStackTraceERSo"},{"b":"7F12B4A34000","o":"157DA79"},{"b":"7F12B4A34000","o":"157DF5D"},{"b":"7F12B361F000","o":"10330"},{"b":"7F12B3257000","o":"36C37","s":"gsignal"},{"b":"7F12B3257000","o":"3A028","s":"abort"},{"b":"7F12B4A34000","o":"80FB9C","s":"_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j"},{"b":"7F12B4A34000","o":"1037654","s":"_ZN5mongo4repl19ScatterGatherRunner10RunnerImpl15processResponseERKNS_8executor12TaskExecutor25RemoteCommandCallbackArgsE"},{"b":"7F12B4A34000","o":"FFDBA1"},{"b":"7F12B4A34000","o":"FFA719"},{"b":"7F12B4A34000","o":"1002784","s":"_ZN5mongo4repl19ReplicationExecutor3runEv"},{"b":"7F12B4A34000","o":"1FF0510"},{"b":"7F12B361F000","o":"8184"},{"b":"7F12B3257000","o":"FDBED","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.4.3", "gitVersion" : "f07437fb5a6cca07c10bafa78365456eb1d6d5e1", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.13.0-112-generic", "version" : "#159-Ubuntu SMP Fri Mar 3 15:26:07 UTC 2017", "machine" : "x86_64" }, "somap" : [ { "b" : "7F12B4A34000", "elfType" : 3, "buildId" : "CE26BD3A6C680DD87D137892B07C2EEFC1162144" }, { "b" : "7FFFC5A9B000", "elfType" : 3, "buildId" : "012E1338BA43AF7C0DC7D069F64F0A6490CC6D9C" }, { "b" : "7F12B45A7000", "path" : "/usr/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "58087C35580286708FF6469B51886E5FF5740101" }, { "b" : "7F12B4165000", "path" : "/usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "1C949DBD43FCE50C9B691B7EA8F31229891B2218" }, { "b" : "7F12B3F5D000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "1EEBA762A6A2C8884D56033EE8CCE79B95CD974D" }, { "b" : "7F12B3D59000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "D0F881E59FF88BE4F29A228C8657376B3C325C2C" }, { "b" : "7F12B3A53000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "1654CB13B1D24ED03F4BDCB51FC7524B9181A771" }, { "b" : "7F12B383D000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "36311B4457710AE5578C4BF00791DED7359DBB92" }, { "b" : "7F12B361F000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "22F9078CFA529CCE1A814A4A1A1C018F169D5652" }, { "b" : "7F12B3257000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "CA5C6CFE528AF541C3C2C15CEE4B3C74DA4E2FB4" }, { "b" : "7F12B480F000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "237E22E5AAC2DDFCD06518F63FD720FE758E6E5B" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x7f12b5fb2981]
 mongod(+0x157DA79) [0x7f12b5fb1a79]
 mongod(+0x157DF5D) [0x7f12b5fb1f5d]
 libpthread.so.0(+0x10330) [0x7f12b362f330]
 libc.so.6(gsignal+0x37) [0x7f12b328dc37]
 libc.so.6(abort+0x148) [0x7f12b3291028]
 mongod(_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j+0x0) [0x7f12b5243b9c]
 mongod(_ZN5mongo4repl19ScatterGatherRunner10RunnerImpl15processResponseERKNS_8executor12TaskExecutor25RemoteCommandCallbackArgsE+0x254) [0x7f12b5a6b654]
 mongod(+0xFFDBA1) [0x7f12b5a31ba1]
 mongod(+0xFFA719) [0x7f12b5a2e719]
 mongod(_ZN5mongo4repl19ReplicationExecutor3runEv+0x274) [0x7f12b5a36784]
 mongod(+0x1FF0510) [0x7f12b6a24510]
 libpthread.so.0(+0x8184) [0x7f12b3627184]
 libc.so.6(clone+0x6D) [0x7f12b3354bed]
-----  END BACKTRACE  -----

My Replica Setup:
4 Member:
1 Arbiter with default prio 1
3 "Worker" with prio 1, 2, 2



 Comments   
Comment by Githook User [ 15/Jun/17 ]

Author:

{u'username': u'mtrussotto', u'name': u'Matthew Russotto', u'email': u'matthew.russotto@10gen.com'}

Message: SERVER-28877 Fix cancel race which can cause crashes during elections.

(cherry picked from commit 7844ffc9be25e16daa9615949941b8749ad3de9f)
Branch: v3.2
https://github.com/mongodb/mongo/commit/daa09e3bfa5382cfc51de3f339a486cfc8aed219

Comment by Githook User [ 14/Jun/17 ]

Author:

{u'username': u'mtrussotto', u'name': u'Matthew Russotto', u'email': u'matthew.russotto@10gen.com'}

Message: SERVER-28877 Fix cancel race which can cause crashes during elections.

(cherry picked from commit 7844ffc9be25e16daa9615949941b8749ad3de9f)
Branch: v3.4
https://github.com/mongodb/mongo/commit/ca14d6a53664d228d0115be55a33f5fcd11d7fd2

Comment by Githook User [ 18/May/17 ]

Author:

{u'username': u'mtrussotto', u'name': u'Matthew Russotto', u'email': u'matthew.russotto@10gen.com'}

Message: SERVER-28877 Fix cancel race which can cause crashes during elections.
Branch: master
https://github.com/mongodb/mongo/commit/7844ffc9be25e16daa9615949941b8749ad3de9f

Comment by Bodenhaltung [X] [ 20/Apr/17 ]

Sure, please see attached the rs.status() output.

Comment by Spencer Brody (Inactive) [ 20/Apr/17 ]

Hi Bodenhaltung,
We are looking through the logs and code to determine if we can identify the source of the crash. In the meantime, could you please attach the replica set config in use to aid in our diagnostics?. This can be obtained by running rs.status() in the shell when connected to one of the nodes in the set.

Thanks,
-Spencer

Comment by Bodenhaltung [X] [ 20/Apr/17 ]

One member runs on "Debian GNU/Linux 8.7", but all with the same mongod Version.

Generated at Thu Feb 08 04:19:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.