[SERVER-15972] Replication state sometimes transitions from PRIMARY to SECONDARY while writes are in progress Created: 22/Oct/14  Updated: 19/Nov/14  Resolved: 07/Nov/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.7.8
Fix Version/s: 2.8.0-rc0

Type: Bug Priority: Critical - P2
Reporter: Ian Whalen (Inactive) Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-15832 Primary server steps down without hol... Closed
Related
is related to SERVER-15881 ReplCoordinatorImpl invokes forwardSl... Closed
Operating System: ALL
Participants:
Linked BF Score: 0

 Description   

When the topology coordinator chooses to step down in response to a heartbeat, it changes its own state immediately. The replication coordinator then schedules a callback to change its own state under the exclusive lock. However, it is possible that before this callback executes, another callback runs that updates the replication executor's state, but without the exclusive lock. This is an error. A reasonable solution would be for the topology coordinator to enter a "going to step down" state, but defer actual stepdown until the replication coordinator can be updated synchronously.

Original description of symptoms follows.

this fassert appears to have happened before:

TASK HISTORY (load older entries to see previous failures)

LOGS

m31000| 2014-10-22T00:49:27.419+0000 I COMMANDS [conn1] insert fgIndexSec.jstests_fgsec query: { _id: ObjectId('5446ff0445925d0f7bf0b56f'), i: 297862.0 } ninserted:1 keyUpdates:0 numYields:0  235ms
 m31000| 2014-10-22T00:49:27.422+0000 I REPLSETS Standing for election
 m31000| 2014-10-22T00:49:27.423+0000 I REPLSETS replSet info electSelf
 m31000| 2014-10-22T00:49:27.423+0000 I REPLSETS replSet election succeeded, assuming primary role
 m31002| 2014-10-22T00:49:27.423+0000 I REPLSETS [conn3] replSet received elect msg { replSetElect: 1, set: "fgIndex", who: "ip-10-123-133-157:31000", whoid: 0, cfgver: 1, round: ObjectId('5446ff175480529d829ad261') }
 m31002| 2014-10-22T00:49:27.423+0000 I REPLSETS replSetElect voting yea for ip-10-123-133-157:31000 (0)
 m31000| 2014-10-22T00:49:27.429+0000 F REPLSETS [conn1] replSet error : logOp() but can't accept write to collection fgIndexSec.jstests_fgsec
 m31000| 2014-10-22T00:49:27.431+0000 I -        [conn1] Fatal Assertion 17405
 m31000| 2014-10-22T00:49:27.447+0000 I -        [conn1] 
 m31000|  0x8e6ea23 0x8e06258 0x8de6d76 0x8a1b8c2 0x8a1ac05 0x8644238 0x86448db 0x8645e5c 0x86466ad 0x864a15c 0x866edba 0x8670f77 0x86724ea 0x89494ad 0x87c219d 0x87c4114 0x83ac3f3 0x8e202e7 0xe34939 0x5837ae
 m31000| ----- BEGIN BACKTRACE -----
 m31000| {"backtrace":[{"b":"8048000","o":"E26A23"},{"b":"8048000","o":"DBE258"},{"b":"8048000","o":"D9ED76"},{"b":"8048000","o":"9D38C2"},{"b":"8048000","o":"9D2C05"},{"b":"8048000","o":"5FC238"},{"b":"8048000","o":"5FC8DB"},{"b":"8048000","o":"5FDE5C"},{"b":"8048000","o":"5FE6AD"},{"b":"8048000","o":"60215C"},{"b":"8048000","o":"626DBA"},{"b":"8048000","o":"628F77"},{"b":"8048000","o":"62A4EA"},{"b":"8048000","o":"9014AD"},{"b":"8048000","o":"77A19D"},{"b":"8048000","o":"77C114"},{"b":"8048000","o":"3643F3"},{"b":"8048000","o":"DD82E7"},{"b":"E2F000","o":"5939"},{"b":"4AC000","o":"D77AE"}],"processInfo":{ "mongodbVersion" : "2.7.8-pre-", "gitVersion" : "9fbc9684ce381f5ef49709696ad0eb69f521d93c", "uname" : { "sysname" : "Linux", "release" : "2.6.18-194.el5xen", "version" : "#1 SMP Tue Mar 16 22:08:06 EDT 2010", "machine" : "i686" }, "somap" : [ { "elfType" : 2, "b" : "8048000" }, { "b" : "278000", "elfType" : 3 }, { "b" : "DC0000", "path" : "/lib/i686/nosegneg/librt.so.1", "elfType" : 3 }, { "b" : "76F000", "path" : "/lib/libdl.so.2", "elfType" : 3 }, { "b" : "286000", "path" : "/usr/lib/libstdc++.so.6", "elfType" : 3 }, { "b" : "120000", "path" : "/lib/i686/nosegneg/libm.so.6", "elfType" : 3 }, { "b" : "181000", "path" : "/lib/libgcc_s.so.1", "elfType" : 3 }, { "b" : "E2F000", "path" : "/lib/i686/nosegneg/libpthread.so.0", "elfType" : 3 }, { "b" : "4AC000", "path" : "/lib/i686/nosegneg/libc.so.6", "elfType" : 3 }, { "b" : "48F000", "path" : "/lib/ld-linux.so.2", "elfType" : 3 } ] }}
 m31000|  mongod(_ZN5mongo15printStackTraceERSo+0x33) [0x8e6ea23]
 m31000|  mongod(_ZN5mongo10logContextEPKc+0xC8) [0x8e06258]
 m31000|  mongod(_ZN5mongo13fassertFailedEi+0xA6) [0x8de6d76]
 m31000|  mongod(+0x9D38C2) [0x8a1b8c2]
 m31000|  mongod(_ZN5mongo4repl5logOpEPNS_16OperationContextEPKcS4_RKNS_7BSONObjEPS5_Pbb+0x135) [0x8a1ac05]
 m31000|  mongod(_ZN5mongo18WriteBatchExecutor13execOneInsertEPNS0_16ExecInsertsStateEPPNS_16WriteErrorDetailE+0x2A8) [0x8644238]
 m31000|  mongod(_ZN5mongo18WriteBatchExecutor11execInsertsERKNS_21BatchedCommandRequestEPSt6vectorIPNS_16WriteErrorDetailESaIS6_EE+0x2CB) [0x86448db]
 m31000|  mongod(_ZN5mongo18WriteBatchExecutor11bulkExecuteERKNS_21BatchedCommandRequestEPSt6vectorIPNS_19BatchedUpsertDetailESaIS6_EEPS4_IPNS_16WriteErrorDetailESaISB_EE+0x3C) [0x8645e5c]
 m31000|  mongod(_ZN5mongo18WriteBatchExecutor12executeBatchERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseE+0x4ED) [0x86466ad]
 m31000|  mongod(_ZN5mongo8WriteCmd3runEPNS_16OperationContextERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x1EC) [0x864a15c]
 m31000|  mongod(_ZN5mongo12_execCommandEPNS_16OperationContextEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x5A) [0x866edba]
 m31000|  mongod(_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_iPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xEE7) [0x8670f77]
 m31000|  mongod(_ZN5mongo12_runCommandsEPNS_16OperationContextEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x8CA) [0x86724ea]
 m31000|  mongod(_ZN5mongo11newRunQueryEPNS_16OperationContextERNS_7MessageERNS_12QueryMessageERNS_5CurOpES3_b+0x25AD) [0x89494ad]
 m31000|  mongod(+0x77A19D) [0x87c219d]
 m31000|  mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortEb+0xB74) [0x87c4114]
 m31000|  mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0xB3) [0x83ac3f3]
 m31000|  mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x4B7) [0x8e202e7]
 m31000|  libpthread.so.0(+0x5939) [0xe34939]
 m31000|  libc.so.6(clone+0x5E) [0x5837ae]
 m31000| -----  END BACKTRACE  -----



 Comments   
Comment by Githook User [ 07/Nov/14 ]

Author:

{u'username': u'stbrody', u'name': u'Spencer T Brody', u'email': u'spencer@mongodb.com'}

Message: SERVER-15972 Prevent stepping down without holding the global lock
Branch: master
https://github.com/mongodb/mongo/commit/af31dc7ba520145df053e72232647556e467dd00

Comment by Scott Hernandez (Inactive) [ 05/Nov/14 ]

This seems to be either insert releasing the lock during which a stepdown (from heartbeat) happens or the stepdown not holding the lock while transitioning (out of primary). I am investigating which one.

Comment by Ian Whalen (Inactive) [ 03/Nov/14 ]

problem seems to be ongoing as of 11/3.

Comment by Daniel Pasette (Inactive) [ 28/Oct/14 ]

crystal, please assign to repl team

Generated at Thu Feb 08 03:39:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.