[SERVER-13822] Running resync before replset config is loaded can crash mongod Created: 17/Apr/14  Updated: 11/Jul/16  Resolved: 19/May/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 2.6.4, 2.7.1

Type: Bug Priority: Major - P3
Reporter: Shaun Verch Assignee: Eric Milkie
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Operating System: ALL
Backport Completed:
Participants:

 Description   
Issue Status as of Jul 22, 2014

ISSUE SUMMARY
In a replica set, if a resync operation is attempted on a node before it loads a valid replica set config, the mongod process crashes.

A newly started mongod with the --replSet parameter does not immediately have a config; it must first load a valid config from disk, have a config delivered to it from another node, or have the replica set initiate command run by an admin.

USER IMPACT
The mongod process crashes, and a stack trace is printed in the log. This only affects newly started mongod processes that have not yet had a chance to join a replica set, so the impact of this issue on a replica set is minimal.

WORKAROUNDS
Do not run resync on a mongod before loading a valid replica set config.

AFFECTED VERSIONS
MongoDB production releases from version 2.6.0 up to 2.6.3 are affected by this issue.

FIX VERSION
The fix is included in the 2.6.4 production release.

RESOLUTION DETAILS
Do not allow resync commands if the replica set config has not yet been loaded.

Original description

https://mci.10gen.com/ui/task/mongodb_mongo_master_osx_108_dur_off_b1300e3f5656423eac55efaedf6440ab10c37125_14_04_16_21_30_07_replicasets_osx_108_dur_off
https://mci.10gen.com/ui/task/mongodb_mongo_master_osx_108_b1300e3f5656423eac55efaedf6440ab10c37125_14_04_16_21_30_07_replicasets_osx_108

 m31001| 2014-04-16T20:14:35.279-0400 [conn2] SEVERE: Invalid access at address: 0
 m31001| 2014-04-16T20:14:35.280-0400 [rsStart] replSet I am mci-osx108-5.build.10gen.cc:31001
 m31001| 2014-04-16T20:14:35.283-0400 [conn2] SEVERE: Got signal: 11 (Segmentation fault: 11).
 m31001| 0x1006b125b 0x1006b0dfe 0x7fff88b2790a 0 0x1001aa945 0x1001ab3db 0x1001ac09c 0x1003c0d5f 0x1002927b0 0x1000065b4 0x1006760f1 0x1006e57d5 0x7fff88b39772 0x7fff88b261a1 
 m31001|  /data/mci/shell/mongodb-mongo-master/mongod(_ZN5mongo15printStackTraceERSo+0x2b) [0x1006b125b]
 m31001|  /data/mci/shell/mongodb-mongo-master/mongod(_ZN5mongo12_GLOBAL__N_124abruptQuitWithAddrSignalEiP9__siginfoPv+0xde) [0x1006b0dfe]
 m31001|  /usr/lib/system/libsystem_c.dylib(_sigtramp+0x1a) [0x7fff88b2790a]
 m31001|  ??? [0]
 m31001|  /data/mci/shell/mongodb-mongo-master/mongod(_ZN5mongo12_execCommandEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x25) [0x1001aa945]
 m31001|  /data/mci/shell/mongodb-mongo-master/mongod(_ZN5mongo7Command11execCommandEPS0_RNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0x85f) [0x1001ab3db]
 m31001|  /data/mci/shell/mongodb-mongo-master/mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x56c) [0x1001ac09c]
 m31001|  /data/mci/shell/mongodb-mongo-master/mongod(_ZN5mongo11newRunQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x64f) [0x1003c0d5f]
 m31001|  /data/mci/shell/mongodb-mongo-master/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x7b0) [0x1002927b0]
 m31001|  /data/mci/shell/mongodb-mongo-master/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x134) [0x1000065b4]
 m31001|  /data/mci/shell/mongodb-mongo-master/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x691) [0x1006760f1]
 m31001|  /data/mci/shell/mongodb-mongo-master/mongod(thread_proxy+0xe5) [0x1006e57d5]
 m31001|  /usr/lib/system/libsystem_c.dylib(_pthread_start+0x147) [0x7fff88b39772]
 m31001|  /usr/lib/system/libsystem_c.dylib(thread_start+0xd) [0x7fff88b261a1]

The only change to actual code in the intersection of the blamelists is: https://github.com/mongodb/mongo/commit/0fbd76d233e213e43f53b8882c4dd3c71897a7f3

Other changes:

https://github.com/mongodb/mongo/commit/8bbe304cde912c0e2f96ff6b8f6e4badd90d60f0
https://github.com/mongodb/mongo/commit/b1300e3f5656423eac55efaedf6440ab10c37125



 Comments   
Comment by Githook User [ 14/Jul/14 ]

Author:

{u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-13822 check for replSet pointer in resync command

(cherry picked from commit c10e8282a7af38f8512e911a14889e14df8a2c6a)

Conflicts:
src/mongo/db/repl/resync.cpp
Branch: v2.6
https://github.com/mongodb/mongo/commit/9965572e13e395240def08cbef56f997931d61eb

Comment by Ramon Fernandez Marina [ 20/May/14 ]

Author:

{u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-13822 check for replSet pointer in resync command

Branch: master
https://github.com/mongodb/mongo/commit/c10e8282a7af38f8512e911a14889e14df8a2c6a

Comment by Eric Milkie [ 02/May/14 ]

Good news, I can reproduce locally by cranking up the poll frequency on the assert.soon. Going to try to diagnose.

Comment by David Storch [ 01/May/14 ]

5b4ad32814 OS X 10.8 DUR OFF replicasets

https://mci.10gen.com/ui/task/mongodb_mongo_master_osx_108_dur_off_5b4ad3281493f6fcdccf96781237e79a13b59621_14_05_01_13_15_06_replicasets_osx_108_dur_off

Comment by David Storch [ 01/May/14 ]

Saw another instance of this failure:

57e01bdc25 OS X 10.8 DUR OFF replicasets_auth

https://mci.10gen.com/ui/task/mongodb_mongo_master_osx_108_dur_off_57e01bdc252cb06225edb0ac5fc712666236dbcf_14_04_30_18_53_10_replicasets_auth_osx_108_dur_off

http://buildlogs.mongodb.org/mci_0.9_osx-108-dur-off/builds/74720/test/replicasets_auth_0/resync.js

Generated at Thu Feb 08 03:32:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.