[SERVER-5948] replSet member can fall out of RECOVERING back into SECONDARY even if no other node in the set can be authenticated against. Created: 28/May/12  Updated: 11/Jul/16  Resolved: 30/May/12

Status: Closed
Project: Core Server
Component/s: Replication, Security
Affects Version/s: None
Fix Version/s: 2.1.2

Type: Bug Priority: Major - P3
Reporter: Ian Whalen (Inactive) Assignee: Kristina Chodorow (Inactive)
Resolution: Done Votes: 0
Labels: buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

 m31000| Mon May 28 08:52:17 [rsHealthPoll] could not authenticate against ip-10-110-9-236:31002, { errmsg: "auth fails", ok: 0.0 }
 m31000| Mon May 28 08:52:17 [rsHealthPoll] couldn't connect to ip-10-110-9-236:31001: couldn't connect to server ip-10-110-9-236:31001
 m31002| Mon May 28 08:52:17 [rsHealthPoll] couldn't connect to ip-10-110-9-236:31001: couldn't connect to server ip-10-110-9-236:31001
 m31002| Mon May 28 08:52:17 [rsBackgroundSync] waiting for 3 pings from other members before syncing
 m31000| Mon May 28 08:52:19 [rsHealthPoll] couldn't connect to ip-10-110-9-236:31001: couldn't connect to server ip-10-110-9-236:31001
 m31000| Mon May 28 08:52:19 [rsBackgroundSync] waiting for 6 pings from other members before syncing
 m31002| Mon May 28 08:52:19 [rsHealthPoll] couldn't connect to ip-10-110-9-236:31001: couldn't connect to server ip-10-110-9-236:31001
 m31000| Mon May 28 08:52:21 [rsMgr] replset error could not reach/authenticate against any members
 m31000| Mon May 28 08:52:21 [rsMgr] replSet RECOVERING
 m31000| Mon May 28 08:52:21 [rsHealthPoll] couldn't connect to ip-10-110-9-236:31001: couldn't connect to server ip-10-110-9-236:31001
 m31000| Mon May 28 08:52:21 [rsSync] replSet SECONDARY
assert.soon failed: function () {
    var result = m.getDB("admin").runCommand({isMaster:1});
    return !result.ismaster && !result.secondary;
}, msg:undefined
Error("Printing Stack Trace")@:0
()@src/mongo/shell/utils.js:37
("assert.soon failed: function () {\n    var result = m.getDB(\"admin\").runCommand({isMaster:1});\n    return !result.ismaster && !result.secondary;\n}, msg:undefined")@src/mongo/shell/utils.js:58
((function () {var result = m.getDB("admin").runCommand({isMaster:1});return !result.ismaster && !result.secondary;}))@src/mongo/shell/utils.js:167
()@/mnt/home/buildbot/slave/Linux_64bit_Nightly/mongo/jstests/replsets/auth2.js:55
@/mnt/home/buildbot/slave/Linux_64bit_Nightly/mongo/jstests/replsets/auth2.js:99

http://buildbot.mongodb.org/builders/Nightly%20Linux%2064-bit/builds/986/steps/test_9/logs/stdio



 Comments   
Comment by auto [ 30/May/12 ]

Author:

{u'login': u'kchodorow', u'name': u'Kristina', u'email': u'kristina@10gen.com'}

Message: Prevent a member from transitioning to secondary state when there are auth problems SERVER-5948
Branch: master
https://github.com/mongodb/mongo/commit/135bbe4cc95023348feb05e3bbc59892b9edf23d

Comment by Kristina Chodorow (Inactive) [ 30/May/12 ]

This was a regression with the recent sync code changes.

Comment by Spencer Brody (Inactive) [ 30/May/12 ]

Excerpt of buildbot log that shows where the node goes briefly into RECOVERING before going back into SECONDARY (and staying that way from then on, which causes the test to fail).

m31002| Wed May 30 03:28:40 [initandlisten] connection accepted from 10.10.223.44:63707 #6 (2 connections now open)
 m31002| Wed May 30 03:28:40 [conn6]  authenticate db: local { authenticate: 1, nonce: "4bd569804f887713", user: "__system", key: "5a6357b05d3b720cb2c7d7ff23a01d6c" }
 m31000| Wed May 30 03:28:40 [rsHealthPoll] replSet member bs-rhel-57-x86_64-1:31002 is now in state UNKNOWN
 m31000| Wed May 30 03:28:40 [rsMgr] replset error could not reach/authenticate against any members
 m31000| Wed May 30 03:28:40 [rsMgr] replSet RECOVERING
 m31000| Wed May 30 03:28:40 [rsSync] replSet SECONDARY
 m31002| Wed May 30 03:28:40 [conn6] auth: key mismatch __system, ns:local

Comment by Spencer Brody (Inactive) [ 30/May/12 ]

I think I understand the problem, and I think it's a real bug. Manager::checkAuth in manager.cpp puts the node into RECOVERING if at least one of the other nodes has auth problems and all other nodes either also have auth problems or are down. The rsSync thread, however, periodically calls ReplSetImpl::tryToGoLiveAsASecondary, which is designed to check if the initial sync is finished by checking this node's optime. If it sees that it's up-to-date, it puts the node into SECONDARY state, regardless of auth. It should probably only take a node out of RECOVERING if the reason it was in RECOVERING in the first place was because it was doing the initial sync. If it went into RECOVERING for some other reason, then being up-to-date on optime isn't a good enough reason to take the node out of RECOVERING.

The reason this test sometimes passes is because it does briefly get into RECOVERING before being set back to SECONDARY, so it depends on the timing of the test and whether or not the part that is checking to make sure that the node is in RECOVERING gets in to do its check before the node goes back to SECONDARY.

Comment by Spencer Brody (Inactive) [ 30/May/12 ]

New failure with more output: http://buildlogs.mongodb.org/build/4fc5af36d2a60f1800000c50/test/4fc5cbded2a60f2a8f0008ed/

Comment by auto [ 29/May/12 ]

Author:

{u'login': u'stbrody', u'name': u'Spencer T Brody', u'email': u'spencer@10gen.com'}

Message: Add debugging output to test to understand failure. SERVER-5948
Branch: master
https://github.com/mongodb/mongo/commit/7c6ad5669d859fe6e2fbbdd5233c69fb7536db73

Generated at Thu Feb 08 03:10:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.