[SERVER-5948] replSet member can fall out of RECOVERING back into SECONDARY even if no other node in the set can be authenticated against. Created: 28/May/12 Updated: 11/Jul/16 Resolved: 30/May/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Security |
| Affects Version/s: | None |
| Fix Version/s: | 2.1.2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Ian Whalen (Inactive) | Assignee: | Kristina Chodorow (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | buildbot | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Participants: |
| Description |
http://buildbot.mongodb.org/builders/Nightly%20Linux%2064-bit/builds/986/steps/test_9/logs/stdio |
| Comments |
| Comment by auto [ 30/May/12 ] | |||||||
|
Author: {u'login': u'kchodorow', u'name': u'Kristina', u'email': u'kristina@10gen.com'}Message: Prevent a member from transitioning to secondary state when there are auth problems | |||||||
| Comment by Kristina Chodorow (Inactive) [ 30/May/12 ] | |||||||
|
This was a regression with the recent sync code changes. | |||||||
| Comment by Spencer Brody (Inactive) [ 30/May/12 ] | |||||||
|
Excerpt of buildbot log that shows where the node goes briefly into RECOVERING before going back into SECONDARY (and staying that way from then on, which causes the test to fail).
| |||||||
| Comment by Spencer Brody (Inactive) [ 30/May/12 ] | |||||||
|
I think I understand the problem, and I think it's a real bug. Manager::checkAuth in manager.cpp puts the node into RECOVERING if at least one of the other nodes has auth problems and all other nodes either also have auth problems or are down. The rsSync thread, however, periodically calls ReplSetImpl::tryToGoLiveAsASecondary, which is designed to check if the initial sync is finished by checking this node's optime. If it sees that it's up-to-date, it puts the node into SECONDARY state, regardless of auth. It should probably only take a node out of RECOVERING if the reason it was in RECOVERING in the first place was because it was doing the initial sync. If it went into RECOVERING for some other reason, then being up-to-date on optime isn't a good enough reason to take the node out of RECOVERING. The reason this test sometimes passes is because it does briefly get into RECOVERING before being set back to SECONDARY, so it depends on the timing of the test and whether or not the part that is checking to make sure that the node is in RECOVERING gets in to do its check before the node goes back to SECONDARY. | |||||||
| Comment by Spencer Brody (Inactive) [ 30/May/12 ] | |||||||
|
New failure with more output: http://buildlogs.mongodb.org/build/4fc5af36d2a60f1800000c50/test/4fc5cbded2a60f2a8f0008ed/ | |||||||
| Comment by auto [ 29/May/12 ] | |||||||
|
Author: {u'login': u'stbrody', u'name': u'Spencer T Brody', u'email': u'spencer@10gen.com'}Message: Add debugging output to test to understand failure. |