[SERVER-20856] During CSRS upgrade, config server gets stuck in STARTUP2 when restarted with --replSet and --configsvrMode=sccc set Created: 09/Oct/15 Updated: 25/Jan/17 Resolved: 28/Oct/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.1.9, 3.2.0-rc0 |
| Fix Version/s: | 3.2.0-rc2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Timothy Olsen (Inactive) | Assignee: | Andy Schwerin |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Steps To Reproduce: |
|
||||||||||||
| Sprint: | Sharding B (10/30/15) | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
During the CSRS upgrade, the first restart of the first config server (with --replSet and --configsvrMode=sccc set) results in the first config server getting stuck in STARTUP2. Log file of first config server:
|
| Comments |
| Comment by Githook User [ 28/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'andy10gen', u'name': u'Andy Schwerin', u'email': u'schwerin@mongodb.com'}Message: Otherwise, if the server running replSetInitiate crashes between writing the | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Timothy Olsen (Inactive) [ 22/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
Problem appears to only occur if journaling is disabled. With journaling enabled, I was able to do 5 automated wiredTiger SCCC -> CSRS conversions without any problems. | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Timothy Olsen (Inactive) [ 21/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
btw, this is with commit dbbc9a2e3d8c4d7fe1748fa980ba7d01b9489dbe | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Timothy Olsen (Inactive) [ 21/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
logs attached | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Timothy Olsen (Inactive) [ 21/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
I am still seeing this happen, although now only about 60% of the time rather than 100% of the time. This only happens in an automated scenario. Efforts to reproduce this manually fail. This leads me to believe a race is being triggered in the automated situation because the shutdown of the config server happens so quickly after the rs.initiate(). Should we reopen this ticket or open a new one? Regardless, I will attach logs now | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 13/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'andy10gen', u'name': u'Andy Schwerin', u'email': u'schwerin@mongodb.com'}Message: This manual resetting allows creation of a replica set oplog while the | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andy Schwerin [ 13/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
Well, I found a way that we might be able to continue to use _logOp for initializing the oplog when you run replSetInitiate on a node started without --replSet. On the assumption that it doesn't happen all the time, we can just reset the oplog.cpp internal state that caches the Collection object for the oplog. | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andy Schwerin [ 13/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
I think this problem arose because repl::_logOp should not be a publicly exposed method in oplog.h. It can only really be called by logOp(), and should be a private function inside of oplog.cpp. | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 09/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
nvm, repro'd it. Both the oplog.rs and oplog.$main collections are created which means that the server is slightly confused about the "active" replication system. To get around this issue one would need to copy all/the-last oplog entries from local.oplog.$main to local.oplog.rs before restarting as a replica set member. | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 09/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
Did you check the result of the rs.initiate() call to make it didn't error? If you have that log, please post it. To answer your question, yes an oplog is created when you run rs.initiate() in non-replica-set-mode. There should be a single entry like this:
We will see if we can repro this. | |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Timothy Olsen (Inactive) [ 09/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
This is during the CSRS upgrade procedure as outlined in https://docs.google.com/document/d/1Ic8NLIX-uiR_1_4HBdURFmhpgb3QsVH3gUdHoHmEVag/edit# . This happens during the step 'Restart “first” config server as a standalone replica set' There are no other members nor do I believe there are expected to be any at this stage of the CSRS upgrade procedure. I did not explicitly delete the oplog. I do not know if it is expected that there not be an oplog at this part of the upgrade procedure. Log file before the restart:
| |||||||||||||||||||||||||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 09/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||
|
Please upload all logs from this node, like before this restart. Are there other members of this replica set, because the config doesn't show them? If so, can you provide those logs as well? This node seems to not have an oplog, is that expected? Was it deleted? |