[SERVER-15655] Secondary can't join if w:0 is in getLastErrorDefaults Created: 14/Oct/14  Updated: 11/Jul/16  Resolved: 31/Oct/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 2.8.0-rc0

Type: Bug Priority: Major - P3
Reporter: A. Jesse Jiryu Davis Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 0
Labels: 28qa
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-15623 Test that w:0 is prohibited in getLas... Closed
related to SERVER-13055 Shouldn't allow w:0 in replset getLas... Closed
is related to SERVER-15868 Don't persist invalid configs receive... Closed
Backwards Compatibility: Minor Change
Operating System: ALL
Steps To Reproduce:

Start a 2.6.4 single-member RS:

> db.serverBuildInfo().version
2.6.4
> rs.initiate({
  _id: 'rs',
  members:[{_id: 0, host: 'localhost:27017'}],
  settings: {getLastErrorDefaults: {w:0}}
})
{
	"info" : "Config now saved locally.  Should come online in about a minute.",
	"ok" : 1
}
rs:OTHER> rs.status().members[0].stateStr
PRIMARY

Now start a member running the latest code from master (I have 4bea2a90) with the option "-vvvv". Try to add it from the shell connected to the 2.6.4 primary:

rs:PRIMARY> rs.add('localhost:27018')
{ "ok" : 1 }

The primary thinks the member was added. However:

rs:PRIMARY> rs.status()
{
	"set" : "rs",
	"date" : ISODate("2014-10-14T16:27:05Z"),
	"myState" : 1,
	"members" : [
		{
			"_id" : 0,
			"name" : "localhost:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 436,
			"optime" : Timestamp(1413303922, 1),
			"optimeDate" : ISODate("2014-10-14T16:25:22Z"),
			"electionTime" : Timestamp(1413303652, 2),
			"electionDate" : ISODate("2014-10-14T16:20:52Z"),
			"self" : true
		},
		{
			"_id" : 1,
			"name" : "localhost:27018",
			"health" : 1,
			"state" : 6,
			"stateStr" : "UNKNOWN",
			"uptime" : 103,
			"optime" : Timestamp(0, 0),
			"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
			"lastHeartbeat" : ISODate("2014-10-14T16:27:04Z"),
			"lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
			"pingMs" : 0,
			"lastHeartbeatMessage" : "still initializing"
		}
	],
	"ok" : 1
}

The member we tried to add logs "User Assertion: 17505:replSet illegal config: getLastErrorDefaults w:0":

2014-10-14T12:25:22.093-0400 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:62519 #1 (1 connection now open)
2014-10-14T12:25:22.093-0400 D COMMANDS [conn1] run command admin.$cmd { _isSelf: 1 }
2014-10-14T12:25:22.093-0400 I QUERY    [conn1] command admin.$cmd command: _isSelf { _isSelf: 1 } ntoreturn:1 keyUpdates:0 numYields:0  reslen:53 0ms
2014-10-14T12:25:22.093-0400 D NETWORK  [conn1] Socket recv() conn closed? 127.0.0.1:62519
2014-10-14T12:25:22.093-0400 D NETWORK  [conn1] SocketException: remote: 127.0.0.1:62519 error: 9001 socket exception [CLOSED] server [127.0.0.1:62519]
2014-10-14T12:25:22.093-0400 I NETWORK  [conn1] end connection 127.0.0.1:62519 (0 connections now open)
2014-10-14T12:25:22.093-0400 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:62520 #2 (2 connections now open)
2014-10-14T12:25:22.093-0400 D COMMANDS [conn2] run command admin.$cmd { replSetHeartbeat: "rs", v: -1, pv: 1, checkEmpty: false, from: "", fromId: 0 }
2014-10-14T12:25:22.093-0400 D COMMANDS [conn2] command: { replSetHeartbeat: "rs", v: -1, pv: 1, checkEmpty: false, from: "", fromId: 0 }
2014-10-14T12:25:22.093-0400 I QUERY    [conn2] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "rs", v: -1, pv: 1, checkEmpty: false, from: "", fromId: 0 } ntoreturn:1 keyUpdates:0 numYields:0  reslen:72 0ms
2014-10-14T12:25:22.094-0400 D COMMANDS [conn2] run command admin.$cmd { replSetHeartbeat: "rs", v: 2, pv: 1, checkEmpty: false, from: "localhost:27017", fromId: 0 }
2014-10-14T12:25:22.094-0400 D COMMANDS [conn2] command: { replSetHeartbeat: "rs", v: 2, pv: 1, checkEmpty: false, from: "localhost:27017", fromId: 0 }
2014-10-14T12:25:22.094-0400 I QUERY    [conn2] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "rs", v: 2, pv: 1, checkEmpty: false, from: "localhost:27017", fromId: 0 } ntoreturn:1 keyUpdates:0 numYields:0  reslen:72 0ms
2014-10-14T12:25:22.408-0400 D QUERY    [rsStart] Running query: query: {} sort: {} projection: {} skip: 0 limit: 1
2014-10-14T12:25:22.408-0400 D QUERY    [rsStart] Collection local.system.replset does not exist. Using EOF stage: query: {} sort: {} projection: {} skip: 0 limit: 1
2014-10-14T12:25:22.408-0400 I QUERY    [rsStart] query local.system.replset planSummary: EOF ntoreturn:1 ntoskip:0 nscanned:0 nscannedObjects:0 keyUpdates:0 numYields:0  nreturned:0 reslen:20 0ms
2014-10-14T12:25:22.408-0400 D REPLSETS [rsStart] ReplSetConfig load localhost:27017
2014-10-14T12:25:22.409-0400 D COMMANDS [ConnectBG] BackgroundJob starting: ConnectBG
2014-10-14T12:25:22.409-0400 D NETWORK  [rsStart] connected to server localhost:27017 (127.0.0.1)
2014-10-14T12:25:22.409-0400 I REPLSETS [rsStart] trying to contact localhost:27017
2014-10-14T12:25:22.409-0400 D COMMANDS [ConnectBG] BackgroundJob starting: ConnectBG
2014-10-14T12:25:22.409-0400 D NETWORK  [rsStart] connected to server localhost:27017 (127.0.0.1)
2014-10-14T12:25:22.410-0400 D -        [rsStart] User Assertion: 17505:replSet illegal config: getLastErrorDefaults w:0
2014-10-14T12:25:22.410-0400 D REPLSETS [rsStart] replSet exception trying to load config from discovered seed localhost:27017
2014-10-14T12:25:22.410-0400 I REPLSETS [rsStart] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
2014-10-14T12:25:22.410-0400 D REPLSETS [rsStart] replSet info no seed hosts were specified on the --replSet command line

Participants:

 Description   

Two bugs:

1. If you try to add a member running 2.8 to a replica set configured with {getLastErrorDefaults: {w:0}}, it stays in state UNKNOWN. It should be REMOVED.

2. The member doesn't log the problem at the default log level. We shouldn't have to increase verbosity to see what the problem is.



 Comments   
Comment by Githook User [ 03/Nov/14 ]

Author:

{u'username': u'stbrody', u'name': u'Spencer T Brody', u'email': u'spencer@mongodb.com'}

Message: SERVER-15655 Fix remove1.js test
Branch: master
https://github.com/mongodb/mongo/commit/1bb29da08f761b4ed79c8bb1cacb74aaec702530

Comment by Githook User [ 31/Oct/14 ]

Author:

{u'username': u'stbrody', u'name': u'Spencer T Brody', u'email': u'spencer@mongodb.com'}

Message: SERVER-15868 SERVER-15655 Better handling of receiving an invalid config via a heartbeat
Branch: master
https://github.com/mongodb/mongo/commit/0bb7d05719e7364cbeac27a78a9817e3b0ebd149

Comment by Spencer Brody (Inactive) [ 29/Oct/14 ]

Also, after shutting down that node and re-starting it I see

2014-10-29T17:31:56.919-0400 I REPLSETS [rsSync] replSet warning did not receive a valid config yet, sleeping 5 seconds 
2014-10-29T17:31:56.919-0400 W REPLSETS [initandlisten] Failed to load timestamp of most recently applied operation; NoMatchingDocument Did not find any entries in local.oplog.rs
2014-10-29T17:31:56.920-0400 I NETWORK  [initandlisten] waiting for connections on port 20002
2014-10-29T17:31:56.921-0400 E REPLSETS [ReplicationExecutor] Locally stored replica set configuration is invalid; See http://www.mongodb.org/dochub/core/recover-replica-set-from-invalid-config for information on how to recover from this. Got "BadValue Default write concern mode must wait for at least 1 member" while validating { _id: "mySet", version: 3, members: [ { _id: 0, host: "ubuntu:20001", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "ubuntu:20002", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatTimeoutSecs: 10, getLastErrorModes: {}, getLastErrorDefaults: { w: 0, wtimeout: 0 } } }
2014-10-29T17:31:56.921-0400 I -        [ReplicationExecutor] Fatal Assertion 28544

Seems like we shouldn't persist an invalid configuration to local.system.replset, as doing so will just cause us to crash on restart and make recovery more difficult

Comment by Spencer Brody (Inactive) [ 29/Oct/14 ]

When I do this on master now I see:

2014-10-29T17:29:07.072-0400 E REPLSETS [ReplicationExecutor] Could not validate configuration received from remote node; Removing self until an acceptable configuration arrives; BadValue Default write concern mode must wait for at least 1 member
2014-10-29T17:29:07.072-0400 I REPLSETS [ReplicationExecutor] transition to REMOVED

and rs.status() reports REMOVED for the secondary.

milkie, your previous comment said we wanted this to cause the secondary to be UNKNOWN not REMOVED, so is the current behavior a bug?

Comment by Eric Milkie [ 29/Oct/14 ]

We should put the config version of every node in replSetGetStatus, at least. And we can log when we reject configs.

Comment by A. Jesse Jiryu Davis [ 28/Oct/14 ]

Now it's worse. If I go through the same steps as before (2.6.4 on port 27017, replSetInitiate with w:0 in getLastErrorDefaults, try to add a secondary running the current master code, hash 4f4a5103), the secondary reports it can't find a member to sync from:

rs:PRIMARY> rs.status()
{
	"set" : "rs",
	"date" : ISODate("2014-10-28T22:00:28Z"),
	"myState" : 1,
	"members" : [
		{
			"_id" : 0,
			"name" : "localhost:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 1224,
			"optime" : Timestamp(1414533570, 1),
			"optimeDate" : ISODate("2014-10-28T21:59:30Z"),
			"electionTime" : Timestamp(1414533558, 2),
			"electionDate" : ISODate("2014-10-28T21:59:18Z"),
			"self" : true
		},
		{
			"_id" : 1,
			"name" : "localhost:27018",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 58,
			"optime" : Timestamp(1414533570, 1),
			"optimeDate" : ISODate("2014-10-28T21:59:30Z"),
			"lastHeartbeat" : ISODate("2014-10-28T22:00:28Z"),
			"lastHeartbeatRecv" : ISODate("2014-10-28T22:00:28Z"),
			"pingMs" : 0,
			"lastHeartbeatMessage" : "could not find member to sync from"
		}
	],
	"ok" : 1
}

At the default log level, the secondary logs:

2014-10-28T18:00:36.176-0400 I REPLSETS [ReplicationExecutor] replSet could not find member to sync from

If I turn it up to "-vvvvv", it logs:

2014-10-28T18:01:30.284-0400 D COMMANDS [conn2] command: { replSetHeartbeat: "rs", v: 2, pv: 1, checkEmpty: false, from: "localhost:27017", fromId: 0 }
2014-10-28T18:01:30.284-0400 I QUERY    [conn2] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "rs", v: 2, pv: 1, checkEmpty: false, from: "localhost:27017", fromId: 0 } ntoreturn:1 keyUpdates:0 numYields:0  reslen:148 0ms
2014-10-28T18:01:30.588-0400 D REPLSETS [ReplicationExecutor] Scheduling replSetHeartbeat to localhost:27017
2014-10-28T18:01:30.589-0400 D REPLSETS Network status of sending replSetHeartbeat to localhost:27017 was OK
2014-10-28T18:01:30.589-0400 D REPLSETS [ReplicationExecutor] setUpValues: heartbeat response good for member _id:0, msg:
2014-10-28T18:01:30.589-0400 D REPLSETS [ReplicationExecutor] Scheduling heartbeat to localhost:27017 at 2014-10-28T22:01:32.589Z

The secondary never logs the real reason it can't sync: it won't accept the config. If I reconfig on the primary to replace the w:0 with a w:1, the new secondary immediately starts syncing.

Comment by Eric Milkie [ 17/Oct/14 ]

New code has been released; I expect that bug #2 above (log issue) is now resolved.
We have decided that bug #1 is as-designed behavior; any 2.8 node with an invalid config will be reported as UNKNOWN rather than REMOVED by 2.6 replica set members.

Comment by Spencer Brody (Inactive) [ 14/Oct/14 ]

Moving into 2.7 Required to ensure that we at least look at this again before release, as this is a 2.6/2.8 compatibility problem.

Comment by Eric Milkie [ 14/Oct/14 ]

As the replication codepaths of 2.7.8 are about to change, we should put this on hold and retest after the switchover.

Generated at Thu Feb 08 03:38:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.