[SERVER-33257] Sync doesn't start when adding a new node Created: 10/Feb/18  Updated: 21/Mar/18  Resolved: 13/Feb/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.18, 3.2.19
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Alexander S. Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

2 nodes: 3.2.18 and 3.2.19, both with the same config and replica set rs0.

When adding a node to the replica set:

rs0:PRIMARY> rs.add({ host: "mongo2.nyc3.wegohealth.com:27017", priority: 0.5 })
{ "ok" : 1 }
rs0:PRIMARY> cfg = rs.conf()
{
	"_id" : "rs0",
	"version" : 690581,
	"members" : [
		{
			"_id" : 6,
			"host" : "mongo1.nyc3.wegohealth.com:27017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
 
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 7,
			"host" : "mongo2.nyc3.wegohealth.com:27017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 0.5,
			"tags" : {
 
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		}
	],
	"settings" : {
		"chainingAllowed" : true,
		"heartbeatIntervalMillis" : 2000,
		"heartbeatTimeoutSecs" : 10,
		"electionTimeoutMillis" : 10000,
		"getLastErrorModes" : {
 
		},
		"getLastErrorDefaults" : {
			"w" : 1,
			"wtimeout" : 0
		}
	}
}

On the second node in the log file I see the following:

2018-02-10T06:21:22.318-0500 I CONTROL  [main] ***** SERVER RESTARTED *****
2018-02-10T06:21:22.323-0500 I CONTROL  [initandlisten] MongoDB starting : pid=7328 port=27017 dbpath=/srv/data/mongo/mongodb 64-bit host=mongo2
2018-02-10T06:21:22.323-0500 I CONTROL  [initandlisten] db version v3.2.19
2018-02-10T06:21:22.323-0500 I CONTROL  [initandlisten] git version: a9f574de6a566a58b24d126b44a56718d181e989
2018-02-10T06:21:22.323-0500 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.2g  1 Mar 2016
2018-02-10T06:21:22.323-0500 I CONTROL  [initandlisten] allocator: tcmalloc
2018-02-10T06:21:22.323-0500 I CONTROL  [initandlisten] modules: none
2018-02-10T06:21:22.323-0500 I CONTROL  [initandlisten] build environment:
2018-02-10T06:21:22.323-0500 I CONTROL  [initandlisten]     distmod: ubuntu1604
2018-02-10T06:21:22.323-0500 I CONTROL  [initandlisten]     distarch: x86_64
2018-02-10T06:21:22.323-0500 I CONTROL  [initandlisten]     target_arch: x86_64
2018-02-10T06:21:22.323-0500 I CONTROL  [initandlisten] options: { config: "/etc/mongod.conf", net: { bindIp: "0.0.0.0", port: 27017 }, operationProfiling: { mode: "slowOp", slowOpThresholdMs: 300 }, processManagement: { fork: false }, replication: { replSetName: "rs0" }, storage: { dbPath: "/srv/data/mongo/mongodb", engine: "wiredTiger", journal: { enabled: true }, mmapv1: { smallFiles: true } }, systemLog: { destination: "file", logAppend: true, path: "/srv/data/mongo/log/mongod.log", quiet: true } }
2018-02-10T06:21:22.339-0500 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=8G,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),verbose=(recovery_progress),
2018-02-10T06:21:22.429-0500 I STORAGE  [initandlisten] WiredTiger [1518261682:429075][7328:0x7fa66cebdc80], txn-recover: Main recovery loop: starting at 2/6656
2018-02-10T06:21:22.479-0500 I STORAGE  [initandlisten] WiredTiger [1518261682:479659][7328:0x7fa66cebdc80], txn-recover: Recovering log 2 through 3
2018-02-10T06:21:22.480-0500 I STORAGE  [initandlisten] WiredTiger [1518261682:480392][7328:0x7fa66cebdc80], txn-recover: Recovering log 3 through 3
2018-02-10T06:21:22.558-0500 W STORAGE  [initandlisten] Detected configuration for non-active storage engine mmapv1 when current storage engine is wiredTiger
2018-02-10T06:21:22.558-0500 I CONTROL  [initandlisten]
2018-02-10T06:21:22.558-0500 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
2018-02-10T06:21:22.558-0500 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2018-02-10T06:21:22.558-0500 I CONTROL  [initandlisten]
2018-02-10T06:21:22.558-0500 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.
2018-02-10T06:21:22.558-0500 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2018-02-10T06:21:22.558-0500 I CONTROL  [initandlisten]
2018-02-10T06:21:22.559-0500 I REPL     [initandlisten] Did not find local voted for document at startup.
2018-02-10T06:21:22.559-0500 I REPL     [initandlisten] Did not find local replica set configuration document at startup;  NoMatchingDocument: Did not find replica set configuration document in local.system.replset
2018-02-10T06:21:22.559-0500 I NETWORK  [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
2018-02-10T06:21:22.559-0500 I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory '/srv/data/mongo/mongodb/diagnostic.data'
2018-02-10T06:21:22.560-0500 I NETWORK  [initandlisten] waiting for connections on port 27017
2018-02-10T06:21:50.487-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mongo1.nyc3.wegohealth.com:27017
2018-02-10T06:21:50.488-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to mongo1.nyc3.wegohealth.com:27017, took 1ms (1 connections now open to mongo1.nyc3.wegohealth.com:27017)

And the synchronization doesn't happen. There is no any data:

> rs.slaveOk()
> show databases
local  0.000GB

Replication config is missing:

> db.isMaster()
{
	"ismaster" : false,
	"secondary" : false,
	"info" : "Does not have a valid replica set config",
	"isreplicaset" : true,
	"maxBsonObjectSize" : 16777216,
	"maxMessageSizeBytes" : 48000000,
	"maxWriteBatchSize" : 1000,
	"localTime" : ISODate("2018-02-10T11:24:17.697Z"),
	"maxWireVersion" : 4,
	"minWireVersion" : 0,
	"ok" : 1
}
> rs.conf()
2018-02-10T06:24:27.130-0500 E QUERY    [thread1] Error: Could not retrieve replica set config: {
	"info" : "run rs.initiate(...) if not yet done for the set",
	"ok" : 0,
	"errmsg" : "no replset config has been received",
	"code" : 94
} :
rs.conf@src/mongo/shell/utils.js:1217:11
@(shell):1:1

Even though the master log says the node is added and it exists in the rs config:

2018-02-10T06:21:50.486-0500 I REPL     [conn2562] replSetReconfig admin command received from client
2018-02-10T06:21:50.487-0500 I REPL     [conn2562] replSetReconfig config object with 2 members parses ok
2018-02-10T06:21:50.487-0500 I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to mongo2.nyc3.wegohealth.com:27017
2018-02-10T06:21:50.487-0500 I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2018-02-10T06:21:50.487-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to mongo2.nyc3.wegohealth.com:27017
2018-02-10T06:21:50.488-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to mongo2.nyc3.wegohealth.com:27017, took 1ms (1 connections now open to mongo2.nyc3.wegohealth.com:27017)
2018-02-10T06:21:50.488-0500 I REPL     [ReplicationExecutor] New replica set config in use: { _id: "rs0", version: 690581, members: [ { _id: 6, host: "mongo1.nyc3.wegohealth.com:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 7, host: "mongo2.nyc3.wegohealth.com:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 0.5, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 } } }
2018-02-10T06:21:50.488-0500 I REPL     [ReplicationExecutor] This node is mongo1.nyc3.wegohealth.com:27017 in the config
2018-02-10T06:21:50.489-0500 I REPL     [ReplicationExecutor] Member mongo2.nyc3.wegohealth.com:27017 is now in state STARTUP
2018-02-10T06:22:22.179-0500 I STORAGE  [conn1467] passes = 5000 in CappedRecordStoreV1::allocRecord: ns: local.oplog.rs, lenToAlloc: 2191240, maxCappedDocs: 9223372036854775807, nrecords: 29226101, datasize: 4444541292. Continuing to delete old records to make room.
2018-02-10T06:22:22.184-0500 I STORAGE  [conn1467] passes = 10000 in CappedRecordStoreV1::allocRecord: ns: local.oplog.rs, lenToAlloc: 2191240, maxCappedDocs: 9223372036854775807, nrecords: 29221101, datasize: 4443817604. Continuing to delete old records to make room.



 Comments   
Comment by Kelsey Schubert [ 13/Feb/18 ]

Thanks for following up, alexs – I'm glad you were able to track it down the misconfiguration.

Comment by Alexander S. [ 13/Feb/18 ]

Sorry, the bug was on our side – we had a wrong DNS configuration in Chef for this new node, so both mongo1.nyc3.wegohealth.com and mongo2.nyc3.wegohealth.com names appeared in the /etc/hosts file for the mongo2 node (both attached to the same IP of mongo2 node).

So the mongo2 node saw it has been added to a replica set but when it tried to connect back it did connect to itself.

Comment by Alexander S. [ 13/Feb/18 ]

Hi, I already dropped the secondary that I tried to add but will re-deploy it now and try again. I didn't try other hosts, what I know for sure I was able to add nodes running 3.2.18 to this replica set previously.

This time I only needed to change the storage engine so I wanted to add a new node running WiredTiger engine, wait for it to catch up and then step down the previous PRIMARY running MMAPV1. So the differences this time were that nodes were running different storage engines and different versions: MMAPV1 on 3.2.18 and WiredTiger on 3.2.19. Later I upgraded the old node to 3.2.19 so both were running the same version which didn't help, then I tried to use the same MMAPV1 on the new node (I removed the old data directory with WiredTiger files) so both were running the same storage engine which didn't help either.

So the only remaining difference is that the replication worked on 3.2.18 and stopped working on 3.2.19.

Comment by Kelsey Schubert [ 13/Feb/18 ]

Hi alexs,

Thanks for reporting this issue. So we can continue to investigate, would you please provide the complete logs from the primary, mongo1.nyc3.wegohealth.com:27017? Are you able to successfully add nodes from other hosts?

Kind regards,
Kelsey

Comment by Alexander S. [ 10/Feb/18 ]

Also tried Updating both to 3.2.19, which didn't help. The new node does not receive the config.

Generated at Thu Feb 08 04:32:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.