[SERVER-25147] "not master" error from execCommand through mongos Created: 19/Jul/16  Updated: 21/Nov/16  Resolved: 21/Nov/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.0.8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kay Agahd Assignee: Randolph Tan
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongo-035.log     Text File mongo-035_1707.log     Text File mongo-036.log     Text File mongo-036_1707.log     Text File router.log     Text File router_1707.log    
Operating System: ALL
Sprint: Sharding 2016-10-10, Sharding 2016-11-21, Sharding 2016-12-12
Participants:

 Description   

We encounter "not master" error even though the command has been executed against a mongos and all primaries of all (both) shards are up and running.

This happens for example when trying to delete a database:

mongos> db.dropDatabase()
{
	"ok" : 0,
	"errmsg" : "{ note: \"from execCommand\", ok: 0.0, errmsg: \"not master\" }"
}

However, the command may succeed some seconds later in the same mongo shell using the same connection:

mongos> db.dropDatabase()
{ "dropped" : "loadtest", "ok" : 1 }
mongos>

The error may also happen when executing mongorestore against the router:

mongorestore -h mongo-router-01.hotel02.pro00.eu.idealo.com --port 27017 --db kay /data/reise/tmp/dump/import_20160713/
2016-07-19T15:41:16.055+0200	building a list of collections to restore from /data/reise/tmp/dump/import_20160713/ dir
2016-07-19T15:41:16.060+0200	Failed: restore error: kay.additional: error reading database: error creating initial database config information :: caused by :: call to listDatabases on reise_shard01/mongo-035.ipx:27017,mongo-036.ipx:27017 failed: { note: "from execCommand", ok: 0.0, errmsg: "not master" }

It may also happen during mongorestore:

admin@sx724:~$ mongorestore -h mongo-router-01.hotel02.pro00.eu.idealo.com --port 27017 --db kay2 /data/reise/tmp/dump/import_20160713/
2016-07-19T17:07:04.430+0200	building a list of collections to restore from /data/reise/tmp/dump/import_20160713/ dir
2016-07-19T17:07:04.731+0200	reading metadata file from /data/reise/tmp/dump/import_20160713/additional.metadata.json
2016-07-19T17:07:04.731+0200	restoring kay2.additional from file /data/reise/tmp/dump/import_20160713/additional.bson
2016-07-19T17:07:04.732+0200	reading metadata file from /data/reise/tmp/dump/import_20160713/catalog.metadata.json
2016-07-19T17:07:04.733+0200	restoring kay2.catalog from file /data/reise/tmp/dump/import_20160713/catalog.bson
2016-07-19T17:07:04.734+0200	reading metadata file from /data/reise/tmp/dump/import_20160713/offer.metadata.json
2016-07-19T17:07:04.734+0200	restoring kay2.offer from file /data/reise/tmp/dump/import_20160713/offer.bson
2016-07-19T17:07:04.735+0200	reading metadata file from /data/reise/tmp/dump/import_20160713/pages.metadata.json
2016-07-19T17:07:04.736+0200	restoring kay2.pages from file /data/reise/tmp/dump/import_20160713/pages.bson
2016-07-19T17:07:07.431+0200	[........................]  kay2.additional   32.0 MB/13.5 GB  (0.2%)
2016-07-19T17:07:07.431+0200	[........................]     kay2.catalog    32.0 MB/3.8 GB  (0.8%)
2016-07-19T17:07:07.431+0200	[........................]       kay2.offer     8.1 MB/1.8 GB  (0.4%)
2016-07-19T17:07:07.431+0200	[........................]       kay2.pages  32.0 MB/945.6 MB  (3.4%)
2016-07-19T17:07:07.431+0200
2016-07-19T17:07:10.431+0200	[........................]  kay2.additional   32.0 MB/13.5 GB  (0.2%)
2016-07-19T17:07:10.431+0200	[........................]     kay2.catalog    32.0 MB/3.8 GB  (0.8%)
2016-07-19T17:07:10.431+0200	[........................]       kay2.offer     8.1 MB/1.8 GB  (0.4%)
2016-07-19T17:07:10.431+0200	[........................]       kay2.pages  32.0 MB/945.6 MB  (3.4%)
2016-07-19T17:07:10.431+0200
2016-07-19T17:07:13.431+0200	[........................]  kay2.additional   32.0 MB/13.5 GB  (0.2%)
2016-07-19T17:07:13.431+0200	[........................]     kay2.catalog    32.0 MB/3.8 GB  (0.8%)
2016-07-19T17:07:13.431+0200	[........................]       kay2.offer     8.1 MB/1.8 GB  (0.4%)
2016-07-19T17:07:13.431+0200	[........................]       kay2.pages  32.0 MB/945.6 MB  (3.4%)
2016-07-19T17:07:13.431+0200
2016-07-19T17:07:16.421+0200	error: not master
2016-07-19T17:07:16.431+0200	[........................]  kay2.additional   32.0 MB/13.5 GB  (0.2%)
2016-07-19T17:07:16.431+0200	[........................]     kay2.catalog    32.0 MB/3.8 GB  (0.8%)
2016-07-19T17:07:16.431+0200	[........................]       kay2.offer    17.9 MB/1.8 GB  (1.0%)
2016-07-19T17:07:16.431+0200	[........................]       kay2.pages  32.0 MB/945.6 MB  (3.4%)
2016-07-19T17:07:16.431+0200
2016-07-19T17:07:16.787+0200	error: not master
2016-07-19T17:07:17.635+0200	error: not master
2016-07-19T17:07:18.541+0200	error: not master
2016-07-19T17:07:18.556+0200	error: not master
2016-07-19T17:07:19.037+0200	error: not master
2016-07-19T17:07:19.431+0200	[........................]  kay2.additional   95.9 MB/13.5 GB   (0.7%)
...

After the restore, some documents are missing if the "not master" error occurred.

However, executing the failed command "listDatabase" against the replSet by using the same connection string the router reported, works without problems:

$ mongo --host reise_shard01/mongo-035.ipx:27017,mongo-036.ipx:27017
MongoDB shell version: 3.0.2
connecting to: reise_shard01/mongo-035.ipx:27017,mongo-036.ipx:27017/test
2016-07-19T15:58:52.973+0200 I NETWORK  starting new replica set monitor for replica set reise_shard01 with seeds mongo-035.ipx:27017,mongo-036.ipx:27017
2016-07-19T15:58:52.973+0200 I NETWORK  [ReplicaSetMonitorWatcher] starting
reise_shard01:PRIMARY> db.adminCommand({listDatabases:1})
{
	"databases" : [
		{
			"name" : "loadtest",
			"sizeOnDisk" : 1466814464,
			"empty" : false
		},
...

These are our shards:

mongos> sh.status()
--- Sharding Status ---
  sharding version: {
	"_id" : 1,
	"minCompatibleVersion" : 5,
	"currentVersion" : 6,
	"clusterId" : ObjectId("5661b33ff5f2bfb6a339840e")
}
  shards:
	{  "_id" : "reise_shard01",  "host" : "reise_shard01/mongo-035.ipx:27017,mongo-036.ipx:27017" }
	{  "_id" : "reise_shard02",  "host" : "reise_shard02/mongo-037.ipx:27017,mongo-038.ipx:27017" }
  active mongoses:
	"3.0.8" : 3
	....

These are the replSets statuses:
First shard:

reise_shard01:PRIMARY> rs.status()
{
	"set" : "reise_shard01",
	"date" : ISODate("2016-07-19T16:17:23.053Z"),
	"myState" : 1,
	"members" : [
		{
			"_id" : 1,
			"name" : "mongo-036.ipx:27017",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 1063103,
			"optime" : Timestamp(1468945028, 20),
			"optimeDate" : ISODate("2016-07-19T16:17:08Z"),
			"lastHeartbeat" : ISODate("2016-07-19T16:17:21.469Z"),
			"lastHeartbeatRecv" : ISODate("2016-07-19T16:17:22.490Z"),
			"pingMs" : 0,
			"syncingTo" : "mongo-035.ipx:27017",
			"configVersion" : 7
		},
		{
			"_id" : 2,
			"name" : "mongo-router-01.hotel02.pro00.eu.idealo.com:27018",
			"health" : 1,
			"state" : 7,
			"stateStr" : "ARBITER",
			"uptime" : 15576582,
			"lastHeartbeat" : ISODate("2016-07-19T16:17:21.590Z"),
			"lastHeartbeatRecv" : ISODate("2016-07-19T16:17:21.662Z"),
			"pingMs" : 0,
			"configVersion" : 7
		},
		{
			"_id" : 3,
			"name" : "mongo-035.ipx:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 15576941,
			"optime" : Timestamp(1468945028, 20),
			"optimeDate" : ISODate("2016-07-19T16:17:08Z"),
			"electionTime" : Timestamp(1453368132, 1),
			"electionDate" : ISODate("2016-01-21T09:22:12Z"),
			"configVersion" : 7,
			"self" : true
		}
	],
	"ok" : 1,
	"$gleStats" : {
		"lastOpTime" : Timestamp(0, 0),
		"electionId" : ObjectId("56a0a344b5288f9897220e23")
	}
}

Second shard:

reise_shard02:PRIMARY> rs.status()
{
	"set" : "reise_shard02",
	"date" : ISODate("2016-07-19T16:18:28.306Z"),
	"myState" : 1,
	"members" : [
		{
			"_id" : 0,
			"name" : "mongo-037.ipx:27017",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 3527123,
			"optime" : Timestamp(1468945106, 551),
			"optimeDate" : ISODate("2016-07-19T16:18:26Z"),
			"lastHeartbeat" : ISODate("2016-07-19T16:18:26.489Z"),
			"lastHeartbeatRecv" : ISODate("2016-07-19T16:18:27.153Z"),
			"pingMs" : 0,
			"syncingTo" : "mongo-038.ipx:27017",
			"configVersion" : 4
		},
		{
			"_id" : 1,
			"name" : "mongo-038.ipx:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 15576995,
			"optime" : Timestamp(1468945108, 132),
			"optimeDate" : ISODate("2016-07-19T16:18:28Z"),
			"electionTime" : Timestamp(1453368159, 1),
			"electionDate" : ISODate("2016-01-21T09:22:39Z"),
			"configVersion" : 4,
			"self" : true
		},
		{
			"_id" : 2,
			"name" : "mongo-router-02.hotel02.pro00.eu.idealo.com:27018",
			"health" : 1,
			"state" : 7,
			"stateStr" : "ARBITER",
			"uptime" : 3557101,
			"lastHeartbeat" : ISODate("2016-07-19T16:18:27.972Z"),
			"lastHeartbeatRecv" : ISODate("2016-07-19T16:18:26.996Z"),
			"pingMs" : 0,
			"configVersion" : 4
		}
	],
	"ok" : 1,
	"$gleStats" : {
		"lastOpTime" : Timestamp(0, 0),
		"electionId" : ObjectId("56a0a35f00d0725cbecc2f18")
	}
}



 Comments   
Comment by Kay Agahd [ 21/Nov/16 ]

Hi renctan,

I replaced today all IP's by hostnames in all mongod's configs to reproduce the issue. However the restore succeeded without any error.
So please feel free to close this tickes as "not reproducible", even if we don't understand why mongo-logs don't say anything about state change(s) in contrast to mms.

Regards,

Kay

Comment by Kay Agahd [ 19/Nov/16 ]

Hi renctan,

I think mms is wrong because mongod logs do not show any state change during the restore.

If there really was one or multiple state changes then our workaround would not have been successful either. Just recall our workaround: We replaced the hostnames by IP in each mongod config rs.conf().members.host and executed rs.reconfig(cfgWithIP). Since then there is no not-master error anymore. The not-master-error occurs again as soon as we replace IP by hostname in rs.conf().members.host.

Do you need simply more logs or do you need logs with more verbosity?

Comment by Randolph Tan [ 18/Nov/16 ]

Hi kay.agahd@idealo.de,

Wanted to check again and see if you still need help.

Thanks!

Comment by Randolph Tan [ 04/Nov/16 ]

Hi kay.agahd@idealo.de,

I checked the host in MMS and it looks like the node did become secondary and became primary again around both the time periods close to the attached logs. If you have more log context (like couple of hours) we can check on why the node became secondary.

Thanks!

Comment by Kay Agahd [ 17/Aug/16 ]

Hi ramon.fernandez,

thanks for the notice.
We're looking forward that the issue will be solved eventually.

Regards,
Kay

Comment by Ramon Fernandez Marina [ 17/Aug/16 ]

Hi kay.agahd@idealo.de,

the "debugging with submitter" fixVersion really should be read as "need verification", meaning we're still trying to determine if there's a bug in the server, and if so, where. If you were to point out that this terminology is not the clearest I probably wouldn't disagree

Essentially this ticket is in our backlog of tickets that need further investigation. If the ticket is in the "Waiting for user input" state it means we're waiting for some information from you, but if it's in the "Open" state it means the ball is in our court. Thanks for your continued patience, we'll update this ticket as we know more about this particular issue.

Regards,
Ramón.

Comment by Kay Agahd [ 15/Aug/16 ]

The status of this ticket is "debugging with submitter" but we haven't heard from you for almost one month already.
Do you need any supplementary information from us to work on this issue?

Comment by Kay Agahd [ 28/Jul/16 ]

We are able to reproduce the issue. The not-master-error occurs again as soon as we replace IP by hostname in rs.conf().members.host.
It seems that mongod's resolver lib has a bug.

Comment by Kay Agahd [ 28/Jul/16 ]

We found something that seems to work: We replaced the hostnames by IP in each mongod config rs.conf().members.host and executed rs.reconfig(cfgWithIP). Since then there is no not-master error anymore.

Do you have any explanation for this? Do we need to use IP's instead of hostnames?

Comment by Kay Agahd [ 28/Jul/16 ]

Just for your info: In order to exclude DNS as the culprit, we added the following entries to /etc/hosts on the router mongo-router-01.hotel02.pro00.eu.idealo.com from where we executed mongorestore:

172.16.70.243	mongo-035.ipx	mongo-035
172.16.70.243	mongo-036.ipx	mongo-036
172.16.70.244	mongo-037.ipx	mongo-037
172.16.70.245	mongo-038.ipx	mongo-038

As expected - without having restarted the services - only /etc/hosts is being called. We don't see any DNS queries of A-records from mongo-035, mongo-036, mongo-037 or mongo-038 against both of our resolvers.

However, mongorestore still throws the "not master" error.

Any help, hints or thoughts are strongly appreciated.

Comment by Kay Agahd [ 26/Jul/16 ]

Just for your info: the issue happens also when the actual 3.0 version (3.0.12) of mongoshell or mongorestore is used

Comment by Kay Agahd [ 21/Jul/16 ]

Just for your info: we have found out that MMSSUPPORT-10695 is unrelated to this issue.

Comment by Kay Agahd [ 21/Jul/16 ]

Just for info: The error occurs nonetheless (even after having cleaned-up mms):

[17:49:52]root@sx724.ipx /home/admin# mongorestore -h mongo-router-01.hotel02.pro00.eu.idealo.com --port 27017 --db kay /data/reise/tmp/dump/import_20160713/
2016-07-21T17:49:57.606+0200	building a list of collections to restore from /data/reise/tmp/dump/import_20160713/ dir
2016-07-21T17:49:57.852+0200	reading metadata file from /data/reise/tmp/dump/import_20160713/additional.metadata.json
2016-07-21T17:49:57.853+0200	restoring kay.additional from file /data/reise/tmp/dump/import_20160713/additional.bson
2016-07-21T17:49:57.854+0200	reading metadata file from /data/reise/tmp/dump/import_20160713/catalog.metadata.json
2016-07-21T17:49:57.854+0200	restoring kay.catalog from file /data/reise/tmp/dump/import_20160713/catalog.bson
2016-07-21T17:49:57.855+0200	reading metadata file from /data/reise/tmp/dump/import_20160713/offer.metadata.json
2016-07-21T17:49:57.856+0200	restoring kay.offer from file /data/reise/tmp/dump/import_20160713/offer.bson
2016-07-21T17:49:57.859+0200	reading metadata file from /data/reise/tmp/dump/import_20160713/pages.metadata.json
2016-07-21T17:49:57.859+0200	restoring kay.pages from file /data/reise/tmp/dump/import_20160713/pages.bson
2016-07-21T17:50:00.607+0200	[........................]  kay.additional   32.0 MB/13.5 GB  (0.2%)
2016-07-21T17:50:00.607+0200	[........................]     kay.catalog    32.0 MB/3.8 GB  (0.8%)
2016-07-21T17:50:00.607+0200	[........................]       kay.offer    16.2 MB/1.8 GB  (0.9%)
2016-07-21T17:50:00.607+0200	[........................]       kay.pages  32.0 MB/945.6 MB  (3.4%)
2016-07-21T17:50:00.607+0200
2016-07-21T17:50:03.607+0200	[........................]  kay.additional   95.9 MB/13.5 GB  (0.7%)
2016-07-21T17:50:03.607+0200	[........................]     kay.catalog    64.0 MB/3.8 GB  (1.6%)
2016-07-21T17:50:03.607+0200	[........................]       kay.offer    32.4 MB/1.8 GB  (1.8%)
2016-07-21T17:50:03.607+0200	[#.......................]       kay.pages  64.0 MB/945.6 MB  (6.8%)
2016-07-21T17:50:03.607+0200
2016-07-21T17:50:06.080+0200	error: not master
2016-07-21T17:50:06.174+0200	error: not master
2016-07-21T17:50:06.607+0200	[........................]  kay.additional   127.9 MB/13.5 GB   (0.9%)
2016-07-21T17:50:06.607+0200	[........................]     kay.catalog    128.0 MB/3.8 GB   (3.3%)
2016-07-21T17:50:06.607+0200	[........................]       kay.offer     48.6 MB/1.8 GB   (2.7%)
2016-07-21T17:50:06.607+0200	[###.....................]       kay.pages  128.0 MB/945.6 MB  (13.5%)
2016-07-21T17:50:06.607+0200
2016-07-21T17:50:07.201+0200	error: not master
2016-07-21T17:50:09.607+0200	[........................]  kay.additional   191.9 MB/13.5 GB   (1.4%)
2016-07-21T17:50:09.607+0200	[........................]     kay.catalog    160.0 MB/3.8 GB   (4.1%)
2016-07-21T17:50:09.607+0200	[........................]       kay.offer     64.8 MB/1.8 GB   (3.6%)
2016-07-21T17:50:09.607+0200	[####....................]       kay.pages  160.0 MB/945.6 MB  (16.9%)

Comment by Kay Agahd [ 21/Jul/16 ]

May it have to do with issue MMSSUPPORT-10695 that I've opened a few hours ago? Please see my comment from Jul 21 2016 05:04:49 PM GMT+0200.

Comment by Kay Agahd [ 19/Jul/16 ]

Log files from 17:07 when the error occurred during mongorestore.

Comment by Kay Agahd [ 19/Jul/16 ]

Please see attached the 3 log files of the time period when the error occurred.
This is from the router.log when the error occurred:

2016-07-19T15:41:16.060+0200 I SHARDING [conn1200183] couldn't find database [kay] in config db
2016-07-19T15:41:16.064+0200 W SHARDING [conn1200183] error creating initial database config information :: caused by :: call to listDatabases on reise_shard01/mongo-035.ipx:27017,mongo-036.ipx:27017 failed: { note: "from execCommand", ok: 0.0, errmsg: "not master" }
2016-07-19T15:41:16.065+0200 I SHARDING [conn1200183] couldn't find database [kay] in config db
2016-07-19T15:41:16.066+0200 W SHARDING [conn1200183] error creating initial database config information :: caused by :: call to listDatabases on reise_shard01/mongo-035.ipx:27017,mongo-036.ipx:27017 failed: { note: "from execCommand", ok: 0.0, errmsg: "not master" }


I couldn't find a related error message in neither of the mongod log files.

Btw., the message "could not find user nagios@admin" is due to our monitoring which is expecting the user nagios but since this mongodb system is exceptionally running without authentication, we get the previoulsly mentioned message in the logs.

Comment by Ramon Fernandez Marina [ 19/Jul/16 ]

kay.agahd@idealo.de, can you please provide logs for the relevant mongos and mongod when you get a "not master" error? If you can reproduce the problem at will via mongorestore that may be the easiest way to narrow down the subset of logs to upload.

Thanks,
Ramón.

Generated at Thu Feb 08 04:08:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.