[SERVER-25147] "not master" error from execCommand through mongos Created: 19/Jul/16 Updated: 21/Nov/16 Resolved: 21/Nov/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.0.8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kay Agahd | Assignee: | Randolph Tan |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Sprint: | Sharding 2016-10-10, Sharding 2016-11-21, Sharding 2016-12-12 |
| Participants: |
| Description |
|
We encounter "not master" error even though the command has been executed against a mongos and all primaries of all (both) shards are up and running. This happens for example when trying to delete a database:
However, the command may succeed some seconds later in the same mongo shell using the same connection:
The error may also happen when executing mongorestore against the router:
It may also happen during mongorestore:
After the restore, some documents are missing if the "not master" error occurred. However, executing the failed command "listDatabase" against the replSet by using the same connection string the router reported, works without problems:
These are our shards:
These are the replSets statuses:
Second shard:
|
| Comments |
| Comment by Kay Agahd [ 21/Nov/16 ] | ||||||||||||||||||||||||||||||||
|
Hi renctan, I replaced today all IP's by hostnames in all mongod's configs to reproduce the issue. However the restore succeeded without any error. Regards, Kay | ||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 19/Nov/16 ] | ||||||||||||||||||||||||||||||||
|
Hi renctan, I think mms is wrong because mongod logs do not show any state change during the restore. If there really was one or multiple state changes then our workaround would not have been successful either. Just recall our workaround: We replaced the hostnames by IP in each mongod config rs.conf().members.host and executed rs.reconfig(cfgWithIP). Since then there is no not-master error anymore. The not-master-error occurs again as soon as we replace IP by hostname in rs.conf().members.host. Do you need simply more logs or do you need logs with more verbosity? | ||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 18/Nov/16 ] | ||||||||||||||||||||||||||||||||
|
Wanted to check again and see if you still need help. Thanks! | ||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 04/Nov/16 ] | ||||||||||||||||||||||||||||||||
|
I checked the host in MMS and it looks like the node did become secondary and became primary again around both the time periods close to the attached logs. If you have more log context (like couple of hours) we can check on why the node became secondary. Thanks! | ||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 17/Aug/16 ] | ||||||||||||||||||||||||||||||||
|
Hi ramon.fernandez, thanks for the notice. Regards, | ||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 17/Aug/16 ] | ||||||||||||||||||||||||||||||||
|
the "debugging with submitter" fixVersion really should be read as "need verification", meaning we're still trying to determine if there's a bug in the server, and if so, where. If you were to point out that this terminology is not the clearest I probably wouldn't disagree Essentially this ticket is in our backlog of tickets that need further investigation. If the ticket is in the "Waiting for user input" state it means we're waiting for some information from you, but if it's in the "Open" state it means the ball is in our court. Thanks for your continued patience, we'll update this ticket as we know more about this particular issue. Regards, | ||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 15/Aug/16 ] | ||||||||||||||||||||||||||||||||
|
The status of this ticket is "debugging with submitter" but we haven't heard from you for almost one month already. | ||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 28/Jul/16 ] | ||||||||||||||||||||||||||||||||
|
We are able to reproduce the issue. The not-master-error occurs again as soon as we replace IP by hostname in rs.conf().members.host. | ||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 28/Jul/16 ] | ||||||||||||||||||||||||||||||||
|
We found something that seems to work: We replaced the hostnames by IP in each mongod config rs.conf().members.host and executed rs.reconfig(cfgWithIP). Since then there is no not-master error anymore. Do you have any explanation for this? Do we need to use IP's instead of hostnames? | ||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 28/Jul/16 ] | ||||||||||||||||||||||||||||||||
|
Just for your info: In order to exclude DNS as the culprit, we added the following entries to /etc/hosts on the router mongo-router-01.hotel02.pro00.eu.idealo.com from where we executed mongorestore:
As expected - without having restarted the services - only /etc/hosts is being called. We don't see any DNS queries of A-records from mongo-035, mongo-036, mongo-037 or mongo-038 against both of our resolvers. However, mongorestore still throws the "not master" error. Any help, hints or thoughts are strongly appreciated. | ||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 26/Jul/16 ] | ||||||||||||||||||||||||||||||||
|
Just for your info: the issue happens also when the actual 3.0 version (3.0.12) of mongoshell or mongorestore is used | ||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 21/Jul/16 ] | ||||||||||||||||||||||||||||||||
|
Just for your info: we have found out that MMSSUPPORT-10695 is unrelated to this issue. | ||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 21/Jul/16 ] | ||||||||||||||||||||||||||||||||
|
Just for info: The error occurs nonetheless (even after having cleaned-up mms):
| ||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 21/Jul/16 ] | ||||||||||||||||||||||||||||||||
|
May it have to do with issue MMSSUPPORT-10695 that I've opened a few hours ago? Please see my comment from Jul 21 2016 05:04:49 PM GMT+0200. | ||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 19/Jul/16 ] | ||||||||||||||||||||||||||||||||
|
Log files from 17:07 when the error occurred during mongorestore. | ||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 19/Jul/16 ] | ||||||||||||||||||||||||||||||||
|
Please see attached the 3 log files of the time period when the error occurred.
I couldn't find a related error message in neither of the mongod log files. Btw., the message "could not find user nagios@admin" is due to our monitoring which is expecting the user nagios but since this mongodb system is exceptionally running without authentication, we get the previoulsly mentioned message in the logs. | ||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 19/Jul/16 ] | ||||||||||||||||||||||||||||||||
|
kay.agahd@idealo.de, can you please provide logs for the relevant mongos and mongod when you get a "not master" error? If you can reproduce the problem at will via mongorestore that may be the easiest way to narrow down the subset of logs to upload. Thanks, |