[SERVER-20775] cluster not reachable while only one (of three) configserver was down Created: 06/Oct/15 Updated: 24/Feb/16 Resolved: 10/Feb/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 2.6.4, 2.6.10 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kay Agahd | Assignee: | Ramon Fernandez Marina |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Participants: |
| Description |
|
We have one cluster consisting of 5 shards, each consisting of 3 physical replset members. 3 configservers and 3 routers (mongos) are running on 3 different VM's, called sx350, sx351, sx352. We have also 3 other VM's, called offerstore-en-router-01, offerstore-en-router-02 and offerstore-en-router-03 where we have installed 3 other router (mongos). The problem is that no connections through mongos on offerstore-en-router-01, offerstore-en-router-02 and offerstore-en-router-03 were possible until sx352 went back round about 20 minutes later after it had crashed down! While sx352 was down, the mongoshell waited so long to connect (using auth) that I closed it before it came back. Without using --user and --password, the mongoshell could connect quickly but as soon as I entered db.auth("admin", "XXX"), the mongoshell blocked so I closed it after a few seconds. Do you know why one crashed configserver is able to compromise the access to the cluster through mongos, running on a different VM's, and how one can avoid this issue? |
| Comments |
| Comment by Kay Agahd [ 24/Feb/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
alexbool, your case seems to be different because "you still can make queries through mongo console and succeed" which was not possible during my case for which I've opened this ticket. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Bulaev [ 24/Feb/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This used to reproduce in our production and testing environments. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 10/Feb/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
kay.agahd@idealo.de, it just occurred to me that if the VM that went down was running the first config server this could also be MongoDB 3.2 includes support for Config Servers as a Replica Set, which is a big improvement over mirrored config servers. If you continue to experience reachability issues in your cluster because of config server unavailability I'd recommend you test MongoDB 3.2. Unfortunately I was not able to reproduce this specific behavior, so I'm going to resolve this ticket for now. If you find a reliable way to reproduce please post a comment here and we can reopen the ticket for further investigation. Regards, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 10/Nov/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi ramon.fernandez, sure, it's very dificult to reproduce. I tried it also in vain multiple times. However I wanted to let know you the issue. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 09/Nov/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
kay.agahd@idealo.de, I was not able to reproduce this behavior using a trivial setup (1 shard, 1 mongos, 3 config servers). That being said, the long delay could be related to | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 06/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The logs of the configserver still running on sx350 shows that user "admin" seemed to be connected during the time frame of the crashed sx352 server:
Other applications, written in Java, couldn't connect either to none of the three routers running on offerstore-en-router-01, offerstore-en-router-02 and offerstore-en-router-03. Their stacktrace is as follows:
Here is the mongod and mongos version of offerstore-en-router-03:
Here is the mongod and mongos version of sx352:
I tried to reproduce the problem by shutting down the configserver on sx352 again. This time there was no problem to connect through any of the 3 mongos running on offerstore-en-router-01, offerstore-en-router-02 and offerstore-en-router-03. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kay Agahd [ 06/Oct/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Since I tried to connect against mongos running on offerstore-en-router-03, I filtered the log file by "user: admin" and listed everything that belongs to that connectionId:
As you can see, sx350 and sx351 were reachable but any connection to the configserver on sx352:20019 failed. The question is, why it blocked since two other configserver were still reachable. |