[SERVER-5911] Mongos crash on restart (signal 11) and never figures out a replicaset primary/secondary change Created: 23/May/12 Updated: 11/Jul/16 Resolved: 27/Nov/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.0.2, 2.0.5 |
| Fix Version/s: | 2.1.2 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Johnny Boy | Assignee: | Mathias Stearn |
| Resolution: | Done | Votes: | 0 |
| Labels: | crash, mongos | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Debian stable with latest .deb from 10gen |
||
| Attachments: |
|
| Operating System: | Linux |
| Participants: |
| Description |
|
I have two machines running mongos locally which connects to 4 replicasets in a sharded environment. One set has a primary, secondary and arbiter. Sometimes when I flip secondary / primary mongos has troubles figuring out which one is the primary. This is what the error looks like after switching primary: Wed May 23 11:14:28 [conn386] DBException in process: could not initialize cursor across all shards because : socket exception @ DuegoB/mongo1:27017,mongo4:27017Wed May 23 11:14:29 [conn389] ns: xxx.communications could not initialize cursor across all shards because : stale config detected for ns: xxx.communications ParallelCursor::_init @ DuegoB/mongo1:27017,mongo4:27017 attempt: 0 Also see attached mongos.log on what happens sometimes when I try to restart mongos (since it never seemed to found the new primary) Another weird entry in that log is: Where 172.16.49.111 is one of the servers in the replicaset that switched primary. But the log also mentions Duego2/mongo2:27027,mongo3:27027 which is not part of this set? |
| Comments |
| Comment by Mathias Stearn [ 28/Aug/12 ] |
|
The failure on shutdown has been fixed for 2.2. We switched from exit() to _exit(). There have also been many fixes for the mongos replica set code. Could you try with 2.2 and see if you still have this issue? Note that you may still get some failed operations during the handoff (or immediately after), but mongos should detect this and send all future requests to the new primary. |
| Comment by Johnny Boy [ 23/May/12 ] |
|
It also lists "attempt: 0" several times in the log which seems a bit off since it actually attempts several times |