[SERVER-5911] Mongos crash on restart (signal 11) and never figures out a replicaset primary/secondary change Created: 23/May/12  Updated: 11/Jul/16  Resolved: 27/Nov/12

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.0.2, 2.0.5
Fix Version/s: 2.1.2

Type: Bug Priority: Critical - P2
Reporter: Johnny Boy Assignee: Mathias Stearn
Resolution: Done Votes: 0
Labels: crash, mongos
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Debian stable with latest .deb from 10gen


Attachments: Text File mongos.log    
Operating System: Linux
Participants:

 Description   

I have two machines running mongos locally which connects to 4 replicasets in a sharded environment.

One set has a primary, secondary and arbiter. Sometimes when I flip secondary / primary mongos has troubles figuring out which one is the primary.
The new primary was elected when I used rs.reconfigure() to put a new priority on the other.

This is what the error looks like after switching primary:

Wed May 23 11:14:28 [conn386] DBException in process: could not initialize cursor across all shards because : socket exception @ DuegoB/mongo1:27017,mongo4:27017Wed May 23 11:14:29 [conn389] ns: xxx.communications could not initialize cursor across all shards because : stale config detected for ns: xxx.communications ParallelCursor::_init @ DuegoB/mongo1:27017,mongo4:27017 attempt: 0

Also see attached mongos.log on what happens sometimes when I try to restart mongos (since it never seemed to found the new primary)
It freezes up and never restarts.

Another weird entry in that log is:
Wed May 23 11:15:55 [conn458] Socket say send() errno:32 Broken pipe 172.16.49.111:27017
Wed May 23 11:15:55 [conn458] DBException in process: could not initialize cursor across all shards because : socket exception @ Duego2/mongo2:27027,mongo3:27027
We

Where 172.16.49.111 is one of the servers in the replicaset that switched primary. But the log also mentions Duego2/mongo2:27027,mongo3:27027 which is not part of this set?



 Comments   
Comment by Mathias Stearn [ 28/Aug/12 ]

The failure on shutdown has been fixed for 2.2. We switched from exit() to _exit().

There have also been many fixes for the mongos replica set code. Could you try with 2.2 and see if you still have this issue? Note that you may still get some failed operations during the handoff (or immediately after), but mongos should detect this and send all future requests to the new primary.

Comment by Johnny Boy [ 23/May/12 ]

It also lists "attempt: 0" several times in the log which seems a bit off since it actually attempts several times

Generated at Thu Feb 08 03:10:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.