[SERVER-16122] PRIMARY down on sharded cluster - downtime Created: 13/Nov/14  Updated: 10/Apr/15  Resolved: 10/Apr/15

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.4.12, 2.6.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Przemek Wroblewski Assignee: Ramon Fernandez Marina
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File logs    
Operating System: ALL
Steps To Reproduce:

I can replicate this behaviour when running mongos on top of single replicaset (PRIMARY, SECONDARY, ARBITER), then I run simple ruby script (https://gist.github.com/lowang/5fc24c6e40b03a613d2b - using original mongo gem, with secondary_preferred) issuing 2 read queries per second to sharded collection.
Then I simulate PRIMARY server problem by issuing "halt -n -f" on it's virtual machine. After few seconds script cannot get results.

Participants:

 Description   

I've got downtime (read queries cannot complete) when PRIMARY is down in sharded cluster.
It takes from about 20-30s (original mongo gem),
however original mongo gem is not a problem as I can't issue a query during that time while connecting to mongos directly with mongo client.
Upgrading to mongo 2.6 didn't improve downtime at all.



 Comments   
Comment by Ramon Fernandez Marina [ 01/Apr/15 ]

lowang, we haven't heard back from you for a while. Is this still an issue for you? If yes, can you please follow up on Randolph's questions above?

Thanks,
Ramón

Comment by Randolph Tan [ 11/Mar/15 ]

Hi,

What kind of query are you performing? Is it a slave ok read or does it have a read preference other than PRIMARY?

Thanks!

Comment by Przemek Wroblewski [ 26/Jan/15 ]

I've uploaded logs to https://gist.github.com/lowang/f47ff1372728efc356d6
Since I'm running this experiment on local computer using:

mongo -nodb
cluster = new ShardingTest({shards : 1, rs : {nodes : [{}, {},

{arbiter: true}

]} });

all logs were redirected to stdout and merged together.
"Failing instance" was added from VM by connecting to existing primary:

mongo Lowang-MacBook-Pro.local:31100
> rs.add({_id: 3, host: "192.168.59.103:27017", priority: 2});

Higher priority made it primary after initialisation.

Then I've paused remote instance on 15:12:45, since then my test script was only getting:
Mongo::OperationTimeout: Timed out waiting on socket read.
On 15:13:16 my test script was able to successfully fetch data.

Comment by Ramon Fernandez Marina [ 23/Jan/15 ]

Can you please upload logs from the mongos servers you're seeing this behavior from, as well as mongod logs for all the shards that contain a PRIMARY that's down?

Generated at Thu Feb 08 03:40:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.