[SERVER-5405] mongos does not send reads to secondaries after replica restart when using keyFiles Created: 26/Mar/12  Updated: 11/Jul/16  Resolved: 09/May/12

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.0.3
Fix Version/s: 2.0.6, 2.1.1

Type: Bug Priority: Major - P3
Reporter: Kristina Chodorow (Inactive) Assignee: Greg Studer
Resolution: Done Votes: 3
Labels: buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File replset_monitor_correctly_routes_slaveok.js    
Issue Links:
Depends
depends on SERVER-5585 "this.awaitOk is not a function" erro... Closed
depends on SERVER-5422 easier way to track which secondary a... Closed
Duplicate
is duplicated by SERVER-5737 Mongorestore to mongos when using aut... Closed
is duplicated by SERVER-5418 High number of commands on replica se... Closed
Related
related to SERVER-5082 Queries with slaveOk() do not reauthe... Closed
related to SERVER-5746 Wrong initialization for ReplicaSetMo... Closed
related to SERVER-5651 Better implementation of Windows serv... Closed
Operating System: ALL
Participants:

 Description   

It looks like, on certain types of errors, secondaries are taken out of circulation and never put in again (even if they are healthy).

See log @ end of https://groups.google.com/forum/?fromgroups#!topic/mongodb-user/KLqbtxLNzUQ



 Comments   
Comment by auto [ 11/May/12 ]

Author:

{u'login': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-5405 make sure we recycle authenticated conn when done

Conflicts:

client/dbclient_rs.cpp
Branch: v2.0
https://github.com/mongodb/mongo/commit/0397af7906e351942b0c4d228c19ee64ad794bde

Comment by auto [ 11/May/12 ]

Author:

{u'login': u'', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-5405 mongos does not send reads to secondaries

Authenticate connection to replica members with keyFile credentials when calling
replSetGetStatus internally.

Conflicts:

client/dbclient_rs.cpp
Branch: v2.0
https://github.com/mongodb/mongo/commit/2cf91c83a188f74d726836a7cfdfcb6fa95520e5

Comment by Randolph Tan [ 09/May/12 ]

Buildbot failure was caused by test update in SERVER-5746 and is already fixed by this commit:
https://github.com/mongodb/mongo/commit/9e1d746c9c7865c5644ec623b63d4cc6e6249c74

Comment by Ian Whalen (Inactive) [ 09/May/12 ]

I'm reopening this because it appears to have reemerged on master at http://buildbot.mongodb.org/builders/Linux%2064-bit/builds/4421/steps/test_9/logs/stdio - the error logs look identical to previous failures.

Comment by Andy Schwerin [ 02/May/12 ]

I'm not planning another RC for 2.0.5. If I do one, it will be to fix a regression from 2.0.4, only. It can be targeted for 2.0.6, though. Just mark the two bugs as "backport: yes", without a specified 2.0.x target version, and we'll triage it in a few weeks for 2.0.6.

-Andy

Comment by Eric Milkie [ 01/May/12 ]

Windows 32-bit test now fixed.

Comment by auto [ 24/Apr/12 ]

Author:

{u'login': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-5405 make sure we recycle authenticated conn when done
Branch: master
https://github.com/mongodb/mongo/commit/2a68c8581ac6e333bb6ceee11dba0d7fc12414cf

Comment by Eric Milkie [ 23/Apr/12 ]

Also http://buildbot.mongodb.org/builders/Linux%2032-bit%20debug/builds/1627/steps/test_9/logs/stdio

Comment by Randolph Tan [ 20/Apr/12 ]

The reason why the test fails in Windows is because Windows build uses the shutdown command (as opposed to the kill method in Linux builds) when stopMongod is called. Since shutdown command requires admin auth, it will never succeed in the test setup.

This patch adds an extra parameter to allow passing the admin user and password when calling the stopMongod. I initially tried calling shutdown instead of using stopMongod (which is called by stopSet) in the test but it was problematic because the test script would not wait for the mongod servers to fully shutdown and can make the test fail sporadically. I also think that adding this infrastructure would also allow us to write auth test easier in the future.

This is just a short term fix. Better stopMongod implementation will be addressed in SERVER-5651.

Comment by auto [ 20/Apr/12 ]

Author:

{u'login': u'', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: Updated test for SERVER-5405 to make it pass Windows builds.
Branch: master
https://github.com/mongodb/mongo/commit/7399b791675c31fd53d9bc759db3197b36c6ae68

Comment by Randolph Tan [ 13/Apr/12 ]

Based on the logs, after the test kills all the members of the replica set and when it tries to start it up again, the members detect unclean shutdown and will not start up.

Comment by Eric Milkie [ 13/Apr/12 ]

The above commit broke the Windows 32-bit build; slaveok_routing.js is not passing there.

Comment by auto [ 12/Apr/12 ]

Author:

{u'login': u'', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-5405 mongos does not send reads to secondaries

Authenticate connection to replica members with keyFile credentials when calling
replSetGetStatus internally.
Branch: master
https://github.com/mongodb/mongo/commit/06e99af88d14406e8b85d660625e1f75938f9a09

Comment by Randolph Tan [ 12/Apr/12 ]

Detailed Cause:
This bug will only manifest on authenticated setups. When a node (a replica member) in ReplicaSetMonitor is marked as bad, it will never become ok again. This is because ReplicaSetMonitor calls the replSetGetStatus command to refresh the node states as ok or not and this command requires admin authentication which will never succeed since ReplicaSetMonitor never authenticates the connection it uses.

Fix:
Authenticate the connection used to call replSetGetStatus with the keyFile credentials.

Comment by Randolph Tan [ 10/Apr/12 ]

Updated test to make it replicate the bug. The reason why this wasn't manifesting in the earlier test is because mongod allows you to access the server even with auth on when connecting locally and it does not have an admin user. So, the new test script now adds an admin user to the replica shard to replicate the behavior as if you were connecting remotely.

Comment by Randolph Tan [ 09/Apr/12 ]

Update: Bug reproduced.

In order for the bug to manifest, mongos should be running on a different machine from the sharded replica set.

Caused by:

ReplicaSetMonitor never authenticates the connection when it tries to call replSetGetStatus (which requires admin priviledges) when trying to refresh the replica connection states.

Comment by Randolph Tan [ 04/Apr/12 ]

Status update: Unable to reproduce.

Attached test used. To make this run in v2.0.3, you need to copy the js files from shell directory and cpp files (except bench.cpp) from scripting directory and rebuild the mongo shell binary.

Test summary:
1. Create a 1 shard cluster with 3 rs members with auth.
2. Authenticate.
3. Insert a couple of docs
4. Query using slaveOk.
5. Try killing 2 members and query again using slaveOk.
6. Kill all members.
7. Restart all members.
8. Query again using slaveOk.

Previous incarnations of test:
1. no auth
2. made the collection sharded (according to printShardingStatus, the collection user was trying to query to is not sharded)

Comment by Randolph Tan [ 03/Apr/12 ]

Needs SERVER-5422 for test script.

Generated at Thu Feb 08 03:08:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.