[SERVER-6297] Socket Exception code 9001 Created: 04/Jul/12  Updated: 15/Aug/12  Resolved: 10/Jul/12

Status: Closed
Project: Core Server
Component/s: Security, Sharding
Affects Version/s: 2.0.4, 2.0.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jeff lee Assignee: Greg Studer
Resolution: Duplicate Votes: 0
Labels: authentication, configserver, mongos
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

OSX, linux


Attachments: File auth_with_config_down.js     File mongos_local.log.gz    
Issue Links:
Depends
depends on SERVER-6378 auth against config server fails if a... Closed
Operating System: ALL
Participants:

 Description   

Hi.

I have a sharded cluster using authentication. If I stop one of the config servers along with one of my data nodes, I start getting this error when attempting to connect to mongos and run any commands: uncaught exception: error

{ "$err" : "socket exception", "code" : 9001 }

.

The problem appears to be worse in 2.0.6. If I just shut down a single config server in 2.0.6 I immediately start getting socket exception errors.

Looks like this is probably related to SERVER-6178.

Steps to reproduce:

1. Create a sharded authenticated database with the following config

2 shards - 2xdata, 1xarb
3 config dbs
1 mongos

2. Add an admin user

3. Stop one configdb

4. Stop secondary on one shard

5. Wait a few minutes - seems to start after syncluster fails

6. Attempt to connect

I turned up logging in mongos and got the following:

Tue Jul 3 22:04:33 [mongosMain] connection accepted from 127.0.0.1:62779 #32
Tue Jul 3 22:04:33 [conn32] authenticate:

{ authenticate: 1.0, user: "admin", nonce: "ee41c70123ef341", key: "dc29f18bef6f52f3abf09f0f49574f83" }

Tue Jul 3 22:04:33 [conn32] DBClientCursor::init call() failed
Tue Jul 3 22:04:33 [conn32] sharded connection to localhost:50010,localhost:50020,localhost:50030 not being returned to the pool
Tue Jul 3 22:04:33 [conn32] end connection 127.0.0.1:62779
Tue Jul 3 22:04:35 [ReplicaSetMonitorWatcher] trying reconnect to localhost:20020
Tue Jul 3 22:04:35 [ReplicaSetMonitorWatcher] reconnect localhost:20020 failed couldn't connect to server localhost:20020
Tue Jul 3 22:04:35 [LockPinger] SyncClusterConnection connecting to [localhost:50010]
Tue Jul 3 22:04:35 [LockPinger] SyncClusterConnection connecting to [localhost:50020]
Tue Jul 3 22:04:35 [LockPinger] SyncClusterConnection connecting to [localhost:50030]
Tue Jul 3 22:04:35 [LockPinger] SyncClusterConnection connect fail to: localhost:50030 errmsg: couldn't connect to server localhost:50030
Tue Jul 3 22:04:35 [LockPinger] trying reconnect to localhost:50030
Tue Jul 3 22:04:35 [LockPinger] reconnect localhost:50030 failed couldn't connect to server localhost:50030
Tue Jul 3 22:04:35 [LockPinger] warning: distributed lock pinger 'localhost:50010,localhost:50020,localhost:50030/Jeffs-MacBook-Air.local:27017:1341378125:16807' detected an exception while pinging. :: caused by :: socket exception
Tue Jul 3 22:04:38 [mongosMain] connection accepted from 127.0.0.1:62792 #33
Tue Jul 3 22:04:38 [conn33] authenticate:

{ authenticate: 1.0, user: "admin", nonce: "320e3c628074ca23", key: "9e5c499327208a089c7cdab6e16bd5dc" }

Tue Jul 3 22:04:38 [conn33] SyncClusterConnection connecting to [localhost:50010]
Tue Jul 3 22:04:38 [conn33] SyncClusterConnection connecting to [localhost:50020]
Tue Jul 3 22:04:38 [conn33] SyncClusterConnection connecting to [localhost:50030]
Tue Jul 3 22:04:38 [conn33] SyncClusterConnection connect fail to: localhost:50030 errmsg: couldn't connect to server localhost:50030
Tue Jul 3 22:04:38 [conn33] trying reconnect to localhost:50030
Tue Jul 3 22:04:38 [conn33] reconnect localhost:50030 failed couldn't connect to server localhost:50030
Tue Jul 3 22:04:38 [conn33] DBException in process: socket exception
Tue Jul 3 22:04:38 [conn33] end connection 127.0.0.1:62792
Tue Jul 3 22:04:40 [mongosMain] connection accepted from 127.0.0.1:62800 #34
Tue Jul 3 22:04:40 [conn34] authenticate:

{ authenticate: 1.0, user: "admin", nonce: "6d642df666f58f77", key: "058a8a22d7a7d96b486bdce134e53026" }

Tue Jul 3 22:04:40 [conn34] SyncClusterConnection connecting to [localhost:50010]
Tue Jul 3 22:04:40 [conn34] SyncClusterConnection connecting to [localhost:50020]
Tue Jul 3 22:04:40 [conn34] SyncClusterConnection connecting to [localhost:50030]
Tue Jul 3 22:04:40 [conn34] SyncClusterConnection connect fail to: localhost:50030 errmsg: couldn't connect to server localhost:50030
Tue Jul 3 22:04:40 [conn34] trying reconnect to localhost:50030
Tue Jul 3 22:04:40 [conn34] reconnect localhost:50030 failed couldn't connect to server localhost:50030
Tue Jul 3 22:04:40 [conn34] DBException in process: socket exception
Tue Jul 4 22:04:40 [conn34] end connection 127.0.0.1:62800



 Comments   
Comment by Greg Studer [ 10/Jul/12 ]

Resolved as duplicate for better triaging.

Comment by Greg Studer [ 09/Jul/12 ]

Thanks for the log - think I see the issue. Sort-of related to time, think this happens when the connection pool to the config server becomes empty after successive errors, and needs to be re-authed.

Comment by Jeff lee [ 06/Jul/12 ]

Narrowed it down a bit. Seems like things go south as soon as I get this in the mongos log:

Fri Jul 6 14:45:49 [LockPinger] warning: distributed lock pinger 'localhost:50010,localhost:50020,localhost:50030/Jeffs-MacBook-Air.local:27017:1341611119:16807' detected an exception while pinging. :: caused by :: SyncClusterConnection::udpate prepare failed: 10276 DBClientBase::findN: transport error: localhost:50030 query:

{ fsync: 1 }

localhost:50030:{}

$ mongo localhost/admin -u admin -p admin
MongoDB shell version: 2.0.6
connecting to: localhost/admin
Fri Jul 6 14:45:50 uncaught exception: error

{ "$err" : "socket exception", "code" : 9001 }

exception: login failed

Comment by Jeff lee [ 06/Jul/12 ]

Sure thing - here ya go.

I restarted mongos after creating the admin user. Here's the log after starting it back up with -vvvvv. You may need to wait a bit for the socket exception to occur it seems to be after the synccluster fails.

It's at line 337 here.

Comment by Greg Studer [ 06/Jul/12 ]

I've tried to reproduce locally on my system with the script above, but shutting down a config server doesn't seem to be triggering a problem (this is in 2.0.6, where you said the failure was immediate after terminating the config server).

Would it be possible to post the same log snippet above at logLevel 5 - basically start mongos with -vvvvv?

Comment by Jeff lee [ 05/Jul/12 ]

Hi Greg,

I actually found this while trying to replicate some other errors we were having in one our staging environments following the Amazon explosion this weekend. We lost 1 config server and a data node and starting seeing some strange auth issues after I recovered the db but not the config server.

I'm not running anything following the login under 2.0.6 - I can't connect at all after I kill the config server.

In 2.0.4, if I shut down the config server I get errors trying to do a show dbs after shutting down the configserver but can still use the db and do a find. If I kill the db ( with the configserver down ) I start gettting could not initialize cursor across all shards errors.

Here's how I'm setting up the test cluster:

mongod --dbpath 1a --nojournal --smallfiles --oplogSize 1 --replSet shard01 --noprealloc --port 10010 --keyFile keyfile
mongod --dbpath 1b --nojournal --smallfiles --oplogSize 1 --replSet shard01 --noprealloc --port 10020 --keyFile keyfile
mongod --dbpath 1arb --nojournal --smallfiles --oplogSize 1 --replSet shard01 --noprealloc --port 10030 --keyFile keyfile

mongod --dbpath 2a --nojournal --smallfiles --oplogSize 1 --replSet shard02 --noprealloc --port 20010 --keyFile keyfile
mongod --dbpath 2b --nojournal --smallfiles --oplogSize 1 --replSet shard02 --noprealloc --port 20020 --keyFile keyfile
mongod --dbpath 2arb --nojournal --smallfiles --oplogSize 1 --replSet shard02 --noprealloc --port 20030 --keyFile keyfile

mongod --dbpath config01 --nojournal --smallfiles --noprealloc --configsvr --port 50010 --keyFile keyfile
mongod --dbpath config02 --nojournal --smallfiles --noprealloc --configsvr --port 50020 --keyFile keyfile
mongod --dbpath config03 --nojournal --smallfiles --noprealloc --configsvr --port 50030 --keyFile keyfile

mongos --configdb localhost:50010,localhost:50020,localhost:50030 --chunkSize 1 --keyFile keyfile

mongo localhost:10010/admin --eval "rs.initiate({ _id:'shard01', members:[{_id:0, host:'localhost:10010'}, {_id:1, host:'localhost:10020'}, {_id:2, host:'localhost:10030', arbiterOnly:true}]})"

mongo localhost:20010/admin --eval "rs.initiate({ _id:'shard02', members:[{_id:0, host:'localhost:20010'}, {_id:1, host:'localhost:20020'}, {_id:2, host:'localhost:20030', arbiterOnly:true}]})"

mongo localhost/admin --eval "db.runCommand(

{addShard:'shard01/localhost:10010,localhost:10020,localhost:10030'}

)"
mongo localhost/admin --eval "db.runCommand(

{addShard:'shard02/localhost:20010,localhost:20020,localhost:20030'}

)"
mongo localhost/admin --eval "db.runCommand(

{enableSharding:'test'}

)"

mongo localhost/test --eval "for ( var i=1; i<=5000; i++ ){ db.foo.save({_id:i, name:Array(1000).join('a'), ts: new Date() })}"

mongo localhost/admin --eval "db.runCommand({shardCollection:'test.foo', key:{_id:1}})"

mongo localhost/admin --eval "db.addUser('admin','admin')"

Comment by Greg Studer [ 05/Jul/12 ]

Thanks for posting the error and logs here - are you able to post a fuller log, starting ~1hr before you shut down the config server until you restart it again? We'll try to reproduce on our side as well.

Don't think this is related to SERVER-6178 - that issue only occurs if a shard is up and reachable but writing to disk fails - but it should be possible to authenticate with a config server down.

What command are you running immediately after authentication? It's not 100% clear from the log, but it could be that auth is succeeding but the next command needs to write to the config server, which should (correctly) fail.

Generated at Thu Feb 08 03:11:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.