[SERVER-6297] Socket Exception code 9001 Created: 04/Jul/12 Updated: 15/Aug/12 Resolved: 10/Jul/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Security, Sharding |
| Affects Version/s: | 2.0.4, 2.0.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jeff lee | Assignee: | Greg Studer |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | authentication, configserver, mongos | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
OSX, linux |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
Hi. I have a sharded cluster using authentication. If I stop one of the config servers along with one of my data nodes, I start getting this error when attempting to connect to mongos and run any commands: uncaught exception: error { "$err" : "socket exception", "code" : 9001 }. The problem appears to be worse in 2.0.6. If I just shut down a single config server in 2.0.6 I immediately start getting socket exception errors. Looks like this is probably related to Steps to reproduce: 1. Create a sharded authenticated database with the following config 2 shards - 2xdata, 1xarb 2. Add an admin user 3. Stop one configdb 4. Stop secondary on one shard 5. Wait a few minutes - seems to start after syncluster fails 6. Attempt to connect I turned up logging in mongos and got the following: Tue Jul 3 22:04:33 [mongosMain] connection accepted from 127.0.0.1:62779 #32 Tue Jul 3 22:04:33 [conn32] DBClientCursor::init call() failed Tue Jul 3 22:04:38 [conn33] SyncClusterConnection connecting to [localhost:50010] Tue Jul 3 22:04:40 [conn34] SyncClusterConnection connecting to [localhost:50010] |
| Comments |
| Comment by Greg Studer [ 10/Jul/12 ] |
|
Resolved as duplicate for better triaging. |
| Comment by Greg Studer [ 09/Jul/12 ] |
|
Thanks for the log - think I see the issue. Sort-of related to time, think this happens when the connection pool to the config server becomes empty after successive errors, and needs to be re-authed. |
| Comment by Jeff lee [ 06/Jul/12 ] |
|
Narrowed it down a bit. Seems like things go south as soon as I get this in the mongos log: Fri Jul 6 14:45:49 [LockPinger] warning: distributed lock pinger 'localhost:50010,localhost:50020,localhost:50030/Jeffs-MacBook-Air.local:27017:1341611119:16807' detected an exception while pinging. :: caused by :: SyncClusterConnection::udpate prepare failed: 10276 DBClientBase::findN: transport error: localhost:50030 query: { fsync: 1 }localhost:50030:{} $ mongo localhost/admin -u admin -p admin exception: login failed |
| Comment by Jeff lee [ 06/Jul/12 ] |
|
Sure thing - here ya go. I restarted mongos after creating the admin user. Here's the log after starting it back up with -vvvvv. You may need to wait a bit for the socket exception to occur it seems to be after the synccluster fails. It's at line 337 here. |
| Comment by Greg Studer [ 06/Jul/12 ] |
|
I've tried to reproduce locally on my system with the script above, but shutting down a config server doesn't seem to be triggering a problem (this is in 2.0.6, where you said the failure was immediate after terminating the config server). Would it be possible to post the same log snippet above at logLevel 5 - basically start mongos with -vvvvv? |
| Comment by Jeff lee [ 05/Jul/12 ] |
|
Hi Greg, I actually found this while trying to replicate some other errors we were having in one our staging environments following the Amazon explosion this weekend. We lost 1 config server and a data node and starting seeing some strange auth issues after I recovered the db but not the config server. I'm not running anything following the login under 2.0.6 - I can't connect at all after I kill the config server. In 2.0.4, if I shut down the config server I get errors trying to do a show dbs after shutting down the configserver but can still use the db and do a find. If I kill the db ( with the configserver down ) I start gettting could not initialize cursor across all shards errors. Here's how I'm setting up the test cluster: mongod --dbpath 1a --nojournal --smallfiles --oplogSize 1 --replSet shard01 --noprealloc --port 10010 --keyFile keyfile mongod --dbpath 2a --nojournal --smallfiles --oplogSize 1 --replSet shard02 --noprealloc --port 20010 --keyFile keyfile mongod --dbpath config01 --nojournal --smallfiles --noprealloc --configsvr --port 50010 --keyFile keyfile mongos --configdb localhost:50010,localhost:50020,localhost:50030 --chunkSize 1 --keyFile keyfile mongo localhost:10010/admin --eval "rs.initiate({ _id:'shard01', members:[{_id:0, host:'localhost:10010'}, {_id:1, host:'localhost:10020'}, {_id:2, host:'localhost:10030', arbiterOnly:true}]})" mongo localhost:20010/admin --eval "rs.initiate({ _id:'shard02', members:[{_id:0, host:'localhost:20010'}, {_id:1, host:'localhost:20020'}, {_id:2, host:'localhost:20030', arbiterOnly:true}]})" mongo localhost/admin --eval "db.runCommand( {addShard:'shard01/localhost:10010,localhost:10020,localhost:10030'})" )" )" mongo localhost/test --eval "for ( var i=1; i<=5000; i++ ){ db.foo.save({_id:i, name:Array(1000).join('a'), ts: new Date() })}" mongo localhost/admin --eval "db.runCommand({shardCollection:'test.foo', key:{_id:1}})" mongo localhost/admin --eval "db.addUser('admin','admin')" |
| Comment by Greg Studer [ 05/Jul/12 ] |
|
Thanks for posting the error and logs here - are you able to post a fuller log, starting ~1hr before you shut down the config server until you restart it again? We'll try to reproduce on our side as well. Don't think this is related to What command are you running immediately after authentication? It's not 100% clear from the log, but it could be that auth is succeeding but the next command needs to write to the config server, which should (correctly) fail. |