[SERVER-43211] mongos claims it is accepting connections but does not Created: 06/Sep/19  Updated: 27/Oct/23  Resolved: 09/Sep/19

Status: Closed
Project: Core Server
Component/s: Networking
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Oleg Pudeyev (Inactive) Assignee: Mira Carey
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Service Arch 2019-09-09, Service Arch 2019-09-23
Participants:

 Description   

I started a 4.3 sharded deployment with the args specified here (https://github.com/p-mongo/dev/blob/master/script/launch-4.4-sharded-multishard):

port=14440
launchargs="--replicaset --nodes 2 --sharded 2 --name ruby-driver-rs --mongos 2"

The server is:

butler% /usr/local/m/versions/4.4/mongos --version
mongos version v4.3.0-574-g6e02a4d
git version: 6e02a4d34bd972e6755bb5f71a5b26f69fe2cfb0
OpenSSL version: OpenSSL 1.1.1c  28 May 2019
allocator: tcmalloc
modules: none
build environment:
    distarch: x86_64
    target_arch: x86_64

mlaunch produced this output:

butler% ./script/launch-4.4-sharded-multishard
Base port: 14440
launching: config server on port 14446
launching: "/usr/local/m/versions/4.4/mongod" on port 14442
launching: "/usr/local/m/versions/4.4/mongod" on port 14443
launching: "/usr/local/m/versions/4.4/mongod" on port 14444
launching: "/usr/local/m/versions/4.4/mongod" on port 14445
launching: /usr/local/m/versions/4.4/mongos on port 14440
launching: /usr/local/m/versions/4.4/mongos on port 14441

The log from mongos is here: https://gist.github.com/p-mongo/a206f10247c39eaa92c63e1c0c977f72

Note it contains the following line:

2019-09-06T16:08:09.263-0400 I  NETWORK  [mongosMain] Listening on 127.0.0.1
2019-09-06T16:08:09.263-0400 I NETWORK [mongosMain] waiting for connections on port 14440

However, connection to 14440 fails:

butler% mongo --port 14440
MongoDB shell version v3.6.9
connecting to: mongodb://127.0.0.1:14440/
2019-09-06T16:10:39.341-0400 W NETWORK  [thread1] Failed to connect to 127.0.0.1:14440 after 5000ms milliseconds, giving up.
2019-09-06T16:10:39.359-0400 E QUERY    [thread1] Error: couldn't connect to server 127.0.0.1:14440, connection attempt failed :
connect@src/mongo/shell/mongo.js:257:13
@(connect):1:6
exception: connect failed

Also note that there are no errors indicated in the server log. There is one warning which is this:

2019-09-06T16:08:07.259-0400 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: LockStateChangeFailed: findAndModify query predicate didn't match any lock document

I expect that if mongos claims that it accepts connections, that it actually accepts connections, and writes error level messages to the log if a connection attempt fails. If mongos is not accepting connections I expect it to indicate what it is doing so that I can track its progress toward being in a usable state.

When my deployment gets in the state described in this ticket, it appears to be stuck in this state and killing all processes and restarting them does not seem to unstick it. I need to nuke the data directories for all mongos+mongod nodes and rebuild the entire deployment from scratch.



 Comments   
Comment by Oleg Pudeyev (Inactive) [ 09/Sep/19 ]

My bad, it was a firewall rule I forgot about. Sorry for wasting so much time.

Comment by Oleg Pudeyev (Inactive) [ 06/Sep/19 ]

mongos appears to be listening:

butler% netstat -an|grep 14440
tcp        0      0 127.0.0.1:14440         0.0.0.0:*               LISTEN     
unix  2      [ ACC ]     STREAM     LISTENING     4779675  /tmp/mongodb-14440.sock

4.4 shell behaves the same:

butler% /usr/local/m/versions/4.4/mongo --port 14440
MongoDB shell version v4.3.0-574-g6e02a4d
connecting to: mongodb://127.0.0.1:14440/?compressors=disabled&gssapiServiceName=mongodb
2019-09-06T18:00:36.742-0400 E  QUERY    [js] Error: couldn't connect to server 127.0.0.1:14440, connection attempt failed: SocketException: Error connecting to 127.0.0.1:14440 :: caused by :: Connection timed out :
connect@src/mongo/shell/mongo.js:342:17
@(connect):2:6
2019-09-06T18:00:36.751-0400 F  -        [main] exception: connect failed
2019-09-06T18:00:36.751-0400 E  -        [main] exiting with code 1

Connecting to socket works:

butler% /usr/local/m/versions/4.4/mongo mongodb://%2ftmp%2fmongodb-14440.sock
MongoDB shell version v4.3.0-574-g6e02a4d
connecting to: mongodb://%2Ftmp%2Fmongodb-14440.sock/?compressors=disabled&gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("74de849a-1675-4463-945e-65d7898f3245") }
MongoDB server version: 4.3.0-574-g6e02a4d
Server has startup warnings: 
2019-09-06T16:08:07.086-0400 I  CONTROL  [main] 
2019-09-06T16:08:07.086-0400 I  CONTROL  [main] ** NOTE: This is a development version (4.3.0-574-g6e02a4d) of MongoDB.
2019-09-06T16:08:07.086-0400 I  CONTROL  [main] **       Not recommended for production.
2019-09-06T16:08:07.086-0400 I  CONTROL  [main] 
2019-09-06T16:08:07.086-0400 I  CONTROL  [main] ** WARNING: Access control is not enabled for the database.
2019-09-06T16:08:07.086-0400 I  CONTROL  [main] **          Read and write access to data and configuration is unrestricted.
2019-09-06T16:08:07.086-0400 I  CONTROL  [main] 
2019-09-06T16:08:07.086-0400 I  CONTROL  [main] ** WARNING: This server is bound to localhost.
2019-09-06T16:08:07.086-0400 I  CONTROL  [main] **          Remote systems will be unable to connect to this server. 
2019-09-06T16:08:07.086-0400 I  CONTROL  [main] **          Start the server with --bind_ip <address> to specify which IP 
2019-09-06T16:08:07.086-0400 I  CONTROL  [main] **          addresses it should serve responses from, or with --bind_ip_all to
2019-09-06T16:08:07.086-0400 I  CONTROL  [main] **          bind to all interfaces. If this behavior is desired, start the
2019-09-06T16:08:07.086-0400 I  CONTROL  [main] **          server with --bind_ip 127.0.0.1 to disable this warning.
2019-09-06T16:08:07.086-0400 I  CONTROL  [main] 
mongos> 

--bind_ip_all makes no difference:

butler% netstat -an|grep 14440
tcp        0      0 0.0.0.0:14440           0.0.0.0:*               LISTEN     
unix  2      [ ACC ]     STREAM     LISTENING     4868033  /tmp/mongodb-14440.sock
 
 
butler% /usr/local/m/versions/4.4/mongo --port 14440
MongoDB shell version v4.3.0-574-g6e02a4d
connecting to: mongodb://127.0.0.1:14440/?compressors=disabled&gssapiServiceName=mongodb
2019-09-06T18:03:51.619-0400 E  QUERY    [js] Error: couldn't connect to server 127.0.0.1:14440, connection attempt failed: SocketException: Error connecting to 127.0.0.1:14440 :: caused by :: Connection timed out :
connect@src/mongo/shell/mongo.js:342:17
@(connect):2:6
2019-09-06T18:03:51.624-0400 F  -        [main] exception: connect failed
2019-09-06T18:03:51.624-0400 E  -        [main] exiting with code 1

Nothing else is listening:

butler% ps awwxu|grep 14440   
m        22650  0.0  0.3 247284 28336 ?        Sl   18:03   0:00 /usr/local/m/versions/4.4/mongos --logpath /mnt/zram/mongodb/4.4-scm/mongos/mongos_14440.log --port 14440 --configdb configRepl/localhost:14446 --setParameter diagnosticDataCollectionEnabled=false --setParameter enableTestCommands=1 --filePermissions 0666 --fork --bind_ip_all
m        22802  0.0  0.0   6076   828 pts/14   S+   18:09   0:00 grep 14440
butler% kill 22650
butler% netstat -an|grep 14440
butler% 

I am working on the logs.

Comment by Mira Carey [ 06/Sep/19 ]

I've one observation, followed by a couple of questions:

First the observation:

I expect that if mongos claims that it accepts connections, that it actually accepts connections, and writes error level messages to the log if a connection attempt fails. If mongos is not accepting connections I expect it to indicate what it is doing so that I can track its progress toward being in a usable state.

The log line "Listening on ..." proceeds directly after a call to ::listen() on each bound interface socket. Then the line "waiting for connections on port ..." comes after spawning the background thread which will start actively calling ::accept(). See transport_layer_asioc.cpp

Those two lines are about as close to ready as we can get. The only bit that's missing would be another lined logged from the accepting socket background threads.

Obviously we still have something wrong there, but it's unlikely to be the mongos' process failing to have called ::bind() / ::listen() / ::accept()

Then the questions:

  1. Does this problem reproduce for you with a same version shell? I'm noticing that you're running a 4.3 cluster at some git commit, but a 3.6.9 shell. Just wondering if you have similar problems with a newer one
  2. Can you check to see if your mongos processes are actually listening on the relevant ports? Something like "lsof -i -P -n | grep LISTEN" would do the trick. It'd also be interesting to know if something was listening on those ports while the processes were not active (I.e. after being killed, before being restarted)
  3. Can you connect to your mongos when --bind_ip_all has been passed?
  4. Can you connect to your mongos over the unix domain socket? Something like: 'mongo "mongodb://%2Ftmp%2Fmongodb-14440.sock"'
  5. You mention "when my deployments gets in the state described", implying that it doesn't immediately become so. And that the only fix is to destroy all data directories and start over. When in this state, can the nodes talk to each other? I.e. do you see log messages on the config servers indicating connections from the mongos'? Or do you see a battery of failure to connects from the server processes as well?
Generated at Thu Feb 08 05:02:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.