[SERVER-25629] Accidental "host:port/replicaSetName"-format mongos --configdb arg should be rejected asap. Created: 16/Aug/16  Updated: 25/Jan/17  Resolved: 30/Sep/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.3.11
Fix Version/s: 3.4.0-rc0

Type: Improvement Priority: Trivial - P5
Reporter: Akira Kurogane Assignee: Andy Schwerin
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Sprint: Sharding 2016-08-29, Sharding 2016-09-19, Sharding 2016-10-10
Participants:

 Description   

When I begin a mongos with a 3.3.11 test cluster I run the following mongos command.

akira:~$ /usr/local/bin/mongodb-linux-x86_64-3.3.11/bin/mongos --fork --logpath data/mongos.log --configdb akira-macbookpro:27019 --port 30000
BadValue: configdb supports only replica set connection string
try '/usr/local/bin/mongodb-linux-x86_64-3.3.11/bin/mongos --help' for more information

So the default/compulsory replica set-style config server has come into play and I can't use a plain host list anymore. That is fine.

It's also good that the error message points to --help, and it has the right syntax

Sharding options:
  --configdb arg                   Connection string for communicating with 
                                   config servers:
                                   <config replset name>/<host1:port>,<host2:po
                                   rt>,[...]

... but I did it backwards! I.e. I wrote "--configdb <host>[,<host>]*/<replsetName>" instead of the correct "<replsetName>/ <host>[,<host>]*"), probably going from my memory regarding mongodb connection uris.

When I do this the mongos starts, implicitly affirming I got the configdb argument format right.

akira:~$ mongos --fork --logpath mongos2.log --configdb akira-macbookpro:27019/cfgrs --port 30000
2016-08-16T14:25:19.460+1000 W SHARDING [main] Running a sharded cluster with fewer than 3 config servers should only be done for testing purposes and is not recommended for production.
about to fork child process, waiting until server is ready for connections.
forked process: 4539
<hangs there>

But it hangs without finishing the fork. Meanwhile in the log file the problem is reported like thus.

2016-08-16T14:25:19.464+1000 I SHARDING [mongosMain] mongos version v3.3.11-30-gc96009e
2016-08-16T14:25:19.464+1000 I CONTROL  [mongosMain] git version: c96009ecd439bbd960ae1c01d6379e64ecdb5eeb
2016-08-16T14:25:19.464+1000 I CONTROL  [mongosMain] allocator: tcmalloc
2016-08-16T14:25:19.464+1000 I CONTROL  [mongosMain] modules: none
2016-08-16T14:25:19.464+1000 I CONTROL  [mongosMain] build environment:
2016-08-16T14:25:19.464+1000 I CONTROL  [mongosMain]     distarch: x86_64
2016-08-16T14:25:19.464+1000 I CONTROL  [mongosMain]     target_arch: x86_64
2016-08-16T14:25:19.464+1000 I CONTROL  [mongosMain] db version v3.3.11-30-gc96009e
2016-08-16T14:25:19.464+1000 I CONTROL  [mongosMain] git version: c96009ecd439bbd960ae1c01d6379e64ecdb5eeb
2016-08-16T14:25:19.464+1000 I CONTROL  [mongosMain] allocator: tcmalloc
2016-08-16T14:25:19.464+1000 I CONTROL  [mongosMain] modules: none
2016-08-16T14:25:19.464+1000 I CONTROL  [mongosMain] build environment:
2016-08-16T14:25:19.464+1000 I CONTROL  [mongosMain]     distarch: x86_64
2016-08-16T14:25:19.464+1000 I CONTROL  [mongosMain]     target_arch: x86_64
2016-08-16T14:25:19.464+1000 I CONTROL  [mongosMain] options: { net: { port: 30000 }, processManagement: { fork: true }, sharding: { configDB: "akira-macbookpro:27019/cfgrs" }, systemLog: { destination: "file", path: "/tmp/mongos2.log" } }
2016-08-16T14:25:19.485+1000 I NETWORK  [mongosMain] Starting new replica set monitor for akira-macbookpro:27019/cfgrs:27019
2016-08-16T14:25:19.485+1000 I SHARDING [thread1] creating distributed lock ping thread for process akira-macbookpro:30000:1471321519:7538175379671250174 (sleeping for 30000ms)
2016-08-16T14:25:19.490+1000 I NETWORK  [ReplicaSetMonitor-TaskExecutor-0] getaddrinfo("cfgrs") failed: Name or service not known
2016-08-16T14:25:19.490+1000 W NETWORK  [ReplicaSetMonitor-TaskExecutor-0] No primary detected for set akira-macbookpro:27019
2016-08-16T14:25:19.490+1000 I NETWORK  [ReplicaSetMonitor-TaskExecutor-0] All nodes for set akira-macbookpro:27019 are down. This has happened for 1 checks in a row.
2016-08-16T14:25:19.994+1000 I NETWORK  [replSetDistLockPinger] getaddrinfo("cfgrs") failed: Name or service not known
2016-08-16T14:25:19.994+1000 W NETWORK  [replSetDistLockPinger] No primary detected for set akira-macbookpro:27019
2016-08-16T14:25:19.994+1000 I NETWORK  [replSetDistLockPinger] All nodes for set akira-macbookpro:27019 are down. This has happened for 2 checks in a row.
2016-08-16T14:25:20.498+1000 I NETWORK  [mongosMain] getaddrinfo("cfgrs") failed: Name or service not known
2016-08-16T14:25:20.498+1000 W NETWORK  [mongosMain] No primary detected for set akira-macbookpro:27019
2016-08-16T14:25:20.498+1000 I NETWORK  [mongosMain] All nodes for set akira-macbookpro:27019 are down. This has happened for 3 checks in a row.
2016-08-16T14:25:21.002+1000 I NETWORK  [mongosMain] getaddrinfo("cfgrs") failed: Name or service not known
2016-08-16T14:25:21.002+1000 W NETWORK  [mongosMain] No primary detected for set akira-macbookpro:27019

The line "No primary detected for set akira-macbookpro:27019" is the first one that caught my eye, and I took it to mean something such as the replica set on my config server had not been initialized, or had lost a majority, etc.

That the "<host>:<port>" string had been misinterpreted as a replicaset name didn't occur to me for quite a while.

So I request some sanity-checking for the --configdb argument.



 Comments   
Comment by Githook User [ 30/Sep/16 ]

Author:

{u'username': u'andy10gen', u'name': u'Andy Schwerin', u'email': u'schwerin@mongodb.com'}

Message: SERVER-25629 Log a warning if it looks like the operator reversed the two haves of the CSRS config string passed to --configdb.
Branch: master
https://github.com/mongodb/mongo/commit/67c441cf094920674fc76668d5b0d225503a9c8a

Comment by Akira Kurogane [ 10/Sep/16 ]

Hi schwerin. Sorry for the delay. I made a mistake with my task management; all blame on me for that one.

I have to admit the warning message rather than exit is sufficient, despite being keen for an exit originally. The warning will be extra-noticeable given nothing else is happening on the terminal at that time.

Thanks for adding this usability patch.

Comment by Andy Schwerin [ 07/Sep/16 ]

akira.kurogane, I'm uncomfortable actually quitting mongos as in your example, but we could definitely add a log message. What do you think? These are the first several lines from the log:

$  ./mongos --configdb localhost/frimrs
2016-09-07T10:57:06.413-0400 W SHARDING [main] The replica set name "localhost" resolves as a host name, but none of the servers in the seed list do. Did you reverse the replica set name and the seed list in localhost/frimrs:27017?
2016-09-07T10:57:06.413-0400 W SHARDING [main] Running a sharded cluster with fewer than 3 config servers should only be done for testing purposes and is not recommended for production.
...

Comment by Akira Kurogane [ 17/Aug/16 ]

Here's an idea in pseudo-code.

With "\-\-configdb <firststring>/<secondstring>"
 
if ( getaddrinfo(secondstring) != OK ) {
   //'Can't find an IP for host(s) <secondstring>'
   if (getaddrinfo(firststring) == OK) {
     //'Looks like you got a hostname in the replicaSetName position'
     //Exit ... Yes I really mean exit. The probability they temporarily had a valid host unresolvable by DNS, but have 
     //  another host up that coincidentally had the same name as the replica set, is not impossibly small. 
     //  But it is way less likely than accidental argument switchover.
   }
}

Additional (not alternate) idea: the presence of ":" and "," chars in the replicaSetName string leads to a warning printed on stderr of the launching shell (only before the fork, obviously).

Comment by Andy Schwerin [ 16/Aug/16 ]

Unfortunately, "host1:20000,host2:30000" is a legal (though strange) replica set name, and "cfgrs" is a legal host name, so I'm not sure how we could do that sanity checking, akira.kurogane. As such, I think this might have to be closed "Works as Designed". It is disappointing. Do you have any ideas?

Generated at Thu Feb 08 04:09:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.