[SERVER-15879] Better error reporting when the mongos configdb string does not match with cached one Created: 30/Oct/14  Updated: 22/Mar/16  Resolved: 22/Mar/16

Status: Closed
Project: Core Server
Component/s: Diagnostics, Sharding
Affects Version/s: 2.6.0
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Anil Kumar Assignee: Kevin Pulo
Resolution: Won't Fix Votes: 10
Labels: polish
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-16774 Better enforcement of configdb string... Closed
is related to SERVER-16733 mongos does not fail when different c... Closed
is related to SERVER-23282 Provide diagnostic log message when C... Closed
Backwards Compatibility: Fully Compatible
Sprint: Sharding 2 04/24/15, Sharding 3 05/15/15, Sharding 4 06/05/15, Sharding 5 06/26/16, Sharding 6 07/17/15, Sharding 7 08/10/15, Sharding 8 08/28/15, Sharding 9 (09/18/15), Sharding A (10/09/15), Sharding B (10/30/15)
Participants:

 Description   

In case the config db string specified by mongos does not match the one cached in config db, there are no errors / warning in the logs at the default log level (in MongoDB 2.6 onwards). This causes the cluster that is partially impaired not being detected so until a metadata write operation starts failing on the cluster. Following is a log snippet from default log-level with a very vague error that possibly misleads in a wrong dimension:

21T22:00:21.460+0000 [conn2] addshard request { addShard: "testshard/examplehost" } failed: could not verify config servers were active and reachable before write
2014-10-21T22:02:07.275+0000 [conn2] end connection 127.0.0.1:51706 (0 connections now open)
2014-10-21T22:40:05.079+0000 [mongosMain] connection accepted from 127.0.0.1:53069 #3 (1 connection now open)

Even when the impact is noted, it is not clear what is causing it. This is only noticeable when higher log level is turned on for mongos. Following are the logs when higher log level on mongos is turned on (logLevel = 2).

2014-10-25T04:46:59.871+0000 [Balancer] calling onCreate auth for host1.cluster1.example.com:27019 (10.10.10.10)
2014-10-25T04:46:59.872+0000 [Balancer] initializing shard connection to host1.cluster1.example.com:27019 (10.10.10.10)
2014-10-25T04:46:59.872+0000 [Balancer] initial sharding settings : { setShardVersion: "", init: true, configdb: "host1.cluster1.example.com:27019,host2.cluster1.example.com:27019,host3.cluster1.example.com:27019", serverID: ObjectId('544a936356680b7002b9a101'), authoritative: true }
2014-10-25T04:46:59.872+0000 [Balancer] initial sharding result : { configdb: { stored: "cfg1.tmp.example.com:27019,cfg2.tmp.example.com:27019,cfg3.tmp.example.com:27019", given: "host1.cluster1.example.com:27019,host2.cluster1.example.com:27019,host3.cluster1.example.com:27019" }, ok: 0.0, errmsg: "mongos specified a different config database string : stored : cfg1.tmp.example.com:27019,cfg2.tmp.example.com:27019,cfg3.tmp.example.com:27019 vs given ..." }
2014-10-25T04:46:59.872+0000 [Balancer] User Assertion: 15907:could not initialize sharding on connection host1.cluster1.example.com:27019 (10.10.10.10) :: caused by :: mongos specified a different config database string : stored : cfg1.tmp.example.com:27019,cfg2.tmp.example.com:27019,cfg3.tmp.example.com:27019 vs given : host1.cluster1.example.com:27019,host2.cluster1.example.com:27019,host3.cluster1.example.com:27019
2014-10-25T04:46:59.872+0000 [Balancer] creating new connection to:host2.cluster1.example.com:27019
2014-10-25T04:46:59.872+0000 [ConnectBG] BackgroundJob starting: ConnectBG
2014-10-25T04:46:59.873+0000 [Balancer] connected to server host2.cluster1.example.com:27019 (10.10.10.11)

Since, this impacts the working of a cluster, the error / assertion should be logged at the default log level.



 Comments   
Comment by Kevin Pulo [ 22/Mar/16 ]

CSRS (Config Server Replica Sets) means that this is no longer relevant or a problem.

Comment by Kevin Pulo [ 31/Oct/14 ]

In terms of triaging this, I believe that the root issue is that combineFsyncErrors() in src/mongo/s/write_ops/config_coordinator.cpp takes a vector of fsync responses, but then does not incorporate any of this information in the subsequent client response that it assembles, instead including only the generic "could not verify config servers were active and reachable before write" error message. If this function was to append to that message the actual underlying error messages from the (failed) config server fsyncs, then I think this would give the necessary visibility into the actual cause of the mongos's inability to write to the config db.

    static void combineFsyncErrors( const vector<ConfigFsyncResponse*>& responses,
                                    BatchedCommandResponse* clientResponse ) {
 
        clientResponse->setOk( false );
        clientResponse->setErrCode( ErrorCodes::RemoteValidationError );
        clientResponse->setErrMessage( "could not verify config servers were "
                                       "active and reachable before write" );
    }

            bool fsyncError = false;
            while ( _dispatcher->numPending() > 0 ) {
 
                fsyncResponses.push_back( new ConfigFsyncResponse() );
                ConfigFsyncResponse& fsyncResponse = *fsyncResponses.back();
                Status dispatchStatus = _dispatcher->recvAny( &fsyncResponse.configHost,
                                                              &fsyncResponse.response );
 
                // We've got to recv everything, no matter what
                if ( !dispatchStatus.isOK() ) {
                    fsyncError = true;
                    buildFsyncErrorFrom( dispatchStatus, &fsyncResponse.response );
                }
                else if ( !fsyncResponse.response.getOk() ) {
                    fsyncError = true;
                }
            }
 
            if ( fsyncError ) {
                combineFsyncErrors( fsyncResponses, clientResponse );
                return;
            }
            else {
                fsyncResponsesOwned.clear();
            }

Generated at Thu Feb 08 03:39:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.