[SERVER-35957] While upgrading from Shard 3.4 to Shard 3.6 config.system.sessions is not sharded. Created: 04/Jul/18  Updated: 23/Jul/18  Resolved: 10/Jul/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.6.5
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: babu balakrishnan Assignee: Esha Maharishi (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File Logs.rar     File a.conf     File b.conf     File cfg.conf     Text File log0.log     File mongos.conf    
Issue Links:
Related
related to SERVER-35795 3.4 secondaries crashing after some t... Closed
Operating System: ALL
Steps To Reproduce:

1. db.adminCommand( { setFeatureCompatibilityVersion: "3.4" } )

Run this command from Mongos. 

2. Made sure the compatiblity is 3.4 before starting the upgrade. 

3. Stop the balancer. 

4. Choose the secondary. Gracefully shutdown the servers.

5, Remove the 3.4 packages. 

6. Install 3.6.5 packages. 

7. Start the mongod and mongo config server. 

8. Wait for the servers to join the cluster. 

9. Repeat the steps from 4 to 8 for the other secondary. 

10. Step down the mongo shard primary and mongo config primary. 

11. Repeat steps 4 to 8 on the last secondary. 

12. After that upgrade the mongos component. 

13. Start the balancer. 

14. Monitor the error logs. 

Sprint: Sharding 2018-07-16
Participants:

 Description   

Hello, 

 

I have 2 shards. And each shard is running with 1 primary and 2 secondary. 

I am sharing the config server along with mongod to share the cost of the project. 

All components were running on mongo 3.4.10. 

For the upgrade I followed the below process. 

 

These are the error logs found. 

 

2018-07-04T22:57:34.431+0800 I REPL [rsBackgroundSync] could not find member to sync from
2018-07-04T22:57:39.429+0800 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to mongo-dba-test01.mylab.com:27017: InvalidSyncSource: Sync source was cleared. Was mongo-dba-test01.mylab.com:27017
2018-07-04T22:57:40.476+0800 I REPL [replexec-7] Starting an election, since we've seen no PRIMARY in the past 10000ms

 

NetworkInterfaceASIO-ShardRegistry-0] Connecting to mongo-dba-test01.mylab.com:27017
2018-07-04T22:56:48.659+0800 I ASIO [NetworkInterfaceASIO-ShardRegistry-0] Successfully connected to mongo-dba-test01.mylab.com:27017, took 2ms (1 connections now open to mongo-dba-test01.mylab.com:27017)
2018-07-04T22:56:48.661+0800 I SHARDING [ShardServerCatalogCacheLoader-0] Refresh for collection config.system.sessions took 4 ms and failed :: caused by :: CommandNotFound: no such command: 'forceRoutingTableRefresh', bad cmd: '{ forceRoutingTableRefresh: "config.system.sessions", maxTimeMS: 30000 }'

bad command.

The configuration files and the error logs are attached. 

 

 

 



 Comments   
Comment by babu balakrishnan [ 11/Jul/18 ]

I honestly think the issue can't be resolved in public forum. Indeed this is a bug. Upgrading from 3.4 to 3.6.5 using standard operating procedure mentioned in the document does not shard config.system.sessions collections. Even after your recommendation of setting the compatibility to 3.6. 

Config.system.session is a system collection not a user collection. We don't see this issue when we upgrade from 3.2 to 3.4. 

Please advice?

 

Comment by Esha Maharishi (Inactive) [ 10/Jul/18 ]

Hi baabhu@yahoo.com,

From your responses, it seems you are looking for general support, rather than describing a specific bug in the server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-users group.

See also our Technical Support page for additional support resources.

Comment by babu balakrishnan [ 10/Jul/18 ]

I have announced the production deployment in two weeks time there are about dozens of servers needs to be upgraded. I need this investigation to be completed end of this week. Can you please help?

 

Comment by babu balakrishnan [ 10/Jul/18 ]

Are you sure that works. I just executed on my lower environment. It just hangs.   

It finished running after a few minutes. Still the same message. Since this my UAT environment let me know where should I upload my log files. 

Comment by Esha Maharishi (Inactive) [ 10/Jul/18 ]

baabhu@yahoo.com, have you set the feature compatibility version of the cluster to 3.6? See https://docs.mongodb.com/manual/reference/command/setFeatureCompatibilityVersion/.

Until the feature compatibility version has been set to 3.6, the config server will not create (and shard) the config.system.sessions collection (see v3.6.5 code).

Comment by babu balakrishnan [ 10/Jul/18 ]

How do I reopen the issue? The link you attached is a different issue. I have a workable cluster just the config.sessions collection is not sharded.

Comment by babu balakrishnan [ 10/Jul/18 ]

Thanks for your comments. The issue is config.system.sessions is not sharded. I followed the standard operating procedure and I am facing the issue of config.system.sessions not sharded. Any advice on this. I have a 3.6.5 upgraded cluster with this error popping out 

Refresh for collection config.system.sessions took 8 ms and found the collection is not sharded

every few seconds. 

I am fairly concerned about not having the system collections not sharded. 

If you think CommandNotFound: no such command: 'forceRoutingTableRefresh is not an issue lets ignore it and continue investigating the root cause for collection config.system.sessions not sharded and help me to shard the collection. 

And I would like to have someone assigned with Apac time zone if possible.  In that case the response to the queries will be much faster. 

 

Comment by Esha Maharishi (Inactive) [ 09/Jul/18 ]

Note, the fact that the secondary tried to refresh for config.system.sessions while in feature compatibility version 3.4 is a separate issue tracked by SERVER-35795.

Comment by Esha Maharishi (Inactive) [ 09/Jul/18 ]

Hi baabhu@yahoo.com,

I looked closely at your logs and configuration files and found:

This is the cluster configuration:

mongo-dba-test01
 27017, shard01
 27018, shard02
 27019, config
 
mongo-dba-test02
 27017, shard01
 27018, shard02
 27019, config
 
mongo-dba-test03
 27017, shard01
 27018, shard02
 27019, config 

The "CommandNotFound" errors for the forceRoutingTableRefresh command were only logged by the (secondary) nodes for shard01 and shard02 on mongo-dba-test03:

./03/mongo_shard1.log:2980:2018-07-04T22:56:48.661+0800 I SHARDING [ShardServerCatalogCacheLoader-0] Refresh for collection config.system.sessions took 4 ms and failed :: caused by :: CommandNotFound: no such command: 'forceRoutingTableRefresh', bad cmd: '{ forceRoutingTableRefresh: "config.system.sessions", maxTimeMS: 30000 }'
 
./03/mongo_shard2.log:3256:2018-07-04T22:56:56.845+0800 I SHARDING [ShardServerCatalogCacheLoader-0] Refresh for 
collection config.system.sessions took 4 ms and failed :: caused by :: CommandNotFound: no such command: 'forceRoutingTableRefresh', bad cmd: '{ forceRoutingTableRefresh: "config.system.sessions", maxTimeMS: 30000 }' 

At the time (2018-07-04T22:56:48) the error was logged by the shard01 node on mongo-dba-test03, the primary for shard01 was mongo-dba-test01/27017, which was running v3.4.15:

01/mongo_shard_01.log:2018-07-04T21:36:18.425+0800 I CONTROL  [initandlisten] db version v3.4.15
 
01/mongo_shard_01.log:2018-07-04T21:44:41.632+0800 I REPL     [ReplicationExecutor] transition to PRIMARY
01/mongo_shard_01.log:2018-07-04T21:44:43.557+0800 I REPL     [rsSync] transition to primary complete; database writes are now permitted
 
// from 2018-07-04T21:44:43 to 2018-07-04T22:57:30, this node was shard01's primary, running v3.4.15
 
01/mongo_shard_01.log:2018-07-04T22:57:30.721+0800 I COMMAND  [conn43] Attempting to step down in response to replSetStepDown command
01/mongo_shard_01.log:2018-07-04T22:57:30.722+0800 I REPL [conn43] transition to SECONDARY 
 
01/mongo_shard_01.log:2018-07-04T23:00:32.939+0800 I CONTROL  [conn56] shutting down with code:0
01/mongo_shard_01.log:2018-07-04T23:03:32.905+0800 I CONTROL  [main] ***** SERVER RESTARTED *****
01/mongo_shard_01.log:2018-07-04T23:03:32.918+0800 I CONTROL  [initandlisten] db version v3.6.5

Similarly, at the time (2018-07-04T22:56:56) the error was logged by the shard02 node on mongo-dba-test03, the primary for shard02 was mongo-dba-test01/27018:

01/mongo_shard_02.log:2018-07-04T21:36:22.240+0800 I CONTROL  [initandlisten] db version v3.4.15
 
01/mongo_shard_02.log:2018-07-04T22:50:05.569+0800 I REPL     [ReplicationExecutor] transition to PRIMARY
01/mongo_shard_02.log:2018-07-04T22:50:07.344+0800 I REPL     [rsSync] transition to primary complete; database writes are now permitted
 
// from 2018-07-04T22:50:07 to 2018-07-04T22:57:52, this node was shard02's primary, running v3.4.15
 
01/mongo_shard_02.log:2018-07-04T22:57:52.803+0800 I COMMAND  [conn39] Attempting to step down in response to replSetStepDown command
01/mongo_shard_02.log:2018-07-04T22:57:52.803+0800 I REPL     [conn39] transition to SECONDARY
 
01/mongo_shard_02.log:2018-07-04T23:00:22.718+0800 I CONTROL  [conn52] shutting down with code:0
01/mongo_shard_02.log:2018-07-04T23:03:44.676+0800 I CONTROL  [main] ***** SERVER RESTARTED *****
01/mongo_shard_02.log:2018-07-04T23:03:44.689+0800 I CONTROL  [initandlisten] db version v3.6.5

Assuming the clocks were relatively synchronized across the machines, it is expected that the v3.6 secondary nodes for the shards on mongo-dba-test03 would log this error while the shards had v3.4 primary nodes.

I am closing this as Works as Designed, but please feel free to re-open if you feel this was an incorrect diagnosis.

 

Comment by babu balakrishnan [ 09/Jul/18 ]

Any updates?

Comment by babu balakrishnan [ 06/Jul/18 ]

I could start the session. No problem. I confirm there are no secondaries  or primary running on 3.4. The error should have gone on its own.

 

I am seeing the messages even after I upgraded all components to 3.6.5

Comment by Esha Maharishi (Inactive) [ 05/Jul/18 ]

Hi baabhu@yahoo.com, thank you for your report.

You should see this error message logged on any 3.6.x secondaries if the primary node in its replica set is 3.4.x.

The error message is an artifact of adding safe secondary reads in 3.6. With safe secondary reads, primaries persist their routing table cache so that secondaries within their replica set can replicate the routing table cache via the oplog. If a secondary finds that the persisted routing table cache has not been written yet (the issue you are probably running into, since the routing table refresh is for the "config.system.sessions" namespace) or is stale (because the secondary was contacted by a mongos that had a more fresh version of the routing table than what had been replicated to that secondary), the secondary will try to provoke its primary to refresh by sending the "forceRoutingTableRefresh" command to the primary.

However, if the primary node is still 3.4.x, it will not have the "forceRoutingTableRefresh" command, so the request against the secondary will simply fail with CommandNotFound.

A user can use the default readConcern on secondaries, 'available', for reading from that secondary until the entire replica set has been upgraded to 3.6.

However, I am surprised that this issue is happening for the refresh for "config.system.sessions", which is a system collection (not a user collection). This suggests that a read or direct refresh against "config.system.sessions" is being done on a secondary with non-default readConcern.

baabhu@yahoo.com,

1) In addition to seeing this error message in the logs, are you unable to use the cluster properly? For example, are you unable to start or use a session?

2) If you step up the 3.6 node to primary, are your issues resolved?

misha.tyulenev, do you know where the internal refresh for config.system.sessions comes from?

Comment by babu balakrishnan [ 04/Jul/18 ]

Logs.rar contains the log files of the entire cluster. 

Generated at Thu Feb 08 04:41:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.