[SERVER-61125] TooManyLogicalSessions error on sharded cluster when balancer active Created: 30/Oct/21  Updated: 22/Jun/22  Resolved: 31/Jan/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Daniele Tessaro Assignee: Eric Sedor
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

This happened twice in a 2 days time range.

We have a sharded cluster with two shards (A and B), 3 mongos and 3 config servers.

The first day (2021-10-27T20:45-700) the shard B primary, started reporting the TooManyLogicalSessions error and became basically unusable (we couldn't even run a rs.stepDown() or shutdownServer).

We stopped traffic to the cluster and waited. The machine became responsive. We re-enabled the traffic, and the machine was unusable again (we couldn't even tell the number of active sessions at that time).
Stopped the traffic again, and got to a state were we could promote a different primary and restart the machine. This fixed the issue.

Around 24hrs after that (2021-10-28T20:45-700), the shard A experienced the same issue. In that occasion we were able to see the number of sessions (db.serverStatus().logicalSessionRecordCache.activeSessionsCount) that was around 950k.
We stopped the traffic again and noticed that, after a few minutes, the number of sessions started going up and down. It would basically stay at a very low number (like 50), and then suddenly jump to 950k. Then down again and then up. This continued until we decided to stop the shard balancer, which seemed to fix the issue.

We kept the balancer off for a few hours, then switched it on again and we didn't see the issue again yet (after around 12hrs being on).

I think it is worth noticing that we're in the middle of a very big chunk migration. shard A has >500k chunks that we're slowly moving to shardB, that started as empty. The migration is being extremely slow. To try to speed-up the migration we have merged some chunks, so that they are ~200M each, and those are the ones that are currently being migrated.

This piece of log here on shard A during the event of Oct 28th seems relevant:

{"t":{"$date":"2021-10-28T20:15:12.311-07:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"LogicalSessionCacheReap","msg":"Slow query","attr":{"type":"command","ns":"config.transactions","command":{"getMore":8936347662363425973,"collection":"transactions","$db":"config"},"originatingCommand":{"find":"transactions","filter":{"lastWriteDate":{"$lt":{"$date":"2021-10-29T02:45:08.517Z"}}},"projection":{"_id":1},"sort":{"_id":1},"readConcern":{},"$db":"config"},"planSummary":"IXSCAN { _id: 1 }","cursorid":8936347662363425973,"keysExamined":187923,"docsExamined":187923,"cursorExhausted":true,"numYields":193,"nreturned":136087,"reslen":12137122,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":194}},"Global":{"acquireCount":{"r":194}},"Database":{"acquireCount":{"r":194}},"Collection":{"acquireCount":{"r":194}},"Mutex":{"acquireCount":{"r":1}}},"readConcern":{"provenance":"clientSupplied"},"storage":{"data":{"bytesRead":88715347,"timeReadingMicros":253343}},"protocol":"op_msg","durationMillis":1057}}
{"t":{"$date":"2021-10-28T20:18:39.894-07:00"},"s":"I",  "c":"CONTROL",  "id":20710,   "ctx":"LogicalSessionCacheRefresh","msg":"Failed to refresh session cache, will try again at the next refresh interval","attr":{"error":"WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true, writeConcern: { w: \"majority\", wtimeout: 15000, provenance: \"clientSupplied\" } } at raw_shard_brsn0"}}
{"t":{"$date":"2021-10-28T20:21:33.303-07:00"},"s":"I",  "c":"CONTROL",  "id":20710,   "ctx":"LogicalSessionCacheRefresh","msg":"Failed to refresh session cache, will try again at the next refresh interval","attr":{"error":"WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true, writeConcern: { w: \"majority\", wtimeout: 15000, provenance: \"clientSupplied\" } } at raw_shard_brsn0"}}
{"t":{"$date":"2021-10-28T20:27:01.972-07:00"},"s":"I",  "c":"CONTROL",  "id":20710,   "ctx":"LogicalSessionCacheRefresh","msg":"Failed to refresh session cache, will try again at the next refresh interval","attr":{"error":"WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true, writeConcern: { w: \"majority\", wtimeout: 15000, provenance: \"clientSupplied\" } } at raw_shard_brsn0"}}
{"t":{"$date":"2021-10-28T20:30:31.718-07:00"},"s":"I",  "c":"CONTROL",  "id":20710,   "ctx":"LogicalSessionCacheRefresh","msg":"Failed to refresh session cache, will try again at the next refresh interval","attr":{"error":"WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true, writeConcern: { w: \"majority\", wtimeout: 15000, provenance: \"clientSupplied\" } } at raw_shard_brsn0"}}
{"t":{"$date":"2021-10-28T20:36:12.032-07:00"},"s":"I",  "c":"CONTROL",  "id":20710,   "ctx":"LogicalSessionCacheRefresh","msg":"Failed to refresh session cache, will try again at the next refresh interval","attr":{"error":"WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true, writeConcern: { w: \"majority\", wtimeout: 15000, provenance: \"clientSupplied\" } } at raw_shard_brsn0"}}
{"t":{"$date":"2021-10-28T20:40:32.434-07:00"},"s":"I",  "c":"CONTROL",  "id":20710,   "ctx":"LogicalSessionCacheRefresh","msg":"Failed to refresh session cache, will try again at the next refresh interval","attr":{"error":"WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true, writeConcern: { w: \"majority\", wtimeout: 15000, provenance: \"clientSupplied\" } } at raw_shard_brsn0"}}
{"t":{"$date":"2021-10-28T20:50:36.166-07:00"},"s":"I",  "c":"CONTROL",  "id":20710,   "ctx":"LogicalSessionCacheRefresh","msg":"Failed to refresh session cache, will try again at the next refresh interval","attr":{"error":"WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true, writeConcern: { w: \"majority\", wtimeout: 15000, provenance: \"clientSupplied\" } } at raw_shard_brsn0"}}
{"t":{"$date":"2021-10-28T20:50:58.783-07:00"},"s":"I",  "c":"CONTROL",  "id":20710,   "ctx":"LogicalSessionCacheRefresh","msg":"Failed to refresh session cache, will try again at the next refresh interval","attr":{"error":"WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true, writeConcern: { w: \"majority\", wtimeout: 15000, provenance: \"clientSupplied\" } } at raw_shard_brsn0"}}

The other thing I want to mention is that when I run this command on mongos(as suggested in other related Jiras), I get an unauthorized:

mongos> db.getSiblingDB('config').system.sessions.stats()
{
	"ok" : 0,
	"errmsg" : "not authorized on config to execute command { collStats: \"system.sessions\", scale: undefined, lsid: { id: UUID(\"051501ce-f5ab-4eaa-80a4-687810b8f7da\") }, $clusterTime: { clusterTime: Timestamp(1635569249, 19), signature: { hash: BinData(0, 2E85395E3928AF293BC4339EB8FBB389043E3BA7), keyId: 6962678295087284246 } }, $db: \"config\" }",
	"code" : 13,
	"codeName" : "Unauthorized",
	"operationTime" : Timestamp(1635569255, 54),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1635569255, 54),
		"signature" : {
			"hash" : BinData(0,"ckGeG5WkxyjzXSgDOrVk0Cd4xfE="),
			"keyId" : NumberLong("6962678295087284246")
		}
	}
}



 Comments   
Comment by Eric Sedor [ 31/Jan/22 ]

I understand daniele@spiketrap.io; be well!

Comment by Daniele Tessaro [ 31/Jan/22 ]

Hi Eric, sorry for the late reply. Unfortunately it's been too long since the even happened. We don't have access to the logs anymore (we actually had to move away from a sharded solution). 

Sorry I can't help more.

Comment by Eric Sedor [ 31/Jan/22 ]

Hi daniele@spiketrap.io, are you able to provide provide the additional information I requested above?

Comment by Eric Sedor [ 10/Dec/21 ]

Hi daniele@spiketrap.io, and apologies for the delay. For the question of authorization I encourage you to ask our community for help by posting on the MongoDB Developer Community Forums.

For the main issue you're reporting:

  • Can you describe the architecture of each replica set (Shard A and B) in the cluster?
  • Can you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory for the shard B primary and a representative mongos (the contents are described here) and upload them to this support uploader location?

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Thanks,
Eric

Generated at Thu Feb 08 05:51:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.