[SERVER-35222] Crash on the config server at expired session cleanup Created: 25/May/18  Updated: 29/Oct/23  Resolved: 28/Aug/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.6.0, 4.0.0
Fix Version/s: 3.6.9, 4.0.3, 4.1.3

Type: Bug Priority: Major - P3
Reporter: PARK-MinSoo [X] Assignee: Randolph Tan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File log.txt    
Issue Links:
Backports
Depends
Duplicate
Related
is related to SERVER-36904 Fuzzer drops config.system.sessions a... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0, v3.6
Sprint: Sharding 2018-06-18, Sharding 2018-07-16, Sharding 2018-07-30, Sharding 2018-08-13, Sharding 2018-09-10
Participants:
Linked BF Score: 52

 Description   

Hello..

Mongodb server In operation.
There was a crash in Configsver.

After that,  The below stackdump occurred.
You need to determine if the cause of the dump is bug.

please thanks.

 

----- BEGIN BACKTRACE -----{"backtrace":[{"b":"7F2340D96000","o":"21AB111","s":"_ZN5mongo15printStackTraceERSo"},{"b":"7F2340D96000","o":"21AA329"},{"b":"7F2340D96000","o":"21AA80D"},{"b":"7F233FE19000","o":"F370"},{"b":"7F233FA58000","o":"351D7","s":"gsignal"},{"b":"7F233FA58000","o":"368C8","s":"abort"},{"b":"7F2340D96000","o":"97650C","s":"_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j"},{"b":"7F2340D96000","o":"10E1F9C"},{"b":"7F2340D96000","o":"1225253","s":"_ZN5mongo19AsyncRequestsSender10RemoteData27resolveShardIdToHostAndPortERKNS_21ReadPreferenceSettingE"},{"b":"7F2340D96000","o":"1225AFD","s":"_ZN5mongo19AsyncRequestsSender16_scheduleRequestENS_8WithLockEm"},{"b":"7F2340D96000","o":"122617F","s":"_ZN5mongo19AsyncRequestsSender17_scheduleRequestsENS_8WithLockE"},{"b":"7F2340D96000","o":"122ACDA","s":"_ZN5mongo19AsyncRequestsSenderC2EPNS_16OperationContextEPNS_8executor12TaskExecutorENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKSt6vectorINS0_7RequestESaISD_EERKNS_21ReadPreferenceSettingENS_5Shard11RetryPolicyE"},{"b":"7F2340D96000","o":"10A0C6C","s":"_ZN5mongo14BatchWriteExec12executeBatchEPNS_16OperationContextERNS_10NSTargeterERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseEPNS_19BatchWriteExecStatsE"},{"b":"7F2340D96000","o":"10AD016","s":"_ZN5mongo13ClusterWriter5writeEPNS_16OperationContextERKNS_21BatchedCommandRequestEPNS_19BatchWriteExecStatsEPNS_22BatchedCommandResponseE"},{"b":"7F2340D96000","o":"1092CBB"},{"b":"7F2340D96000","o":"10930B5"},{"b":"7F2340D96000","o":"1AFC3D3"},{"b":"7F2340D96000","o":"1AFFA14"},{"b":"7F2340D96000","o":"1AFFECD","s":"_ZN5mongo18SessionsCollection9doRefreshERKNS_15NamespaceStringERKSt13unordered_setINS_20LogicalSessionRecordENS_24LogicalSessionRecordHashESt8equal_toIS5_ESaIS5_EESt8functionIFNS_6StatusENS_7BSONObjEEE"},{"b":"7F2340D96000","o":"1091502","s":"_ZN5mongo25SessionsCollectionSharded15refreshSessionsEPNS_16OperationContextERKSt13unordered_setINS_20LogicalSessionRecordENS_24LogicalSessionRecordHashESt8equal_toIS4_ESaIS4_EE"},{"b":"7F2340D96000","o":"1AF56EF","s":"_ZN5mongo23LogicalSessionCacheImpl8_refreshEPNS_6ClientE"},{"b":"7F2340D96000","o":"1AF64A8","s":"_ZN5mongo23LogicalSessionCacheImpl16_periodicRefreshEPNS_6ClientE"},{"b":"7F2340D96000","o":"106B1B2"},{"b":"7F2340D96000","o":"1BC03EA","s":"_ZN4asio6detail14strand_service8dispatchINS0_7binder1ISt8functionIFvSt10error_codeEES5_EEEEvRPNS1_11strand_implERT_"},{"b":"7F2340D96000","o":"1BC0DAC","s":"_ZN4asio6detail14strand_service8dispatchINS0_17rewrapped_handlerINS0_7binder1INS0_15wrapped_handlerINS_10io_context6strandESt8functionIFvSt10error_codeEENS0_26is_continuation_if_runningEEES9_EESB_EEEEvRPNS1_11strand_implERT_"},{"b":"7F2340D96000","o":"1BC1184","s":"_ZN4asio6detail12wait_handlerINS0_15wrapped_handlerINS_10io_context6strandESt8functionIFvSt10error_codeEENS0_26is_continuation_if_runningEEEE11do_completeEPvPNS0_19scheduler_operationERKS6_m"},{"b":"7F2340D96000","o":"1D25829","s":"_ZN4asio6detail9scheduler10do_run_oneERNS0_27conditionally_enabled_mutex11scoped_lockERNS0_21scheduler_thread_infoERKSt10error_code"},{"b":"7F2340D96000","o":"1D25A71","s":"_ZN4asio6detail9scheduler3runERSt10error_code"},{"b":"7F2340D96000","o":"106A0BD"},{"b":"7F2340D96000","o":"22B9060"},{"b":"7F233FE19000","o":"7DC5"},{"b":"7F233FA58000","o":"F776D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.6.0""gitVersion" : "a57d8e71e6998a2d0afde7edc11bd23e5661c915""compiledModules" : [], "uname" : { "sysname" : "Linux""release" : "3.10.0-514.el7.x86_64""version" : "#1 SMP Tue Nov 22 16:42:41 UTC 2016""machine" : "x86_64" }, "somap" : [ { "b" : "7F2340D96000""elfType" : 3"buildId" : "0900573D611BFAA62156892DB8AAF9BC4331496B" }, { "b" : "7FFC834ED000""elfType" : 3"buildId" : "BF6B0C931C67B5BB2B0E80E07CBF73BAA5A466D2" }, { "b" : "7F2340959000""path" : "/lib64/libresolv.so.2""elfType" : 3"buildId" : "FE7AE845A123A3DFC0FDC2408BCBC2BA8B61B158" }, { "b" : "7F2340751000""path" : "/lib64/librt.so.1""elfType" : 3"buildId" : "82E77ADE22BC9FFF8D3458BD37331E7EDF174C28" }, { "b" : "7F234054D000""path" : "/lib64/libdl.so.2""elfType" : 3"buildId" : "C5F560504E1AF52E29679C3B52FF11121015D6BB" }, { "b" : "7F234024B000""path" : "/lib64/libm.so.6""elfType" : 3"buildId" : "721C7CC9488EFA25F83B48AF713AB27DBE48EF3E" }, { "b" : "7F2340035000""path" : "/lib64/libgcc_s.so.1""elfType" : 3"buildId" : "408B46E291B2D4C9612E27C0509D165D7E186D40" }, { "b" : "7F233FE19000""path" : "/lib64/libpthread.so.0""elfType" : 3"buildId" : "C3DEB1FA27CD0C1C3CC575B944ABACBA0698B0F2" }, { "b" : "7F233FA58000""path" : "/lib64/libc.so.6""elfType" : 3"buildId" : "1CC0E171FD5E7E28A6BB49667AFBD730CDBF22A0" }, { "b" : "7F2340B73000""path" : "/lib64/ld-linux-x86-64.so.2""elfType" : 3"buildId" : "0874508AA13D28E3F48637C1D5BF067BA8D9FD3A" } ] }} mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x7f2342f41111] mongod(+0x21AA329) [0x7f2342f40329] mongod(+0x21AA80D) [0x7f2342f4080d] libpthread.so.0(+0xF370) [0x7f233fe28370]

 

 

mongod version : 3.6.0

os : CentOS Linux 7.3.1611



 Comments   
Comment by Park YoungSoo [ 05/Apr/19 ]

Hi

Did the fix come to the version listed above?

(3.6.9, 4.0.3, 4.1.3)

 

Do you have any explanation of the cause of the bug?

 

Thank you.

Comment by Githook User [ 19/Sep/18 ]

Author:

{'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}

Message: SERVER-35222 Make sure that SessionsCollectionConfigServer will shard config.system.sessions for the first time

(cherry picked from commit ce0602665adb7ec7d241dd77e585f7907e405e84)
Branch: v3.6
https://github.com/mongodb/mongo/commit/3a28630bc2a3e457ffb2497a137ad6db34c808f1

Comment by Githook User [ 13/Sep/18 ]

Author:

{'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}

Message: SERVER-35222 Make sure that SessionsCollectionConfigServer will shard config.system.sessions for the first time

(cherry picked from commit ce0602665adb7ec7d241dd77e585f7907e405e84)
Branch: v4.0
https://github.com/mongodb/mongo/commit/51272842bf6a7f30842a7c8167d145488f46f099

Comment by Githook User [ 28/Aug/18 ]

Author:

{'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}

Message: SERVER-35222 Make sure that SessionsCollectionConfigServer will shard config.system.sessions for the first time
Branch: master
https://github.com/mongodb/mongo/commit/ce0602665adb7ec7d241dd77e585f7907e405e84

Comment by Kaloian Manassiev [ 28/Jun/18 ]

It doesn't seem right that the reaper code should be using separate code paths for whether it is going against itself versus against a shard. jack.mulrow/renctan, can you guys please figure out if there is a cleaner way to solve this?

Comment by Misha Tyulenev [ 25/May/18 ]

Hey kaloian.manassiev. ShardRegistry keeps shards built by ShardFactory according to their ConnectionString type . If its local it will build ShardLocal that has no targeter or ReplicaSetMonitor so the invariant that was hit is expected. This all boils down to the initialization code which for config servers createa corresponding shards as local
The fix should be in the reaper code that needs to use the API that matches the shard's type.
Moving to needs triage state as this is a bug.

Comment by Kaloian Manassiev [ 25/May/18 ]

Hi misha.tyulenev,

From the call stack I think this might be happening when there is a session created against the config server (e.g., customer write with a session against the config/admin databases). Then these sessions expire and when the reaper goes to clean them up, it ends up using the write commands code, which does targeting by calling Shard::getTargeter and this is not allowed to be called on the config server.

-Kal.

Comment by Kaloian Manassiev [ 25/May/18 ]

Hi TheCoin,

Thank you very much for your report!

In your comment above you mention that you might have a backup of the cluster from before the crash. Would it be possible to upload the part which contains the config server's data? It is not a big deal if you don't, but it might help us diagnose this issue faster.

Also, does your application by any chance write to the config or admin databases of the cluster?

Thank you in advance.

-Kal.

Comment by PARK-MinSoo [X] [ 25/May/18 ]

Hi kelsey.schubert 

Unfortunately, the entire log file does not remain.

so. Crash Time log will upload the file as requested.
Mongodb-consitent-backup before configsver's crash, and two minutes later, crash occurred.

 

And, mongos log 

<Refresh for collection config.system.sessions took 0 and found the collection is not sharded>

It has continued to happen.

 

Information about config.system.session is missing from config.collections.

So, db.chunks.remove({_id:"config.system.sessions-_id_MinKey"})

db.runCommand({shardcollection:"config.system.sessions",key:{_id:"hashed"}})

after config.system.seeison Re sharding. Did not happen 'not sharded' log

 

Thank you for your help,

Park-Minsoo

Comment by Kelsey Schubert [ 25/May/18 ]

Hi TheCoin,

So we can continue to investigate, would you please upload the complete log file of the affected node? I'd like to see what was happening before the mongod crashed.

Thank you for your help,
Kelsey

Generated at Thu Feb 08 04:39:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.