[SERVER-77935] Investigate timeout from fcv downgrade in jstests/core/query/push/push2.js Created: 09/Jun/23  Updated: 05/Feb/24

Status: Open
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Adi Zaimi Assignee: Adi Zaimi
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File core_32287_stacktrace_uniq.txt     Text File lock_mgr_dump.txt    
Operating System: ALL
Sprint: Sharding NYC 2023-06-12, Sharding NYC 2023-06-26, Sharding NYC 2023-07-10, Sharding NYC 2023-07-24, Sharding NYC 2023-08-07, Sharding NYC 2023-08-21, Sharding NYC 2023-09-04, Sharding NYC 2023-09-18, Sharding NYC 2023-10-02, Sharding NYC 2023-10-16, Sharding NYC 2023-10-30, Cluster Scalability 2023-11-13, Cluster Scalability 2023-11-27, Cluster Scalability 2023-12-11, Cluster Scalability 2023-12-25, Cluster Scalability 2024-1-8, Cluster Scalability 2024-1-22, Cluster Scalability 2024-2-5, Cluster Scalability 2024-2-19
Participants:

 Description   

Investigate timeout from fcv downgrade in jstests/core/query/push/push2.js

evergreen run: https://spruce.mongodb.com/task/mongodb_mongo_master_enterprise_rhel_80_64_bit_dynamic_all_feature_flags_fcv_upgrade_downgrade_sharded_collections_jscore_passthrough_0_linux_enterprise_patch_dcd8c050fb807cd6a30f1c3f833f4be23c22fdcf_6480f63e32f417f2ff94f46f_23_06_07_21_27_30/tests?execution=0&sortBy=STATUS&sortDir=ASC

 



 Comments   
Comment by Adi Zaimi [ 09/Jun/23 ]

Looking at the db log output 

https://parsley.mongodb.com/resmoke/a3c8a5c366080396f2838f449df63c50/all?bookmarks=0,666,107087,107158,107207,107324,107364,107404,107564,107785,107786,1963855,1965416&filters=000setFeatureCompatibilityVersion,000j2%253As0%253Aprim,100fa475df0-632f-4d03-a688-740fcb34aea,1009189fc29-8fd1-4196-8189-b25a6697f147&shareLine=107821

It seems that the internal transaction was started and the last thing it printed is 
"Going to write participant list"
 

Comment by Adi Zaimi [ 09/Jun/23 ]

Looking at the lockmgr dump, it can be seen that FeatureCompatibilityVersion lock is granted to:

    resourceId = {2305843009213693953: Global, 1}, FeatureCompatibilityVersion,
    grantedList = mongo::LockRequestList = {{
        locker = 0x5574bb846800,
        notify = 0x5574bb8468c0,
        enqueueAtFront = false,
        compatibleFirst = false,
        partitioned = true,
        recursiveCount = 1,
        lock = 0x5574b9170880,
        partitionedLock = 0x0,
        prev = 0x0,
        next = 0x0,
        status = mongo::LockRequest::STATUS_GRANTED,
        mode = mongo::MODE_IX,
        convertMode = mongo::MODE_NONE,
        unlockPending = 1
      }},

 

and is being requested by 

    conflictList = mongo::LockRequestList = {{
        locker = 0x5574b9c8c000,
        notify = 0x5574b9c8c0c0,
        enqueueAtFront = true,
        compatibleFirst = true,
        partitioned = false,
        recursiveCount = 1,
        lock = 0x5574b9170880,
        partitionedLock = 0x0,
        prev = 0x0,
        next = 0x5574b8d6fd58,
        status = mongo::LockRequest::STATUS_WAITING,
        mode = mongo::MODE_S,
        convertMode = mongo::MODE_NONE,
        unlockPending = 0
      }

 

The threadId holdiong the lock is 0, which means it is probably an internal txn which has exited, and the debug info shows:

    _debugInfo = "lsid: { id: UUID(\"fa475df0-632f-4d03-a688-740fcb34aea1\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855), txnUUID: UUID(\"9189fc29-8fd1-4196-8189-b25a6697f147\") }"
 

The thread waiting for the lock is the SetFeatureCompatibilityVersionCommand.

Thread 89: "" (Thread 0x7f3bd9852700 (LWP 919))
....
#13 0x00007f3c1bafac51 in operator() (speed=mongo::Interruptible::WakeSpeed::kFast, deadline=..., __closure=<synthetic pointer>) at src/mongo/util/interruptible.h:340
#14 mongo::Interruptible::waitForConditionOrInterruptUntil<std::unique_lock<mongo::latch_detail::Latch>, mongo::CondVarLockGrantNotification::wait(mongo::OperationContext*, mongo::Milliseconds)::<lambda()> > (pred=..., finalDeadline=..., m=..., cv=...,
this=0x5574b9c90480) at src/mongo/util/interruptible.h:352
#15 mongo::Interruptible::waitForConditionOrInterruptFor<std::unique_lock<mongo::latch_detail::Latch>, mongo::CondVarLockGrantNotification::wait(mongo::OperationContext*, mongo::Milliseconds)::<lambda()> > (pred=..., ms=..., m=..., cv=..., this=0x5574b9
c90480) at src/mongo/util/interruptible.h:392
#16 mongo::CondVarLockGrantNotification::wait (this=this@entry=0x5574b9c8c0c0, opCtx=opCtx@entry=0x5574b9c90480, timeout=..., timeout@entry=...) at src/mongo/db/concurrency/lock_state.cpp:277
#17 0x00007f3c1bafc076 in mongo::LockerImpl::_lockComplete(mongo::OperationContext*, mongo::ResourceId, mongo::LockMode, mongo::Date_t, std::function<void ()> const&) (this=this@entry=0x5574b9c8c000, opCtx=opCtx@entry=0x5574b9c90480, resId=..., mode=mode@entry=mongo::MODE_S, deadline=..., deadline@entry=..., onTimeout=...) at src/mongo/db/concurrency/lock_state.cpp:1034
#18 0x00007f3c1bafccd5 in mongo::LockerImpl::lock (this=0x5574b9c8c000, opCtx=0x5574b9c90480, resId=..., mode=mongo::MODE_S, deadline=...) at src/mongo/db/concurrency/lock_state.cpp:579
#19 0x00007f3c1bae784b in mongo::Lock::ResourceLock::_lock (this=this@entry=0x7f3bd984b680, mode=mode@entry=mongo::MODE_S, deadline=..., deadline@entry=...) at src/mongo/db/concurrency/d_concurrency.cpp:306
#20 0x00007f3c1b9797df in mongo::Lock::ResourceLock::ResourceLock (deadline=..., mode=mongo::MODE_S, rid=..., opCtx=<optimized out>, this=0x7f3bd984b680) at src/mongo/db/concurrency/d_concurrency.h:56
#21 mongo::(anonymous namespace)::SetFeatureCompatibilityVersionCommand::_prepareToDowngrade (changeTimestamp=..., request=..., opCtx=<optimized out>, this=0x7f3c1b9ffc40 <mongo::(anonymous namespace)::setFeatureCompatibilityVersionCommand>) at src/mongo/db/commands/set_feature_compatibility_version_command.cpp:1358
#22 mongo::(anonymous namespace)::SetFeatureCompatibilityVersionCommand::run (this=<optimized out>, opCtx=<optimized out>, cmdObj=..., result=...) at src/mongo/db/commands/set_feature_compatibility_version_command.cpp:481
#23 0x00007f3c1c15e9e7 in mongo::BasicCommand::runWithReplyBuilder (replyBuilder=0x5574b9af8310, cmdObj=owned BSONObj 823 bytes @ 0x5574ba27f51d = {...}, dbName=..., opCtx=0x5574b9c90480, this=0x7f3c1b9ffc40 <mongo::(anonymous namespace)::setFeatureCompatibilityVersionCommand>) at src/mongo/db/commands.h:1022
#24 mongo::BasicCommandWithReplyBuilderInterface::Invocation::run (this=0x5574b94452a0, opCtx=0x5574b9c90480, result=0x5574b9af8310) at src/mongo/db/commands.cpp:926
#25 0x00007f3c1c14b410 in mongo::CommandHelpers::runCommandInvocation (opCtx=0x5574b9c90480, request=..., invocation=0x5574b94452a0, response=0x5574b9af8310) at src/mongo/db/commands.cpp:186
#26 0x00007f3c1c14f0bd in operator() (__closure=<optimized out>) at src/mongo/db/request_execution_context.h:69
#27 mongo::makeReadyFutureWith<mongo::CommandHelpers::runCommandInvocation(std::shared_ptr<mongo::RequestExecutionContext>, std::shared_ptr<mongo::CommandInvocation>, bool)::<lambda()> > (func=...) at src/mongo/util/future.h:1348
#28 mongo::CommandHelpers::runCommandInvocation (rec=std::shared_ptr<mongo::RequestExecutionContext> (empty) = {...}, invocation=std::shared_ptr<mongo::CommandInvocation> (empty) = {...}, useDedicatedThread=<optimized out>) at src/mongo/db/commands.cpp:1

Comment by Adi Zaimi [ 09/Jun/23 ]

fcv commands are not sent after 10:21:
https://parsley.mongodb.com/resmoke/a3c8a5c366080396f2838f449df63c50/all?bookmarks=0,666,107785,1963855,1965416&filters=100setFeatureCompatibilityVersion,000j2%253As0%253Aprim&shareLine=107785
and dumping the mutexes of process 32287 (looking at dump_mongod.32287.core) and 'grep true':

{"client": "TTLMonitor", "waiting": true, "mutex": "CondVarLockGrantNotification::_mutex", "since": "2023-06-08T00:10:49.117000+00:00"}
{"client": "TopologyVersionObserver", "waiting": true, "mutex": "AnonymousLockable", "since": "2023-06-07T22:21:36.387000+00:00"}
{"client": "conn39", "waiting": true, "mutex": "AnonymousLockable", "since": "2023-06-08T00:10:41.791000+00:00"}
{"client": "conn24", "waiting": true, "mutex": "AnonymousLockable", "since": "2023-06-08T00:10:41.987000+00:00"}
{"client": "range-deleter", "waiting": true, "mutex": "ReadyRangeDeletionsProcessor", "since": "2023-06-07T22:20:21.986000+00:00"}
{"client": "conn32", "waiting": true, "mutex": "AnonymousLockable", "since": "2023-06-08T00:10:41.807000+00:00"}
{"client": "TransactionCoordinator", "waiting": true, "mutex": "CondVarLockGrantNotification::_mutex", "since": "2023-06-08T00:10:49.152000+00:00"}
{"client": "conn57", "waiting": true, "mutex": "CondVarLockGrantNotification::_mutex", "since": "2023-06-08T00:10:49.337000+00:00"}
{"client": "ChangeStreamExpiredPreImagesRemover", "waiting": true, "mutex": "CondVarLockGrantNotification::_mutex", "since": "2023-06-08T00:10:49.178000+00:00"}
{"client": "conn118", "waiting": true, "mutex": "AnonymousLockable", "since": "2023-06-07T22:21:36.846000+00:00"}
{"client": "conn33", "waiting": true, "mutex": "AnonymousLockable", "since": "2023-06-08T00:10:41.782000+00:00"}
{"client": "ReplBatcher", "waiting": true, "mutex": "OplogBufferBlockingQueue::mutex", "since": "2023-06-08T00:10:49.075000+00:00"}
{"client": "monitoring-keys-for-HMAC", "waiting": true, "mutex": "PeriodicRunner::_mutex", "since": "2023-06-07T22:20:26.136000+00:00"}

 

Generated at Thu Feb 08 06:37:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.