[SERVER-30790] ServerStatus on WiredTiger accesses the storage engine without any locks Created: 23/Aug/17  Updated: 30/Oct/23  Resolved: 30/Aug/17

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: 3.4.9, 3.5.13

Type: Bug Priority: Major - P3
Reporter: Daniel Gottlieb (Inactive) Assignee: Daniel Gottlieb (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
is related to SERVER-30888 Have FTDC code paths obtain locks wit... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.4
Sprint: Storage 2017-09-11
Participants:
Linked BF Score: 0

 Description   

Note that "ServerStatus" is also called by FTDC which isn't necessarily shutdown before the storage engine shutsdown.

Moreover, recover to a timestamp expects the global lock to be sufficient for requiring the system to keep its dirty paws from accessing the storage engine during this critical and complicated operation.



 Comments   
Comment by Githook User [ 01/Sep/17 ]

Author:

{'username': 'dgottlieb', 'name': 'Daniel Gottlieb', 'email': 'daniel.gottlieb@mongodb.com'}

Message: SERVER-30790: Have WiredTigerServerStatusSection take a Global IS lock.

(cherry picked from commit 1dc2afc8b5ce09d88535049df2ad72adc0434cc8)
Branch: v3.4
https://github.com/mongodb/mongo/commit/59316ba410b75c88b0c50319f1f0aaf0a9bfda52

Comment by Daniel Gottlieb (Inactive) [ 30/Aug/17 ]

>> oplog application uses a different lock than the global lock to block readers, which we call the Parallel Batch Writer Mode

Thanks for the clarification. Both are reported rolled together in the "Global" locks section in serverStatus and log files, correct?

I believe so yes.

I wonder if that's the case, or if in some way serverStatus opts out of PWBM lock

I also believe FTDC opts out of the PBWM lock:
https://github.com/mongodb/mongo/blob/1dc2afc8b5ce09d88535049df2ad72adc0434cc8/src/mongo/db/ftdc/collector.cpp#L69

I'm filing a follow-up ticket for FDTC to skip over sections that can't acquire locks. It's non-trivial as the AutoGetCollection constructor does not currently have a way to express lock acquisition timeouts. It also would be an API change if some sections were returned non-existent or empty. Monitoring can be prepped to handle that (if it's not already), but product would have better knowledge if omitting some data would pose a problem for customers.

Comment by Githook User [ 30/Aug/17 ]

Author:

{'name': 'Daniel Gottlieb', 'username': 'dgottlieb', 'email': 'daniel.gottlieb@mongodb.com'}

Message: SERVER-30790: Have WiredTigerServerStatusSection take a Global IS lock.
Branch: master
https://github.com/mongodb/mongo/commit/1dc2afc8b5ce09d88535049df2ad72adc0434cc8

Comment by Bruce Lucas (Inactive) [ 23/Aug/17 ]

oplog application uses a different lock than the global lock to block readers, which we call the Parallel Batch Writer Mode

Thanks for the clarification. Both are reported rolled together in the "Global" locks section in serverStatus and log files, correct?

Are you running with replication enabled?

I was not, and when I enable replication I do indeed see regular acquisitions reported in the Global locks metrics. Given the above I can't tell whether this is PBWM lock or actual global lock, but I assume the stack trace you quoted above shows it's indeed the global lock. Given this, then I agree the proposed change won't make a difference to FTDC.

AutoGetCollectionForRead probably acquires it.

I wonder if that's the case, or if in some way serverStatus opts out of PWBM lock. In HELP-4808 we see secondary readers regularly logging 5-10 seconds per "Global" lock acquisition, which I assume to be PWBM lock not actual global lock because a) actual global lock is rare and b) secondary is lagging indicating heavy replication activity. However FTDC is not impacted at all, suggesting it doesn't acquire PBWM lock. I'd be grateful to get clarity on this.

Comment by Andy Schwerin [ 23/Aug/17 ]

I believe that several replication state transitions require the global lock in mode X, but Eric's point remains. I think lockers have to opt out of participation in the PBWM lock, though. AutoGetCollectionForRead probably acquires it.

Comment by Eric Milkie [ 23/Aug/17 ]

For example, on a secondary ftdc would be blocked by oplog application's global lock.

This isn't quite correct; oplog application uses a different lock than the global lock to block readers, which we call the Parallel Batch Writer Mode. We also have an easy way to allow FTDC to not be blocked by this lock.
The only place I would see FTDC getting blocked would be when the GlobalLock is enqueued or acquired in X mode, which I believe only happens for replica set stepdown and at shutdown.

Comment by Daniel Gottlieb (Inactive) [ 23/Aug/17 ]

Are you running with replication enabled? https://github.com/mongodb/mongo/blob/c1e7921e9d69bd9a37761deb58d119a324341a54/src/mongo/db/ftdc/ftdc_mongod.cpp#L47-L60

Thread 16 "ftdc" hit Breakpoint 3, mongo::Lock::GlobalLock::GlobalLock (this=0x7fffeaf3cdf8, opCtx=0x7ffff1011a20, 
    lockMode=mongo::MODE_IS, timeoutMs=4294967295) at src/mongo/db/concurrency/d_concurrency.cpp:140
140	    : GlobalLock(opCtx, lockMode, timeoutMs, EnqueueOnly()) {

#0  mongo::Lock::GlobalLock::GlobalLock (this=0x7fffeaf3cdf8, opCtx=0x7ffff1011a20, lockMode=mongo::MODE_IS, 
    timeoutMs=4294967295) at src/mongo/db/concurrency/d_concurrency.cpp:140
#1  0x00005555592be5b8 in mongo::Lock::DBLock::DBLock (this=0x7fffeaf3cdc0, opCtx=0x7ffff1011a20, db="local", 
    mode=mongo::MODE_IS) at src/mongo/db/concurrency/d_concurrency.cpp:188
#2  0x00005555579b8c74 in mongo::AutoGetDb::AutoGetDb (this=0x7fffeaf3cdc0, opCtx=0x7ffff1011a20, ns="local", 
    mode=mongo::MODE_IS) at src/mongo/db/db_raii.cpp:50
#3  0x00005555579b8dca in mongo::AutoGetCollection::AutoGetCollection (this=0x7fffeaf3cdb8, opCtx=0x7ffff1011a20, nss=..., 
    modeDB=mongo::MODE_IS, modeColl=mongo::MODE_IS, viewMode=mongo::AutoGetCollection::ViewMode::kViewsForbidden)
    at src/mongo/db/db_raii.cpp:58
#4  0x00005555579bb3e3 in boost::optional_detail::optional_base<mongo::AutoGetCollection>::emplace_assign<mongo::OperationContext*&, mongo::NamespaceString const&, mongo::LockMode, mongo::LockMode, mongo::AutoGetCollection::ViewMode&>(mongo::OperationContext*&, mongo::NamespaceString const&, mongo::LockMode&&, mongo::LockMode&&, mongo::AutoGetCollection::ViewMode&) (
    this=0x7fffeaf3cdb0, args=@0x7fffeaf3c72c: mongo::AutoGetCollection::ViewMode::kViewsForbidden, 
    args=@0x7fffeaf3c72c: mongo::AutoGetCollection::ViewMode::kViewsForbidden, 
    args=@0x7fffeaf3c72c: mongo::AutoGetCollection::ViewMode::kViewsForbidden, 
    args=@0x7fffeaf3c72c: mongo::AutoGetCollection::ViewMode::kViewsForbidden, 
    args=@0x7fffeaf3c72c: mongo::AutoGetCollection::ViewMode::kViewsForbidden)
    at src/third_party/boost-1.60.0/boost/optional/optional.hpp:494
#5  0x00005555579bb0d1 in boost::optional<mongo::AutoGetCollection>::emplace<mongo::OperationContext*&, mongo::NamespaceString const&, mongo::LockMode, mongo::LockMode, mongo::AutoGetCollection::ViewMode&>(mongo::OperationContext*&, mongo::NamespaceString const&, mongo::LockMode&&, mongo::LockMode&&, mongo::AutoGetCollection::ViewMode&) (this=0x7fffeaf3cdb0, 
    args=@0x7fffeaf3c72c: mongo::AutoGetCollection::ViewMode::kViewsForbidden, 
    args=@0x7fffeaf3c72c: mongo::AutoGetCollection::ViewMode::kViewsForbidden, 
    args=@0x7fffeaf3c72c: mongo::AutoGetCollection::ViewMode::kViewsForbidden, 
    args=@0x7fffeaf3c72c: mongo::AutoGetCollection::ViewMode::kViewsForbidden, 
    args=@0x7fffeaf3c72c: mongo::AutoGetCollection::ViewMode::kViewsForbidden)
    at src/third_party/boost-1.60.0/boost/optional/optional.hpp:981
#6  0x00005555579b9bc6 in mongo::AutoGetCollectionForRead::AutoGetCollectionForRead (this=0x7fffeaf3cdb0, 
    opCtx=0x7ffff1011a20, nss=..., viewMode=mongo::AutoGetCollection::ViewMode::kViewsForbidden)
    at src/mongo/db/db_raii.cpp:129
#7  0x00005555579bb4a5 in boost::optional_detail::optional_base<mongo::AutoGetCollectionForRead>::emplace_assign<mongo::OperationContext*&, mongo::NamespaceString const&, mongo::AutoGetCollection::ViewMode&> (this=0x7fffeaf3cda8, 
    args=@0x7fffeaf3c87c: mongo::AutoGetCollection::ViewMode::kViewsForbidden, 
    args=@0x7fffeaf3c87c: mongo::AutoGetCollection::ViewMode::kViewsForbidden, 
---Type <return> to continue, or q <return> to quit---
    args=@0x7fffeaf3c87c: mongo::AutoGetCollection::ViewMode::kViewsForbidden)
    at src/third_party/boost-1.60.0/boost/optional/optional.hpp:494
#8  0x00005555579bb1e7 in boost::optional<mongo::AutoGetCollectionForRead>::emplace<mongo::OperationContext*&, mongo::NamespaceString const&, mongo::AutoGetCollection::ViewMode&> (this=0x7fffeaf3cda8, 
    args=@0x7fffeaf3c87c: mongo::AutoGetCollection::ViewMode::kViewsForbidden, 
    args=@0x7fffeaf3c87c: mongo::AutoGetCollection::ViewMode::kViewsForbidden, 
    args=@0x7fffeaf3c87c: mongo::AutoGetCollection::ViewMode::kViewsForbidden)
    at src/third_party/boost-1.60.0/boost/optional/optional.hpp:981
#9  0x00005555579b9fb9 in mongo::AutoGetCollectionForReadCommand::AutoGetCollectionForReadCommand (this=0x7fffeaf3cda8, 
    opCtx=0x7ffff1011a20, nss=..., viewMode=mongo::AutoGetCollection::ViewMode::kViewsForbidden)
    at src/mongo/db/db_raii.cpp:174
#10 0x0000555557157c0a in mongo::AutoGetCollectionForReadCommand::AutoGetCollectionForReadCommand (this=0x7fffeaf3cda8, 
    opCtx=0x7ffff1011a20, nss=...) at src/mongo/db/db_raii.h:309
#11 0x0000555557584fa8 in mongo::appendCollectionStorageStats (opCtx=0x7ffff1011a20, nss=..., 
    param=owned BSONObj 44 bytes @ 0x7ffff6175728 = {...}, result=0x7fffeaf3d1d0) at src/mongo/db/stats/storage_stats.cpp:57
#12 0x00005555574a25a5 in mongo::CollectionStats::errmsgRun (this=0x55555abf4e90 <mongo::cmdCollectionStats>, 
    opCtx=0x7ffff1011a20, dbname=..., jsobj=owned BSONObj 44 bytes @ 0x7ffff6175728 = {...}, errmsg=..., result=...)
    at src/mongo/db/commands/dbcommands.cpp:992
#13 0x0000555558dbab15 in mongo::ErrmsgCommandDeprecated::run (this=0x55555abf4e90 <mongo::cmdCollectionStats>, 
    opCtx=0x7ffff1011a20, db=..., cmdObj=owned BSONObj 44 bytes @ 0x7ffff6175728 = {...}, result=...)
    at src/mongo/db/commands.cpp:388
#14 0x0000555558dbaa32 in mongo::BasicCommand::enhancedRun (this=0x55555abf4e90 <mongo::cmdCollectionStats>, 
    opCtx=0x7ffff1011a20, request=..., result=...) at src/mongo/db/commands.cpp:380
#15 0x0000555558db8bd3 in mongo::Command::runCommandDirectly (opCtx=0x7ffff1011a20, request=...)
    at src/mongo/db/commands.cpp:192
#16 0x00005555575cad50 in mongo::FTDCSimpleInternalCommandCollector::collect (this=0x7ffff65bff60, opCtx=0x7ffff1011a20, 
    builder=...) at src/mongo/db/ftdc/ftdc_server.cpp:256
#17 0x00005555575dd07c in mongo::FTDCCollectorCollection::collect (this=0x7ffff134a770, client=0x7ffff2748020)
    at src/mongo/db/ftdc/collector.cpp:86
#18 0x00005555575e3fa2 in mongo::FTDCController::doLoop (this=0x7ffff134a6a0) at src/mongo/db/ftdc/controller.cpp:244
#19 0x00005555575e65f9 in std::_Mem_fn_base<void (mongo::FTDCController::*)(), true>::operator()<, void>(mongo::FTDCController*) const (this=0x7ffff6175548, __object=0x7ffff134a6a0)
    at /opt/mongodbtoolchain/v2/bin/../lib/gcc/x86_64-mongodb-linux/5.4.0/../../../../include/c++/5.4.0/functional:600
#20 0x00005555575e6560 in std::_Bind<std::_Mem_fn<void (mongo::FTDCController::*)()> (mongo::FTDCController*)>::__call<void, , ---Type <return> to continue, or q <return> to quit---
0ul>(std::tuple<>&&, std::_Index_tuple<0ul>) (this=0x7ffff6175548, 
    __args=<unknown type in /home/dgottlieb/xgen/mongo/mongod, CU 0x252290d, DIE 0x25457ad>)
    at /opt/mongodbtoolchain/v2/bin/../lib/gcc/x86_64-mongodb-linux/5.4.0/../../../../include/c++/5.4.0/functional:1073
#21 0x00005555575e64cf in std::_Bind<std::_Mem_fn<void (mongo::FTDCController::*)()> (mongo::FTDCController*)>::operator()<, void>() (this=0x7ffff6175548)
    at /opt/mongodbtoolchain/v2/bin/../lib/gcc/x86_64-mongodb-linux/5.4.0/../../../../include/c++/5.4.0/functional:1131
#22 0x00005555575e6475 in std::_Bind_simple<std::_Bind<std::_Mem_fn<void (mongo::FTDCController::*)()> (mongo::FTDCController*)> ()>::_M_invoke<>(std::_Index_tuple<>) (this=0x7ffff6175548)
    at /opt/mongodbtoolchain/v2/bin/../lib/gcc/x86_64-mongodb-linux/5.4.0/../../../../include/c++/5.4.0/functional:1530
#23 0x00005555575e6445 in std::_Bind_simple<std::_Bind<std::_Mem_fn<void (mongo::FTDCController::*)()> (mongo::FTDCController*)> ()>::operator()() (this=0x7ffff6175548)
    at /opt/mongodbtoolchain/v2/bin/../lib/gcc/x86_64-mongodb-linux/5.4.0/../../../../include/c++/5.4.0/functional:1520
#24 0x00005555575e6239 in std::thread::_Impl<std::_Bind_simple<std::_Bind<std::_Mem_fn<void (mongo::FTDCController::*)()> (mongo::FTDCController*)> ()> >::_M_run() (this=0x7ffff6175530)
    at /opt/mongodbtoolchain/v2/bin/../lib/gcc/x86_64-mongodb-linux/5.4.0/../../../../include/c++/5.4.0/thread:115
#25 0x000055555967a720 in std::execute_native_thread_routine (__p=<optimized out>)
    at ../../../../../gcc-5.4.0/libstdc++-v3/src/c++11/thread.cc:84
#26 0x00007ffff72966ba in start_thread (arg=0x7fffeaf3e700) at pthread_create.c:333
#27 0x00007ffff6fcc3dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Comment by Bruce Lucas (Inactive) [ 23/Aug/17 ]

When I look at lock acquisition metrics for an idle mongod, I don't see any lock acquisitions happening at 1 per second. (There is some lock acquisition happening once every 4 seconds, but ftdc is once per second, and moreover if I increase that to 10 times per second the lock acquisitions every 4 seconds don't change, so that must be some other periodic task).

So it appears that serverStatus doesn't do the normal locking as you describe and I think would be hampered by this change. For example, on a secondary ftdc would be blocked by oplog application's global lock. Can we come up with another way to accomplish the objective without negatively impacting diagnosability?

Comment by Daniel Gottlieb (Inactive) [ 23/Aug/17 ]

My understanding is that the typical FTDC loop already does a collstats on the oplog. Unless that's special cased out of grabbing the typical locks, the proposed lock required to grab for this WiredTigerServerStatusSection is strictly easier to obtain. This proposed fix is to get the Global Lock in MODE_IS. Collstats on the oplog is probably a sequence of getting the Global Lock in IS, DBLock in IS and the collection lock in IS or S.

The only time the Global Lock in MODE_IS should be difficult to obtain is for rare, tricky operations (e.g: recover to a timestamp). It's possible to add a timeout to lock acquisition and omit the wiredtiger server status section if the timeout is exceeded. Although, I expect further investigation would show the collstats on the oplog not doing such a thing.

Comment by Bruce Lucas (Inactive) [ 23/Aug/17 ]

Will this negatively impact FTDC collection by blocking it, in fact during the times when it might be particularly important to collect for diagnostic purposes?

Generated at Thu Feb 08 04:25:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.