[SERVER-20877] Under cache-full conditions serverStatus can become stuck Created: 12/Oct/15  Updated: 07/Dec/16  Resolved: 18/Nov/15

Status: Closed
Project: Core Server
Component/s: Diagnostics, WiredTiger
Affects Version/s: 3.0.6
Fix Version/s: 3.2.0-rc4

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: David Hows
Resolution: Done Votes: 1
Labels: WTplaybook
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-20961 Large amounts of create and drop coll... Closed
is related to SERVER-20876 Hang in scenario with sharded ttl col... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

This is a spinoff from SERVER-20876. In that ticket we saw in cache full condtions that serverStatus became stuck in this stack trace:

Thread 17 (Thread 0x7f58d3df9700 (LWP 7721)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x0000000001366e61 in __wt_cond_wait ()
#2  0x000000000134cfa7 in __wt_cache_wait ()
#3  0x000000000139312d in ?? ()
#4  0x0000000000d7c6ad in mongo::WiredTigerRecoveryUnit::_txnOpen(mongo::OperationContext*) ()
#5  0x0000000000d7c7ef in mongo::WiredTigerRecoveryUnit::getSession(mongo::OperationContext*) ()
#6  0x0000000000d8020b in mongo::WiredTigerServerStatusSection::generateSection(mongo::OperationContext*, mongo::BSONElement const&) const ()
#7  0x0000000000974e49 in mongo::CmdServerStatus::run(mongo::OperationContext*, std::string const&, mongo::BSONObj&, int, std::string&, mongo::BSONObjBuilder&, bool) ()
#8  0x00000000009bdc64 in mongo::_execCommand(mongo::OperationContext*, mongo::Command*, std::string const&, mongo::BSONObj&, int, std::string&, mongo::BSONObjBuilder&, bool) ()
#9  0x00000000009bebed in mongo::Command::execCommand(mongo::OperationContext*, mongo::Command*, int, char const*, mongo::BSONObj&, mongo::BSONObjBuilder&, bool) ()
#10 0x00000000009bf8fb in mongo::_runCommands(mongo::OperationContext*, char const*, mongo::BSONObj&, mongo::_BufBuilder<mongo::TrivialAllocator>&, mongo::BSONObjBuilder&, bool, int) ()
#11 0x0000000000b9340a in mongo::runQuery(mongo::OperationContext*, mongo::Message&, mongo::QueryMessage&, mongo::NamespaceString const&, mongo::CurOp&, mongo::Message&) ()
#12 0x0000000000aa3480 in mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&) ()
#13 0x00000000007e99fd in mongo::MyMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*, mongo::LastError*) ()
#14 0x0000000000f1badb in mongo::PortMessageServer::handleIncomingMsg(void*) ()
#15 0x00007f58d99c6182 in start_thread (arg=0x7f58d3df9700) at pthread_create.c:312
#16 0x00007f58d8ac747d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

This means that unfortunately no useful diagnostic data could be collected after the cache became full. Ideally serverStatus should succeed regardless of the state of the WT cache.



 Comments   
Comment by Githook User [ 19/Nov/15 ]

Author:

{u'username': u'daveh86', u'name': u'David Hows', u'email': u'howsdav@gmail.com'}

Message: SERVER-20877 - fix linting errors
Branch: master
https://github.com/mongodb/mongo/commit/e2eff50dba9396769caac67c78c7e7cc2968029b

Comment by Githook User [ 18/Nov/15 ]

Author:

{u'username': u'daveh86', u'name': u'David Hows', u'email': u'howsdav@gmail.com'}

Message: SERVER-20877 - Allow getSession to specify if a WiredTiger txn is needed
Branch: master
https://github.com/mongodb/mongo/commit/c5e3d38ac9f63191749844a906fe54777e775136

Comment by David Hows [ 04/Nov/15 ]

On re-run
ServerStatus was slow, but did not hang fully:

2015-11-04T11:45:37.433+1100 I COMMAND  [conn2] serverStatus was very slow: { after basic: 0, after asserts: 0, after connections: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opcounters: 0, after opcountersRepl: 0, after storageEngine: 0, after tcmalloc: 0, after wiredTiger: 1070, at end: 1070 }
2015-11-04T11:45:37.433+1100 I COMMAND  [conn2] command admin.$cmd command: serverStatus { serverStatus: 1.0 } ntoreturn:1 ntoskip:0 keyUpdates:0 writeConflicts:0 numYields:0 reslen:17268 locks:{} protocol:op_command 1071ms

I have another patch for this in CR:
https://mongodbcr.appspot.com/30370001/

With a re-run and rebase I get immediate returns.

Comment by Alexander Gorrod [ 03/Nov/15 ]

david.hows Could you re-run your reproducer with a debug build including the latest WiredTiger drop. I'm interested to see whether it times out after 5 minutes due to the new diagnostic code added here:

https://github.com/wiredtiger/wiredtiger/commit/66b44e1344219f445305248cb1ea630536af41d2

Generated at Thu Feb 08 03:55:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.