[SERVER-26987] Sharded find operations stall for several seconds during chunk migration Created: 10/Nov/16  Updated: 05/Apr/17  Resolved: 27/Dec/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.11
Fix Version/s: 3.2.12

Type: Bug Priority: Major - P3
Reporter: Yoni Douek Assignee: Nathan Myers
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2017-01-02
Participants:

 Description   

Note: this is different than SERVER-22611.

WT nodes, running 3.2.10. During chunk migration - we notice severe drops in mongos query ability (from 500 req/sec to 0) - rendering the cluster unusable for several seconds . This happens when the chunk is being read from the source node, and exactly every 16mb of reads (we see a surge of disk reads at this time). Using a tester that tries to query both mongos and the source mongod at times in which the freeze happens, we noticed that only mongos is frozen - while the source mongod is responsive . To eliminate any hardware bottleneck - we used very very powerful machines, in mongos, config servers and mongods - in terms of CPU and disk IOPS.

This is how a drop looks like (every line represents a second):

444 Requests, 0 Failed (0 Timeouts)
574 Requests, 0 Failed (0 Timeouts)
516 Requests, 0 Failed (0 Timeouts)
512 Requests, 0 Failed (0 Timeouts)
463 Requests, 0 Failed (0 Timeouts)
212 Requests, 0 Failed (0 Timeouts)
28 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
203 Requests, 0 Failed (0 Timeouts)
66 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
221 Requests, 0 Failed (0 Timeouts)
69 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
320 Requests, 0 Failed (0 Timeouts)
7 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
16 Requests, 0 Failed (0 Timeouts)
98 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
0 Requests, 0 Failed (0 Timeouts)
313 Requests, 0 Failed (0 Timeouts)
783 Requests, 0 Failed (0 Timeouts)
355 Requests, 0 Failed (0 Timeouts)
682 Requests, 0 Failed (0 Timeouts)

Only the balanced collection is affected during the freeze - and we see nothing special in the logs of either mongos and the source mongod when this happens. Secondary throttling does not help.

This renders our cluster unusable during balancing - so currently our balancer is stopped. Any advice on the source of this issue and how we can address it will be extremely helpful. Thanks



 Comments   
Comment by Githook User [ 27/Dec/16 ]

Author:

{u'name': u'Nathan Myers', u'email': u'ncm@asperasoft.com'}

Message: SERVER-26987 Release migration manager lock during find ops in clone
Branch: v3.2
https://github.com/mongodb/mongo/commit/1c43b90c0483bc7574fecd0eaeee610480cc9217

Comment by Yoni Douek [ 12/Dec/16 ]

Thanks for the info. Can you fix this on the 3.2 branch? We're stuck with an unbalanced cluster and don't want to use 3.4 anytime soon until it proves to be stable.

Also - what is the ETA?

Comment by Kaloian Manassiev [ 12/Dec/16 ]

Hi yonido,

Thank you very much for the latest reproduction and the set of stacks. From there it becomes clear what the problem is. In short - during migration we are performing potentially blocking I/O while holding a mutex, which is used to protect the in-memory state for a migration and the find operations are acquiring this mutex every time. This problem is specific to 3.2 (and now 3.4 as well).

There are two possible ways to fix this - (1) change the find command to not do this check every time or (2) fix the migration cloner to not do blocking I/O while holding this mutex.

More detail:

All the user finds are blocked in the waitTillNotInCriticalSection call, which deceptively indicates that migration might be in a critical section. However, in fact they are only stuck waiting for the mutex, which protects the migration state, but we are not in critical section yet:

Thread 4 (Thread 0x7f76355de700 (LWP 12647)):
#0  0x00007f766ab7ef4d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f766ab7ad02 in _L_lock_791 () from /lib64/libpthread.so.0
#2  0x00007f766ab7ac08 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000000b3a6e3 in std::mutex::lock() ()
#4  0x0000000000f63d09 in mongo::MigrationSourceManager::waitTillNotInCriticalSection(int) ()
#5  0x0000000000af3852 in mongo::FindCmd::run(mongo::OperationContext*, std::string const&, mongo::BSONObj&, int, std::string&, mongo::BSONObjBuilder&) ()
#6  0x0000000000b77d83 in mongo::Command::run(mongo::OperationContext*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
#7  0x0000000000b78c09 in mongo::Command::execCommand(mongo::OperationContext*, mongo::Command*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
#8  0x0000000000acbfd0 in mongo::runCommands(mongo::OperationContext*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
#9  0x0000000000c92865 in mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&) ()
#10 0x000000000095284c in mongo::MyMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*) ()
#11 0x00000000012c3c45 in mongo::PortMessageServer::handleIncomingMsg(void*) ()

The reason we are stuck waiting for this mutex is that the migration cloner process resolves record is to actual records in a loop, while holding this mutex:

Thread 157 (Thread 0x7f762c64f700 (LWP 11668)):
#0  0x00007f766ab7fa93 in pread64 () from /lib64/libpthread.so.0
#1  0x0000000001a684c6 in ?? ()
#2  0x00000000019c33d8 in __wt_block_read_off ()
#3  0x00000000019c3600 in __wt_bm_read ()
#4  0x00000000019df5f8 in __wt_bt_read ()
#5  0x00000000019e422d in __wt_page_in_func ()
#6  0x0000000001a06bf7 in __wt_row_search ()
#7  0x00000000019d1749 in __wt_btcur_search ()
#8  0x0000000001a23dbf in ?? ()
#9  0x000000000106eb49 in mongo::WiredTigerRecordStore::findRecord(mongo::OperationContext*, mongo::RecordId const&, mongo::RecordData*) const ()
#10 0x0000000000a8bf98 in mongo::Collection::findDoc(mongo::OperationContext*, mongo::RecordId const&, mongo::Snapshotted<mongo::BSONObj>*) const ()
#11 0x0000000000f66900 in mongo::MigrationSourceManager::clone(mongo::OperationContext*, mongo::MigrationSessionId const&, std::string&, mongo::BSONObjBuilder&) ()
#12 0x00000000011f1539 in ?? ()
#13 0x0000000000b77d83 in mongo::Command::run(mongo::OperationContext*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
#14 0x0000000000b78c09 in mongo::Command::execCommand(mongo::OperationContext*, mongo::Command*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
#15 0x0000000000acbfd0 in mongo::runCommands(mongo::OperationContext*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) ()
#16 0x0000000000c92865 in mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, mongo::HostAndPort const&) ()
#17 0x000000000095284c in mongo::MyMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*) ()
#18 0x00000000012c3c45 in mongo::PortMessageServer::handleIncomingMsg(void*) ()

This problem only happens for find operations, because they do the critical section check early in the sequence, while updates do it only if an actual version mismatch was discovered. This is specific to the implementation of the find command in 3.2.

Comment by Yoni Douek [ 12/Dec/16 ]

Hey,

3.2.11 does not fix this. Attached dumps of both mongos, and the reading mongod, at times of freeze. Reproduced twice on each one (5-10sec freeze) just in case, filenames = mongod_1.txt mongod_2.txt mongos_1.txt mongo2_2.txt

Please approve that you've received the files.
If you can please expedite checking this, as we've spin up a special environment with load tests in order to reproduce it + we cannot balance data in our cluster until this is fixed.

Thanks!

Comment by Kaloian Manassiev [ 06/Dec/16 ]

Hi yoni@appsee.com,

Apologies for the delayed response.

From looking at the diagnostic data on both shards and correlating it with the times when the chunk clone progress stalls and from when the slow queries appear, I was able to see that the number of WT pinned pages steadily increases during that period. This is an indication of a long-running transaction and is consistent with the symptoms in SERVER-26898.

This bug was introduced in 3.2.0 and was fixed in 3.2.11. Would it be possible that you upgrade your shards to 3.2.11 and see if the problem still reproduces?

Just in case it still occurs, would it be possible to run the same gdb command, but on one of the affected shards and include the output, because there is a possibility that we might be missing some metrics to help us better pintpoint this kind of problem. The command is:

sudo gdb -ex "thread apply all bt" --batch -p $(pidof mongos) > stacks.txt

Thank you in advance.

Best regards,
-Kal.

Comment by Yoni Douek [ 02/Dec/16 ]

Hey,

Any news?
Thanks

Comment by Kelsey Schubert [ 22/Nov/16 ]

Hi yonido,

Thanks for following up, we have the required files and are investigating. We understand that this is significantly impacting your performance and will update you as we continue to diagnosis the root cause of this behavior. Please be aware that this Thursday and Friday are holidays in the United States, which will delay our investigation.

Kind regards,
Thomas

Comment by Yoni Douek [ 21/Nov/16 ]

Hey,

Wanted to make sure you got what you need. This is pretty urgent.

Thanks

Comment by Yoni Douek [ 16/Nov/16 ]

Hey,

I've uploaded them to the "upload portal" that Thomas added on this thread.

I don't think it's the same bug, as SERVER-20854 talks about blocking all collections, and here - only the balanced collection is affected, and only when accessed via mongos - not via mongod.

Comment by Kaloian Manassiev [ 15/Nov/16 ]

Hi yoni@appsee.com,

You seem to have forgotten to attach the files which you listed in your comment - can you please attach them?

From what you are describing, this seems like a manifestation of SERVER-20854, but I would like to look at the stacks and the logs to confirm.

-Kal.

Comment by Yoni Douek [ 15/Nov/16 ]

Hey,

1. I want to stress out that its not that the entire mongos freezes, only queries via mongos that target the collection that is being balanced. In the attached repro this is "Clients". same queries that are done on the source mongod at times of freeze are not affected.

2. I cannot tell if it happened before 3.2.10 since our setup was different at that time. We have 5 shards, 2 mongos servers.

3. We can reproduce it very easily. In the attached files there are 2 reproductions:

@ Timestamp 2016-11-15 13:14:42 - when the chunk is being read at from its beginning. The source node at this time was db2m1. Note that after a few seconds you can see a lot of slow operations - this is due to mongos being stuck and then unstuck.
@ Timestamp 2016-11-15 13:16:30 - when a chunk is being read and arrived at the 16MB cloned bytes. The source node at this time was db3m2. Please note how much time the source mongod spends on reading the 16233551 cloned bytes (unlike other parts). This is exactly when mongos freezes.

This is how our request log looks like:

2016-11-15 13:14:26,507 MainThread@8568 root INFO   592 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:27,508 MainThread@8568 root INFO   682 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:28,509 MainThread@8568 root INFO   560 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:29,510 MainThread@8568 root INFO   262 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:30,511 MainThread@8568 root INFO   262 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:31,512 MainThread@8568 root INFO   493 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:32,513 MainThread@8568 root INFO   334 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:33,514 MainThread@8568 root INFO   272 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:34,515 MainThread@8568 root INFO   290 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:35,516 MainThread@8568 root INFO   21 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:36,518 MainThread@8568 root INFO   0 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:37,519 MainThread@8568 root INFO   0 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:38,520 MainThread@8568 root INFO   0 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:39,522 MainThread@8568 root INFO   0 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:40,523 MainThread@8568 root INFO   0 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:41,524 MainThread@8568 root INFO   0 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:42,525 MainThread@8568 root INFO   0 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:43,527 MainThread@8568 root INFO   0 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:44,528 MainThread@8568 root INFO   0 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:45,529 MainThread@8568 root INFO   0 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:46,530 MainThread@8568 root INFO   0 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:47,531 MainThread@8568 root INFO   0 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:48,533 MainThread@8568 root INFO   0 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:49,534 MainThread@8568 root INFO   0 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:50,535 MainThread@8568 root INFO   0 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:51,536 MainThread@8568 root INFO   0 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:52,537 MainThread@8568 root INFO   0 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:53,538 MainThread@8568 root INFO   0 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:54,540 MainThread@8568 root INFO   167 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:55,541 MainThread@8568 root INFO   229 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:56,542 MainThread@8568 root INFO   10 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:57,543 MainThread@8568 root INFO   103 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:58,545 MainThread@8568 root INFO   115 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:14:59,545 MainThread@8568 root INFO   385 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:00,546 MainThread@8568 root INFO   352 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:01,547 MainThread@8568 root INFO   389 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:02,549 MainThread@8568 root INFO   416 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:03,550 MainThread@8568 root INFO   504 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:04,551 MainThread@8568 root INFO   306 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:05,551 MainThread@8568 root INFO   227 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:06,552 MainThread@8568 root INFO   275 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:07,554 MainThread@8568 root INFO   413 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:08,555 MainThread@8568 root INFO   235 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:09,555 MainThread@8568 root INFO   319 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:10,556 MainThread@8568 root INFO   462 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:11,558 MainThread@8568 root INFO   297 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:12,559 MainThread@8568 root INFO   410 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:13,560 MainThread@8568 root INFO   304 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:14,561 MainThread@8568 root INFO   158 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:15,561 MainThread@8568 root INFO   270 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:16,562 MainThread@8568 root INFO   282 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:17,564 MainThread@8568 root INFO   278 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:18,565 MainThread@8568 root INFO   331 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:19,566 MainThread@8568 root INFO   252 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:20,567 MainThread@8568 root INFO   425 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:21,569 MainThread@8568 root INFO   430 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:22,570 MainThread@8568 root INFO   413 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:23,570 MainThread@8568 root INFO   94 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:24,571 MainThread@8568 root INFO   319 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:25,573 MainThread@8568 root INFO   200 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:26,573 MainThread@8568 root INFO   101 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:27,574 MainThread@8568 root INFO   131 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:28,575 MainThread@8568 root INFO   48 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:29,576 MainThread@8568 root INFO   33 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:30,578 MainThread@8568 root INFO   176 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:31,578 MainThread@8568 root INFO   160 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:32,579 MainThread@8568 root INFO   42 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:33,579 MainThread@8568 root INFO   60 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:34,580 MainThread@8568 root INFO   180 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:35,581 MainThread@8568 root INFO   2 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:36,583 MainThread@8568 root INFO   62 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:37,583 MainThread@8568 root INFO   65 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:38,584 MainThread@8568 root INFO   57 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:39,586 MainThread@8568 root INFO   54 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:40,586 MainThread@8568 root INFO   111 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:41,587 MainThread@8568 root INFO   240 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:42,589 MainThread@8568 root INFO   458 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:43,590 MainThread@8568 root INFO   563 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:44,591 MainThread@8568 root INFO   507 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:45,593 MainThread@8568 root INFO   454 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:46,594 MainThread@8568 root INFO   226 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:47,595 MainThread@8568 root INFO   351 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:48,596 MainThread@8568 root INFO   503 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:49,598 MainThread@8568 root INFO   330 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:50,599 MainThread@8568 root INFO   424 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:51,599 MainThread@8568 root INFO   391 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:52,600 MainThread@8568 root INFO   261 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:53,602 MainThread@8568 root INFO   252 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:54,603 MainThread@8568 root INFO   248 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:55,603 MainThread@8568 root INFO   373 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:56,604 MainThread@8568 root INFO   418 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:57,605 MainThread@8568 root INFO   442 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:58,607 MainThread@8568 root INFO   392 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:15:59,607 MainThread@8568 root INFO   386 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:00,608 MainThread@8568 root INFO   367 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:01,610 MainThread@8568 root INFO   287 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:02,610 MainThread@8568 root INFO   301 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:03,611 MainThread@8568 root INFO   246 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:04,612 MainThread@8568 root INFO   268 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:05,614 MainThread@8568 root INFO   261 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:06,615 MainThread@8568 root INFO   610 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:07,615 MainThread@8568 root INFO   611 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:08,616 MainThread@8568 root INFO   265 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:09,618 MainThread@8568 root INFO   353 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:10,619 MainThread@8568 root INFO   416 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:11,620 MainThread@8568 root INFO   231 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:12,621 MainThread@8568 root INFO   279 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:13,622 MainThread@8568 root INFO   197 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:14,623 MainThread@8568 root INFO   202 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:15,625 MainThread@8568 root INFO   215 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:16,626 MainThread@8568 root INFO   299 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:17,627 MainThread@8568 root INFO   490 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:18,628 MainThread@8568 root INFO   323 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:19,630 MainThread@8568 root INFO   220 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:20,631 MainThread@8568 root INFO   329 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:21,631 MainThread@8568 root INFO   278 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:22,632 MainThread@8568 root INFO   216 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:23,634 MainThread@8568 root INFO   92 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:24,635 MainThread@8568 root INFO   87 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:25,636 MainThread@8568 root INFO   120 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:26,638 MainThread@8568 root INFO   47 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:27,639 MainThread@8568 root INFO   43 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:28,640 MainThread@8568 root INFO   0 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:29,641 MainThread@8568 root INFO   0 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:30,643 MainThread@8568 root INFO   18 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:31,643 MainThread@8568 root INFO   38 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:32,644 MainThread@8568 root INFO   111 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:33,646 MainThread@8568 root INFO   153 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:34,647 MainThread@8568 root INFO   16 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:35,648 MainThread@8568 root INFO   0 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:36,649 MainThread@8568 root INFO   87 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:37,651 MainThread@8568 root INFO   164 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:38,652 MainThread@8568 root INFO   179 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:39,653 MainThread@8568 root INFO   155 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:40,654 MainThread@8568 root INFO   207 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:41,655 MainThread@8568 root INFO   414 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:42,656 MainThread@8568 root INFO   424 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:43,658 MainThread@8568 root INFO   432 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:44,659 MainThread@8568 root INFO   313 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:45,660 MainThread@8568 root INFO   267 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:46,661 MainThread@8568 root INFO   339 Requests, 0 Failed (0 Timeouts)
2016-11-15 13:16:47,663 MainThread@8568 root INFO   305 Requests, 0 Failed (0 Timeouts)

Attachments:

  • mongos_log
  • db2m1_log (first drop)
  • db3m2_log (2nd drop)
  • gdb stacks for both drops
  • diag data for both source mongods
Comment by Kaloian Manassiev [ 10/Nov/16 ]

Hi Yoni,

Just to make clear - is this something which began after you upgraded to 3.2.10 or it started happening out of nowhere? What is the number of shards that you have in your cluster?

When you are attaching the logs, would it be possible to also include the time stamp of the events where you observe the drop in throughput so we can correlate with what is happening in the logs?

Also, if you have the repro setup, would it be possible to execute the following command against the mongos, which gets frozen on order for us to see at what point in the execution does this happen?

sudo gdb -ex "thread apply all bt" --batch -p $(pidof mongos) > stacks.txt

Thanks in advance,

-Kal.

Comment by Kelsey Schubert [ 10/Nov/16 ]

Hi yonido,

I'm sorry you're running into this issue. So we can continue to investigate, would you please provide the complete logs for the mongos and mongod?

Additionally, would you please provide the diagnostic.data of the mongod?

I've created a secure upload portal for you to use. Files uploaded to this portal are visible only to MongoDB employees investigating this issue and are routinely deleted after some time.

Thank you for your help,
Thomas

Generated at Thu Feb 08 04:13:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.