[SERVER-34029] mongod get stuck after upgrade Created: 21/Mar/18  Updated: 23/Apr/18  Resolved: 22/Mar/18

Status: Closed
Project: Core Server
Component/s: Querying, WiredTiger
Affects Version/s: 3.4.5, 3.4.14, 3.6.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Tarvi Pillessaar Assignee: Kelsey Schubert
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates WT-3972 Allow more than 64K cursors to be ope... Closed
Related
related to SERVER-34012 Planner's logic for taking union of i... Closed
Operating System: ALL
Steps To Reproduce:

I don't have exact steps, but I can test different versions, usually it happens in few hours, e.g I can enable more logging etc.

Participants:

 Description   

One of our mongodb cluster has very weird issue.
I tried to upgrade our cluster from 3.4.1 to 3.6.3, after upgrade I noticed that one slave server gets stuck after a while, I can connect to that slave, some commands are working, but not all, e.g rs.printSlaveReplicationInfo() is not responding, also that slave starts lagging, there is no cpu usage or disk IO, server is idling.
And later even primary server got stuck (I suspect it is same issue, but not sure).
After primary got stuck I decided to rollback cluster to 3.4.13, same thing happened.
Rolled back whole cluster to 3.4.1, this version works.

I started playing with one slave server, tried different versions, turned out that 3.4.4 is OK, but 3.4.5 is not. Using git bisect I found that following commit is causing issues for us: https://github.com/mongodb/mongo/commit/c3fda2d13c2f60eb3a68564403985cb1c7e8aa43

Slave that gets stuck is also used for read-only queries, another slave that is not getting queries works fine.

Every time slave gets stuck there is are few huge queries like following in log:

 
2018-03-21T11:16:02.888+0000 I COMMAND  [conn124] warning: log line attempted (644kB) over max size (10kB), printing beginning and end ... command Baz.foocollection command: getMore { getMore: 120605126995, collection: "foocollection" } originatingCommand: { find: "foocollection", filter: { $or: [ { fooObjectId: ObjectId('58931cae526107078bbf71f7'), stamp: 1494514165.463685 }, { fooObjectId: ObjectId('58b412d65e9206070a1dd5f6'), stamp: 1494422652.740239 }, { fooObjectId: ObjectId('58b412d65e9206070a1dd5f6'), stamp: 1494422716.137933 }, { fooObjectId: ObjectId('58b412d65e9206070a1dd5f6'), stamp: 1494422723.481655 }, { fooObjectId: ObjectId('58b412d65e9206070a1dd5f6'), stamp: 1495113378.631109 }, { fooObjectId: ObjectId('58b412d65e9206070a1dd5f6'), stamp: 1494422718.830563 }, { fooObjectId: ObjectId('58b412d65e9206070a1dd5f6'), stamp: 1495113381.569376 }, { fooObjectId: ObjectId('58b412d65e9206070a1dd5f6'), stamp: 1495113388.42224 }, { fooObjectId: ObjectId('58b412d65e9206070a1dd5f6'), stamp: 1495113426.608498 }, { fooObjectId: ObjectId('58931cae526107078bbf71f7'), stamp: 1493802053.254245 }, { fooObjectId: ObjectId('58931cae526107078bbf71f7'), stamp: 1493802049.093342 }, { fooObjectId: ObjectId('58931cae526107078bbf71f7'), stamp: 1493802054.233313 }, { fooObjectId: ObjectId('58931cae526107078bbf71f7'), stamp: 1494329972.558202 }, { fooObjectId: ObjectId('58931cae526107078bbf71f7'), stamp: 1493802055.457206 }, { stamp: 1487772816.953898, fooObjectId: ObjectId('5805bf6e74712800bb012f05') }, { fooObjectId: ObjectId('5805bf6e74712800bb012f05'), stamp: 1487772831.640239 }, { fooObjectId: ObjectId('589c35934e89350783564909'), stamp: 1496404387.08405 }, { fooObjectId: ObjectId('589c35934e89350783564909'), stamp: 1496404383.584682 }, { fooObjectId: ObjectId('589c35934e89350783564909'), stamp: 1496404357.338354 }, { fooObjectId: ObjectId('58931cae526107078bbf71f7'), stamp: 1494513722.518675 }, { fooObjectId: ObjectId('58b412645e9206070a1dd06a'), stamp: 1494423044.121071 }, { fooObjectId: ObjectId('58b412645e9206070a1dd06a'), stamp: 1494423041.918467 }, { fooObjectId: ObjectId('58b412645e9206070a1dd06a'), stamp: 1494423039.961413 }, { fooObjectId: ObjectId('58b412645e9206070a1dd06a'), stamp: 1494423034.819859 }, { stamp: 1501573001.144575, fooObjectId: ObjectId('5812fc35b2a83a00b8c9061f') }, { fooObjectId: ObjectId('5812fc35b2a83a00b8c9061f'), stamp: 1501573004.145434 }, { fooObjectId: ObjectId('5812fc35b2a83a00b8c9061f'), stamp: 1501572997.895948 }, { fooObjectId: ObjectId('5812fc35b2a83a00b8c9061f'), stamp: 1501573008.895141 }, { fooObjectId: ObjectId('5812fc35b2a83a00b8c9061f'), stamp: 1501573005.394891 }, { fooObjectId: ObjectId('5812fc35b2a83a00b8c9061f'), stamp: 1501090931.443829 }, { fooObjectId: ObjectId('58931cae526107078bbf71f7'), stamp: 1497892396.056092 }, { stamp: 1493890957.166266, fooObjectId: ObjectId('58b412d65e9206070a1dd5f6') }, { fooObjectId: ObjectId('58b412d65e9206070a1dd5f6'), stamp: 1493890819.475622 }, { fooObjectId: ObjectId('58b412d65e9206070a1dd5f6'), stamp: 1493890822.902208 }, { fooObjectId: ObjectId('57beea4dd93194b60029ee9a'), stamp: 1490703301.703321 }, { stamp: 1487949706.666049, fooObjectId: ObjectId('5805c00f74712800bb01320a') }, { fooObjectId: ObjectId('58b412d65e9206070a1dd5f6'), stamp: 1497550576.221023 }, { fooObjectId: ObjectId('58b412d65e9206070a1dd5f6'), stamp: 1497550576.95499 }, { fooObjectId: ObjectId('58b412645e9206070a1dd06a'), stamp: 1511271224.294596 }, { fooObjectId: ObjectId('58b412645e9206070a1dd06a'), stamp: 15112712 .......... on.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 }, IXSCAN { stamp: 1, position.x: 1, position.y: 1 } cursorid:120605126995 keysExamined:34140 docsExamined:34140 cursorExhausted:1 numYields:307 nreturned:34140 reslen:11101775 locks:{ Global: { acquireCount: { r: 616 }, acquireWaitCount: { r: 146 }, timeAcquiringMicros: { r: 469372 } }, Database: { acquireCount: { r: 308 } }, Collection: { acquireCount: { r: 308 } } } protocol:op_query 1110ms

It doesn't happen each time, most of the time these queries are successful.

I can provide logs, but not sure how to do that privately as our logs can contain sensitive information.



 Comments   
Comment by Kelsey Schubert [ 22/Mar/18 ]

Hi tarvip,

Thank you for the stacks, I can confirm that issue you're encountering is being tracked in WT-3972, and I expect that the fix will be backported into next release of MongoDB 3.4. As a workaround, I would suggest looking into whether you can modify your application to use $in queries rather than $or queries which should also have performance benefits as described by SERVER-34012.

Kind regards,
Kelsey

Comment by Tarvi Pillessaar [ 22/Mar/18 ]

Hi Kelsey,

files uploaded. From diagnostic data you can probably see, that it actually happened twice (first time it happened few hours earlier, first time I forgot to run gdb).
gdb output was following:

gdb -p $(pidof mongod) -batch -ex 'thread apply all bt' > gdb_`date +"%Y%m%d_%H%M%S"`.txt
185	../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory.

Slave started lagging at ~11:39:05 UTC, so I guess something happened at that time.

Also, if needed I can increase log verbosity for some components, at the moment verbosity is set to default.

Thank you,
Tarvi Pillessaar

Comment by Kelsey Schubert [ 21/Mar/18 ]

Hi tarvip,

Would you please upload the complete logs and diagnostic.data from an affected node? I've created a secure portal for you to use.

Additionally, if performance isn't critical, would you please collect gdb during this issue using the following command?

gdb -p $(pidof mongod) -batch -ex 'thread apply all bt' >gdb.txt

Please be sure to specify the time the gdb.txt file was generated when you upload so we can correlate it to the mongod logs and diagnostic.data.

Thank you,
Kelsey

Comment by Tarvi Pillessaar [ 21/Mar/18 ]

Slave is running on:
OS: Ubuntu Xenial, 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
HW: AWS VM c4.
Data volume: 1300GB gp2 c4.4xlarge

Today I also tested 3.4.14, still not working.

Generated at Thu Feb 08 04:35:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.