[SERVER-26923] OOM Killer Terminates All 3 Nodes in a Shard Using WiredTiger Created: 06/Nov/16  Updated: 08/Jan/17  Resolved: 06/Jan/17

Status: Closed
Project: Core Server
Component/s: Text Search
Affects Version/s: 3.0.11, 3.2.10
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Shy Tamir Assignee: Kelsey Schubert
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive Wired Tiger Crash Logs.zip    
Issue Links:
Duplicate
duplicates SERVER-26534 Text search uses excessive memory Backlog
Operating System: ALL
Steps To Reproduce:

1. Configure shard to use wiredTiger.
2. Wait an indefinite period of time while automated tests are running (6-12 hours).
3. Identify the oom-killer and resulting crash in the server's logs.

Participants:

 Description   

Hello,

We've recently upgraded our MongoDB deployment to 3.2.10. During this upgrade we intended to migrate the storage engine to the new wiredTiger but ran into stability issues. Seemingly randomly throughout the day all data bearing nodes would crash due to oom-killer termination.

There are many memory leak issues with wiredTiger in JIRA, most of them fixed. The one we hoped would be beneficial was fixed in 3.2.10 (WT-2796), but alas, we still ran into the same problem while running automated tests (not particularly stressful ones) on our cluster.

We have a multiple environment deployment and the problem presented itself in all lower environments, causing us to reverse the decision to migrate to wiredTiger until we find a way to stabilize it.

We are using the 1.11 C# driver in our application. The higher environments are both 5 sharded clusters with 3 data bearing nodes in each shard's replica set. The config servers are not configured as a replica set and will not be migrated to wiredTiger at this point. Our application is cloud hosted in AWS and the number of servers running mongos.exe locally scales up and down automatically according to the load on service queues.

Please let me know if I can provide any further information. This bug is marked as critical because it involves a severe memory leak, per the table of priorities.

Thanks,
Shy



 Comments   
Comment by Shy Tamir [ 08/Jan/17 ]

Understood, Tom.

Thanks for the time you've put into this. I'll make sure we have the ticket watched and voted and explain to our developers we must explore other alternatives.

Thanks again,
Shy

Comment by Kelsey Schubert [ 14/Dec/16 ]

Hi shy@tegrity.com,

I'm sorry for the difficulties that this issue is causing you. Currently, SERVER-26534 and SERVER-26833 both have the fix version "planned but not scheduled." This fix version indicates that we would like to complete this work, but it is not currently targeted for a specific release.

We have not heard many reports of users encountering this issue, and, unfortunately, the changes to the underlying logic is significant. If you have not already done so, I would recommend watching and voting on both these tickets, as we take this into account during the planning process.

Kind regards,
Thomas

Comment by Shy Tamir [ 14/Dec/16 ]

Hi Thomas,

I wanted to touch base after having had the opportunity to discuss this with our dev leads. We were surprised to learn that such a critical issue is affecting all MongoDB versions and hasn't been fixed yet. Making changes to our application behavior will require more than a little effort on our part, and does not guarantee a stable database even if we do change our application logic.

We are currently exploring ways to mitigate this phenomenon per your suggestions, but the only effective solutions so far seem to be in the realm of long-term solutions, postponing our shard reduction efforts for longer than we'd like. Is there any news on what MongoDB version might include a fix for this?

Thanks,
Shy

Comment by Shy Tamir [ 06/Dec/16 ]

Hi Thomas,

We've experienced OOM-killer on all nodes, but the secondaries go down first, and then the primary gets demoted before going down itself, so the symptoms described in the server issue conform with what we're seeing in the logs.

The average object size in the recordings collection is reported by the stats method as 46,555 in the lower affected environment out of 186k objects. And ~26,000 across the higher environments shards out of 3.3m objects.

Do those numbers sit well with the suspected diagnosis? Barring a change to our application's logic (which can take time and requires a lot of coordinated effort) would it be reasonable to expect a solution for this server issue in the near future?

Thanks,
Shy

Comment by Kelsey Schubert [ 30/Nov/16 ]

Hi shy@tegrity.com,

Thanks for the detailed information. It looks like you are encountering SERVER-26534. The increase in memory consumption occurs while the following queries complete.

2016-11-21T22:06:33.036-0500 I COMMAND  [conn203] command test.recordings command: find { find: "recordings", filter: { UniversityId: "b1527c26-46f5-4d33-8e6f-7a5d6cc668b1", CourseId: { $in: [ "4976b169-e883-4d82-b591-f9dc0936de75" ] }, $text: { $search: "Neurology Clinical Localization Joel Shenker, M.D., Ph.D. University o..." }, RecordingType: { $ne: "Test" }, IsDeleted: false }, projection: { score: { $meta: "textScore" }, $sortKey: { $meta: "sortKey" } }, sort: { score: { $meta: "textScore" } }, shardVersion: [ Timestamp 1000|100, ObjectId('5553043951c058815a10b78d') ] } planSummary: IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 } keysExamined:62682 docsExamined:42260 hasSortStage:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:713 nreturned:2 reslen:109413 locks:{ Global: { acquireCount: { r: 1430 } }, Database: { acquireCount: { r: 715 } }, Collection: { acquireCount: { r: 715 } } } protocol:op_command 4249ms
2016-11-21T22:06:33.044-0500 I COMMAND  [conn181] command test.recordings command: find { find: "recordings", filter: { UniversityId: "b1527c26-46f5-4d33-8e6f-7a5d6cc668b1", CourseId: { $in: [ "770c4a14-12f1-427a-b8fb-eda41148baf5" ] }, $text: { $search: "Neurology Clinical Localization Joel Shenker, M.D., Ph.D. University o..." }, RecordingType: { $ne: "Test" }, IsDeleted: false }, projection: { score: { $meta: "textScore" }, $sortKey: { $meta: "sortKey" } }, sort: { score: { $meta: "textScore" } }, shardVersion: [ Timestamp 1000|100, ObjectId('5553043951c058815a10b78d') ] } planSummary: IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 } keysExamined:62682 docsExamined:42260 hasSortStage:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:713 nreturned:0 reslen:218 locks:{ Global: { acquireCount: { r: 1430 } }, Database: { acquireCount: { r: 715 } }, Collection: { acquireCount: { r: 715 } } } protocol:op_command 4263ms
2016-11-21T22:06:33.080-0500 I COMMAND  [conn204] command test.recordings command: find { find: "recordings", filter: { UniversityId: "b1527c26-46f5-4d33-8e6f-7a5d6cc668b1", CourseId: { $in: [ "a0d9495d-138b-44dd-b221-ab847e23dcf8" ] }, $text: { $search: "Neurology Clinical Localization Joel Shenker, M.D., Ph.D. University o..." }, RecordingType: { $ne: "Test" }, IsDeleted: false }, projection: { score: { $meta: "textScore" }, $sortKey: { $meta: "sortKey" } }, sort: { score: { $meta: "textScore" } }, shardVersion: [ Timestamp 1000|100, ObjectId('5553043951c058815a10b78d') ] } planSummary: IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 } keysExamined:62682 docsExamined:42260 hasSortStage:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:712 nreturned:0 reslen:218 locks:{ Global: { acquireCount: { r: 1428 } }, Database: { acquireCount: { r: 714 } }, Collection: { acquireCount: { r: 714 } } } protocol:op_command 4282ms
2016-11-21T22:06:33.088-0500 I COMMAND  [conn208] command test.recordings command: find { find: "recordings", filter: { UniversityId: "b1527c26-46f5-4d33-8e6f-7a5d6cc668b1", CourseId: { $in: [ "a6ff7224-37cf-499b-9ece-f28ff24bcebf" ] }, $text: { $search: "Neurology Clinical Localization Joel Shenker, M.D., Ph.D. University o..." }, RecordingType: { $ne: "Test" }, IsDeleted: false }, projection: { score: { $meta: "textScore" }, $sortKey: { $meta: "sortKey" } }, sort: { score: { $meta: "textScore" } }, shardVersion: [ Timestamp 1000|100, ObjectId('5553043951c058815a10b78d') ] } planSummary: IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 } keysExamined:62682 docsExamined:42260 hasSortStage:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:710 nreturned:0 reslen:218 locks:{ Global: { acquireCount: { r: 1424 } }, Database: { acquireCount: { r: 712 } }, Collection: { acquireCount: { r: 712 } } } protocol:op_command 4289ms
2016-11-21T22:06:33.107-0500 I COMMAND  [conn205] command test.recordings command: find { find: "recordings", filter: { UniversityId: "b1527c26-46f5-4d33-8e6f-7a5d6cc668b1", CourseId: { $in: [ "81dc93d0-5602-4ba6-b01e-5a76b3b52fdf" ] }, $text: { $search: "Neurology Clinical Localization Joel Shenker, M.D., Ph.D. University o..." }, RecordingType: { $ne: "Test" }, IsDeleted: false }, projection: { score: { $meta: "textScore" }, $sortKey: { $meta: "sortKey" } }, sort: { score: { $meta: "textScore" } }, shardVersion: [ Timestamp 1000|100, ObjectId('5553043951c058815a10b78d') ] } planSummary: IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 } keysExamined:62682 docsExamined:42260 hasSortStage:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:712 nreturned:0 reslen:218 locks:{ Global: { acquireCount: { r: 1428 } }, Database: { acquireCount: { r: 714 } }, Collection: { acquireCount: { r: 714 } } } protocol:op_command 4326ms
2016-11-21T22:06:33.109-0500 I COMMAND  [conn180] command test.recordings command: find { find: "recordings", filter: { UniversityId: "b1527c26-46f5-4d33-8e6f-7a5d6cc668b1", CourseId: { $in: [ "3be4fdb0-879e-4b15-adbf-48b4434a4abe" ] }, $text: { $search: "Neurology Clinical Localization Joel Shenker, M.D., Ph.D. University o..." }, RecordingType: { $ne: "Test" }, IsDeleted: false }, projection: { score: { $meta: "textScore" }, $sortKey: { $meta: "sortKey" } }, sort: { score: { $meta: "textScore" } }, shardVersion: [ Timestamp 1000|100, ObjectId('5553043951c058815a10b78d') ] } planSummary: IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 } keysExamined:62682 docsExamined:42260 hasSortStage:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:712 nreturned:0 reslen:218 locks:{ Global: { acquireCount: { r: 1428 } }, Database: { acquireCount: { r: 714 } }, Collection: { acquireCount: { r: 714 } } } protocol:op_command 4328ms
2016-11-21T22:06:33.113-0500 I COMMAND  [conn182] command test.recordings command: find { find: "recordings", filter: { UniversityId: "b1527c26-46f5-4d33-8e6f-7a5d6cc668b1", CourseId: { $in: [ "fa98c340-e4ce-4462-9273-6c7f22e7b69f" ] }, $text: { $search: "Neurology Clinical Localization Joel Shenker, M.D., Ph.D. University o..." }, RecordingType: { $ne: "Test" }, IsDeleted: false }, projection: { score: { $meta: "textScore" }, $sortKey: { $meta: "sortKey" } }, sort: { score: { $meta: "textScore" } }, shardVersion: [ Timestamp 1000|100, ObjectId('5553043951c058815a10b78d') ] } planSummary: IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 } keysExamined:62682 docsExamined:42260 hasSortStage:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:713 nreturned:16 reslen:667099 locks:{ Global: { acquireCount: { r: 1430 } }, Database: { acquireCount: { r: 715 } }, Collection: { acquireCount: { r: 715 } } } protocol:op_command 4331ms
2016-11-21T22:06:33.120-0500 I COMMAND  [conn202] command test.recordings command: find { find: "recordings", filter: { UniversityId: "b1527c26-46f5-4d33-8e6f-7a5d6cc668b1", CourseId: { $in: [ "c99a2db8-9c81-4fcc-90ee-939b3e15b9f1" ] }, $text: { $search: "Neurology Clinical Localization Joel Shenker, M.D., Ph.D. University o..." }, RecordingType: { $ne: "Test" }, IsDeleted: false }, projection: { score: { $meta: "textScore" }, $sortKey: { $meta: "sortKey" } }, sort: { score: { $meta: "textScore" } }, shardVersion: [ Timestamp 1000|100, ObjectId('5553043951c058815a10b78d') ] } planSummary: IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 } keysExamined:62682 docsExamined:42260 hasSortStage:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:712 nreturned:0 reslen:218 locks:{ Global: { acquireCount: { r: 1428 } }, Database: { acquireCount: { r: 714 } }, Collection: { acquireCount: { r: 714 } } } protocol:op_command 4328ms
2016-11-21T22:06:33.121-0500 I COMMAND  [conn183] command test.recordings command: find { find: "recordings", filter: { UniversityId: "b1527c26-46f5-4d33-8e6f-7a5d6cc668b1", CourseId: { $in: [ "2914fbd2-db56-4905-8bee-f4544b0a81df" ] }, $text: { $search: "Neurology Clinical Localization Joel Shenker, M.D., Ph.D. University o..." }, RecordingType: { $ne: "Test" }, IsDeleted: false }, projection: { score: { $meta: "textScore" }, $sortKey: { $meta: "sortKey" } }, sort: { score: { $meta: "textScore" } }, shardVersion: [ Timestamp 1000|100, ObjectId('5553043951c058815a10b78d') ] } planSummary: IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 } keysExamined:62682 docsExamined:42260 hasSortStage:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:710 nreturned:0 reslen:218 locks:{ Global: { acquireCount: { r: 1424 } }, Database: { acquireCount: { r: 712 } }, Collection: { acquireCount: { r: 712 } } } protocol:op_command 4340ms
2016-11-21T22:06:33.122-0500 I COMMAND  [conn184] command test.recordings command: find { find: "recordings", filter: { UniversityId: "b1527c26-46f5-4d33-8e6f-7a5d6cc668b1", CourseId: { $in: [ "b5e6f12d-e018-402b-b049-278550df2988" ] }, $text: { $search: "Neurology Clinical Localization Joel Shenker, M.D., Ph.D. University o..." }, IsDeleted: false }, projection: { score: { $meta: "textScore" }, $sortKey: { $meta: "sortKey" } }, sort: { score: { $meta: "textScore" } }, shardVersion: [ Timestamp 1000|100, ObjectId('5553043951c058815a10b78d') ] } planSummary: IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 } keysExamined:62682 docsExamined:42260 hasSortStage:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:711 nreturned:0 reslen:218 locks:{ Global: { acquireCount: { r: 1426 } }, Database: { acquireCount: { r: 713 } }, Collection: { acquireCount: { r: 713 } } } protocol:op_command 4334ms
2016-11-21T22:06:33.122-0500 I COMMAND  [conn206] command test.recordings command: find { find: "recordings", filter: { UniversityId: "b1527c26-46f5-4d33-8e6f-7a5d6cc668b1", CourseId: { $in: [ "6baec395-5768-40c9-a9ad-6ac3885fb104" ] }, $text: { $search: "Neurology Clinical Localization Joel Shenker, M.D., Ph.D. University o..." }, RecordingType: { $ne: "Test" }, IsDeleted: false }, projection: { score: { $meta: "textScore" }, $sortKey: { $meta: "sortKey" } }, sort: { score: { $meta: "textScore" } }, shardVersion: [ Timestamp 1000|100, ObjectId('5553043951c058815a10b78d') ] } planSummary: IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 } keysExamined:62682 docsExamined:42260 hasSortStage:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:713 nreturned:0 reslen:218 locks:{ Global: { acquireCount: { r: 1430 } }, Database: { acquireCount: { r: 715 } }, Collection: { acquireCount: { r: 715 } } } protocol:op_command 4339ms
2016-11-21T22:06:33.125-0500 I COMMAND  [conn186] command test.recordings command: find { find: "recordings", filter: { UniversityId: "b1527c26-46f5-4d33-8e6f-7a5d6cc668b1", CourseId: { $in: [ "4ee3c2ba-f12f-4789-a2e8-0a27effca427" ] }, $text: { $search: "Neurology Clinical Localization Joel Shenker, M.D., Ph.D. University o..." }, RecordingType: { $ne: "Test" }, IsDeleted: false }, projection: { score: { $meta: "textScore" }, $sortKey: { $meta: "sortKey" } }, sort: { score: { $meta: "textScore" } }, shardVersion: [ Timestamp 1000|100, ObjectId('5553043951c058815a10b78d') ] } planSummary: IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 } keysExamined:62682 docsExamined:42260 hasSortStage:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:712 nreturned:0 reslen:218 locks:{ Global: { acquireCount: { r: 1428 } }, Database: { acquireCount: { r: 714 } }, Collection: { acquireCount: { r: 714 } } } protocol:op_command 4338ms
2016-11-21T22:06:33.129-0500 I COMMAND  [conn198] command test.recordings command: find { find: "recordings", filter: { UniversityId: "b1527c26-46f5-4d33-8e6f-7a5d6cc668b1", CourseId: { $in: [ "40b1a9d2-aa25-4bd6-92fb-c505b451282d" ] }, $text: { $search: "Neurology Clinical Localization Joel Shenker, M.D., Ph.D. University o..." }, IsDeleted: false }, projection: { score: { $meta: "textScore" }, $sortKey: { $meta: "sortKey" } }, sort: { score: { $meta: "textScore" } }, shardVersion: [ Timestamp 1000|100, ObjectId('5553043951c058815a10b78d') ] } planSummary: IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 } keysExamined:62682 docsExamined:42260 hasSortStage:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:710 nreturned:6 reslen:203996 locks:{ Global: { acquireCount: { r: 1424 } }, Database: { acquireCount: { r: 712 } }, Collection: { acquireCount: { r: 712 } } } protocol:op_command 4331ms
2016-11-21T22:06:33.131-0500 I COMMAND  [conn185] command test.recordings command: find { find: "recordings", filter: { UniversityId: "b1527c26-46f5-4d33-8e6f-7a5d6cc668b1", CourseId: { $in: [ "d0f2c2f6-c45f-4385-add9-a12253eb6b39" ] }, $text: { $search: "Neurology Clinical Localization Joel Shenker, M.D., Ph.D. University o..." }, RecordingType: { $ne: "Test" }, IsDeleted: false }, projection: { score: { $meta: "textScore" }, $sortKey: { $meta: "sortKey" } }, sort: { score: { $meta: "textScore" } }, shardVersion: [ Timestamp 1000|100, ObjectId('5553043951c058815a10b78d') ] } planSummary: IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 }, IXSCAN { UniversityId: 1.0, _fts: "text", _ftsx: 1 } keysExamined:62682 docsExamined:42260 hasSortStage:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:718 nreturned:0 reslen:218 locks:{ Global: { acquireCount: { r: 1440 } }, Database: { acquireCount: { r: 720 } }, Collection: { acquireCount: { r: 720 } } } protocol:op_command 4332ms

Therefore, I would recommend modifying your application to reduce the number of concurrent text searches, or modifying these queries to reduce the number of documents examined.

It looks like these queries are targeted at the secondaries, do you send similar queries to the primary, or have you only experienced OOM killer on secondaries?

To confirm that SERVER-26534 completely explains the memory consumption, would you please provide the average data size of the documents being examined by these queries?

Thank you,
Thomas

Comment by Shy Tamir [ 27/Nov/16 ]

Hi Thomas, and a pleasure to meet you,

I can share detailed information about the tests and configurations, what specifically is of interest. I'll gloss over the major details so you can help me out by pointing out what's left out or needs expanding:

Infrastructure

  1. Reproduced in two separate environments of a similar nature.
  2. Both environments are sharded clusters, one has only one shard, the other has 5 shards.
  3. The servers are all AWS EC2 instances (Memory optimized r3.2xlarge type), running dedicated mongod processes, using assigned ENIs.
  4. All communication is secured using self-signed certificates, between the servers with a key file and using SSL for clients.
  5. The config servers are all identical but not a replica set.
  6. The shard replica sets are all data bearing nodes, and only the 3 data bearing nodes (one in each us-east-1 AZ) exist per replica set.
  7. When reproducing the issue only the storage engine for and only for non config data bearing nodes (3 in QA, 15 in PQA).

Application

  1. Our application is a video capture and quick-edit software as a service used by higher education institutions to allow their faculty and students to record instructional materials such as lectures - aptly dubbed "Tegrity Lecture Capture".
  2. The nature of the application is such that DB operations are very read and write intensive, especially during peak times but sometimes during routine operations.
  3. During manual testing operations for both environments the issue is not reproducible.
  4. During a suite of load tests run on PQA the issue is not reproducible.
  5. A suite of automated tests performed on both environments, simulating most of the application features' functionalities, reproduces the issue consistently every time we run it, sometimes several times.
  6. Nothing but normal activity is recorded in application logs for the times of the crashes.
  7. Secondary read preference is set on certain queries (at the query level) that have to do with collecting aggregated data.

If it will help I can loop in our QA Manager who knows the nature of the tests better than anyone, or one of developer leads for more insight into how the application makes database queries.

Thanks so much,
Shy

Comment by Kelsey Schubert [ 25/Nov/16 ]

Hi shy@tegrity.com,

Thanks for uploading the files. I've spent some time examining them, but haven't identified the root cause yet. It appears that the the memory usage is consistent before spiking, causing the OOM. My understanding is that you are able to consistently reproduce this issue in your testing enviroment. Are you able to share the tests and configuration that reproduces this issue?

Thanks again,
Thomas

Comment by Shy Tamir [ 25/Nov/16 ]

Hello,

Has anyone had a chance to look into this yet.?Just curious...

Thanks,
Shy

Comment by Shy Tamir [ 22/Nov/16 ]

Hello,

Please find the logging information requested attached to this ticket.

Please note that we've reverted the change so reproducing the issue will take another round of changing it back again and running our tests.

We're eager to hear the findings of your analysis.

Thanks,
Shy

Comment by Shy Tamir [ 17/Nov/16 ]

Hi,

Please note that we have failed to reproduce this issue in an heretofore untested environment, and so we don't have the diagnostic data available yet.

However, we do plan on running the same suite of tests on the environments that were giving us problems during next week, at which time I will upload them here.

Thanks,
Shy

Comment by Shy Tamir [ 13/Nov/16 ]

Hi,

Apologies for the delays in getting you those logs.

We have since reverted the storage engine change and will need to recreate the issue, which will take some time since we need to coordinate it with multiple departments.

I will post the logs here by the end of the week with any luck and we can progress from there.

Thanks,
Shy

Comment by Ramon Fernandez Marina [ 13/Nov/16 ]

shy@tegrity.com, we haven't heard back from you. If this issue is still critical to you, can you please upload the diagnostic.data directory and the mongod log file for all impacted nodes as Dan requested above?

Thanks,
Ramón.

Comment by Daniel Pasette (Inactive) [ 07/Nov/16 ]

Hi Shy,
Please tar up and the files located in the diagnostic.data directory and the mongod run logs for all impacted nodes.

Dan

Generated at Thu Feb 08 04:13:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.