[SERVER-41326] Initial replica member sync failing due to OOM Created: 27/May/19 Updated: 30/Jul/19 Resolved: 16/Jul/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.4.16 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Stefan Adamov | Assignee: | Eric Sedor |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Steps To Reproduce: |
|
| Participants: |
| Description |
|
Hey guys, we needed to do a re-sync of some of our smaller(CPU/memory) replica set members that are used for DB backups. During the process we noticed that on few occasions the Mongo process dies with the following logged error:
Similar issuesI noticed this issue: https://jira.mongodb.org/browse/SERVER-28241, but it is related to MongoDB 3.4.1 and we're running 3.4.16 Setup information
|
| Comments |
| Comment by Stefan Adamov [ 30/Jul/19 ] |
|
Hello Joseph, sorry for the late reply. We experimented with a lower value for cacheSizeGB (old 1.5, new 1.0) and we haven't seen any crashes since then. |
| Comment by Eric Sedor [ 16/Jul/19 ] |
|
Hi, We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket. Regards, |
| Comment by Eric Sedor [ 28/Jun/19 ] |
|
Hi, We still need additional information to diagnose the problem. If this is still an issue for you, would you please see my previous comment? Gratefully, |
| Comment by Eric Sedor [ 11/Jun/19 ] |
|
Hi adamof, right now we think some activity during initial sync (possibly an index build) is requiring more memory than is available. To continue researching the possibility of a bug, we still need more information. If you still need help, can you:
Thanks in advance. |
| Comment by Eric Sedor [ 06/Jun/19 ] |
|
Hi adamof, we're looking into the data but are having difficulty reconciling the timestamps provided above with that diagnostic data. Are the complete logs for the Secondary from 2019-05-27T08:21:44.586Z to 2019-05-27T10:31:48.053Z available? And if so could you provide them? |
| Comment by Stefan Adamov [ 31/May/19 ] |
|
There you go |
| Comment by Eric Sedor [ 30/May/19 ] |
|
With the exception of the sudden drop of pageheap_free_bytes leading into the crash at A, this does appear similar to SERVER-33296. It could be that memory simply wasn't being reclaimed swiftly enough at crash time. Setting a TCMALLOC_AGGRESSIVE_DECOMMIT environment variable to 1 on this node may be a workaround for you. Still, we're looking at this further to make sure something else isn't going on.
Again, data from the Primary may be of help. |
| Comment by Eric Sedor [ 30/May/19 ] |
|
Sorry about that, adamof; I've corrected the link and we are looking at the data you provided. Because activity on the Primary node can provide important context for Secondary activity, it would be helpful if you could attach the diagnostic data for the Primary during these OOMs as well. Thanks in advance! |
| Comment by Stefan Adamov [ 30/May/19 ] |
|
Hey Eric, the link you've given me (about diagnostics data) seems to be broken. We haven't seen this issue on primaries, I'm uploading the diagnostics data from the secondary. |
| Comment by Eric Sedor [ 29/May/19 ] |
|
Hi adamof, would you please archive (tar or zip) the $dbpath/diagnostic.data directory (described here) and attach it to this ticket? Can you please do this for both the Primary and the Secondary experiencing an issue? An initial possibility we'll be looking to confirm or rule out is if this is a case of SERVER-33296. |