[SERVER-25922] Replication fails due to to many open files or out of memmory Created: 01/Sep/16  Updated: 04/Oct/16  Resolved: 04/Oct/16

Status: Closed
Project: Core Server
Component/s: Replication, WiredTiger
Affects Version/s: 3.2.4, 3.2.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kral Markus [X] Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

see attached additional-info


Attachments: HTML File additional_info     HTML File mongodb_config     HTML File oom_messages     HTML File startup_after_failure     HTML File too_many_open_files    
Operating System: ALL
Participants:

 Description   

We have 2 running Servers in replication.

We are adding a 3rd server. The data seems to get synced without any problems, but the index fails to get build with either "to many open files" (although the limit here is set to 512000, recommended are 64000) or because of out of memmory.

Attached you can find the log-outputs of both.

I tested with version 3.2.4 and 3.2.9 on the affected instance.
The cluster is overall version 3.2.4 and running without problems.



 Comments   
Comment by Kelsey Schubert [ 04/Oct/16 ]

Hi KralMar,

Since we haven't heard back from you, I assume that the ulimit settings explained the issue. Regarding the OOM kills, you have been affected by SERVER-20306, which was fixed in MongoDB 3.2.10. If this is still an issue for you after upgrading, please let us know and we will continue to investigate.

Thank you,
Thomas

Comment by Kelsey Schubert [ 12/Sep/16 ]

Hi KralMar,

The index build is failing to open /srv/mongodb/data/_tmp/extsort.976 and I see WiredTiger has 45 currently open files, which brings the total number suspiciously close to the default ulimit setting of 1024.

Would you please double check that the user running mongod has the correct ulimit settings?

cat /proc/<mongod-pid>/limits

Thank you,
Thomas

Comment by Kral Markus [X] [ 12/Sep/16 ]

Hi,

what is the current status here?

Thanks
Markus

Comment by Kral Markus [X] [ 02/Sep/16 ]

Hi Ramón,

thanks for your quick response.
I uploaded the requested files.

As the environment is not that critical and we do have a running replica-set with 2 members and a backup,
I would wait for your instructions before trying anything further on my own.

Comment by Ramon Fernandez Marina [ 01/Sep/16 ]

I understand your concern KralMar. You can see the data that it collects here, which is essentially the data produced by the following mongo shell commands:

db.serverStatus({tcmalloc: true})
rs.status()
db.getSiblingDB('local').oplog.rs.stats()
db.adminCommand({getCmdLineOpts: true})
db.adminCommand({buildInfo: true})
db.adminCommand({hostInfo: true})

This data is gathered at periodic intervals by the server, compressed, and stored inside the diagnostic.data directory. It contains no collection data.

Comment by Kral Markus [X] [ 01/Sep/16 ]

Hi Ramón,

we are more than happy to provide you with everything you need,
however, wee need to know what kind of sensitive data the diagnostic.data includes because our database does includes personal data that cannot be easily shared (We could get into law-problems).

Kind Regards
Markus

Comment by Ramon Fernandez Marina [ 01/Sep/16 ]

Sorry to hear you're having trouble adding a third node KralMar. I don't see a "smoking gun" in the information you already sent, so I'd like to ask you for the following:

  • The output of free
  • The contents of the /srv/mongodb/data/diagnostic.data directory
  • The full log file for the initial sync attempts (/srv/mongodb/log/mongodb.log)

I've created a secure upload portal so you can share this information privately with us.

If this issue is critical for you, I would recommend you consider one or both of these workarounds:

I can't guarantee that these workarounds will address the issue since the data may indicate there's a bug somewhere, but trying won't hurt either. If you decide to try, please make sure you keep the existing contents of the diagnostic.data directory somewhere (uploading them to us is sufficient) – if the initial sync succeeds with the workaround we'll need to compare this data before and after to understand what the issue is.

Thanks,
Ramón.

Generated at Thu Feb 08 04:10:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.