[SERVER-18231] Primary is unable to be reached when secondary does fullSync Created: 28/Apr/15 Updated: 29/Apr/15 Resolved: 29/Apr/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Admin |
| Affects Version/s: | 3.0.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Maziyar Panahi | Assignee: | Ramon Fernandez Marina |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Steps To Reproduce: | Shutdown the Secondary |
||||||||
| Participants: | |||||||||
| Description |
|
Hello, I have a replica set with three nodes (Primary, secondary and arbiter). MongoDB version is 3.0.2 and I started my replica set by this command:
This is a db.stats():
And this is how it looks like while I am syncing the Secondary on mongostat:
This is a second time I wanted to full sync my secondary and at the end when the storage is almost equal (650GB) and secondary is building indexes the Primary suddenly has a high cpu usage and eventually freezes. The SSH connection will drop and the machine is not accessible. By the look at alerts on both MMS and application level I can see that all the operations also blocked on Primary and there is no insert/update/and query. I didn't wait to see what would've happened when the secondary finishes its building index as it was at 22% and I had to wait for a long time without primary but when I restarted the primary the secondary suddenly removed everything and started from the beginning. The hardware spec is 10-core CPU with 80GB of memory and 3TB of storage on both Primary and Secondary. I don't have CPU profiling on MMS enabled as I remember I couldn't do it way back so let me know if you need more info or to log something for the next time. |
| Comments |
| Comment by Ramon Fernandez Marina [ 29/Apr/15 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Thanks for the update maziyar. In general I'd recommend to let WiredTiger choose the default size for the cache, but I gave you a lower limit to be on the safe side and help allow your system to complete the initial sync. I am going to close this ticket as a duplicate of Regards, | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Maziyar Panahi [ 29/Apr/15 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Well I started syncing yesterday right after my comment and now it's only been 358GB out of 650GB data on Primary. This is some insight of how resources are being used during syncing just in case if it's needed in future. My best guess is the fact that I ignored the startup warning that says do not run mongod as a root! and perhaps too much memory being used by Primary caused the problem. Which I decreased the wiredTigerCacheSizeGB as you suggested and run the mongod as regular user. Primary:
Secondary:
mongostat --discover:
| |||||||||||||||||||||||||||||||||||||||||||
| Comment by Maziyar Panahi [ 28/Apr/15 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Hi Ramon, I am doing exactly what you said and will get back to you with the results. Regards, | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 28/Apr/15 ] | |||||||||||||||||||||||||||||||||||||||||||
|
maziyar, I'm afraid we'll need more information to understand what's going. My first guess is that your primary uses too much memory to the point of thrashing and even hanging, and since the mongod process is run by root the OOM killer does not terminate this process before your interactive sessions become unresponsive. You mention 80GB of RAM, but how much swap do you have? Is your system allowed to degrade gracefully? My first suggestion is that you run mongod as a regular user on all your replica members and then:
In addition, this could be the same issue as Cheers, | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Maziyar Panahi [ 28/Apr/15 ] | |||||||||||||||||||||||||||||||||||||||||||
|
I forgot to mention that I also tried to use scp to move all the files to the secondary to avoid stalling Primary but when I started my secondary I got this error: This is a frok.log beginning when I started my secondary:
Thanks |