[SERVER-38089] Fatal error 40292 due to missing oplog entry at the start time stamp Created: 12/Nov/18 Updated: 12/Dec/18 Resolved: 12/Dec/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 4.0.1, 4.0.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Arun Lakhotia [X] | Assignee: | Danny Hatcher (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | SWNA | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Steps To Reproduce: | I have the various log, turtle, wt, and diagnostic data files. But I have not idea how to recreate it. Am attaching all the pertinent data, I could think of.
We are using monogdb cloud manager. The tar file contains logs of the automation agents as well. Since the automation agent keeps trying to restart the server, you'd find repetitive entries. It contains errors from 4.0.1 and from 4.0.4.
|
||||||||
| Participants: | |||||||||
| Description |
|
A series of weird happenings have ended up in a situation triggering Fatal Error 40292 in mongo/db/repl/replication_recovery.cpp line 133. We have a PSA configuration. The entire issue started when we our DB approached (or hit) the disk limit. We resized the disk about mid October. From there on we've not been able to have a stable system. The servers keep going down. I got involved two days ago, so I don't have the entire history. I found that our system was triggering 2018-11-12T05:41:22.612+0000 I FTDC [initandlisten] Initializing full-time diagnostic data capture with directory '/mongo/rs0_8/diagnostic.data' For reference, here is log with 4.0.1 (prior to the above error). 2018-11-11T10:23:54.182+0000 I ROLLBACK [rsBackgroundSync] Rolling back to the stable timestamp. StableTimestamp: Timestamp(1539794142, 116) Initial Data Timestamp: Timestamp(1539794142, 116)2018-11-11T10:23:54.275+0000 E STORAGE [rsBackgroundSync] WiredTiger error (22) [1541931834:274955][23314:0x7f35f5bd7700], WT_CONNECTION.rollback_to_stable: rollback_to_stable illegal with active transactions: Invalid argument Raw: [1541931834:274955][23314:0x7f35f5bd7700], WT_CONNECTION.rollback_to_stable: rollback_to_stable illegal with active transactions: Invalid argument2018-11-11T10:23:54.275+0000 F ROLLBACK [rsBackgroundSync] RecoverToStableTimestamp failed. :: caused by :: UnrecoverableRollbackError: Error rolling back to stable. Err: Invalid argument
|
| Comments |
| Comment by Danny Hatcher (Inactive) [ 12/Dec/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello Arun, I'm sorry that you are still encountering some inconsistent data. I can recommend our Technical Support team for future issues with Automation (and data recovery) as they are experts on those subjects. Please let us know if you discover anything else that could help identify the root cause. Thank you, Danny | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Lakhotia [X] [ 10/Dec/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Daniel and the rest of the team: Thanks for your responses to this Issue, and providing helpful guidance. Yes, at this time there is nothing more we can do on this issue. At best, I (and I am sure others) can benefit from some Do's and Don'ts. This could certainly be very useful for people who also use the mongodb web service for automation and monitoring. It so happens I am still finding inconsistent state in the DB. But its not something you can help me with. So you may go ahead and close the issue.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 10/Dec/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello Arun, I've been trying to think of any alternative possibilities but I believe that your assessment is accurate. MongoDB has known issues when it actually runs out of disk space so that situation plus multiple mongod's trying to access the same data would absolutely cause problems. For the former issue, SERVER-13811 would track the work when we come up with a viable solution. As there are so many unknowns as to the original cause of the issue and your system is now back up and running, I think this ticket can be closed. Do you have anything else for us? Thank you, Danny | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Lakhotia [X] [ 03/Dec/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Danny: I think I'd need a therapist to make him talk FWIW, my initial upload of log files contained logs from just before, during, and after. So there is a possibility the log contains some useful story. Here are steps that I am aware of:
The possibilities I could think of:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 03/Dec/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello Arun, I think your explanation makes sense as a possible original cause for these issues. Would you happen to have the list of steps the other user followed back in October? If we can identify exactly what happened then it may reveal something that we can fix. Thank you, Danny | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Lakhotia [X] [ 30/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Based on the above statement that automation may have interfered with recovery, may be this may help identify the root cause of where this all started. When we reconfigured our disk (about October 17), and restarted mongo, the person responsible was not aware of the automation agent. He started mongod manually AND he started it as root. As he was configuring system, there was automation agent running in the background competing with the same work. He wasn't aware of the contention. He would look at the logs, see issues related to file ownership, and change ownerships to root. But the process itself was running as mongodb user. Maybe this fight between two mongodb processes (and more on replica set), put the system in some state that it usually doesn't enter. Add to it the file limit, and the issue ballooned. So my plan is that once we reach a stable operation manually, we will then configure automation agent. And from there on we will be very careful when doing anything manually.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 30/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello Arun,
Actually, I think Automation is very useful for the normal running state of systems. If a mongod process managed by Automation goes down for any reason, the Automation agent will attempt to bring it back up. In situations where the node won't start back up there's no real advantage but otherwise you don't have to worry about a mongod silently crashing and never restarting. It also makes things like version upgrades easier and as you mentioned other reconfigurations. The downside of Automation is that it is "optimistic". It assumes that any changes that occur are happening to a relatively healthy system. In this scenario, we have been attempting some configurations that you would never apply to a healthy system so Automation tries and fails to fix the problems it sees. Unfortunately, some of those attempted fixes directly contradict what we are telling the system to do. Thus, for periods of abnormal maintenance I believe it is beneficial to remove the relevant processes from Automation but in all other circumstances to take advantage of its features.
We actually recommend a minimum value of 64000 for the "max open files" ulimit. Of course you can go higher if you believe your system requires it. It may not have impacted you before due to a relatively low amount of load on MongoDB under most circumstances but it is useful to have the higher value just in case.
The impact of Monitoring should be so negligible that it is not even noticed and Backup should only be slightly more. However, if you have no intention of ever using the Backup feature I see no reason to install the agent.
Possibly, but I do not see any indication of that. 1024 really is a low value so it is not surprising you ran into that issue. If you do run out of open files at a much larger value, like the recommended 64000, then that would be something to investigate. Thank you, Danny | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Lakhotia [X] [ 30/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
About the PANIC attack. The attack happened because the number of open files exceeded the ulimit settings. Does that by any chance point to some issue in the server code? Is there a possibility that files are opened, and not being closed? I don't per se need an explanation. I am sure there are very good design reasons why one may open such large number of files. So this is just a comment, in case there is some issue that may have been overlooked.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Lakhotia [X] [ 30/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Danny: My status: I am currently running standalone configuration. In this mode we did find some corruption in the data, that is, our collections were out of sync. That is because we never focused on robustness in the event of such outages. At this time, I am more concerned about getting back into an operational state using RS. Since I do not really know what caused the issue, I am being ultra cautious. I have understood the following from your note; 1) Automation agent has to be used with care. Do you have any dos/don'ts for using the automation agent. At this time, I have cut it out entirely. I can see the need for using automation agent when configuring or reconfiguring machines. I don't see if it serves a lot of purpose when a system is in steady state. (May be it does, and I'd like to learn.) 2) ulimit. See below. I see that the open files in my settings is much smaller (1024) compared to what is on the mongodb page (21000). I can see that as the first tangible cause for the PANIC attack. So this is great. 3) You didn't mention, but related. How do monitoring agent and backup agent interfere with mongo operation. I like to use at least the monitoring agent. I can do without backup agent, for we are using disk snapshots as backups.
For reference, here is my ulimit settings
And limits of the mongod process on the current functional machine. We have not changed any configuration here, so I am sure that the processes were running with same limits earlier.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 26/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello Arun, Thank you for providing the extended amount of detail surrounding your environment. I think there's enough evidence that Automation has been interfering with your recovery efforts so I am glad to see that you are currently running things manually until everything is healthy again. In regards to the latest panic during the repair, it appears that we hit the limit of open files on the Linux server. Do you happen to know the ulimit settings these servers are using? If you have a mongod currently up and running on one of the problem servers, the following is probably the best way to obtain that information:
I can see that the fatal assertion here was the same one that caused you to open this case. I know that there have been multiple copies of data files. Are the data files under this node considered the "clean" ones or are they the known "broken" ones? Do you still have a "clean" copy of the data files that do not cause errors when starting a server against it?
As far as I can see, each iteration that has been tested thus far has been PSA or a standalone node. Have there been any attempts to run the data as a single-member replica set or as a member of a PSS replica set? The latter would involve your arbiter server hosting actual data load so I am not sure if this is too onerous. Thank you, Danny | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Lakhotia [X] [ 22/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Here is another crash report. Now I have my production system running as standalone. Am using replicaset for experiments. The planned set will have three machines: mongod-1, mongod-2, and mongod-arbiter. On starting mongod-1 we got an error that it was improperly shutdown. So we started it with --repair. It crashed after about 35 hours. Here is extract from error triggering the panic.
I have uploaded all the data. The server was run manually, so you can ignore the automation ogs.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Lakhotia [X] [ 20/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
FYI – I am running my production server on a standalone mode, which gives me reduced performance. This is not ideal, but it is better than being completely shut. To help with debugging this issue, I am keeping the replicas intact. If you'd like me to try any new builds, turn on more intrusive logging, or any other experiment, I'd be happy to give it a try.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Lakhotia [X] [ 20/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Pardon me for this big blast. It appears there is some mess up with automation going on as well The automation log show lots of permission issues on both sides. Also, I found that mongod-2 started on an empty database, not on a copy of the old DB. (There was an error in the directory path of the DB configuration which led mongod to create an empty DB. That explains why it started right away, instead of taking hours.) FWIW, this may offer some clues as well. In the morning I will have the three servers started afresh without any automation agent.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Lakhotia [X] [ 20/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Continuing from previous, I figure I should give as much data as I can think of. Here is the local directory on mongod-2.
And their SHA1
Now SHA1s of files in mongod-1::local.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Lakhotia [X] [ 20/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Danny: I do appreciate your team and you to continue to look at the issue. Per your suggestion, I brought up the replica set again in a PSA configuration. This time too it crashed. But turns out I made a mistake along the way that may provide some clue. I have uploaded the log and relevant files. Here are their details:
Here is what we did.
Relevant IP addresses:
The error log says that the crash is due to Permission denied on files on both servers. I checked the permissions, and they look good to me.
Here is log from mongod-1.
Here is log from mongod-2
The mongod-1:local directory. The error above says the collection-16 file exists, implying may be it shouldn't.
And the mongod-1::journal directory. The error says permission denied listing the directory. The processes were running as mongodb user. A fact I further verified by the ownership of /tmp/mongod-6655.sock. So I donot understand why permission would be denied listing the directory.
On mongod-2 the error is only about permission denied listing the journal directory. Here are permissions in that directory.
On a whim I also computed the SHA1 of the two journal directories. Turns out the Preplog files on both servers are the same, though they differ in the time stamps. Their time stamps are indicators of about when each server was started.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 19/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello Arun, We've done some more investigation into the data you've provided thus far. We are still looking into why the oldest timestamp did not advance once the node applied all of its oplog. However, we believe that some of the improvements in later versions of 4.0.x have improved memory usage for situations such as this so that you can get around the OOM. Would it be possible to try running the node in the replica set using the latest 4.0.4 binaries? Thank you very much, Danny | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 16/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello Arun, Yes, the setting forces read-only mode on the server for technical reasons. We are seeing some unexplained activity on the server after the oplog batches have been applied. We are investigating as to what could be causing it. As such, I am currently unable to state equivocally that the data on the node is complete. That being said, I believe that all the issues have dealt with recovering data and all the data currently on the node should be safe (not corrupted). Do you have some way of validating the data presence on the client-side? You may wish to run the validate command on the standalone as it should touch all the data files. Thank you, Danny | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Lakhotia [X] [ 16/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Uploaded logs and diagnostics. Configuration – standalone with the –-setParameter recoverFromOplogAsStandAlone=1.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Lakhotia [X] [ 16/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Danny: Re: Saving all the data. It's a matter of trade-off I need to make. a) I have over 2 years of data. The oplog replay is for about a month. b) For over a week now I have completely shutdown my entire system – which is analyzing a stream of malware I get on a daily basis. Prior to that the system was flaky so I figured I should solve the core problem. As it is I do not know the state of the data. (My database schema has not been designed to account for intermittent failures.) Which means if I have to lose a month's worth of data to get back to a reliable operation today, as opposed to wait indefinite amount of time. I will choose losing the oplog data. However, at this time we have not ruled out the possibility that the data storage itself is corrupted. I have shutdown any query (more so insertions) in the data. So I am anxious to know whether the primary data storage has any issues. ------ Re: Status from using the --recoverFromOplogAsStandalone switch. First, for anyone else following this thread, depending on your mongod version, you may need to use –-setParameter recoverFromOplogAsStandAlone=1. So far the server has been up and has not crashed, under no activity of any kind. ------ Observation: The banner message splashed by the server after restart with the switch says its running in readOnly mode. I take that to mean that the DB will not permit any inserts. If so, this mode is good for me to get the data out, but not much else.
MY PLAN:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 16/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello Arun, That flag is actually relatively new with MongoDB 4.0 as in previous versions we recovered from the journal; in 4.0 we journal the oplog and recover from the oplog after the journal. As you noticed, starting the node as a standalone did not replay the oplog. Thus, the data there could be data that has not been applied yet. We are operating under the assumption that those updates are valuable; if instead there has not been any real production data since the disk resizing then it should be easy to recover from this state but losing some of those writes. However, we will continue assuming that you wish to save all of the data. Starting the standalone with the --recoverFromOplogAsStandalone flag may cause you to run into the same OOM issue again. But we do think it's the next step. Regardless of the outcome, please upload fresh diagnostic.data to this ticket after the node has gone through the replay process. Thank you, Danny | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Lakhotia [X] [ 16/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Dennis: Thanks for the note of caution. I do have a backup of my two replicas. So I am good. I did run the server in standalone, it restarted in a jiffy. No oplog replay. And stayed up for long enough. But I didn't put it under any load. No queries, nothing. Follow up to your note I also go this hint from:
This escaped my mind when I restarted the server. (Been a busy day with meetings, and I got both the notes when I was in a meeting.) So FWIW, I will do the standalone with the flag Bruce mentions. As and when you have something new for me to try, I can bring my other two servers back up. Arun
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 15/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello Arun, To keep you updated, we’ve been having an internal discussion on the best way forward and a few snippets have been getting through. Tess’s comment and Bruce’s graph were two of those. Unfortunately, the graph was generated by a tool that is not yet customer-available. There is a project in the works to make it so but it's still down-the-line. It may be best to keep Automation disabled for the entire replica set until we get your cluster back into a healthy state. In its attempts to fix the problem it is possible that Automation may contradict some of your actions. We are discussing the standalone idea as we are not sure if the content will still be pinned in cache if the node is no longer in the replica set. If you can hold off for a little, we'd like to make sure that going to a standalone in the best option before recommending it. However, if you need to get your system up as soon as possible, please make sure that you have the data files backed up to a separate location just in case. We will let you know more as soon as we can. I appreciate your patience in working with us through this; getting to the root cause here will also help us support the rest of our customers. Thank you, Danny | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Lakhotia [X] [ 15/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Bruce: That is one cool graph, showing timeline. Is that an in-house tool, or something available to users. All: I ran the replica again with just PRIMARY and ARBITER. No SECONDARY. And I also turned off any other communication with the server. So absolutely no queries. After 3 hours of replaying oplog, and about 9 hours of staying up, the server crashed. CAVEATS: I did find that my automation manager was not able to change the replica config. Reason: the _id of the PRIMARY (mongod-0) in the replica was different from the _id being given by automation manager. So I turned off automation, and manually removed mongod-1 from the replica. Good thing is that mongod-1 was powered off all through, so mongod-0 was really by itself. Except, it was expecting mongod-1 in the replica. And so you see heartbeats to mongod-1 that fail. I have uploaded the log and diagnostics of both mongod-0 and mongod-arbiter. Here are the SHA1s of the pushed files.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Lakhotia [X] [ 14/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I received a note from Tess Avitable. Don't see it in the ticket here. For reference, here is its:
Good observation. I had disabled "enableMajorityReadConcern" while shooting in the dark, and trying to do something. I found a reference to the property on stack exchange, and then found this description on the docs. I thought of giving it a chance.
I figured by deployment was immobilized, so give it a try. I have now enabled it back (by deleting the property on cloudmanager, so it reverts to the default). My plan is to still bring up the PRIMARY up first. Let it run for sometime, and then bring up the SECONDARY.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Lakhotia [X] [ 14/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Kelsey: Thanks for the upload location. I have uploaded the files from my CURRENT configuration. However, there may be a major issue in using this data. Yesterday we "disk copied" PRIMARY (mongod-0) to create the SECONDARY (mongod-1). Which means a good bit of the history on mongod-1 is from mongod-0. FWIW, I have also included mongod.log files from around October 22 as well. That is when we resized the disk, and the problems started. In case it is useful, the replay oplog starts apply oplogs from October 17. I do have a copy of the old mongod-1. I'd need to get the data out from it for you. Here are the sha1sums of the files I uploaded today. 9a18047b359ca3b332e0450c208acbb802f6e1e8 mongod-1.tgz EDIT: Uploaded 5778a9a467a89314c1cb0e9d596d43fe4bca060e mongod-2.tgz
RECAP: My early configuration where this problem started had mongod-0 as PRIMARY and mongod-2 as SECONDARY (but at that time it was called mongod-1). The configuration setup yesterday had mongod-0 as PRIMARY and mongod-1 as SECONDARY (where mongod-1 was created by disk copy from mongod-0).
Arun
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kelsey Schubert [ 14/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi DrArunL, Would you please upload a more recent archive of diagnostic.data directory as well as the complete logs for the past three days for all the nodes in the replica set (including the arbiter)? I've created a secure upload portal for you to use to provide these files to help you get around the size limitations of attachments to JIRA. Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Lakhotia [X] [ 14/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Excitement short lived. The SECONDARY (mongod-1) created from copying PRIMARY (mongod-0) crashed due to memory allocation error. 2018-11-14T02:50:44.488+0000 E STORAGE [WTCheckpointThread] WiredTiger error (12) [1542163844:402387][22365:0x7f2e68ca8700], file:WiredTigerLAS.wt, WT_CURSOR.next: memory allocation of 27665 bytes failed: Cannot allocate memory Raw: [1542163844:402387][22365:0x7f2e68ca8700], file:WiredTigerLAS.wt, WT_CURSOR.next: memory allocation of 27665 bytes failed: Cannot allocate memory This then triggers cascading errors, and the server restarts. I have attached the recent log. Extracted only the lines between two SERVER RESTARTED lines. When the server restarts, I see that it goes through the "replaying stored operations", which takes almost 3 hours. As I write, mongod-1 is going through that sequence again. Am curious if there is a way for making the server remember the stored operatons and not replay it again. EDIT: Turns out both the PRIMARY (mongod-0) and SECONDARY (mongod-1) crashed at exactly the same time (within a minute of each other). I have attached log of both. The two log files show a lot of network connections. The following may be helpful to understand the connections. 10.128.0.10 - mongod-0 mongod-0.c.virusbattle-1.internal My plan: Turn off the secondary. Let the primary run by itself for a day, maybe two. See if it croaks. If it does or doesn't should help isolate the issue further.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Lakhotia [X] [ 14/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Went through the steps. I now have the PSA configuration back up, in that the mongod processes are running, and accepting connections (as indicated in the mongod.log files). However, the cloud manager shows there is problem. It says "Wait until this process is primary" (for mongod-0). I looked in the mongod.log file, and it shows that the process has indieed transitioned to PRIMARY. And mongod-1 also says that mongod-0 is PRIMARY. There is however an E (ERROR) message in the mongod-0 log file and a recurring I (INFO) in the mongod-1 log file which may hold some clue. The mongod-0 error is also referenced in the "View Details" error log of the automation agent. Here is one instance of the error on mongod-0 (PRIMARY) {{2018-11-13T23:45:20.660+0000 E REPL [conn5585] replSetReconfig got NewReplicaSetConfigurationIncompatible: New and old configurations both have members with host of mongod-0.c.virusbattle-1.internal:6655 but in the new configuration the _id field is 0 and in the old configuration it is 8 for replica set rs0 while validating { _id: "rs0", version: 30, members: [ { priority: 2.0, votes: 1, arbiterOnly: false, buildIndexes: true, _id: 0, host: "mongod-0.c.virusbattle-1.internal:6655", hidden: false, tags: {}, slaveDelay: 0 }, { hidden: false, host: "mongod-arbiter-0.c.virusbattle-1.internal:6655", tags: {}, _id: 2, priority: 0.0, buildIndexes: true, arbiterOnly: true, votes: 1, slaveDelay: 0 }, { _id: 3, tags: {}, arbiterOnly: false, hidden: false, host: "mongod-1.c.virusbattle-1.internal:6655", priority: 1.0, votes: 1, buildIndexes: true, slaveDelay: 0 } ], settings: { replicaSetId: ObjectId('5a4c05bb0f58fd8f4ca635e0'), getLastErrorDefaults: { w: 1, wtimeout: 0 }, getLastErrorModes: {}, electionTimeoutMillis: 10000.0, chainingAllowed: true, catchUpTimeoutMillis: -1.0, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10.0, catchUpTakeoverDelayMillis: 30000.0 }, protocolVersion: 1 }}}
And the INFO entry in mongod-1 (SECONDARY) – For reference, this server was restarted on 20:16. And these are some of the messages.
2018-11-13T23:36:03.194+0000 I NETWORK [LogicalSessionCacheRefresh] Starting new replica set monitor for rs0/mongod-0.c.virusbattle-1.internal:6655,mongod-1.c.virusbattle-1.internal:6655
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 13/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello Arun, That's actually our recommended strategy if you think that the MongoDB's initial sync process is too slow. It's mentioned in our documentation on resyncing a member of a replica set. The procedure is: 1. Stop the Secondary process Once the Secondary is started again, it should apply everything in the oplog that it didn't have from the data copy (inserts/updates between steps 5 and 6). Then it will be a healthy member again. Please note that because you are using Cloud Manager Automation you must first remove these processes from automation before performing the above procedure. Otherwise, as you mentioned in your initial description, Automation will automatically try to start the processes if it notices that they are down. Please let me know if you have any issues with this. Thank you, Danny | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Lakhotia [X] [ 13/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Danny: Thanks for the suggestion about initial sync. That seems like a reset button for mongod. Am newbie in this. Will try it in the morning tomorrow. In regards making a replica, what if I stopped my PRIMARY, copied the drive, and used that to start by SECONDARY. I am guessing that's a no-no. But I thought I'd ask. I think copying disk may be faster than a init sync.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 12/Nov/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hello Arun, Your summary makes sense to me. Have you attempted to perform an initial sync since the problem on 4.0.1? If not, could you do so on this node that is currently experiencing the fatal assertion? It's possible that the oplog for it is in an unrecoverable state. Thank you, Danny |