[SERVER-38089] Fatal error 40292 due to missing oplog entry at the start time stamp Created: 12/Nov/18  Updated: 12/Dec/18  Resolved: 12/Dec/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 4.0.1, 4.0.4
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Arun Lakhotia [X] Assignee: Danny Hatcher (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: SWNA
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File image-2018-11-13-21-54-57-681.png     File mongod-0.tgz     File mongod-1.tgz     File mongodb-0.log-2018-11-14     File mongodb-1.log-2018-11-14     PNG File node1.png    
Issue Links:
Related
related to SERVER-36495 Cache pressure issues during recovery... Closed
Operating System: ALL
Steps To Reproduce:

I have the various log, turtle, wt, and diagnostic data files. But I have not idea how to recreate it. 

Am attaching all the pertinent data,  I could think of.

  • mongod-1.tgz: Secondary (with the problem)
  • mongod-0.tgz: Primary

We are using monogdb cloud manager. The tar file contains logs of the automation agents as well.

Since the automation agent keeps trying to restart the server, you'd find repetitive entries. It contains errors from 4.0.1 and from 4.0.4.

 

Participants:

 Description   

 

A series of weird happenings have ended up in a situation triggering Fatal Error 40292 in mongo/db/repl/replication_recovery.cpp line 133.

We have a PSA configuration. The entire issue started when we our DB approached (or hit) the disk limit. We resized the disk about mid October. From there on we've not been able to have a stable system. The servers keep going down.

I got involved two days ago, so I don't have the entire history.

I found that our system was triggering WT-4335, with version 4.0.1. I upgraded to 4.0.4 yesterday. I am guessing now the fatal error is triggering at the root cause. (The WT-4335 was triggering after all the oplog replay, now the error is at the start itself.)

2018-11-12T05:41:22.612+0000 I FTDC [initandlisten] Initializing full-time diagnostic data capture with directory '/mongo/rs0_8/diagnostic.data'
2018-11-12T05:41:22.614+0000 I REPL [initandlisten] Rollback ID is 12
2018-11-12T05:41:22.614+0000 I REPL [initandlisten] Recovering from stable timestamp: Timestamp(1539794142, 116) (top of oplog: { ts: Timestamp(1541830881, 1), t: 85 }, appliedThrough: { ts: Timestamp(0, 0), t: -1 }, TruncateAfter: Timestamp(0, 0))
2018-11-12T05:41:22.614+0000 I REPL [initandlisten] Starting recovery oplog application at the stable timestamp: Timestamp(1539794142, 116)
2018-11-12T05:41:22.614+0000 I REPL [initandlisten] Replaying stored operations from { : Timestamp(1539794142, 116) } (exclusive) to { : Timestamp(1541830881, 1) } (inclusive).
{{2018-11-12T05:41:22.617+0000 F REPL [initandlisten] Oplog entry at { : Timestamp(1539794142, 116) } is missing; actual entry found is { : Timestamp(1539794616, 312) }}}
2018-11-12T05:41:22.617+0000 F - [initandlisten] Fatal Assertion 40292 at src/mongo/db/repl/replication_recovery.cpp 134
{{2018-11-12T05:41:22.617+0000 F - [initandlisten] }}

For reference, here is log with 4.0.1 (prior to the above error).

2018-11-11T10:23:54.182+0000 I ROLLBACK [rsBackgroundSync] Rolling back to the stable timestamp. StableTimestamp: Timestamp(1539794142, 116) Initial Data Timestamp: Timestamp(1539794142, 116)2018-11-11T10:23:54.275+0000 E STORAGE [rsBackgroundSync] WiredTiger error (22) [1541931834:274955][23314:0x7f35f5bd7700], WT_CONNECTION.rollback_to_stable: rollback_to_stable illegal with active transactions: Invalid argument Raw: [1541931834:274955][23314:0x7f35f5bd7700], WT_CONNECTION.rollback_to_stable: rollback_to_stable illegal with active transactions: Invalid argument2018-11-11T10:23:54.275+0000 F ROLLBACK [rsBackgroundSync] RecoverToStableTimestamp failed. :: caused by :: UnrecoverableRollbackError: Error rolling back to stable. Err: Invalid argument

 

 

 



 Comments   
Comment by Danny Hatcher (Inactive) [ 12/Dec/18 ]

Hello Arun,

I'm sorry that you are still encountering some inconsistent data. I can recommend our Technical Support team for future issues with Automation (and data recovery) as they are experts on those subjects.

Please let us know if you discover anything else that could help identify the root cause.

Thank you,

Danny

Comment by Arun Lakhotia [X] [ 10/Dec/18 ]

Daniel and the rest of the team:

Thanks for your responses to this Issue, and providing helpful guidance. Yes, at this time there is nothing more we can do on this issue. At best, I (and I am sure others) can benefit from some Do's and Don'ts. This could certainly be very useful for people who also use the mongodb web service for automation and monitoring.

It so happens I am still finding inconsistent state in the DB. But its not something you can help me with.

So you may go ahead and close the issue.

 

Comment by Danny Hatcher (Inactive) [ 10/Dec/18 ]

Hello Arun,

I've been trying to think of any alternative possibilities but I believe that your assessment is accurate. MongoDB has known issues when it actually runs out of disk space so that situation plus multiple mongod's trying to access the same data would absolutely cause problems. For the former issue, SERVER-13811 would track the work when we come up with a viable solution.

As there are so many unknowns as to the original cause of the issue and your system is now back up and running, I think this ticket can be closed. Do you have anything else for us?

Thank you,

Danny

Comment by Arun Lakhotia [X] [ 03/Dec/18 ]

Hi Danny:

I think I'd need a therapist to make him talk , he may be scared of reliving the experience or he is scared he will be held responsible. I have been very clear all along that none of this will be held against him. And that I want to know what may have happened so we do not repeat it.

FWIW, my initial upload of log files contained logs from just before, during, and after. So there is a possibility the log contains some useful story.

Here are steps that I am aware of:

  • Early Oct'18: We have a 4T x 2 replica set. It was running close to full. I don't know if the system had started giving errors/warnings. Per my colleague, he observed we were running out of space.
  • Mid-October: He manually reconfigured the disks to increase size to 4T; and brought the servers up manually. He didn't know about the automation agent.
  • After some period of operation, the system (as observed from external operation) would shut down. And he'd start the server again. This kept going for some time.
  • I got involved a day or two before I started this note. Which means the system had been flaky for over 3 weeks.

The possibilities I could think of:

  1. We were already out of disk space, or close, before the resizing. Some operations would succeed, others fail, leading to inconsistent state. So the damage was already done. There could be some issue with how mongodb performs when it hits disk limits.
  2. The resizing of disk on Google Cloud Platform has some issues that messed up with the data. (Low probability, based on all other indications.)
  3. There was some contention in a "root mongodb" and "non-root mongodb", and they didn't know how to resolve the contention, especially when they may get lock alternately (because a process is killed or dies).
  4. A combination of #1 and #3.

 

Comment by Danny Hatcher (Inactive) [ 03/Dec/18 ]

Hello Arun,

I think your explanation makes sense as a possible original cause for these issues. Would you happen to have the list of steps the other user followed back in October? If we can identify exactly what happened then it may reveal something that we can fix.

Thank you,

Danny

Comment by Arun Lakhotia [X] [ 30/Nov/18 ]

Based on the above statement that automation may have interfered with recovery, may be this may help identify the root cause of where this all started.

When we reconfigured our disk (about October 17), and restarted mongo, the person responsible was not aware of the automation agent. He started mongod manually AND he started it as root. As he was configuring system, there was automation agent running in the background competing with the same work. 

He wasn't aware of the contention. He would look at the logs, see issues related to file ownership, and change ownerships to root. But the process itself was running as mongodb user.

Maybe this fight between two mongodb processes (and more on replica set), put the system in some state that it usually doesn't enter. Add to it the file limit, and the issue ballooned.

So my plan is that once we reach a stable operation manually, we will then configure automation agent. And from there on we will be very careful when doing anything manually.

 

Comment by Danny Hatcher (Inactive) [ 30/Nov/18 ]

Hello Arun,

Do you have any dos/don'ts for using the automation agent. At this time, I have cut it out entirely. I can see the need for using automation agent when configuring or reconfiguring machines. I don't see if it serves a lot of purpose when a system is in steady state. (May be it does, and I'd like to learn.)

Actually, I think Automation is very useful for the normal running state of systems. If a mongod process managed by Automation goes down for any reason, the Automation agent will attempt to bring it back up. In situations where the node won't start back up there's no real advantage but otherwise you don't have to worry about a mongod silently crashing and never restarting. It also makes things like version upgrades easier and as you mentioned other reconfigurations.

The downside of Automation is that it is "optimistic". It assumes that any changes that occur are happening to a relatively healthy system. In this scenario, we have been attempting some configurations that you would never apply to a healthy system so Automation tries and fails to fix the problems it sees. Unfortunately, some of those attempted fixes directly contradict what we are telling the system to do. Thus, for periods of abnormal maintenance I believe it is beneficial to remove the relevant processes from Automation but in all other circumstances to take advantage of its features.

ulimit. See below. I see that the open files in my settings is much smaller (1024) compared to what is on the mongodb page (21000).

We actually recommend a minimum value of 64000 for the "max open files" ulimit. Of course you can go higher if you believe your system requires it. It may not have impacted you before due to a relatively low amount of load on MongoDB under most circumstances but it is useful to have the higher value just in case.

You didn't mention, but related. How do monitoring agent and backup agent interfere with mongo operation. I like to use at least the monitoring agent. I can do without backup agent, for we are using disk snapshots as backups.

The impact of Monitoring should be so negligible that it is not even noticed and Backup should only be slightly more. However, if you have no intention of ever using the Backup feature I see no reason to install the agent.

The attack happened because the number of open files exceeded the ulimit settings. Does that by any chance point to some issue in the server code? Is there a possibility that files are opened, and not being closed?

Possibly, but I do not see any indication of that. 1024 really is a low value so it is not surprising you ran into that issue. If you do run out of open files at a much larger value, like the recommended 64000, then that would be something to investigate.

Thank you,

Danny

Comment by Arun Lakhotia [X] [ 30/Nov/18 ]

About the PANIC attack.

The attack happened because the number of open files exceeded the ulimit settings. Does that by any chance point to some issue in the server code? Is there a possibility that files are opened, and not being closed?

I don't per se need an explanation. I am sure there are very good design reasons why one may open such large number of files. So this is just a comment, in case there is some issue that may have been overlooked.

 

Comment by Arun Lakhotia [X] [ 30/Nov/18 ]

Danny:

My status: I am currently running standalone configuration. In this mode we did find some corruption in the data, that is, our collections were out of sync. That is because we never focused on robustness in the event of such outages. 

At this time, I am more concerned about getting back into an operational state using RS. Since I do not really know what caused the issue, I am being ultra cautious. 

I have understood the following from your note;

1) Automation agent has to be used with care.

Do you have any dos/don'ts for using the automation agent. At this time, I have cut it out entirely. I can see the need for using automation agent when configuring or reconfiguring machines. I don't see if it serves a lot of purpose when a system is in steady state. (May be it does, and I'd like to learn.)

2) ulimit. See below. I see that the open files in my settings is much smaller (1024) compared to what is on the mongodb page (21000). 

 I can see that as the first tangible cause for the PANIC attack. So this is great.

3) You didn't mention, but related. How do monitoring agent and backup agent interfere with mongo operation. I like to use at least the monitoring agent. I can do without backup agent, for we are using disk snapshots as backups.

 

For reference, here is my ulimit settings

ubuntu@mongod-0:~$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 483382
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 483382
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

 

And limits of the mongod process on the current functional machine. We have not changed any configuration here, so I am sure that the processes were running with same limits earlier.

 

ubuntu@mongod-0:~$ cat /proc/11087/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             483382               483382               processes 
Max open files            1024                 1048576              files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       483382               483382               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us  

 

 

Comment by Danny Hatcher (Inactive) [ 26/Nov/18 ]

Hello Arun,

Thank you for providing the extended amount of detail surrounding your environment. I think there's enough evidence that Automation has been interfering with your recovery efforts so I am glad to see that you are currently running things manually until everything is healthy again.

In regards to the latest panic during the repair, it appears that we hit the limit of open files on the Linux server. Do you happen to know the ulimit settings these servers are using? If you have a mongod currently up and running on one of the problem servers, the following is probably the best way to obtain that information:

cat /proc/$pid/limits

On starting mongod-1 we got an error that it was improperly shutdown. So we started it with --repair.

I can see that the fatal assertion here was the same one that caused you to open this case. I know that there have been multiple copies of data files. Are the data files under this node considered the "clean" ones or are they the known "broken" ones? Do you still have a "clean" copy of the data files that do not cause errors when starting a server against it?

2018-11-20T20:40:10.026+0000 I REPL     [initandlisten] Recovering from stable timestamp: Timestamp(1539794142, 116) (top of oplog: { ts: Timestamp(1542212444, 1), t: 217 }, appliedThrough: { ts: Timestamp(0, 0), t: -1 }, TruncateAfter: Timestamp(0, 0))
2018-11-20T20:40:10.026+0000 I REPL     [initandlisten] Starting recovery oplog application at the stable timestamp: Timestamp(1539794142, 116)
2018-11-20T20:40:10.026+0000 I REPL     [initandlisten] Replaying stored operations from { : Timestamp(1539794142, 116) } (exclusive) to { : Timestamp(1542212444, 1) } (inclusive).
2018-11-20T20:40:10.029+0000 F REPL     [initandlisten] Oplog entry at { : Timestamp(1539794142, 116) } is missing; actual entry found is { : Timestamp(1539795252, 332) }
2018-11-20T20:40:10.029+0000 F -        [initandlisten] Fatal Assertion 40292 at src/mongo/db/repl/replication_recovery.cpp 134

As far as I can see, each iteration that has been tested thus far has been PSA or a standalone node. Have there been any attempts to run the data as a single-member replica set or as a member of a PSS replica set? The latter would involve your arbiter server hosting actual data load so I am not sure if this is too onerous.

Thank you,

Danny

Comment by Arun Lakhotia [X] [ 22/Nov/18 ]

Here is another crash report. Now I have my production system running as standalone. Am using replicaset for experiments. The planned set will have three machines: mongod-1, mongod-2, and mongod-arbiter.

On starting mongod-1 we got an error that it was improperly shutdown. So we started it with --repair. It crashed after about 35 hours. 

Here is extract from error triggering the panic.

2018-11-22T09:28:53.338+0000 E STORAGE  [thread1] WiredTiger error (24) [1542878933:337974][17240:0x7fbffff42700], log-server: __directory_list_worker, 48: /mongo/rs0_8//journal: directory-list: opendir: Too many open files Raw: [1542878933:337974][17240:0x7fbffff42700], log-server: __directory_list_worker, 48: /mongo/rs0_8//journal: directory-list: opendir: Too many open files
2018-11-22T09:28:53.338+0000 E STORAGE  [thread1] WiredTiger error (24) [1542878933:338216][17240:0x7fbffff42700], log-server: __log_prealloc_once, 469: log pre-alloc server error: Too many open files Raw: [1542878933:338216][17240:0x7fbffff42700], log-server: __log_prealloc_once, 469: log pre-alloc server error: Too many open files
2018-11-22T09:28:53.338+0000 E STORAGE  [thread1] WiredTiger error (24) [1542878933:338232][17240:0x7fbffff42700], log-server: __log_server, 1015: log server error: Too many open files Raw: [1542878933:338232][17240:0x7fbffff42700], log-server: __log_server, 1015: log server error: Too many open files
2018-11-22T09:28:53.338+0000 E STORAGE  [thread1] WiredTiger error (-31804) [1542878933:338248][17240:0x7fbffff42700], log-server: __wt_panic, 523: the process must exit and restart: WT_PANIC: WiredTiger library panic Raw: [1542878933:338248][17240:0x7fbffff42700], log-server: __wt_panic, 523: the process must exit and restart: WT_PANIC: WiredTiger library panic
2018-11-22T09:28:53.384+0000 F -        [WTJournalFlusher] Invariant failure: _waitUntilDurableSession->log_flush(_waitUntilDurableSession, "sync=on") resulted in status UnknownError: -31804: WT_PANIC: WiredTiger library panic at src/mongo/db/storage/wiredtiger/wiredtiger_session_cache.cpp 296
2018-11-22T09:28:53.384+0000 F -        [WTJournalFlusher] ***aborting after invariant() failure

I have uploaded all the data. The server was run manually, so you can ignore the automation ogs.

725d5af2ade5541c3a1411d08aafec1e7c3e4882  mongod-1-2018-11-22T19-03-39.tgz

 

Comment by Arun Lakhotia [X] [ 20/Nov/18 ]

FYI – I am running my production server on a standalone mode, which gives me reduced performance.  This is not ideal, but it is better than being completely shut.

To help with debugging this issue, I am keeping the replicas intact. If you'd like me to try any new builds, turn on more intrusive logging, or any other experiment, I'd be happy to give it a try.

 

Comment by Arun Lakhotia [X] [ 20/Nov/18 ]

Pardon me for this big blast. It appears there is some mess up with automation going on as well The automation log show lots of permission issues on both sides.

Also, I found that mongod-2 started on an empty database, not on a copy of the old DB. (There was an error in the directory path of the DB configuration which led mongod to create an empty DB. That explains why it started right away, instead of taking hours.)

FWIW, this may offer some clues as well.

In the morning I will have the three servers started afresh without any automation agent.

 

Comment by Arun Lakhotia [X] [ 20/Nov/18 ]

Continuing from previous, I figure I should give as much data as I can think of. Here is the local directory on mongod-2.

root@mongod-2:~# ls -lh  /mongo/rs0_6/local
total 136K
-rw------- 1 mongodb nogroup  16K Nov 20 03:01 collection-0-714744392904966981.wt
-rw------- 1 mongodb nogroup 4.0K Nov 20 03:00 collection-2-714744392904966981.wt
-rw------- 1 mongodb nogroup  16K Nov 20 03:01 collection-4-714744392904966981.wt
-rw------- 1 mongodb nogroup  16K Nov 20 03:01 collection-6-714744392904966981.wt
-rw------- 1 mongodb nogroup  16K Nov 20 03:01 collection-8-714744392904966981.wt
-rw------- 1 mongodb nogroup  16K Nov 20 03:01 index-1-714744392904966981.wt
-rw------- 1 mongodb nogroup 4.0K Nov 20 03:00 index-3-714744392904966981.wt
-rw------- 1 mongodb nogroup  16K Nov 20 03:01 index-5-714744392904966981.wt
-rw------- 1 mongodb nogroup  16K Nov 20 03:01 index-7-714744392904966981.wt
-rw------- 1 mongodb nogroup  16K Nov 20 03:01 index-9-714744392904966981.wt
root@mongod-2:~# 

And their SHA1

 

root@mongod-2:/mongo/rs0_6/local# find . -type f -exec sha1sum {} \;
5bd422b5f3e0bce89c1e56327c6f05cf2b125cee  ./collection-0-714744392904966981.wt
3a974266d660c996aef5327b8b7e8b69fbdaf7f6  ./index-1-714744392904966981.wt
ef338c4b8d25224f30566e1ade30a1ca9eb26866  ./collection-2-714744392904966981.wt
ef338c4b8d25224f30566e1ade30a1ca9eb26866  ./index-3-714744392904966981.wt
fd2592fd13660ad11bca8f65579cdd0de5de001e  ./collection-4-714744392904966981.wt
d30d746858bde08873e4ce04d61877811b3ba5f4  ./index-5-714744392904966981.wt
3c985315d6cb096020609164ee53188c0e22e52d  ./collection-6-714744392904966981.wt
41129285edc0117139d365c0b63313afa0a634d8  ./index-7-714744392904966981.wt
d5c669006a44bfa50cbd42763ddc3a2aa8c02782  ./collection-8-714744392904966981.wt
82a25b11b9c412f69d810fda82f4563cf8b0f176  ./index-9-714744392904966981.wt

Now SHA1s of files in mongod-1::local.

root@mongod-1:/mongo/rs0_8/local# find . -type f -exec sha1sum {} \;
a5ee9865a488720fefae8fb50d93dc59f02f5085  ./collection-0-8568352613365886055.wt
aed1c55406f8c04f2d1ba66b1328fdd5c10dccf0  ./index-1-8568352613365886055.wt
252b62bf3cd48434769cc77ca4a17255ec5b4b28  ./collection-2-8568352613365886055.wt
ac3c642b06a213a4897508d83087fbdf8e7bad71  ./index-3-8568352613365886055.wt
ac469e36e1f2a2d7b09aaa557939d6437a24321e  ./collection-4-8568352613365886055.wt
f1d054442f0b15814000d0cbfce3336f43057e67  ./index-5-8568352613365886055.wt
000adfebaa328c3e85a89627b6fa08b8b7eab88d  ./collection-6-8568352613365886055.wt
41129285edc0117139d365c0b63313afa0a634d8  ./index-7-8568352613365886055.wt
b467ba25df451e5a67404a9fcf3199e31d7b1f59  ./collection-8-8568352613365886055.wt
8950463508ba036aff8b2657f5b1dc247a43d259  ./index-9-8568352613365886055.wt
9b0e624ab42c8047836571c69d687191660925fc  ./collection-82-8568352613365886055.wt
9175a4fd5af37e93d966b56a25a96087153239da  ./index-83-8568352613365886055.wt
d2ee304f279051fb1f3d2700e4e58caa2339541a  ./collection-14-8568352613365886055.wt
3f4049692686ee877457aeb581d853ff16ad6162  ./index-15-8568352613365886055.wt

 

Comment by Arun Lakhotia [X] [ 20/Nov/18 ]

Hi Danny:

I do appreciate your team and you to continue to look at the issue.

Per your suggestion, I brought up the replica set again in a PSA configuration. This time too it crashed. But turns out I made a mistake along the way that may provide some clue.

I have uploaded the log and relevant files. Here are their details:

 

f15ad69720901051e8038d49f85a058cbf14fd9d  mongod-1-2018-11-20T03-20-59.tgz
7b5778ef42a8e8f7afa691809ded69413c5f211e  mongod-2-2018-11-20T03-21-41.tgz
3f40af687662330147b8ea1294b600e3b37c7550  mongod-arbiter-0-2018-11-20T03-21-55.tgz

 

Here is what we did.

  • Used the cloudmanager for everything. Everything was done on freshly booted machines.  But they had all the old DB.
  • Machines: mongod-1 (PRIMARY), mongod-2 (SECONDARY), mongo-arbiter
  • Replica set name: RS1-2 (chose a different name from earlier; just in case).
  • Turns out in the initial launch, we had not mounted the DB drive on mongod-2.
  • The automation agent log shows attempts to launch and failing.
  • Then when I mounted the drive, the mongod process on mongod-2 started. 
  • And it immediately crashed, and brought mongod-1 down with it.

 

Relevant IP addresses:

 

10.128.0.12 - mongod-1
10.128.0.19 - mongod-2
10.128.0.13 - mongod-arbiter

The error log says that the crash is due to Permission denied on files on both servers. I checked the permissions, and they look good to me.

 

Here is log from mongod-1.

 

2018-11-20T03:09:51.000+0000 E STORAGE  [ftdc] WiredTiger error (13) [1542683391:916][4448:0x7f89d88b4700], WT_SESSION.open_cursor: __posix_fs_exist, 189: /mongo/rs
0_8//local/collection-16-8568352613365886055.wt: file-exist: stat: Permission denied Raw: [1542683391:916][4448:0x7f89d88b4700], WT_SESSION.open_cursor: __posix_fs_
exist, 189: /mongo/rs0_8//local/collection-16-8568352613365886055.wt: file-exist: stat: Permission denied
2018-11-20T03:09:51.024+0000 E STORAGE  [thread179] WiredTiger error (13) [1542683391:24081][4448:0x7f89de8c0700], log-server: __directory_list_worker, 48: /mongo/r
s0_8//journal: directory-list: opendir: Permission denied Raw: [1542683391:24081][4448:0x7f89de8c0700], log-server: __directory_list_worker, 48: /mongo/rs0_8//journ
al: directory-list: opendir: Permission denied

 

Here is log from mongod-2

 

2018-11-20T03:08:24.750+0000 E STORAGE [thread911] WiredTiger error (13) [1542683304:750754][4423:0x7f993a6a6700], log-server: __directory_list_worker, 48: /mongo/
rs0_6//journal: directory-list: opendir: Permission denied Raw: [1542683304:750754][4423:0x7f993a6a6700], log-server: __directory_list_worker, 48: /mongo/rs0_6//jou
rnal: directory-list: opendir: Permission denied
2018-11-20T03:08:24.750+0000 E STORAGE [thread911] WiredTiger error (13) [1542683304:750857][4423:0x7f993a6a6700], log-server: __log_prealloc_once, 469: log pre-al
loc server error: Permission denied Raw: [1542683304:750857][4423:0x7f993a6a6700], log-server: __log_prealloc_once, 469: log pre-alloc server error: Permission deni
ed
2018-11-20T03:08:24.750+0000 E STORAGE [thread911] WiredTiger error (13) [1542683304:750868][4423:0x7f993a6a6700], log-server: __log_server, 1015: log server error
: Permission denied Raw: [1542683304:750868][4423:0x7f993a6a6700], log-server: __log_server, 1015: log server error: Permission denied
2018-11-20T03:08:24.750+0000 E STORAGE [thread911] WiredTiger error (-31804) [1542683304:750877][4423:0x7f993a6a6700], log-server: __wt_panic, 523: the process mus
t exit and restart: WT_PANIC: WiredTiger library panic Raw: [1542683304:750877][4423:0x7f993a6a6700], log-server: __wt_panic, 523: the process must exit and restart
: WT_PANIC: WiredTiger library panic
2018-11-20T03:08:24.750+0000 F - [thread911] Fatal Assertion 50853 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 409
2018-11-20T03:08:24.750+0000 F - [thread911]

 

 

The mongod-1:local directory. The error above says the collection-16 file exists, implying may be it shouldn't. 

 

root@mongod-1:~# ls -l /mongo/rs0_8/local/*
-rw------- 1 mongodb mongodb        53248 Nov 19 22:37 /mongo/rs0_8/local/collection-0-8568352613365886055.wt
-rw------- 1 mongodb mongodb        36864 Nov 20 02:33 /mongo/rs0_8/local/collection-14-8568352613365886055.wt
-rw------- 1 mongodb mongodb 124586246144 Nov 19 22:36 /mongo/rs0_8/local/collection-16-8568352613365886055.wt
-rw------- 1 mongodb mongodb        16384 Nov 19 22:36 /mongo/rs0_8/local/collection-2-8568352613365886055.wt
-rw------- 1 mongodb mongodb        36864 Nov 20 03:08 /mongo/rs0_8/local/collection-4-8568352613365886055.wt
-rw------- 1 mongodb mongodb        36864 Nov 19 22:36 /mongo/rs0_8/local/collection-6-8568352613365886055.wt
-rw------- 1 mongodb mongodb        36864 Nov 19 22:36 /mongo/rs0_8/local/collection-80-8568352613365886055.wt
-rw------- 1 mongodb mongodb        36864 Nov 19 22:36 /mongo/rs0_8/local/collection-82-8568352613365886055.wt
-rw------- 1 mongodb mongodb        36864 Nov 19 22:36 /mongo/rs0_8/local/collection-8-8568352613365886055.wt
-rw------- 1 mongodb mongodb        16384 May  1  2018 /mongo/rs0_8/local/index-15-8568352613365886055.wt
-rw------- 1 mongodb mongodb        36864 Nov 19 22:37 /mongo/rs0_8/local/index-1-8568352613365886055.wt
-rw------- 1 mongodb mongodb        16384 May  1  2018 /mongo/rs0_8/local/index-3-8568352613365886055.wt
-rw------- 1 mongodb mongodb        16384 May  1  2018 /mongo/rs0_8/local/index-5-8568352613365886055.wt
-rw------- 1 mongodb mongodb        16384 Nov 19 22:36 /mongo/rs0_8/local/index-7-8568352613365886055.wt
-rw------- 1 mongodb mongodb        16384 Nov 19 22:36 /mongo/rs0_8/local/index-81-8568352613365886055.wt
-rw------- 1 mongodb mongodb        16384 May  1  2018 /mongo/rs0_8/local/index-83-8568352613365886055.wt
-rw------- 1 mongodb mongodb        16384 May  1  2018 /mongo/rs0_8/local/index-9-8568352613365886055.wt

 

 

And the mongod-1::journal directory. The error says permission denied listing the directory. The processes were running as mongodb user. A fact I further verified by the ownership of /tmp/mongod-6655.sock. So I donot understand why permission would be denied listing the directory.

 

root@mongod-1:~# ls -al /mongo/rs0_8//journal/
total 307220
drwx------  2 mongodb mongodb       110 Nov 19 22:36 .
drwxr-x--- 19 mongodb mongodb     16384 Nov 20 03:08 ..
-rw-------  1 mongodb nogroup 104857600 Nov 20 03:09 WiredTigerLog.0000062081
-rw-------  1 mongodb nogroup 104857600 Nov 19 22:36 WiredTigerPreplog.0000000001
-rw-------  1 mongodb nogroup 104857600 Nov 19 22:36 WiredTigerPreplog.0000000002

On mongod-2 the error is only about permission denied listing the journal directory. Here are permissions in that directory.

root@mongod-2:~# ls -l /mongo/rs0_6/journal/
total 307200
-rw------- 1 mongodb nogroup 104857600 Nov 20 03:07 WiredTigerLog.0000000001
-rw------- 1 mongodb nogroup 104857600 Nov 20 03:00 WiredTigerPreplog.0000000001
-rw------- 1 mongodb nogroup 104857600 Nov 20 03:00 WiredTigerPreplog.0000000002

On a whim I also computed the SHA1 of the two journal directories. Turns out the Preplog files on both servers are the same, though they differ in the time stamps. Their time stamps are indicators of about when each server was started.

root@mongod-1:~#  find  /mongo/rs0_8/journal/ -type f -exec sha1sum {} \;
31c9a529772abec60ef3121227b4db3455cf17fc  /mongo/rs0_8/journal/WiredTigerLog.0000062081
4aebdfaa295761d1bce1f996be6051b3a6e9990f  /mongo/rs0_8/journal/WiredTigerPreplog.0000000001
4aebdfaa295761d1bce1f996be6051b3a6e9990f  /mongo/rs0_8/journal/WiredTigerPreplog.0000000002

root@mongod-2:~# find  /mongo/rs0_6/journal/ -type f -exec sha1sum {} \;
b86612183c99e587cd8a3b0e708c3ffa3486eccf  /mongo/rs0_6/journal/WiredTigerLog.0000000001
4aebdfaa295761d1bce1f996be6051b3a6e9990f  /mongo/rs0_6/journal/WiredTigerPreplog.0000000001
4aebdfaa295761d1bce1f996be6051b3a6e9990f  /mongo/rs0_6/journal/WiredTigerPreplog.0000000002
root@mongod-2:~# 

Comment by Danny Hatcher (Inactive) [ 19/Nov/18 ]

Hello Arun,

We've done some more investigation into the data you've provided thus far. We are still looking into why the oldest timestamp did not advance once the node applied all of its oplog. However, we believe that some of the improvements in later versions of 4.0.x have improved memory usage for situations such as this so that you can get around the OOM. Would it be possible to try running the node in the replica set using the latest 4.0.4 binaries?

Thank you very much,

Danny

Comment by Danny Hatcher (Inactive) [ 16/Nov/18 ]

Hello Arun,

Yes, the setting forces read-only mode on the server for technical reasons.

We are seeing some unexplained activity on the server after the oplog batches have been applied. We are investigating as to what could be causing it. As such, I am currently unable to state equivocally that the data on the node is complete.

That being said, I believe that all the issues have dealt with recovering data and all the data currently on the node should be safe (not corrupted). Do you have some way of validating the data presence on the client-side? You may wish to run the validate command on the standalone as it should touch all the data files.

Thank you,

Danny

Comment by Arun Lakhotia [X] [ 16/Nov/18 ]

Uploaded logs and diagnostics. Configuration – standalone with the  –-setParameter recoverFromOplogAsStandAlone=1.

 

7411370a7d5a6ae0d9134d5608dd4ad4a1f86050 mongod-0-2018-11-16T16-39-11.tgz

 

Comment by Arun Lakhotia [X] [ 16/Nov/18 ]

Hi Danny:

Re: Saving all the data. 

It's a matter of trade-off I need to make.

a) I have over 2 years of data. The oplog replay is for about a month.

b) For over a week now I have completely shutdown my entire system – which is analyzing a stream of malware I get on a daily basis. Prior to that the system was flaky so I figured I should solve the core problem. As it is I do not know the state of the data. (My database schema has not been designed to account for intermittent failures.)

Which means if I have to lose a month's worth of data to get back to a reliable operation today, as opposed to wait indefinite amount of time. I will choose losing the oplog data.

However, at this time we have not ruled out the possibility that the data storage itself is corrupted. I have shutdown any query (more so insertions) in the data. 

So I am anxious to know whether the primary data storage has any issues.

------

Re: Status from using the --recoverFromOplogAsStandalone switch.

First, for anyone else following this thread, depending on your mongod version, you may need to use  –-setParameter recoverFromOplogAsStandAlone=1.

So far the server has been up and has not crashed, under no activity of any kind.

------

Observation:

The banner message splashed by the server after restart with the switch says its running in readOnly mode. I take that to mean that the DB will not permit any inserts. If so, this mode is good for me to get the data out, but not much else. 

 

018-11-16T09:05:24.293+0000 W REPL [initandlisten] Setting mongod to readOnly mode as a result of specifying 'recoverFromOplogAsStandalone'.
2018-11-16T09:05:24.370+0000 I STORAGE [initandlisten] 
2018-11-16T09:05:24.372+0000 I STORAGE [initandlisten] ** WARNING: mongod started without --replSet yet 1 documents are present in local.system.replset.
2018-11-16T09:05:24.372+0000 I STORAGE [initandlisten] ** Database contents may appear inconsistent with the oplog and may appear to not contain
2018-11-16T09:05:24.372+0000 I STORAGE [initandlisten] ** writes that were visible when this node was running as part of a replica set.
2018-11-16T09:05:24.372+0000 I STORAGE [initandlisten] ** Restart with --replSet unless you are doing maintenance and no other clients are connected.
2018-11-16T09:05:24.372+0000 I STORAGE [initandlisten] ** The TTL collection monitor will not start because of this.
2018-11-16T09:05:24.372+0000 I STORAGE [initandlisten] ** 
2018-11-16T09:05:24.372+0000 I STORAGE [initandlisten] For more info see http://dochub.mongodb.org/core/ttlcollections
2018-11-16T09:05:24.372+0000 I STORAGE [initandlisten] 
2018-11-16T09:05:24.414+0000 I NETWORK [initandlisten] waiting for connections on port 6666

 

MY PLAN:

  • I will shutdown the server and post the logs.  Not sure if that would create new diagnostics.data. I will upload the most recent anyway.
  • I will run the server standalone, and open it up to normal processing for a few hours. In the process I will be exposing myself to some loss of data. Either the new data that I am creating or the old data in the oplog. But I will get piece of mind if I can rule out major corruption of data. 

 

 

Comment by Danny Hatcher (Inactive) [ 16/Nov/18 ]

Hello Arun,

That flag is actually relatively new with MongoDB 4.0 as in previous versions we recovered from the journal; in 4.0 we journal the oplog and recover from the oplog after the journal.

As you noticed, starting the node as a standalone did not replay the oplog. Thus, the data there could be data that has not been applied yet. We are operating under the assumption that those updates are valuable; if instead there has not been any real production data since the disk resizing then it should be easy to recover from this state but losing some of those writes. However, we will continue assuming that you wish to save all of the data.

Starting the standalone with the --recoverFromOplogAsStandalone flag may cause you to run into the same OOM issue again. But we do think it's the next step. Regardless of the outcome, please upload fresh diagnostic.data to this ticket after the node has gone through the replay process.

Thank you,

Danny

Comment by Arun Lakhotia [X] [ 16/Nov/18 ]

Dennis:

Thanks for the note of caution. I do have a backup of my two replicas. So I am good.

I did run the server in standalone, it restarted in a jiffy. No oplog replay. And stayed up for long enough. But I didn't put it under any load. No queries, nothing.

Follow up to your note I also go this hint from:

bruce.lucas, the standalone node will replay the oplog if started with --recoverFromOplogAsStandalone.

This escaped my mind when I restarted the server. (Been a busy day with meetings, and I got both the notes when I was in a meeting.)

So FWIW, I will do the standalone with the flag Bruce mentions.  As and when you have something new for me to try, I can bring my other two servers back up.

Arun

 

 

 

 

Comment by Danny Hatcher (Inactive) [ 15/Nov/18 ]

Hello Arun,

To keep you updated, we’ve been having an internal discussion on the best way forward and a few snippets have been getting through. Tess’s comment and Bruce’s graph were two of those. Unfortunately, the graph was generated by a tool that is not yet customer-available. There is a project in the works to make it so but it's still down-the-line.

It may be best to keep Automation disabled for the entire replica set until we get your cluster back into a healthy state. In its attempts to fix the problem it is possible that Automation may contradict some of your actions.

We are discussing the standalone idea as we are not sure if the content will still be pinned in cache if the node is no longer in the replica set. If you can hold off for a little, we'd like to make sure that going to a standalone in the best option before recommending it. However, if you need to get your system up as soon as possible, please make sure that you have the data files backed up to a separate location just in case.

We will let you know more as soon as we can. I appreciate your patience in working with us through this; getting to the root cause here will also help us support the rest of our customers.

Thank you,

Danny

Comment by Arun Lakhotia [X] [ 15/Nov/18 ]

Bruce: That is one cool graph, showing timeline. Is that an in-house tool, or something available to users.

All:  I ran the replica again with just PRIMARY and ARBITER. No SECONDARY. And I also turned off any other communication with the server. So absolutely no queries.

After 3 hours of replaying oplog, and about 9 hours of staying up, the server crashed.

CAVEATS: I did find that my automation manager was not able to change the replica config. Reason: the _id of the PRIMARY (mongod-0) in the replica was different from the _id being given by automation manager.  

So I turned off automation, and manually removed mongod-1 from the replica. Good thing is that mongod-1 was powered off all through, so mongod-0 was really by itself. Except, it was expecting mongod-1 in the replica. And so you see heartbeats to mongod-1 that fail.

I have uploaded the log and diagnostics of both mongod-0 and mongod-arbiter. Here are the SHA1s of the pushed files.
 

6c09c42dacb8c0860b850dfe80102ef37a4058b8 mongod-0-2018-11-15T16-18-15.tgz 
5a603937fe5c7a518100d4346c9fb5a7b07883ac mongod-arbiter-0-2018-11-15T16-21-47.tgz

 
 
In the diagnostic I have kept metrics.interim and the most recent metrics.$TIMESTAMP file. In the logs, I have kept the recent logs.
 
I have a crude script to collect the necessary data and push it to the server. If it's of any value, I'd be happy to share.
 
MY PLAN:
 

  • Run the server standalone. Do it manually (no cloud manager). Run with --repair.
  • Disconnect any access to the server (no queries)
  • Wait 9 hours. 
     
     
Comment by Arun Lakhotia [X] [ 14/Nov/18 ]

I received a note from Tess Avitable. Don't see it in the ticket here. For reference, here is its:

When read concern majority is disabled, we don't update the stable timestamp, but we do update the oldest timestamp, which controls how much history is pinned in memory.

 

Good observation. I had disabled "enableMajorityReadConcern" while shooting in the dark, and trying to do something. I found a reference to the property on stack exchange, and then found this description on the docs. I thought of giving it a chance.

you can disable read concern "majority" to prevent the storage cache pressure from immobilizing a deployment with a three-member primary-secondary-arbiter (PSA) architecture

I figured by deployment was immobilized, so give it a try. 

I have now enabled it back (by deleting the property on cloudmanager, so it reverts to the default). 

My plan is to still bring up the PRIMARY up first. Let it run for sometime, and then bring up the SECONDARY.

 

 

Comment by Arun Lakhotia [X] [ 14/Nov/18 ]

Kelsey:

Thanks for the upload location.

I have uploaded the files from my CURRENT configuration. However, there may be a major issue in using this data. Yesterday we "disk copied" PRIMARY (mongod-0)  to create the SECONDARY (mongod-1). Which means a good bit of the history on mongod-1 is from mongod-0.

FWIW, I have also included mongod.log files from around October 22 as well. That is when we resized the disk, and the problems started. In case it is useful, the replay oplog starts apply oplogs from October 17.

I do have a copy of the old mongod-1. I'd need to get the data out from it for you.

Here are the sha1sums of the files I uploaded today.

9a18047b359ca3b332e0450c208acbb802f6e1e8 mongod-1.tgz
bdaf9bb331288ac5ccf046d99ee64fa9a3bc42d1 mongod-0.tgz
e8de78c5e0053b4992f06a5f98a7b1ddd13e2334 mongod-arbiter-0.tgz

EDIT: Uploaded

5778a9a467a89314c1cb0e9d596d43fe4bca060e mongod-2.tgz

 

RECAP: My early configuration where this problem started had mongod-0 as PRIMARY and mongod-2 as SECONDARY (but at that time it was called mongod-1).

The configuration setup yesterday had mongod-0 as PRIMARY and mongod-1 as SECONDARY (where mongod-1 was created by disk copy from mongod-0).

 

Arun

 

Comment by Kelsey Schubert [ 14/Nov/18 ]

Hi DrArunL,

Would you please upload a more recent archive of diagnostic.data directory as well as the complete logs for the past three days for all the nodes in the replica set (including the arbiter)? I've created a secure upload portal for you to use to provide these files to help you get around the size limitations of attachments to JIRA.

Thanks,
Kelsey

Comment by Arun Lakhotia [X] [ 14/Nov/18 ]

Excitement short lived. The SECONDARY  (mongod-1) created from copying PRIMARY (mongod-0) crashed due to memory allocation error.

2018-11-14T02:50:44.488+0000 E STORAGE [WTCheckpointThread] WiredTiger error (12) [1542163844:402387][22365:0x7f2e68ca8700], file:WiredTigerLAS.wt, WT_CURSOR.next: memory allocation of 27665 bytes failed: Cannot allocate memory Raw: [1542163844:402387][22365:0x7f2e68ca8700], file:WiredTigerLAS.wt, WT_CURSOR.next: memory allocation of 27665 bytes failed: Cannot allocate memory

This then triggers cascading errors, and the server restarts.

I have attached the recent log. Extracted only the lines between two SERVER RESTARTED lines.

When the server restarts, I see that it goes through the "replaying stored operations", which takes almost 3 hours. As I write, mongod-1 is going through that sequence again. Am curious if there is a way for making the server remember the stored operatons and not replay it again.

EDIT: Turns out both the PRIMARY (mongod-0) and SECONDARY (mongod-1) crashed at exactly the same time (within a minute of each other). I have attached log of both.

The two log files show a lot of network connections. The following may be helpful to understand the connections.

10.128.0.10 - mongod-0 mongod-0.c.virusbattle-1.internal
10.128.0.12 - mongod-1 mongod-1.c.virusbattle-1.internal
10.128.0.13 - arbiter mongod-arbiter-0.c.virusbattle-1.internal

 My plan:

Turn off the secondary. Let the primary run by itself for a day, maybe two. See if it croaks.  If it does or doesn't should help isolate the issue further.

 

 

 

Comment by Arun Lakhotia [X] [ 14/Nov/18 ]

Went through the steps. I now have the PSA configuration back up, in that the mongod processes are running, and accepting connections (as indicated in the mongod.log files).

However, the cloud manager shows there is problem. It says "Wait until this process is primary" (for mongod-0). I looked in the mongod.log file, and it shows that the process has indieed transitioned to PRIMARY. And mongod-1 also says that mongod-0 is PRIMARY.

There is however an E (ERROR) message in the mongod-0 log file and a recurring I (INFO) in the mongod-1 log file which may hold some clue.  The mongod-0 error is also referenced in the "View Details" error log of the automation agent.

Here is one instance of the error on mongod-0 (PRIMARY)

{{2018-11-13T23:45:20.660+0000 E REPL [conn5585] replSetReconfig got NewReplicaSetConfigurationIncompatible: New and old configurations both have members with host of mongod-0.c.virusbattle-1.internal:6655 but in the new configuration the _id field is 0 and in the old configuration it is 8 for replica set rs0 while validating { _id: "rs0", version: 30, members: [ { priority: 2.0, votes: 1, arbiterOnly: false, buildIndexes: true, _id: 0, host: "mongod-0.c.virusbattle-1.internal:6655", hidden: false, tags: {}, slaveDelay: 0 }, { hidden: false, host: "mongod-arbiter-0.c.virusbattle-1.internal:6655", tags: {}, _id: 2, priority: 0.0, buildIndexes: true, arbiterOnly: true, votes: 1, slaveDelay: 0 }, { _id: 3, tags: {}, arbiterOnly: false, hidden: false, host: "mongod-1.c.virusbattle-1.internal:6655", priority: 1.0, votes: 1, buildIndexes: true, slaveDelay: 0 } ], settings: { replicaSetId: ObjectId('5a4c05bb0f58fd8f4ca635e0'), getLastErrorDefaults:

{ w: 1, wtimeout: 0 }

, getLastErrorModes: {}, electionTimeoutMillis: 10000.0, chainingAllowed: true, catchUpTimeoutMillis: -1.0, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10.0, catchUpTakeoverDelayMillis: 30000.0 }, protocolVersion: 1 }}}

 

And the INFO entry in mongod-1 (SECONDARY) – For reference, this server was restarted on 20:16. And these are some of the messages.

 

2018-11-13T23:36:03.194+0000 I NETWORK [LogicalSessionCacheRefresh] Starting new replica set monitor for rs0/mongod-0.c.virusbattle-1.internal:6655,mongod-1.c.virusbattle-1.internal:6655
2018-11-13T23:36:03.196+0000 I NETWORK [LogicalSessionCacheRefresh] Successfully connected to mongod-0.c.virusbattle-1.internal:6655 (11 connections now open to mongod-0.c.virusbattle-1.internal:6655 with a 0 second timeout)
2018-11-13T23:36:03.196+0000 I NETWORK [LogicalSessionCacheReap] Successfully connected to mongod-0.c.virusbattle-1.internal:6655 (12 connections now open to mongod-0.c.virusbattle-1.internal:6655 with a 0 second timeout)
2018-11-13T23:36:03.196+0000 I NETWORK [LogicalSessionCacheRefresh] scoped connection to mongod-0.c.virusbattle-1.internal:6655 not being returned to the pool
2018-11-13T23:36:03.196+0000 I NETWORK [LogicalSessionCacheReap] scoped connection to mongod-0.c.virusbattle-1.internal:6655 not being returned to the pool

 

Comment by Danny Hatcher (Inactive) [ 13/Nov/18 ]

Hello Arun,

That's actually our recommended strategy if you think that the MongoDB's initial sync process is too slow. It's mentioned in our documentation on resyncing a member of a replica set. The procedure is:

1. Stop the Secondary process
2. Remove all the files under the Secondary's dbpath
3. Stop the Primary process
4. Copy the data files from the Primary's dbpath to the Secondary's dbpath
5. Start the Primary
6. Start the Secondary

Once the Secondary is started again, it should apply everything in the oplog that it didn't have from the data copy (inserts/updates between steps 5 and 6). Then it will be a healthy member again.

Please note that because you are using Cloud Manager Automation you must first remove these processes from automation before performing the above procedure. Otherwise, as you mentioned in your initial description, Automation will automatically try to start the processes if it notices that they are down.

Please let me know if you have any issues with this.

Thank you,

Danny

Comment by Arun Lakhotia [X] [ 13/Nov/18 ]

Hi Danny:

Thanks for the suggestion about initial sync. That seems like a reset button for mongod. Am newbie in this.  Will try it in the morning tomorrow.

In regards making a replica, what if I stopped my PRIMARY, copied the drive,  and used that to start by SECONDARY.

I am guessing that's a no-no. But I thought I'd ask. I think copying disk may be faster than a init sync.

 

 

Comment by Danny Hatcher (Inactive) [ 12/Nov/18 ]

Hello Arun,

Your summary makes sense to me. Have you attempted to perform an initial sync since the problem on 4.0.1? If not, could you do so on this node that is currently experiencing the fatal assertion? It's possible that the oplog for it is in an unrecoverable state.

Thank you,

Danny

Generated at Thu Feb 08 04:47:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.