-
Type: Bug
-
Resolution: Incomplete
-
Priority: Major - P3
-
None
-
Affects Version/s: 4.0.3
-
Component/s: None
-
None
-
ALL
It's a replica set, running on version 4.0.3 in k8s.
I would like to both find out how to recover from this and actually understand how did it end up in this state in the first place.
My understanding is that after any applied operation (on the primary) that change is reflected on the oplog as well. I believe these happen atomically, so client gets an OK write response if both of these succeed.
However, this does not mean the data is persisted yet (nor the operations in the journal, if they are less than 128 KB #anchor 1[]). It's persisted when the next checkpoint kicks in. So when a hard shutdown occur, and there is some data in the journal buffer, that's basically lost.
Which is in correspondence with the docs: 'In between write operations, while the journal records remain in the WiredTiger buffers, updates can be lost following a hard shutdown of mongod.'
Going back to the log `Applied op { : Timestamp(1539695503, 1) } not found. Top of oplog is { : Timestamp(1539695466, 1) }`. Now, I'm not 100% where is the 'applied op' coming from (is it from the last checkpoint? - probably yes), but it's an operation that is newer than the top of the oplog, which means that when it was persisted, the related document in the oplog wasn't.
My rough guess is that the hard shutdown occurred exactly at the point of the checkpoint, which then has succeeded only partially.
I've been a user of mongo for years now, but never really had the need to dig deeper into its working. Any help is greatly appreciated!
```
...
2018-10-24T09:08:43.547+0000 I FTDC [initandlisten] Initializing full-time diagnostic data capture with directory '/data/db/diagnostic.data'
2018-10-24T09:08:43.550+0000 I REPL [initandlisten] Rollback ID is 9
2018-10-24T09:08:43.550+0000 I REPL [initandlisten] Recovering from stable timestamp: Timestamp(1539695503, 1) (top of oplog: { ts: Timestamp(1539695466, 1), t: 236 }, appliedThrough: { ts: Timestamp(1539695503, 1), t: 237 }, TruncateAfter: Timestamp(0, 0))
2018-10-24T09:08:43.550+0000 I REPL [initandlisten] Starting recovery oplog application at the stable timestamp: Timestamp(1539695503, 1)
2018-10-24T09:08:43.550+0000 F REPL [initandlisten] Applied op { : Timestamp(1539695503, 1) } not found. Top of oplog is { : Timestamp(1539695466, 1) }.
2018-10-24T09:08:43.550+0000 F - [initandlisten] Fatal Assertion 40313 at src/mongo/db/repl/replication_recovery.cpp 359
2018-10-24T09:08:43.550+0000 F - [initandlisten]
***aborting after fassert() failure
```
1- https://docs.mongodb.com/manual/core/journaling/#journaling-process