Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Incomplete
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 4.0.3
Component/s: None
Labels:
None

Operating System:
ALL
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

It's a replica set, running on version 4.0.3 in k8s.

I would like to both find out how to recover from this and actually understand how did it end up in this state in the first place.

My understanding is that after any applied operation (on the primary) that change is reflected on the oplog as well. I believe these happen atomically, so client gets an OK write response if both of these succeed.

However, this does not mean the data is persisted yet (nor the operations in the journal, if they are less than 128 KB #anchor 1[]). It's persisted when the next checkpoint kicks in. So when a hard shutdown occur, and there is some data in the journal buffer, that's basically lost.

Which is in correspondence with the docs: 'In between write operations, while the journal records remain in the WiredTiger buffers, updates can be lost following a hard shutdown of mongod.'

Going back to the log `Applied op { : Timestamp(1539695503, 1) } not found. Top of oplog is { : Timestamp(1539695466, 1) }`. Now, I'm not 100% where is the 'applied op' coming from (is it from the last checkpoint? - probably yes), but it's an operation that is newer than the top of the oplog, which means that when it was persisted, the related document in the oplog wasn't.

My rough guess is that the hard shutdown occurred exactly at the point of the checkpoint, which then has succeeded only partially.

I've been a user of mongo for years now, but never really had the need to dig deeper into its working. Any help is greatly appreciated!

```

...

2018-10-24T09:08:43.547+0000 I FTDC [initandlisten] Initializing full-time diagnostic data capture with directory '/data/db/diagnostic.data'
2018-10-24T09:08:43.550+0000 I REPL [initandlisten] Rollback ID is 9
2018-10-24T09:08:43.550+0000 I REPL [initandlisten] Recovering from stable timestamp: Timestamp(1539695503, 1) (top of oplog: { ts: Timestamp(1539695466, 1), t: 236 }, appliedThrough: { ts: Timestamp(1539695503, 1), t: 237 }, TruncateAfter: Timestamp(0, 0))
2018-10-24T09:08:43.550+0000 I REPL [initandlisten] Starting recovery oplog application at the stable timestamp: Timestamp(1539695503, 1)
2018-10-24T09:08:43.550+0000 F REPL [initandlisten] Applied op { : Timestamp(1539695503, 1) } not found. Top of oplog is { : Timestamp(1539695466, 1) }.
2018-10-24T09:08:43.550+0000 F - [initandlisten] Fatal Assertion 40313 at src/mongo/db/repl/replication_recovery.cpp 359
2018-10-24T09:08:43.550+0000 F - [initandlisten]

***aborting after fassert() failure

```

1- https://docs.mongodb.com/manual/core/journaling/#journaling-process

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

mongo.log
10 kB
Oct 24 2018 09:24:16 AM UTC

Assignee:: Danny Hatcher (Inactive)
Reporter:: Zoltan Mark Bodor [X]
Participants:: Danny Hatcher, Joachim Aumann, Zoltan Mark Bodor [X]
Votes:: 0 Vote for this issue
Watchers:: 11 Start watching this issue

Created:: Oct 24 2018 10:32:56 AM UTC
Updated:: Jan 03 2019 08:32:13 PM UTC
Resolved:: Nov 14 2018 03:25:42 PM UTC

Details

Description

Attachments

Attachments

Forms

Activity

People

Dates