[SERVER-54150] Recovery from a stable checkpoint should fassert on oplog application failures Created: 29/Jan/21  Updated: 29/Oct/23  Resolved: 26/Apr/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0, 6.0.6, 5.0.18, 7.0.0-rc1

Type: Task Priority: Major - P3
Reporter: Lingzhi Deng Assignee: Moustafa Maher
Resolution: Fixed Votes: 0
Labels: repl-shortlist
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-75865 check constraints for prepared transa... Closed
Problem/Incident
Related
related to SERVER-21700 Do not relax constraints during stead... Closed
is related to SERVER-71490 Move steady state replication constra... Closed
is related to SERVER-75865 check constraints for prepared transa... Closed
is related to SERVER-46221 Remove oplogApplicationEnforcesSteady... Open
Assigned Teams:
Replication
Backwards Compatibility: Minor Change
Backport Requested:
v7.0, v6.3, v6.2, v6.0, v5.0
Sprint: Repl 2023-03-06, Repl 2023-03-20, Repl 2023-04-03, Repl 2023-05-01
Participants:
Linked BF Score: 135

 Description   

Currently, we ignore certain errors when applying oplog in Mode::kRecovering for idempotency (e.g. this). This makes sense for initial sync, eMRC=false and rollback via refetch. But if we are recovering from a stable checkpoint, oplog application should be able to finish without any errors. And we should fassert on oplog application errors like we do in secondary oplog application.

Skip fassert on oplog application failures in selective restore process:
As the first step of selective restore is to start the shard node as a replica set to do oplog application of the snapshot taken for restore, so for selective restore we might still have some oplogs referencing the unrestored collections which will fail with collection see this test.Selective restore process done with --restore flag is enabled, so we need to skip fassert on oplog application failures. (See this PR for implementation)

 



 Comments   
Comment by Githook User [ 27/Apr/23 ]

Author:

{'name': 'Moustafa Maher Khalil', 'email': 'm.maher@mongodb.com', 'username': 'moustafamaher'}

Message: SERVER-54150 Recovery from a stable checkpoint should fassert on oplog application failures
Branch: v7.0
https://github.com/mongodb/mongo/commit/2aeab9063cc965e6198a20be4818bff40a4fe90f

Comment by Githook User [ 26/Apr/23 ]

Author:

{'name': 'Moustafa Maher Khalil', 'email': 'm.maher@mongodb.com', 'username': 'moustafamaher'}

Message: SERVER-54150 Recovery from a stable checkpoint should fassert on oplog application failures
Branch: v5.0
https://github.com/mongodb/mongo/commit/933edd562c2e658b0a0bcb9f4d13b0179a7c87b2

Comment by Githook User [ 26/Apr/23 ]

Author:

{'name': 'Moustafa Maher Khalil', 'email': 'm.maher@mongodb.com', 'username': 'moustafamaher'}

Message: SERVER-54150 Recovery from a stable checkpoint should fassert on oplog application failures
Branch: v6.0
https://github.com/mongodb/mongo/commit/1deb4f36dd1845450929fb692965b14838007a08

Comment by Githook User [ 26/Apr/23 ]

Author:

{'name': 'Moustafa Maher Khalil', 'email': 'm.maher@mongodb.com', 'username': 'moustafamaher'}

Message: SERVER-54150 Recovery from a stable checkpoint should fassert on oplog application failures
Branch: master
https://github.com/mongodb/mongo/commit/269961220e0a5b954a2e0d878c82bd58068030ae

Comment by Githook User [ 12/Apr/23 ]

Author:

{'name': 'Moustafa Maher Khalil', 'email': 'm.maher@mongodb.com', 'username': 'moustafamaher'}

Message: Revert "SERVER-54150 Recovery from a stable checkpoint should fassert on oplog application failures"

This reverts commit d8d5582fd381ed87b8463782747399a6c1965892.
Branch: v7.0
https://github.com/mongodb/mongo/commit/1e5272dcce8ab47ce0ee5857845f50b2549a76fb

Comment by Moustafa Maher [ 12/Apr/23 ]

We need to investigate this error before committing this change.

Comment by Githook User [ 12/Apr/23 ]

Author:

{'name': 'Moustafa Maher Khalil', 'email': 'm.maher@mongodb.com', 'username': 'moustafamaher'}

Message: Revert "SERVER-54150 Recovery from a stable checkpoint should fassert on oplog application failures"

This reverts commit d8d5582fd381ed87b8463782747399a6c1965892.
Branch: master
https://github.com/mongodb/mongo/commit/4f91163fae04825430a7396443bf5fae7813f75e

Comment by Githook User [ 12/Apr/23 ]

Author:

{'name': 'Moustafa Maher Khalil', 'email': 'm.maher@mongodb.com', 'username': 'moustafamaher'}

Message: Revert "SERVER-54150 Recovery from a stable checkpoint should fassert on oplog application failures"

This reverts commit 7822344e72464810f6614d3491b86c7d0971b1bd.
Branch: v6.0
https://github.com/mongodb/mongo/commit/9cdf8d188da5c964807c8474c2a3dc421b45a50c

Comment by Githook User [ 12/Apr/23 ]

Author:

{'name': 'Moustafa Maher Khalil', 'email': 'm.maher@mongodb.com', 'username': 'moustafamaher'}

Message: Revert "SERVER-54150 Recovery from a stable checkpoint should fassert on oplog application failures"

This reverts commit 347f059c439dfafe9e8a34365b4c5e7a17c22acf.
Branch: v5.0
https://github.com/mongodb/mongo/commit/e19b996a2d05a3a1a8f361b73f5d50fa36a451b3

Comment by Githook User [ 10/Apr/23 ]

Author:

{'name': 'Moustafa Maher Khalil', 'email': 'm.maher@mongodb.com', 'username': 'moustafamaher'}

Message: SERVER-54150 Recovery from a stable checkpoint should fassert on oplog application failures

(cherry picked from commit d8d5582fd381ed87b8463782747399a6c1965892)
(cherry picked from commit 4b9fcc952fa5193c42a832bb33152ba0da92068d)
Branch: v6.0
https://github.com/mongodb/mongo/commit/7822344e72464810f6614d3491b86c7d0971b1bd

Comment by Githook User [ 10/Apr/23 ]

Author:

{'name': 'Moustafa Maher Khalil', 'email': 'm.maher@mongodb.com', 'username': 'moustafamaher'}

Message: SERVER-54150 Recovery from a stable checkpoint should fassert on oplog application failures

(cherry picked from commit d8d5582fd381ed87b8463782747399a6c1965892)
(cherry picked from commit 4b9fcc952fa5193c42a832bb33152ba0da92068d)
Branch: v5.0
https://github.com/mongodb/mongo/commit/347f059c439dfafe9e8a34365b4c5e7a17c22acf

Comment by Githook User [ 28/Mar/23 ]

Author:

{'name': 'Moustafa Maher Khalil', 'email': 'm.maher@mongodb.com', 'username': 'moustafamaher'}

Message: SERVER-54150 Recovery from a stable checkpoint should fassert on oplog application failures
Branch: master
https://github.com/mongodb/mongo/commit/d8d5582fd381ed87b8463782747399a6c1965892

Comment by Opal Hoyt [ 06/Feb/23 ]

Consider how far this can be backported

Comment by Judah Schvimer [ 02/Feb/23 ]

While we should fassert in testing, we might want to be careful and first introduce this as a log message in production and change it to an fassert in production after some confidence that we truly do not need to ignore these errors in any cases. See SERVER-71490 and SERVER-46221 for similar efforts.

Comment by Lingzhi Deng [ 02/Feb/23 ]

Another example is we ignore NamespaceNotFound error for CRUD during startup recovery.

Comment by Judah Schvimer [ 02/Feb/23 ]

We should reconsider this as an extra safeguard against data corruption.

Generated at Thu Feb 08 05:32:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.