[SERVER-77278] Replication rollback of a dropDatabase oplog entries leaves the in-memory database closed on the primary but open on secondaries, leading to secondaries crashing on receipt of conflicting database name Created: 18/May/23  Updated: 29/Oct/23  Resolved: 24/May/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0, 7.0.0-rc3, 6.0.11

Type: Bug Priority: Major - P3
Reporter: Dianna Hohensee (Inactive) Assignee: Dianna Hohensee (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-74703 Primary running dropDatabase may fail... Closed
is related to SERVER-77281 Investigate removing use of the Datab... Backlog
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v7.0, v6.0
Sprint: Execution Team 2023-05-29
Participants:
Linked BF Score: 29

 Description   

A primary running dropDatabase first majority commits dropCollection oplog entries and then writes a final dropDatabase oplog entry, with which the in-memory database is closed atomically. However, if replication rollback goes back to right before the dropDatabase and after the dropCollection(s), this leaves the node with no in-memory database state. Whereas nodes that were secondaries when the dropDatabase was running on the primary will still have the database open in-memory.

This sets the replica set up to allow a createCollection on a conflicting database on the primary, and then crash on oplog application on the secondaries with a DatabaseDifferCase error.

Potential solution: 

  • Secondaries could check whether a database is empty before returning name conflict errors.
  • SERVER-74703 should be re-examined to see whether or not it is simple to change something back to fix this new bug, without undoing the fix for the previous bug.


 Comments   
Comment by Githook User [ 07/Sep/23 ]

Author:

{'name': 'Dianna Hohensee', 'email': 'dianna.hohensee@mongodb.com', 'username': ''}

Message: SERVER-77278 Secondaries close empty database on receipt of a conflicting createCollection database name to handle primary dropDatabase oplog rollback

(cherry picked from commit 7d6513611f97a60746af4c9ebd7e1cec2d4b844a)
Branch: v6.0
https://github.com/mongodb/mongo/commit/a247065b7edf958c9fcbb672cd97e5648cdbe9ca

Comment by Githook User [ 24/May/23 ]

Author:

{'name': 'Dianna Hohensee', 'email': 'dianna.hohensee@mongodb.com', 'username': 'DiannaHohensee'}

Message: SERVER-77278 Secondaries close empty database on receipt of a conflicting createCollection database name to handle primary dropDatabase oplog rollback

(cherry picked from commit 7d6513611f97a60746af4c9ebd7e1cec2d4b844a)
Branch: v7.0
https://github.com/mongodb/mongo/commit/6481f5f4dd925395d8ac6271ee294cb6dc040d62

Comment by Githook User [ 24/May/23 ]

Author:

{'name': 'Dianna Hohensee', 'email': 'dianna.hohensee@mongodb.com', 'username': 'DiannaHohensee'}

Message: SERVER-77278 Secondaries close empty database on receipt of a conflicting createCollection database name to handle primary dropDatabase oplog rollback
Branch: master
https://github.com/mongodb/mongo/commit/7d6513611f97a60746af4c9ebd7e1cec2d4b844a

Comment by Dianna Hohensee (Inactive) [ 22/May/23 ]

So, interestingly, I've ascertained that this is not a hot BF because it is also broken in 6.0, not just 7.0.

Turns out that the test on which I reproduced the bug was actually broken from 4.3 to just recently during 7.0 development: Greg's work in SERVER-74703 actually unintentionally fixed the test. So my reproducer didn't catch anything earlier because the base test was not testing anything. The test looks like it was broken since v4.3. The test wasn't working properly because a failpoint got moved to after onDropDatabase is called, when the test was expecting to hang before the dropDatabase oplog entry is written. I adjusted the failpoint and can reproduce the crash in 6.0.

Comment by Dianna Hohensee (Inactive) [ 18/May/23 ]

I got this to reproduce modifying an existing replication JS test. The test hangs because I have a w:2 that never gets acknowledged (secondary crashes, other secondary isn't taking writes), but the logs show

[js_test:rollback_drop_database] 
[js_test:rollback_drop_database] 
[js_test:rollback_drop_database] [jsTest] ----
[js_test:rollback_drop_database] [jsTest] ~~~after step up node
[js_test:rollback_drop_database] [jsTest] ----
[js_test:rollback_drop_database] 
[js_test:rollback_drop_database] d20050| \{"t":{"$date":"2023-05-18T16:29:15.402+00:00"},"s":"I",  "c":"STORAGE",  "id":20320,   "ctx":"conn32","msg":"createCollection","attr":\{"namespace":"OLDDATABASE.afterRollback","uuidDisposition":"generated","uuid":{"uuid":{"$uuid":"8a3554d9-d148-4726-9067-e48b8ee35ce1"}},"options":{}}}
[js_test:rollback_drop_database] d20052| \{"t":{"$date":"2023-05-18T16:29:15.417+00:00"},"s":"I",  "c":"REPL",     "id":21079,   "ctx":"BackgroundSync","msg":"bgsync - stopReplProducer fail point enabled. Blocking until fail point is disabled."}
[js_test:rollback_drop_database] d20050| \{"t":{"$date":"2023-05-18T16:29:15.421+00:00"},"s":"I",  "c":"REPL",     "id":7360103, "ctx":"conn32","msg":"Wrote oplog entry for create","attr":\{"namespace":"OLDDATABASE.$cmd","uuid":{"uuid":{"$uuid":"8a3554d9-d148-4726-9067-e48b8ee35ce1"}},"opTime":\{"ts":{"$timestamp":{"t":1684427355,"i":3}},"t":3},"object":\{"create":"afterRollback","idIndex":{"v":2,"key":{"_id":1},"name":"_id_"}}}}
[js_test:rollback_drop_database] d20050| \{"t":{"$date":"2023-05-18T16:29:15.422+00:00"},"s":"I",  "c":"INDEX",    "id":20345,   "ctx":"conn32","msg":"Index build: done building","attr":\{"buildUUID":null,"collectionUUID":{"uuid":{"$uuid":"8a3554d9-d148-4726-9067-e48b8ee35ce1"}},"namespace":"OLDDATABASE.afterRollback","index":"_id_","ident":"index-3--2789427611038464271","collectionIdent":"collection-2--2789427611038464271","commitTimestamp":\{"$timestamp":{"t":1684427355,"i":3}}}}
[js_test:rollback_drop_database] d20051| \{"t":{"$date":"2023-05-18T16:29:15.424+00:00"},"s":"I",  "c":"REPL",     "id":7360109, "ctx":"ReplBatcher","msg":"Processing DDL command oplog entry in OplogBatcher","attr":\{"oplogEntry":{"oplogEntry":{"op":"c","ns":"OLDDATABASE.$cmd","ui":{"$uuid":"8a3554d9-d148-4726-9067-e48b8ee35ce1"},"o":\{"create":"afterRollback","idIndex":{"v":2,"key":{"_id":1},"name":"_id_"}},"ts":\{"$timestamp":{"t":1684427355,"i":3}},"t":3,"v":2,"wall":\{"$date":"2023-05-18T16:29:15.421Z"}}}}}
[js_test:rollback_drop_database] d20051| \{"t":{"$date":"2023-05-18T16:29:15.427+00:00"},"s":"I",  "c":"REPL",     "id":7360110, "ctx":"ReplWriterWorker-0","msg":"Applying DDL command oplog entry","attr":\{"oplogEntry":{"oplogEntry":{"op":"c","ns":"OLDDATABASE.$cmd","ui":{"$uuid":"8a3554d9-d148-4726-9067-e48b8ee35ce1"},"o":\{"create":"afterRollback","idIndex":{"v":2,"key":{"_id":1},"name":"_id_"}},"ts":\{"$timestamp":{"t":1684427355,"i":3}},"t":3,"v":2,"wall":\{"$date":"2023-05-18T16:29:15.421Z"}}},"oplogApplicationMode":"Secondary"}}
[js_test:rollback_drop_database] d20051| \{"t":{"$date":"2023-05-18T16:29:15.429+00:00"},"s":"E",  "c":"REPL",     "id":21262,   "ctx":"ReplWriterWorker-0","msg":"Failed command during oplog application","attr":\{"command":{"create":"afterRollback","idIndex":{"v":2,"key":{"_id":1},"name":"_id_"}},"db":"OLDDATABASE","error":\{"code":13297,"codeName":"DatabaseDifferCase","errmsg":"db already exists with different case already have: [olddatabase] trying to create [OLDDATABASE]"}}}
[js_test:rollback_drop_database] d20051| \{"t":{"$date":"2023-05-18T16:29:15.429+00:00"},"s":"F",  "c":"REPL",     "id":21237,   "ctx":"ReplWriterWorker-0","msg":"Error applying operation","attr":\{"oplogEntry":{"oplogEntry":{"op":"c","ns":"OLDDATABASE.$cmd","ui":{"$uuid":"8a3554d9-d148-4726-9067-e48b8ee35ce1"},"o":\{"create":"afterRollback","idIndex":{"v":2,"key":{"_id":1},"name":"_id_"}},"ts":\{"$timestamp":{"t":1684427355,"i":3}},"t":3,"v":2,"wall":\{"$date":"2023-05-18T16:29:15.421Z"}}},"error":" :: caused by :: DatabaseDifferCase: db already exists with different case already have: [olddatabase] trying to create [OLDDATABASE]"}}
[js_test:rollback_drop_database] d20051| \{"t":{"$date":"2023-05-18T16:29:15.430+00:00"},"s":"F",  "c":"REPL",     "id":21235,   "ctx":"OplogApplier-0","msg":"Failed to apply batch of operations","attr":\{"numOperationsInBatch":1,"firstOperation":{"oplogEntry":{"op":"c","ns":"OLDDATABASE.$cmd","ui":{"$uuid":"8a3554d9-d148-4726-9067-e48b8ee35ce1"},"o":\{"create":"afterRollback","idIndex":{"v":2,"key":{"_id":1},"name":"_id_"}},"ts":\{"$timestamp":{"t":1684427355,"i":3}},"t":3,"v":2,"wall":\{"$date":"2023-05-18T16:29:15.421Z"}}},"lastOperation":\{"oplogEntry":{"op":"c","ns":"OLDDATABASE.$cmd","ui":{"$uuid":"8a3554d9-d148-4726-9067-e48b8ee35ce1"},"o":\{"create":"afterRollback","idIndex":{"v":2,"key":{"_id":1},"name":"_id_"}},"ts":\{"$timestamp":{"t":1684427355,"i":3}},"t":3,"v":2,"wall":\{"$date":"2023-05-18T16:29:15.421Z"}}},"failedWriterThread":2,"error":"DatabaseDifferCase: db already exists with different case already have: [olddatabase] trying to create [OLDDATABASE]"}}
[js_test:rollback_drop_database] d20051| \{"t":{"$date":"2023-05-18T16:29:15.430+00:00"},"s":"F",  "c":"ASSERT",   "id":23095,   "ctx":"OplogApplier-0","msg":"Fatal assertion","attr":\{"msgid":34437,"error":"DatabaseDifferCase: db already exists with different case already have: [olddatabase] trying to create [OLDDATABASE]","file":"src/mongo/db/repl/oplog_applier_impl.cpp","line":436}}

Generated at Thu Feb 08 06:35:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.