[SERVER-34211] A failed restartCatalog command can clear the cached repl oplog pointer without reestablishing it Created: 30/Mar/18  Updated: 29/Oct/23  Resolved: 03/May/18

Status: Closed
Project: Core Server
Component/s: Replication, Storage
Affects Version/s: None
Fix Version/s: 4.0.0-rc0

Type: Bug Priority: Major - P3
Reporter: Kyle Suarez Assignee: Kyle Suarez
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:
Linked BF Score: 0

 Description   

Imagine this sequence of events:

  1. I run a background index build (or any background job, really) on the namespace "test.coll".
  2. Someone issues a restartCatalog command.
  3. We close all of the open databases via DBHolder::closeAll(). This simply loops through each database and attempts to close it. Suppose the order of databases is "local", then "test".
    1. Database "local" is closed. The cached oplog collection pointer is cleared.
    2. We attempt to close database "test" but then throw because a background operation is in progress.
  4. A later operation causes us to write to the oplog, and we dereference our bad oplog pointer because logOp() does not call acquireOplogCollectionForLogging().

One solution would be to add a ScopeGuard to restartCatalog that calls repl::acquireOplogCollectionForLogging() if the call to catalog::closeCatalog() fails for any reason.



 Comments   
Comment by Githook User [ 03/May/18 ]

Author:

{'email': 'kyle.suarez@mongodb.com', 'name': 'Kyle Suarez', 'username': 'ksuarz'}

Message: SERVER-34211 restore cached oplog pointer if restartCatalog exits early
Branch: master
https://github.com/mongodb/mongo/commit/50bb9fd6ff7e87b39a6317c7bf3b783e0dbf836d

Comment by Kyle Suarez [ 02/Apr/18 ]

Thanks Andy, that sounds like a solid approach.

Comment by Andy Schwerin [ 02/Apr/18 ]

I think you should suppress lock acquisition interruption using the recently introduced guard type. As for deadlock, it may not be an issue if you hold the global lock in MODE_X. Just keep an eye on it.

Comment by Kyle Suarez [ 02/Apr/18 ]

While the command is executing, the global lock is held in exclusive mode. Does that still leave open the possibility for lock acquisition to throw?

To prevent deadlock, we could do the locking ourselves (rather than calling {{repl::acquireOplogCollectionForLogging(), which uses one of the AutoGet* helpers) given that we know we're exclusively locked.

Comment by Andy Schwerin [ 01/Apr/18 ]

The risk with the proposed solution is that acquiring locks can throw or deadlock.

Comment by Kyle Suarez [ 30/Mar/18 ]

Note that a successful restartCatalog command will re-establish the cached oplog collection pointer in catalog::openCatalog().

Generated at Thu Feb 08 04:35:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.