[SERVER-61343] Invariant when killing backup cursor Created: 09/Nov/21  Updated: 27/Oct/23  Resolved: 04/Mar/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Matthew Russotto Assignee: Backlog - Storage Execution Team
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-61344 Make sure cloning is complete before ... Closed
Assigned Teams:
Storage Execution
Operating System: ALL
Sprint: Execution Team 2022-02-21, Execution Team 2022-03-21
Participants:

 Description   

On Windows, killing a backup cursor while still copying from it will cause an invariant (since WT cannot delete the file).

This was found during file copy based initial sync testing; I will also file a ticket to not do this, but crashing seems undesirable anyway.

[j0:prim] | 2021-11-09T14:08:02.331+00:00 D2 COMMAND  21965   [conn131] "About to run the command","attr":{"db":"admin","client":"127.0.0.1:54849","commandArgs":{"killCursors":"$cmd.aggregate","cursors":[3906021346930736600],"$db":"admin"}}
[j0:prim] | 2021-11-09T14:08:02.331+00:00 E  STORAGE  22435   [conn131] "WiredTiger error","attr":{"error":16,"message":"[1636466882:330745][5124:140706774209120], WT_CURSOR.close: int __cdecl __win_fs_remove(struct __wt_file_system *,struct __wt_session *,const char *,unsigned int), 81: \\data\\db\\job0\\resmoke\\node0\\WiredTiger.backup: file-remove: DeleteFileW: The process cannot access the file because it is being used by another process.\r\n: Resource device"}
[j0:prim] | 2021-11-09T14:08:02.331+00:00 F  ASSERT   23083   [conn131] "Invariant failure","attr":{"expr":"_session->close(_session, nullptr)","error":"ObjectIsBusy: 16: Resource device","file":"src\\mongo\\db\\storage\\wiredtiger\\wiredtiger_session_cache.cpp","line":78}
[j0:prim] | 2021-11-09T14:08:02.331+00:00 F  ASSERT   23084   [conn131] "\n\n***aborting after invariant() failure\n\n"
[j0:prim] | 2021-11-09T14:08:02.331+00:00 F  CONTROL  4757800 [conn131] "Writing fatal message","attr":{"message":"Got signal: 22 (SIGABRT).\n"}
[j0:prim] | 2021-11-09T14:08:02.332+00:00 I  COMMAND  51803   [conn132] "Slow query","attr":{"type":"command","ns":"admin.$cmd.aggregate","appName":"FileCopyBasedInitialSyncer","command":{"aggregate":1,"pipeline":[{"$_backupFile":{"backupId":{"$uuid":"0e76c891-3922-4a93-9fbf-5ace9f44756e"},"file":"\\data\\db\\job0\\resmoke\\node0\\WiredTiger.backup","byteOffset":0}}],"cursor":{"batchSize":101},"readConcern":{},"writeConcern":{"w":1,"wtimeout":0},"$readPreference":{"mode":"secondaryPreferred"},"$db":"admin"},"keysExamined":0,"docsExamined":0,"cursorExhausted":true,"numYields":0,"nreturned":1,"reslen":170033,"locks":{},"readConcern":{"provenance":"implicitDefault"},"writeConcern":{"w":1,"wtimeout":0,"provenance":"clientSupplied"},"remote":"127.0.0.1:54852","protocol":"op_msg","durationMillis":14}



 Comments   
Comment by Haley Connelly [ 01/Mar/22 ]

Pausing on this to work on project work. 

One possible solution is:
Rather than invariant in the destructor, if the session fails to close, queue the session on some list of dead sessions in the WiredTigerSessionCache. The proposed solution needs some exploration- there could be a potential deadlock between trying to shutdown the cache and trying to re-queue sessions that fail to close.

Generated at Thu Feb 08 05:52:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.