[SERVER-56625] validate cachedir race condition when cleaning bad cachefile Created: 04/May/21  Updated: 29/Oct/23  Resolved: 21/Jul/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.0.2, 5.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Daniel Moody Assignee: Daniel Moody
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
is related to SERVER-58020 cache-dir: prevent same buildsig with... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0
Sprint: Dev Platform 2021-06-14, Dev Platform 2021-06-28, Dev Platform 2021-07-12, Dev Platform 2021-07-26
Participants:
Linked BF Score: 53

 Description   

Sometimes when an invalid checksum event happens, there are issues around reading or accessing the related files. There may be a race condition with the deletion of the bad cachefile.

 



 Comments   
Comment by Vivian Ge (Inactive) [ 06/Oct/21 ]

Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you!

Comment by Githook User [ 21/Jul/21 ]

Author:

{'name': 'Daniel Moody', 'email': 'daniel.moody@mongodb.com', 'username': 'dmoody256'}

Message: SERVER-56625 use lazy delete for bad cachefile race condition

(cherry picked from commit 958e563b46e31ee0316be0a5f4b0cb8e0f878a9c)
Branch: v5.0
https://github.com/mongodb/mongo/commit/3d81bf163bf9e3a7e0158accff8c4835e32ea9d7

Comment by Githook User [ 21/Jul/21 ]

Author:

{'name': 'Daniel Moody', 'email': 'daniel.moody@mongodb.com', 'username': 'dmoody256'}

Message: SERVER-56625 use lazy delete for bad cachefile race condition
Branch: master
https://github.com/mongodb/mongo/commit/958e563b46e31ee0316be0a5f4b0cb8e0f878a9c

Comment by Githook User [ 06/May/21 ]

Author:

{'name': 'Daniel Moody', 'email': 'daniel.moody@mongodb.com', 'username': 'dmoody256'}

Message: SERVER-56625 turning off validate cachedir invalid checksum clean up
Branch: master
https://github.com/mongodb/mongo/commit/ffde170147ef83d31bb583720d4fd9c51af6b7e9

Comment by Daniel Moody [ 04/May/21 ]

I propose adding some exception handling around the deletion of the bad cachefile. Also there seems to many cases of using rmdir on windows specifically where issues like this happen. I suspect windows is doing something funny behind the scenes. Something like this may help:

diff --git a/site_scons/site_tools/validate_cache_dir.py b/site_scons/site_tools/validate_cache_dir.py
index bf2075daa4..9e6c6ea7e4 100644
--- a/site_scons/site_tools/validate_cache_dir.py
+++ b/site_scons/site_tools/validate_cache_dir.py
@@ -24,6 +24,7 @@ import json
 import os
 import pathlib
 import shutil
+import stat
 
 import SCons
 
@@ -51,6 +52,10 @@ class UnsupportedError(SCons.Errors.BuildError):
     def __str__(self):
         return self.message
 
+def remove_readonly(func, path, _):
+    "Clear the readonly bit and reattempt the removal"
+    os.chmod(path, stat.S_IWRITE)
+    func(path)
 class CacheDirValidate(SCons.CacheDir.CacheDir):
 
     def __init__(self, path):
@@ -193,8 +198,30 @@ class CacheDirValidate(SCons.CacheDir.CacheDir):
         cksum_dir = pathlib.Path(self.cachepath(node)[1]).parent
         if cksum_dir.is_dir():
             rm_path = f"{cksum_dir}.{SCons.CacheDir.cache_tmp_uuid}.del"
-            cksum_dir.replace(rm_path)
-            shutil.rmtree(rm_path)
+
+            try:
+                cksum_dir.replace(rm_path)
+            except OSError as ex:
+                failed_rename_msg = f"Failed to rename {cksum_dir} to {rm_path}: {ex}"
+                print(failed_rename_msg)
+                self.CacheDebug(failed_rename_msg + cache_debug_suffix, node, cksum_dir)
+                self.CacheDebugJson({
+                    'type': 'error',
+                    'error': failed_rename_msg
+                }, node, cksum_dir)
+                return
+
+            try:
+                shutil.rmtree(rm_path, onerror=remove_readonly)
+            except OSError as ex:
+                failed_rmtree_msg = f"Failed to rmtree {rm_path}: {ex}"
+                print(failed_rename_msg)
+                self.CacheDebug(failed_rmtree_msg + cache_debug_suffix, node, cksum_dir)
+                self.CacheDebugJson({
+                    'type': 'error',
+                    'error': failed_rmtree_msg
+                }, node, cksum_dir)
+                return
 
             clean_msg = f"Removed bad cachefile {cksum_dir} from cache."
             print(clean_msg)

Generated at Thu Feb 08 05:39:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.