[SERVER-38480] Make Map-Reduce fully interruptible Created: 07/Dec/18 Updated: 29/Oct/23 Resolved: 31/Jan/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | MapReduce, Querying |
| Affects Version/s: | None |
| Fix Version/s: | 4.1.8 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Siyuan Zhou | Assignee: | Justin Seyster |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | prepare_interruptibility | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Sprint: | Query 2019-01-14, Query 2019-01-28, Query 2019-02-11 | ||||||||
| Participants: | |||||||||
| Description |
|
We disallow interruptions in Map-Reduce on single node and on shards. They will conflict with prepared transactions on stepdown and shutdown. We can either make Map-Reduce interruptible or use weaker IX and IS locks instead. dropTempCollections() is also protected by UninterruptibleLockGuard. If the temp collections are only in local database as done by |
| Comments |
| Comment by Githook User [ 31/Jan/19 ] |
|
Author: {'name': 'Justin Seyster', 'email': 'justin.seyster@mongodb.com', 'username': 'jseyster'}Message: |
| Comment by Justin Seyster [ 03/Jan/19 ] |
|
I spent some time inspecting the MapReduceCommand and MapReduceFinishCommand run methods, and except for two easily fixable ON_BLOCK_EXIT, I don't see any destructor actions that have the potential to cause a problem (by throwing a double-fault exception) during exception unwinding if the OperationContext is in an interrupted state. The exceptions are the lock acquisitions on line 1183 and line 1503. It's not surprising that the unit tests didn't trip these lines, because an interrupt has to occur within a narrow window to cause them to execute: during the time that the command gives up its collection lock, either here or here. My understanding is that The dropTempCollections() method also gets called as part of destruction, but the destructor wraps this call in a try block that catches all exceptions, so there is no risk of a double-fault exception. Interrupting a map-reduce that has a temp collection will leave the temp collection in place, which is probably the desired behavior. It will keep interrupt cleanup fast, and the temp collections still get cleaned up later. (The log message says that they get cleaned up when the mongod is restarted; perhaps we should consider a more frequent interval than that to be safe.) |
| Comment by Siyuan Zhou [ 10/Dec/18 ] |
|
david.storch, I talked with louis.williams. Quoting him on adding UninterruptibleLockGuard in mr.cpp.
I've run three patch builds to remove the one on single node, to remove the one on shared node and to remove all three including the one in dropTempCollections(). Surprisingly, I haven't seen any failure related to map-reduce. This may be because we don't have a good test coverage on the concurrency of map-reduce, or the recent changes on map-reduce / query have already made it resilient to interrupts. Passing Evergreen doesn't make me feel comfortable to remove them blindly. I think Query team has more context than Replication team to investigate the impact of the removal and fix the issues when they come out. |
| Comment by David Storch [ 10/Dec/18 ] |
|
siyuan.zhou, my understanding was that the work to investigate why mapReduce's strong lock acquisitions are not interruptible would fall to the replication team. Did you already complete that investigation? |