[SERVER-38480] Make Map-Reduce fully interruptible Created: 07/Dec/18  Updated: 29/Oct/23  Resolved: 31/Jan/19

Status: Closed
Project: Core Server
Component/s: MapReduce, Querying
Affects Version/s: None
Fix Version/s: 4.1.8

Type: Task Priority: Major - P3
Reporter: Siyuan Zhou Assignee: Justin Seyster
Resolution: Fixed Votes: 0
Labels: prepare_interruptibility
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-35365 MapReduce temporary inc collections s... Closed
Backwards Compatibility: Fully Compatible
Sprint: Query 2019-01-14, Query 2019-01-28, Query 2019-02-11
Participants:

 Description   

We disallow interruptions in Map-Reduce on single node and on shards. They will conflict with prepared transactions on stepdown and shutdown. We can either make Map-Reduce interruptible or use weaker IX and IS locks instead. 

dropTempCollections() is also protected by UninterruptibleLockGuard. If the temp collections are only in local database as done by SERVER-35365, they won't conflict with prepared transactions. This ticket should also audit that.



 Comments   
Comment by Githook User [ 31/Jan/19 ]

Author:

{'name': 'Justin Seyster', 'email': 'justin.seyster@mongodb.com', 'username': 'jseyster'}

Message: SERVER-38480 Make Map-Reduce fully interruptible
Branch: master
https://github.com/mongodb/mongo/commit/57b22a11d206272a78124ee03c5a6cf26b3e1105

Comment by Justin Seyster [ 03/Jan/19 ]

I spent some time inspecting the MapReduceCommand and MapReduceFinishCommand run methods, and except for two easily fixable ON_BLOCK_EXIT, I don't see any destructor actions that have the potential to cause a problem (by throwing a double-fault exception) during exception unwinding if the OperationContext is in an interrupted state.

The exceptions are the lock acquisitions on line 1183 and line 1503. It's not surprising that the unit tests didn't trip these lines, because an interrupt has to occur within a narrow window to cause them to execute: during the time that the command gives up its collection lock, either here or here.

My understanding is that SERVER-37449 makes these lock acquisitions unnecessary, and that they are on the chopping block anyway as part of SERVER-37453. As part of this work, I'll delete both ON_BLOCK_EXIT blocks along with the UninterruptibleLockGuards after confirming with david.storch about my understanding of SERVER-37449.

The dropTempCollections() method also gets called as part of destruction, but the destructor wraps this call in a try block that catches all exceptions, so there is no risk of a double-fault exception. Interrupting a map-reduce that has a temp collection will leave the temp collection in place, which is probably the desired behavior. It will keep interrupt cleanup fast, and the temp collections still get cleaned up later. (The log message says that they get cleaned up when the mongod is restarted; perhaps we should consider a more frequent interval than that to be safe.)

Comment by Siyuan Zhou [ 10/Dec/18 ]

david.storch, I talked with louis.williams. Quoting him on adding UninterruptibleLockGuard in mr.cpp.

We decided to put that there after seemingly endless bugs with map reduce getting interrupted. It was way too complicated to bother identifying every single location. Instead we put that there to ensure nothing bad happened until we decided to investigate the failures.

I've run three patch builds to remove the one on single node, to remove the one on shared node and to remove all three including the one in dropTempCollections(). Surprisingly, I haven't seen any failure related to map-reduce. This may be because we don't have a good test coverage on the concurrency of map-reduce, or the recent changes on map-reduce / query have already made it resilient to interrupts. Passing Evergreen doesn't make me feel comfortable to remove them blindly. I think Query team has more context than Replication team to investigate the impact of the removal and fix the issues when they come out.

Comment by David Storch [ 10/Dec/18 ]

siyuan.zhou, my understanding was that the work to investigate why mapReduce's strong lock acquisitions are not interruptible would fall to the replication team. Did you already complete that investigation?

Generated at Thu Feb 08 04:49:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.