[SERVER-35567] Concurrency simultaneous replication tests - unblacklist snapshot_read_kill_operations.js and remove snapshot_read_kill_op_only.js Created: 12/Jun/18 Updated: 06/Nov/19 Resolved: 04/Mar/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Jonathan Abrahams | Assignee: | Anton Korshunov |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | open_todo_in_code, todo_in_code | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Sprint: | Query 2019-02-25, Query 2019-03-11 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
The test snapshot_read_kill_operations.js is blacklisted in the concurrency_simultaneous_replication suite, due to states that are not supported in 4.0. An extended test was created, snapshot_read_kill_op_only.js. The extended test and blacklist can be removed once the global cursor management has been implemented. |
| Comments |
| Comment by Githook User [ 06/Nov/19 ] | ||||
|
Author: {'username': 'antkorsh', 'email': 'anton.korshunov@mongodb.com', 'name': 'Anton Korshunov'}Message: | ||||
| Comment by Anton Korshunov [ 04/Mar/19 ] | ||||
|
So, after all, it was a combination of a test issue along with a design of the killCursors locking protocol. The hang that I was observing was caused by the implementation of the FSM killSessions state. At the beginning of the test we start a new transaction and insert a session ID into a collection. The reason we need to store session IDs in a collection is because all FSM threads need access this information in order to randomly pick a session ID to be killed. However, the FSM framework doesn't provide a global working area which could be accessed by all threads, so FSM workloads can only operate local structures allocated per thread. In the killSessions state we execute a find command (which takes locks) to select the session ID from the collection, and this is where we get stuck due to the following sequence of lock acquisitions:
To workaround this issue I modified the test to store the sessions IDs in a separate mongod instance. Then I hit another problem with the test. Here is how we pick a session to kill:
A sessions ID in this test is formed as "sessionId" + tid, where tid is a thread ID passed to the test by the FSM workload manager. If we were running 5 threads with threads IDs between [60...64], we'd never be able to exit the loop above, as the Math.floor(...) expression would return a value in the range [0..4]. I don't know why this part was written the way it was, as a less error-prone approach would be to find all documents in the collections holding the sessionID values, and then randomly select one element in the returned array. This is how I re-wrote it and it worked fine. Then I hit another deadlock, and this time it was caused by the codepath I mentioned earlier. When the killSessions command tried to kill a cursor we acquire a Global IS lock in AutoStatsTracker, and we hang by the same reason I described earlier. For that reason, the test cannot be re-enabled. Also, after talking to david.storch we decided that supporting this particular scenario is not worth the additional server changes to avoid taking the lock there. At least, not at this time. Instead, we'll create a separate ticket to address this issue in the future: see SERVER-39939.
| ||||
| Comment by Craig Homa [ 28/Feb/19 ] | ||||
|
Removed this from the All Cursors Globally Managed epic as further testing showed that the work done in the project did not enable the extra testing. | ||||
| Comment by Tess Avitabile (Inactive) [ 26/Feb/19 ] | ||||
|
I was referring to the find in the killSessions FSM state. Can you see the JS stack traces for the FSM states? | ||||
| Comment by Anton Korshunov [ 26/Feb/19 ] | ||||
|
No, I don't see the killSessions command in the hang analyzer output, which means it completed successfully. I'll send you the hang analyzer output if you want to take a look at it. | ||||
| Comment by Tess Avitabile (Inactive) [ 26/Feb/19 ] | ||||
|
Without looking at the hang analyzer output myself, it's hard to say for sure what is going on. However, I do see one thing in the killSessions FSM state that is suspicious. We select the session to kill by doing a find, which takes locks. If there is a DDL operation that is blocked by an open transaction, then this could cause the find to hang. Do you see that the killSessions state is getting stuck at this point? If that doesn't help diagnose the problem, I'd be happy to look at the hang analyzer output with you. | ||||
| Comment by Anton Korshunov [ 26/Feb/19 ] | ||||
|
A further update. We had a theory that the deadlock could still be related to killCursors command acquiring a lock in AutoStatsTracker, but I couldn't find an evidence that is was the cause of the deadlock. I actually found out the opposite, that neither killOp nor killCursors are the cause of the hang. The suite only fails when killSessions comes into play. So, if the issue was with the cursor manager and collection locks, then it would also manifest in killCursors, which is not the case. That said, it could be something in killSessions which leads to a deadlock (e.g., a transaction is aborted without a proper clean up). I wonder if someone from the Replication team could comment on this matter. | ||||
| Comment by Tess Avitabile (Inactive) [ 05/Jul/18 ] | ||||
|
The test snapshot_read_kill_operations.js alternates running transaction operations, killOp, killCursors, and killSessions. If there are concurrent DDL operations happening in the suite that are blocked by transactions, then the test can hang in the killCursors or killSessions state, since killing cursors requires taking collection locks. Since the test hangs in killCursors or killSessions, it will not progress to a state where it would abort/commit the transaction, which would allow the DDL operation to proceed. In Evergreen, we increase the transaction expiration deadline to 2 hours, so these hangs will not be resolved by the transaction reaper. When we no longer require taking collection locks to kill cursors, it should be possible to re-enable this test in the concurrency_simultaneous_replication suite, since it should no longer hange in killCursors or killSessions. | ||||
| Comment by Jonathan Abrahams [ 05/Jul/18 ] | ||||
|
I believe tess.avitabile should have more context on that. I believe that random killSessions and killCursors created deadlocks when subsequently committing or aborting the associated transaction. | ||||
| Comment by David Storch [ 05/Jul/18 ] | ||||
|
jonathan.abrahams, can you elaborate on why this test is blacklisted from concurrency_simultaneous_replication? That may help me understand why the query team's planned work around cursor management will allow us to unblacklist the test. | ||||
| Comment by Jonathan Abrahams [ 05/Jul/18 ] | ||||
|
david.storch Yeah that is a bit confusing! When the work to support the globally managed cursors is implemented, then this workload does not need to blacklisted anymore. tess.avitabile Is there a particular SERVER ticket that this should be dependent on? | ||||
| Comment by David Storch [ 03/Jul/18 ] | ||||
|
jonathan.abrahams, this ticket is in an odd state. Its component is "Testing Infrastructure", yet it is assigned to the replication team, but it is also inside an Epic for a query team project. Can you clarify? Which team do you imagine will do this work? |