[SERVER-35567] Concurrency simultaneous replication tests - unblacklist snapshot_read_kill_operations.js and remove snapshot_read_kill_op_only.js Created: 12/Jun/18  Updated: 06/Nov/19  Resolved: 04/Mar/19

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Jonathan Abrahams Assignee: Anton Korshunov
Resolution: Won't Fix Votes: 0
Labels: open_todo_in_code, todo_in_code
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-43479 Complete TODO listed in SERVER-35567 Closed
related to SERVER-44212 Complete TODO listed in SERVER-35567 Closed
related to SERVER-39939 Avoid taking any lock manager locks i... Backlog
Sprint: Query 2019-02-25, Query 2019-03-11
Participants:

 Description   

The test snapshot_read_kill_operations.js is blacklisted in the concurrency_simultaneous_replication suite, due to states that are not supported in 4.0. An extended test was created, snapshot_read_kill_op_only.js.

The extended test and blacklist can be removed once the global cursor management has been implemented.



 Comments   
Comment by Githook User [ 06/Nov/19 ]

Author:

{'username': 'antkorsh', 'email': 'anton.korshunov@mongodb.com', 'name': 'Anton Korshunov'}

Message: SERVER-44212 Complete TODO listed in SERVER-35567
Branch: master
https://github.com/mongodb/mongo/commit/246e119f7258784535b2eee9fc8645fd382a2a51

Comment by Anton Korshunov [ 04/Mar/19 ]

So, after all, it was a combination of a test issue along with a design of the killCursors locking protocol. The hang that I was observing was caused by the implementation of the FSM killSessions state. At the beginning of the test we start a new transaction and insert a session ID into a collection. The reason we need to store session IDs in a collection is because all FSM threads need access this information in order to randomly pick a session ID to be killed. However, the FSM framework doesn't provide a global working area which could be accessed by all threads, so FSM workloads can only operate local structures allocated per thread. In the killSessions state we execute a find command (which takes locks) to select the session ID from the collection, and this is where we get stuck due to the following sequence of lock acquisitions:

  • Thread 1 starts a transactions and takes Global IX lock to write down the session ID
  • Thread 2 in a parallel test executes a DDL command and queues Global X lock
  • Thread 1 enters killSessions state, runs find and queues Global IS lock

To workaround this issue I modified the test to store the sessions IDs in a separate mongod instance. Then I hit another problem with the test. Here is how we pick a session to kill:

while (!sessionDocToKill || idToKill == "sessionDoc" + this.tid) {
   idToKill = "sessionDoc" + Math.floor(Math.random() * this.threadCount);
   sessionDocToKill = db[collName].findOne({"_id": idToKill});
}

A sessions ID in this test is formed as "sessionId" + tid, where tid is a thread ID passed to the test by the FSM workload manager. If we were running 5 threads with threads IDs between [60...64], we'd never be able to exit the loop above, as the Math.floor(...) expression would return a value in the range [0..4]. I don't know why this part was written the way it was, as a less error-prone approach would be to find all documents in the collections holding the sessionID values, and then randomly select one element in the returned array. This is how I re-wrote it and it worked fine.

Then I hit another deadlock, and this time it was caused by the codepath I mentioned earlier. When the killSessions command tried to kill a cursor we acquire a Global IS lock in AutoStatsTracker, and we hang by the same reason I described earlier.

For that reason, the test cannot be re-enabled. Also, after talking to david.storch we decided that supporting this particular scenario is not worth the additional server changes to avoid taking the lock there. At least, not at this time. Instead, we'll create a separate ticket to address this issue in the future: see SERVER-39939.

 

Comment by Craig Homa [ 28/Feb/19 ]

Removed this from the All Cursors Globally Managed epic as further testing showed that the work done in the project did not enable the extra testing.

Comment by Tess Avitabile (Inactive) [ 26/Feb/19 ]

I was referring to the find in the killSessions FSM state. Can you see the JS stack traces for the FSM states?

Comment by Anton Korshunov [ 26/Feb/19 ]

No, I don't see the killSessions command in the hang analyzer output, which means it completed successfully. I'll send you the hang analyzer output if you want to take a look at it.

Comment by Tess Avitabile (Inactive) [ 26/Feb/19 ]

Without looking at the hang analyzer output myself, it's hard to say for sure what is going on. However, I do see one thing in the killSessions FSM state that is suspicious. We select the session to kill by doing a find, which takes locks. If there is a DDL operation that is blocked by an open transaction, then this could cause the find to hang. Do you see that the killSessions state is getting stuck at this point?

If that doesn't help diagnose the problem, I'd be happy to look at the hang analyzer output with you.

Comment by Anton Korshunov [ 26/Feb/19 ]

A further update. We had a theory that the deadlock could still be related to killCursors command acquiring a lock in AutoStatsTracker, but I couldn't find an evidence that is was the cause of the deadlock. I actually found out the opposite, that neither killOp nor killCursors are the cause of the hang. The suite only fails when killSessions comes into play. So, if the issue was with the cursor manager and collection locks, then it would also manifest in killCursors, which is not the case. That said, it could be something in killSessions which leads to a deadlock (e.g., a transaction is aborted without a proper clean up). I wonder if someone from the Replication team could comment on this matter.

Comment by Tess Avitabile (Inactive) [ 05/Jul/18 ]

The test snapshot_read_kill_operations.js alternates running transaction operations, killOp, killCursors, and killSessions. If there are concurrent DDL operations happening in the suite that are blocked by transactions, then the test can hang in the killCursors or killSessions state, since killing cursors requires taking collection locks. Since the test hangs in killCursors or killSessions, it will not progress to a state where it would abort/commit the transaction, which would allow the DDL operation to proceed. In Evergreen, we increase the transaction expiration deadline to 2 hours, so these hangs will not be resolved by the transaction reaper. When we no longer require taking collection locks to kill cursors, it should be possible to re-enable this test in the concurrency_simultaneous_replication suite, since it should no longer hange in killCursors or killSessions.

Comment by Jonathan Abrahams [ 05/Jul/18 ]

I believe tess.avitabile should have more context on that. I believe that random killSessions and killCursors created deadlocks when subsequently committing or aborting the associated transaction.

Comment by David Storch [ 05/Jul/18 ]

jonathan.abrahams, can you elaborate on why this test is blacklisted from concurrency_simultaneous_replication? That may help me understand why the query team's planned work around cursor management will allow us to unblacklist the test.

Comment by Jonathan Abrahams [ 05/Jul/18 ]

david.storch Yeah that is a bit confusing! When the work to support the globally managed cursors is implemented, then this workload does not need to blacklisted anymore.

tess.avitabile Is there a particular SERVER ticket that this should be dependent on?

Comment by David Storch [ 03/Jul/18 ]

jonathan.abrahams, this ticket is in an odd state. Its component is "Testing Infrastructure", yet it is assigned to the replication team, but it is also inside an Epic for a query team project. Can you clarify? Which team do you imagine will do this work?

Generated at Thu Feb 08 04:40:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.