[SERVER-21582] Sharding tests' run time has doubled Created: 20/Nov/15  Updated: 14/Apr/16  Resolved: 20/Nov/15

Status: Closed
Project: Core Server
Component/s: Performance, WiredTiger
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Alexander Gorrod
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File enterprise-rhel62_sharding.png     PNG File linux64-sharding_WT.png    
Issue Links:
Duplicate
duplicates SERVER-21078 Segfault from race between getlasterr... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

The run time of the sharding suite has doubled in time between these two commits:

SERVER-20877 - fix linting errors - took 49 min

SERVER-21078 - Fix race conditions by using the session cache more within the wtkv engine code - took 2hrs+

The increase in time is not in any specific test, but is across the board. Tests, which took 1 minute to run, seem to be taking about 2 min now. Longer tests have gone up some too.



 Comments   
Comment by Daniel Pasette (Inactive) [ 20/Nov/15 ]

fixed by change in SERVER-21078

Comment by Alexander Gorrod [ 20/Nov/15 ]

After digging into this, it appears as though an applyOps cursor is being left open for a while. The new code in https://github.com/mongodb/mongo/commit/1944e28410ee687c7314e848d96582d5a9d54ff6 now waits during shutdown for that session to be released.

The original issue this change was fixing was a workload that constantly runs getLastError with fsync:true set, which could race with a WiredTiger connection close. The race was possible because the MongoDB clean shutdown takes the global lock, but the getLastError doesn't take that lock.

I have coded up a patch that changes the approach to avoiding the race. The WiredTiger storage engine will wait for any fsync operation to complete before shutting down, rather than waiting for all sessions to be returned to the session cache.

My testing shows that fixes the testing performance regression, and I haven't been able to reproduce the failure described in SERVER-21078.

Comment by Michael Cahill (Inactive) [ 20/Nov/15 ]

Moving to SERVER now we have confirmed that this is real. Background syncs appear to have sessions open when they are waiting for activity. There may also be other cases.

Current thinking is to revert part of SERVER-21078 and check inside WiredTiger that there is no flush in progress when WT_CONNECTION::close is called.

Comment by Kaloian Manassiev [ 20/Nov/15 ]

michael.cahill:

I suspect there is some sharding operation that keeps a session active...

Sharding tests enable the snapshot thread on the config server instances in order to use read concern majority, so this could be what is keeping active WT sessions.

Comment by Max Hirschhorn [ 20/Nov/15 ]

Interestingly, there is no increase in time on the linux-64 suite.

It seems like the execution time has increased across all build variants (see attached screenshots). Perhaps the sharding and sharding_WT tasks hadn't run on Linux 64-bit when you checked. It might be obvious, but I'd like to mention explicitly that this affects both the sharding and sharding_WT tasks because CSRS run with --storageEngine=wiredTiger.

Also, we are hitting the Evergreen 6-hour task execution time limit with the Windows DEBUG builders.

Comment by Michael Cahill (Inactive) [ 20/Nov/15 ]

The marker in the log is "Waiting for WiredTiger operations to complete" – that is definitely kicking in with these all of the sharding tests. I'll take a look today and let you know what I find.

Comment by Michael Cahill (Inactive) [ 20/Nov/15 ]

That change waits on shutdown for any WiredTiger operations that are in progress because otherwise mongod can segfault.

It defines "in progress" as having a WiredTiger active (checked out of the cache). I suspect there is some sharding operation that keeps a session active, which will make the shutdown very slow (it does eventually give up).

We'll take a look and try to catch the state when this happens.

Comment by Kaloian Manassiev [ 20/Nov/15 ]

Interestingly, there is no increase in time on the linux-64 suite. ernie.hershey, is it possible that this is due to some change of hardware?

Comment by Kaloian Manassiev [ 20/Nov/15 ]

Aggregation tests have gone up in time as well, but that's probably because of those tests, which exercise sharding: https://evergreen.mongodb.com/task_timing/mongodb-mongo-master#/project=mongodb-mongo-master&buildVariant=enterprise-rhel-62-64-bit&taskName=aggregation

Generated at Thu Feb 08 03:57:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.