[SERVER-26308] Decrease number of jobs for sharding-related suites on Windows DEBUG and PPC variants Created: 23/Sep/16  Updated: 05/Apr/17  Resolved: 28/Dec/16

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 3.4.2, 3.5.2

Type: Task Priority: Critical - P2
Reporter: Charlie Swanson Assignee: Daniel Pasette (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Duplicate
is duplicated by SERVER-26456 Avoid overloading Windows test machines Closed
Related
is related to SERVER-27605 Reduce concurrency for jsCore_small_o... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v3.4, v3.2
Participants:
Linked BF Score: 0

 Description   

If you look at the task history, you can see that some particular suites have a very low success rate historically. For example:

  1. sharding in windows 2k8 debug (note sharding_WT does not have the same problem historically).
  2. sharding_WT and sharding_auth on enterprise-ubuntu1604-ppc64ie.
  3. sharding_WT and sharding_auth on arm64.
  4. replicasets on windows 2k8 DEBUG. (Note again, replicasets_WT is unaffected).


 Comments   
Comment by Githook User [ 13/Jan/17 ]

Author:

{u'username': u'monkey101', u'name': u'Dan Pasette', u'email': u'dan@mongodb.com'}

Message: SERVER-26308 Reduce number of jobs for all replicasets and sharding tasks on Windows and ARM variants

(cherry picked from commit 3f64fb082c4e2a3c5750a2f0bb8dfffbabe4d06e)
Branch: v3.4
https://github.com/mongodb/mongo/commit/211c01760688b7864f0c06cec073a5854b594870

Comment by Githook User [ 28/Dec/16 ]

Author:

{u'username': u'monkey101', u'name': u'Dan Pasette', u'email': u'dan@mongodb.com'}

Message: SERVER-26308 Reduce number of jobs for all replicasets and sharding tasks on Windows and ARM variants
Branch: master
https://github.com/mongodb/mongo/commit/3f64fb082c4e2a3c5750a2f0bb8dfffbabe4d06e

Comment by Daniel Pasette (Inactive) [ 28/Dec/16 ]

Following up on the original description... It seems the ARM and Windows tasks are still not passing reliably, but the PPC tasks are passing reliably. I'm going to halve the tasks used on both ARM and Windows.

Ubuntu 1604 ARM tasks:

Windows DEBUG tasks:

Ubuntu 1604 PPC tasks are now passing:

Comment by Eric Milkie [ 29/Nov/16 ]

For evidence, I present exhibit A:
https://logkeeper.mongodb.org/build/561bb94cb25ffde45adf0fc390dad8ac/test/583ca1599041301924037542
(running from https://evergreen.mongodb.com/task/mongodb_mongo_master_windows_64_2k8_debug_replicasets_2068c42aa2179902d4a96941fcfc7cd577a4c2a9_16_11_28_20_13_22 )

After 21:29:00, the test hangs (due to a logic error). Incredibly, after the test hangs, we still continue to see the replica set struggle to stay up. At one point, ftdc reports that it took 12 seconds to run the serverStatus command. All on an idle three node replica set. The culprit clearly is due to running too many jobs on one machine. Running 16 jobs on one Windows machine completely overwhelms the IO subsystem, and we'll continue to see many build failures due to that.

Comment by Ernie Hershey [ 27/Sep/16 ]

acm - When I said RHEL was "working better," I just meant that tasks on RHEL are running much faster and failing much less. I haven't seen other evidence of different behavior between the two distros.

Comment by Ernie Hershey [ 27/Sep/16 ]

dan@10gen.com - Brian made the change. It's in BUILD-2166. We also decommissioned old hosts using the windows-vs2015-large distro, so any new tasks will be on the bigger hosts, starting now.

Comment by Daniel Pasette (Inactive) [ 27/Sep/16 ]

Great. I'll stand down. Please update when we can look. I'll do a patch build which decreases the jobs on the mmap sharding suite.

Comment by Ernie Hershey [ 27/Sep/16 ]

We can increase the windows distro size today. It's easy. brian.mccarthy or I will do it.

Comment by Daniel Pasette (Inactive) [ 27/Sep/16 ]

I'm going to do some exploring of the win debug issues.

Comment by Daniel Pasette (Inactive) [ 24/Sep/16 ]

cc: ernie.hershey/ramon.fernandez

A few questions:

  • For windows:
    • Have we determined whether this is storage system being overwhelmed by concurrnet mmapv1 allocations?
    • How likely is either increasing the size of the box or lowering the number of concurrent workers likely to help? Can we try this out?
  • For ppcle
    • it is curious that ubuntu1604 fails consistently but rhel7.1 does not. I believe they're running on the same hardware
  • For arm
    • It appears the timing is bimodal, which implies we're running on different hardware. The slow runs are failing (though not timing out)
Generated at Thu Feb 08 04:11:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.