[SERVER-10111] buildbot-special: sharding/remove2.js failing on Nightly Linux 64-bit SSL Amazon AMI Created: 05/Jul/13 Updated: 11/Jul/16 Resolved: 13/Jul/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | 2.5.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matt Kangas | Assignee: | Greg Studer |
| Resolution: | Done | Votes: | 0 |
| Labels: | buildbot | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
buildbot-special: Nightly Linux 64-bit SSL Amazon AMI |
||
| Issue Links: |
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Steps To Reproduce: | scons -j1 --no-glibc-check --ssl --distmod=amzn64-ssl --sharedclient --release mongosTest smokeSharding |
||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
Failed on two most recent builds:
Build #541 July 05 fails in the same way |
| Comments |
| Comment by J Rassi [ 13/Jul/13 ] | |||||||||||||||
|
Confirmed that the Looking further into the failure, I discovered that slaveTracking::_slaves on the primary would sometimes list the secondary as one oplog entry behind the primary, long after the secondary had fully caught up. As such, the cleanup thread was never signaled to continue. Per above, this drastically slowed down the secondary-throttle migration mechanism, and in the end resulted in extremely slow migrations. It seems that the | |||||||||||||||
| Comment by J Rassi [ 12/Jul/13 ] | |||||||||||||||
|
My analysis: while the cleanup is waiting for secondary-throttle replication, the thread queues itself up on the _threadsWaitingForReplication condition variable, but sometimes doesn't get signaled even when the secondary has caught up. The thread only gets woken up after the one-minute timed wait, which exceeds the 30-second assert.soon() default, generating the test failure. I would guess that this failure is related to the recent slaveTracking changes. It's not clear to me yet whether the lack of signaling is due to a deadlock or some other timing-related bug. I've taken the bs-e-amzn64-2 buildslave process offline, and I can reproduce the issue on that machine. I redefined the assert.soon() macro in this test to hang on failure, so I'm able to drop into a gdb session for further work (and, helpfully, the test still fails on the debug build). | |||||||||||||||
| Comment by Matt Kangas [ 11/Jul/13 ] | |||||||||||||||
|
Same builder, different error message? Nightly Linux 64-bit SSL Amazon AMI Build #548
| |||||||||||||||
| Comment by Greg Studer [ 08/Jul/13 ] | |||||||||||||||
|
Looks like a deadlock in removeRange - discussing as part of another fix. | |||||||||||||||
| Comment by Matt Kangas [ 08/Jul/13 ] | |||||||||||||||
|
Ditto on Nightly Linux 64-bit Subscription Ubuntu 1204 build #157 July 07 http://buildbot-special.10gen.com/builders/Nightly%20Linux%2064-bit%20Subscription%20Ubuntu%201204/builds/157/steps/shell_3/logs/stdio
|