[SERVER-11262] sharding/sync6.js failed on Win 32 Created: 18/Oct/13  Updated: 11/Jul/16  Resolved: 04/Dec/13

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 2.5.5

Type: Bug Priority: Major - P3
Reporter: Matt Kangas Assignee: Randolph Tan
Resolution: Done Votes: 0
Labels: build-failure, buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File sync6.log    
Issue Links:
Duplicate
is duplicated by SERVER-11749 sharding/sync3.js failing on linux-32 Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

78a5cc81af22 Windows 32-bit sharding (logfile)

 m30004| 2013-10-18T02:10:37.145+0000 [LockPinger] cluster localhost:30000,localhost:30001,localhost:30002 pinged successfully at Mon Oct 21 15:44:38 2013 by distributed lock pinger 'localhost:30000,localhost:30001,localhost:30002/EC2AMAZ-ZVRDQB5:30004:1382062221:41:thread-5:41', sleeping for 133ms
 m30004| 2013-10-18T02:10:37.145+0000 [thread-6] could not force lock 'TimeSkewFailNewTest_lock_8/EC2AMAZ-ZVRDQB5:30004:1382062221:41:thread-3:41' because elapsed time 1947 <= takeover time 4000
 m30004| 2013-10-18T02:10:37.147+0000 [LockPinger] cluster localhost:30000,localhost:30001,localhost:30002 pinged successfully at Tue Oct 22 06:10:37 2013 by distributed lock pinger 'localhost:30000,localhost:30001,localhost:30002/EC2AMAZ-ZVRDQB5:30004:1382062221:41:thread-1:41', sleeping for 133ms
 m30004| 2013-10-18T02:10:37.148+0000 [thread-7] could not force lock 'TimeSkewFailNewTest_lock_8/EC2AMAZ-ZVRDQB5:30004:1382062221:41:thread-3:41' because elapsed time 1950 <= takeover time 4000
 m30004| 2013-10-18T02:10:37.148+0000 [thread-1] could not force lock 'TimeSkewFailNewTest_lock_8/EC2AMAZ-ZVRDQB5:30004:1382062221:41:thread-3:41' because elapsed time 1951 <= takeover time 4000
 m30004| 2013-10-18T02:10:37.148+0000 [thread-1] created new distributed lock for TimeSkewFailNewTest_lock_8 on localhost:30000,localhost:30001,localhost:30002 ( lock timeout : 4000, ping interval : 133, process : 1 )
 m30004| 2013-10-18T02:10:37.149+0000 [LockPinger] cluster localhost:30000,localhost:30001,localhost:30002 pinged successfully at Tue Oct 15 10:57:41 2013 by distributed lock pinger 'localhost:30000,localhost:30001,localhost:30002/EC2AMAZ-ZVRDQB5:30004:1382062221:41:thread-4:41', sleeping for 133ms
 m30004| 2013-10-18T02:10:37.149+0000 [LockPinger] cluster localhost:30000,localhost:30001,localhost:30002 pinged successfully at Sat Oct 19 00:48:31 2013 by distributed lock pinger 'localhost:30000,localhost:30001,localhost:30002/EC2AMAZ-ZVRDQB5:30004:1382062221:41:thread-6:41', sleeping for 133ms
 m30004| 2013-10-18T02:10:26.798+0000 BackgroundJob starting: ConnectBG
 m30004| 2013-10-18T02:10:37.123+0000 [LockPinger] cluster localhost:30000,localhost:30001,localhost:30002 pinged successfully at Tue Oct 22 00:18:17 2013 by distributed lock pinger 'localhost:30000,localhost:30001,localhost:30002/EC2AMAZ-ZVRDQB5:30004:1382062221:41:thread-7:41', sleeping for 133ms
 m30004| 2013-10-18T02:10:37.158+0000 [thread-5] *** unhandled exception 0xE06D7363 at 0x7659FC2E, terminating
 m30004| 2013-10-18T02:10:37.158+0000 [thread-5] *** stack trace for unhandled exception:
 m30004| 2013-10-18T02:10:37.196+0000 [LockPinger] cluster localhost:30000,localhost:30001,localhost:30002 pinged successfully at Mon Oct 21 14:13:16 2013 by distributed lock pinger 'localhost:30000,localhost:30001,localhost:30002/EC2AMAZ-ZVRDQB5:30004:1382062221:41:thread-2:41', sleeping for 133ms
 m30004| 2013-10-18T02:10:37.199+0000 [LockPinger] cluster localhost:30000,localhost:30001,localhost:30002 pinged successfully at Wed Oct 16 03:27:22 2013 by distributed lock pinger 'localhost:30000,localhost:30001,localhost:30002/EC2AMAZ-ZVRDQB5:30004:1382062221:41:thread-8:41', sleeping for 133ms
 m30004| 2013-10-18T02:10:37.348+0000 [LockPinger] cluster localhost:30000,localhost:30001,localhost:30002 pinged successfully at Mon Oct 21 15:44:38 2013 by distributed lock pinger 'localhost:30000,localhost:30001,localhost:30002/EC2AMAZ-ZVRDQB5:30004:1382062221:41:thread-5:41', sleeping for 133ms
 m30004| 2013-10-18T02:10:37.367+0000 [LockPinger] cluster localhost:30000,localhost:30001,localhost:30002 pinged successfully at Tue Oct 22 00:18:17 2013 by distributed lock pinger 'localhost:30000,localhost:30001,localhost:30002/EC2AMAZ-ZVRDQB5:30004:1382062221:41:thread-7:41', sleeping for 133ms
 m30004| 2013-10-18T02:10:37.368+0000 [LockPinger] cluster localhost:30000,localhost:30001,localhost:30002 pinged successfully at Sat Oct 19 00:48:31 2013 by distributed lock pinger 'localhost:30000,localhost:30001,localhost:30002/EC2AMAZ-ZVRDQB5:30004:1382062221:41:thread-6:41', sleeping for 133ms
 m30004| 2013-10-18T02:10:37.376+0000 [LockPinger] cluster localhost:30000,localhost:30001,localhost:30002 pinged successfully at Tue Oct 22 06:10:37 2013 by distributed lock pinger 'localhost:30000,localhost:30001,localhost:30002/EC2AMAZ-ZVRDQB5:30004:1382062221:41:thread-1:41', sleeping for 133ms
 m30004| 2013-10-18T02:10:37.376+0000 [LockPinger] cluster localhost:30000,localhost:30001,localhost:30002 pinged successfully at Tue Oct 15 10:57:42 2013 by distributed lock pinger 'localhost:30000,localhost:30001,localhost:30002/EC2AMAZ-ZVRDQB5:30004:1382062221:41:thread-4:41', sleeping for 133ms
...
 m30004| 2013-10-18T02:10:42.119+0000 [LockPinger] cluster localhost:30000,localhost:30001,localhost:30002 pinged successfully at Sat Oct 19 00:48:36 2013 by distributed lock pinger 'localhost:30000,localhost:30001,localhost:30002/EC2AMAZ-ZVRDQB5:30004:1382062221:41:thread-6:41', sleeping for 133ms
 m30004| 2013-10-18T02:10:42.178+0000 [LockPinger] cluster localhost:30000,localhost:30001,localhost:30002 pinged successfully at Wed Oct 16 03:27:27 2013 by distributed lock pinger 'localhost:30000,localhost:30001,localhost:30002/EC2AMAZ-ZVRDQB5:30004:1382062221:41:thread-8:41', sleeping for 133ms
 m30004| 2013-10-18T02:10:42.178+0000 [LockPinger] cluster localhost:30000,localhost:30001,localhost:30002 pinged successfully at Tue Oct 15 10:57:47 2013 by distributed lock pinger 'localhost:30000,localhost:30001,localhost:30002/EC2AMAZ-ZVRDQB5:30004:1382062221:41:thread-4:41', sleeping for 133ms
 m30004| 2013-10-18T02:10:42.196+0000 [thread-5] kernel32.dll                                                                        RaiseException+0x58
 m30004| 2013-10-18T02:10:42.196+0000 [thread-5] mongod.exe    f:\dd\vctools\crt_bld\self_x86\crt\prebuild\eh\throw.cpp(157)         _CxxThrowException+0x48
 m30004| 2013-10-18T02:10:42.196+0000 [thread-5] mongod.exe    ...\src\mongo\client\connpool.cpp(249)                                mongo::DBConnectionPool::get+0x24e
 m30004| 2013-10-18T02:10:42.196+0000 [thread-5] mongod.exe    ...\src\mongo\client\connpool.h(248)                                  mongo::ScopedDbConnection::ScopedDbConnection+0x7f
 m30004| 2013-10-18T02:10:42.196+0000 [thread-5] mongod.exe    ...\src\mongo\client\distlock.cpp(379)                                mongo::DistributedLock::remoteTime+0x118
 m30004| 2013-10-18T02:10:42.197+0000 [thread-5] mongod.exe    ...\src\mongo\client\distlock.cpp(645)                                mongo::DistributedLock::lock_try+0x187f
 m30004| 2013-10-18T02:10:42.197+0000 [thread-5] mongod.exe    ...\src\mongo\client\distlock_test.cpp(255)                           mongo::TestDistLockWithSkew::runThread+0xaf7
 m30004| 2013-10-18T02:10:42.197+0000 [thread-5] mongod.exe    ...\src\third_party\boost\boost\thread\detail\thread.hpp(63)          boost::detail::thread_data<boost::_bi::bind_t<void,boost::_mfi::mf5<void,mongo::TestDistLockWithSkew,mongo::ConnectionString &,unsigned int,unsigned int,mongo::BSONObj &,mongo::BSONObjBuilder &>,boost::_bi::list6<boost::_bi::value<mongo::TestDistLockWithSkew *>,boost::_bi::value<mongo::ConnectionString>,boost::_bi::value<unsigned int>,boost::_bi::value<unsigned int>,boost::reference_wrapper<mongo::BSONObj>,boost::reference_wrapper<mongo::BSONObjBuilder> > > >::run+0x21
 m30004| 2013-10-18T02:10:42.197+0000 [thread-5] mongod.exe    ...\src\third_party\boost\libs\thread\src\win32\thread.cpp(16707566)  boost::`anonymous namespace'::thread_start_function+0x7b
 m30004| 2013-10-18T02:10:42.197+0000 [thread-5] mongod.exe    f:\dd\vctools\crt_bld\self_x86\crt\src\threadex.c(314)                _callthreadstartex+0x1b
 m30004| 2013-10-18T02:10:42.197+0000 [thread-5] mongod.exe    f:\dd\vctools\crt_bld\self_x86\crt\src\threadex.c(292)                _threadstartex+0x64
 m30004| 2013-10-18T02:10:42.197+0000 [thread-5] kernel32.dll                                                                        BaseThreadInitThunk+0x12
 m30004| 2013-10-18T02:10:42.197+0000 [thread-5] ntdll.dll                                                                           RtlInitializeExceptionChain+0x63
 m30004| 2013-10-18T02:10:42.198+0000 [thread-5] 
 m30004| 2013-10-18T02:10:42.198+0000 [thread-5] writing minidump diagnostic file mongo.dmp
 m30004| 2013-10-18T02:10:42.986+0000 [thread-5] *** immediate exit due to unhandled exception

Another example:
1a90352d5f09 Windows 32-bit sharding



 Comments   
Comment by Githook User [ 08/Dec/13 ]

Author:

{u'username': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-11262 explicitly init delay for compiler
Branch: master
https://github.com/mongodb/mongo/commit/daa2acbba25387621326f708ec0a7afbb746664a

Comment by Githook User [ 08/Dec/13 ]

Author:

{u'username': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-11262 sync6.js sync3.js correctly wrap connectivity exceptions in remote time check
Branch: master
https://github.com/mongodb/mongo/commit/8e2d6557f8b3e17b7230377a89b6dece9e87e461

Comment by Githook User [ 04/Dec/13 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-11262 sharding/sync6.js failed on Win 32
Branch: master
https://github.com/mongodb/mongo/commit/051d74136ee5f920dbab72150a33a31405757ab9

Comment by Randolph Tan [ 03/Dec/13 ]

What's failing:
The atomic integer that a thread was supposed to have exclusive access to (via the virtue of owning the distributed lock) got incremented when it went to sleep.

How it happened:
Another thread was able to takeover the lock because the lock ping was too old. If you examine the attached log, you will also notice that there is no activity between 04:54:26 and 04:54:32 (test machine hanged? or descheduled?), and this is way over the takeover timeout (which was set to 4 secs in the test).

sample log walkthrough:

// acquired lock
 m30004| 2013-11-30T04:54:26.842+0000 [thread-2] distributed lock 'TimeSkewFailNewTest_lock_8/ip-10-142-140-212:30004:1385787265:1723589040:thread-2:53688872' acquired, ts : 52996f82c675855fd6008963
...
 
 m30004| 2013-11-30T04:54:26.842+0000 [thread-2] **** Locked for thread 2 with ts ts: ObjectId('52996f82c675855fd6008963')
...
 
// thread0 overtook the lock thread2 had
 m30004| 2013-11-30T04:54:32.273+0000 [thread-0] lock update won, completing lock propagation for 'TimeSkewFailNewTest_lock_8/ip-10-142-140-212:30004:1385787265:1723589040:thread-0:990002550'
...
 
// thread2 woke and noticed that the value changed
 m30004| 2013-11-30T04:54:32.473+0000 [thread-2] **** !Bad increment while sleeping with lock for: 285ms

Comment by Greg Studer [ 02/Dec/13 ]

I'll go back over these, but it looks like hangs of 4s during the test caused forcing of locks in the most recent failure. Might need to re-tune the test settings for MCI.

m30004| 2013-11-29T23:17:39.862-0500 [LockPinger] cluster localhost:30000,localhost:30001,localhost:30002 pinged successfully at Thu Nov 28 00:34:25 2013 by distributed lock pinger 'localhost:30000,localhost:30001,localhost:30002/mci-osx108-7.build.10gen.cc:30004:1385785046:1602179737:thread-8:1821814659', sleeping for 133ms
m30004| 2013-11-29T23:17:43.638-0500 [thread-0] checkin

Comment by Adinoyi Omuya [ 02/Dec/13 ]

Another new failure on MCI (buildlogger here).

Comment by Greg Studer [ 18/Oct/13 ]

New issue with the test itself. The exception handling in TestDistLockForSkew needs to catch connection problems as well - I suspect this showed up now because connection problems in the tests were more likely.

Comment by Matt Kangas [ 18/Oct/13 ]

Greg, triage plz? Same issue as MCI-507?

Other recent sync6.js tickets: SERVER-10688, SERVER-10201

Generated at Thu Feb 08 03:25:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.