[SERVER-10688] sync6.js - verify failed in DistributedLock:lock_try on Linux 32-bit debug Created: 05/Sep/13  Updated: 11/Jul/16  Resolved: 12/Sep/13

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 2.5.3

Type: Bug Priority: Major - P3
Reporter: Matt Kangas Assignee: Randolph Tan
Resolution: Done Votes: 0
Labels: buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Buildbot Linux 32-bit debug


Attachments: File sync6.log    
Issue Links:
Duplicate
is duplicated by SERVER-10572 V2.2 Linux 32-bit ABORT during shardi... Closed
Operating System: ALL
Participants:

 Description   

Builder Linux 32-bit debug Build #3260 Thu Sep 5 (rev 4b4fd431f5)

BUILD FAILED: failed mongosTest_smokeSharding

http://buildlogs.mongodb.org/Linux%2032-bit%20debug/builds/3260/test/sharding/sync6.js

 m30004| 2013-09-05T14:29:42.241-0500 [thread-2] trying to acquire new distributed lock for TimeSkewFailNewTest_lock_8 on localhost:30000,localhost:30001,localhost:30002 ( lock timeout : 4000, ping interval : 133, process : tp2.10gen.cc:30004:1378409369:154154505:thread-2:1323470036 )
 m30004| 2013-09-05T14:29:42.241-0500 [thread-2] checking last ping for lock 'TimeSkewFailNewTest_lock_8/tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223' against process tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223 and ping Tue Sep  3 05:30:32 2013
 m30004| 2013-09-05T14:29:42.241-0500 [thread-7] trying to acquire new distributed lock for TimeSkewFailNewTest_lock_8 on localhost:30000,localhost:30001,localhost:30002 ( lock timeout : 4000, ping interval : 133, process : tp2.10gen.cc:30004:1378409369:154154505:thread-7:581265785 )
 m30004| 2013-09-05T14:29:42.241-0500 [thread-8] trying to acquire new distributed lock for TimeSkewFailNewTest_lock_8 on localhost:30000,localhost:30001,localhost:30002 ( lock timeout : 4000, ping interval : 133, process : tp2.10gen.cc:30004:1378409369:154154505:thread-8:995717321 )
 m30004| 2013-09-05T14:29:42.241-0500 [thread-7] checking last ping for lock 'TimeSkewFailNewTest_lock_8/tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223' against process tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223 and ping Tue Sep  3 05:30:32 2013
 m30004| 2013-09-05T14:29:42.241-0500 [thread-6] 0x8e36091 0x8e15df4 0x8e03376 0x89063c4 0x89163e1 0x891ca30 0x891c92e 0x891c844 0x891c6f6 0x8e7335e 0xd5d919 0xca6d4e 
 m30004|  /home/yellow/buildslave/Linux_32bit_debug/mongo/mongod(_ZN5mongo15printStackTraceERSo+0x2d) [0x8e36091]
 m30004|  /home/yellow/buildslave/Linux_32bit_debug/mongo/mongod(_ZN5mongo10logContextEPKc+0x95) [0x8e15df4]
 m30004|  /home/yellow/buildslave/Linux_32bit_debug/mongo/mongod(_ZN5mongo12verifyFailedEPKcS1_j+0x10c) [0x8e03376]
 m30004|  /home/yellow/buildslave/Linux_32bit_debug/mongo/mongod(_ZN5mongo15DistributedLock8lock_tryERKSsbPNS_7BSONObjEd+0x718e) [0x89063c4]
 m30004|  /home/yellow/buildslave/Linux_32bit_debug/mongo/mongod(_ZN5mongo20TestDistLockWithSkew9runThreadERNS_16ConnectionStringEjjRNS_7BSONObjERNS_14BSONObjBuilderE+0x99f) [0x89163e1]
 m30004|  /home/yellow/buildslave/Linux_32bit_debug/mongo/mongod(_ZNK5boost4_mfi3mf5IvN5mongo20TestDistLockWithSkewERNS2_16ConnectionStringEjjRNS2_7BSONObjERNS2_14BSONObjBuilderEEclEPS3_S5_jjS7_S9_+0x62) [0x891ca30]
 m30004|  /home/yellow/buildslave/Linux_32bit_debug/mongo/mongod(_ZN5boost3_bi5list6INS0_5valueIPN5mongo20TestDistLockWithSkewEEENS2_INS3_16ConnectionStringEEENS2_IjEES9_NS_17reference_wrapperINS3_7BSONObjEEENSA_INS3_14BSONObjBuilderEEEEclINS_4_mfi3mf5IvS4_RS7_jjRSB_RSD_EENS0_5list0EEEvNS0_4typeIvEERT_RT0_i+0xe2) [0x891c92e]
 m30004|  /home/yellow/buildslave/Linux_32bit_debug/mongo/mongod(_ZN5boost3_bi6bind_tIvNS_4_mfi3mf5IvN5mongo20TestDistLockWithSkewERNS4_16ConnectionStringEjjRNS4_7BSONObjERNS4_14BSONObjBuilderEEENS0_5list6INS0_5valueIPS5_EENSE_IS6_EENSE_IjEESI_NS_17reference_wrapperIS8_EENSJ_ISA_EEEEEclEv+0x48) [0x891c844]
 m30004|  /home/yellow/buildslave/Linux_32bit_debug/mongo/mongod(_ZN5boost6detail11thread_dataINS_3_bi6bind_tIvNS_4_mfi3mf5IvN5mongo20TestDistLockWithSkewERNS6_16ConnectionStringEjjRNS6_7BSONObjERNS6_14BSONObjBuilderEEENS2_5list6INS2_5valueIPS7_EENSG_IS8_EENSG_IjEESK_NS_17reference_wrapperISA_EENSL_ISC_EEEEEEE3runEv+0x22) [0x891c6f6]
 m30004|  /home/yellow/buildslave/Linux_32bit_debug/mongo/mongod() [0x8e7335e]
 m30004|  /lib/libpthread.so.0() [0xd5d919]
 m30004|  /lib/libc.so.6(clone+0x5e) [0xca6d4e]
 m30004| 2013-09-05T14:29:42.241-0500 [thread-2] could not force lock 'TimeSkewFailNewTest_lock_8/tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223' because elapsed time 28 <= takeover time 4000
 m30004| 2013-09-05T14:29:42.241-0500 [thread-6] dev: lastError==0 won't report:indUpdate[LocksType::state()].numberInt() > 0
 m30004| 2013-09-05T14:29:42.241-0500 [thread-6] 
 m30004| 
 m30004| ***aborting after verify() failure as this is a debug/test build



 Comments   
Comment by auto [ 10/Sep/13 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-10688 sync6.js - verify failed in DistributedLock:lock_try

Gracefully give up trying to acquire lock when detected unexpected lock state change.
Branch: master
https://github.com/mongodb/mongo/commit/2c1a91e5c0570d2a00cba8675b8b4b85afd4cd96

Comment by Randolph Tan [ 09/Sep/13 ]

Description of race:

2 Threads tries to grab lock:
Thread2.ts > Thread1.ts

State after both tried to grab the lock:
config 1: Thread1's update applied
config 2: Thread2's update applied
config 3: Thread3 still got the lock (was able to unlock config 1 & 2 and about to unlock this one).

Sequence that will lead to the verify failure:
1. Both Thread1 and Thread2 realized that it was not able to update all the config servers so it tries to force them one by one.
2. Thread1 & Thread2 is about to check config3, gets the Thread3.ts and tries to force it. Thread1 gets in first.
3. Thread3 now tries to unlock config3 but since the thread1 already forced it, it failed to update anything (hence it was "unlocked (messily)").
4. Thread1 found out that Thread2 has higher ts so it declares that it lost the update. This also registers the Thread1.ts to the list of ts LockPingers will unlock later on.
5. LockPinger unlocks the lock with Thread1.ts.
6. When Thread2 checks the lock state (note: the query does not contain ts) it is now 0 so the verify fails.

Comment by Randolph Tan [ 06/Sep/13 ]

Looks like a race in the distLock:

Thread 4 is about to unlock.

 m30004| 2013-09-05T14:29:42.117-0500 [thread-4] **** Unlocking for thread 4 with ts ts: ObjectId('5228dba4f601126a3fc57c43')

Thread 3 tries to get lock

 m30004| 2013-09-05T14:29:42.167-0500 [thread-3] about to acquire distributed lock 'TimeSkewFailNewTest_lock_8/tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223:
 m30004| { "state" : 1,
 m30004|   "who" : "tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223",
 m30004|   "process" : "tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223",
 m30004|   "when" : { "$date" : "Tue Sep  3 05:30:32 2013" },
 m30004|   "why" : "Testing distributed lock with skew.",
 m30004|   "ts" : { "$oid" : "5228dba6f601126a3fc57c49" } }
 m30004| { "_id" : "TimeSkewFailNewTest_lock_8",
 m30004|   "state" : 0,
 m30004|   "ts" : { "$oid" : "5228dba4f601126a3fc57c43" } }

Thread 6 tries to get the lock as well

 m30004| 2013-09-05T14:29:42.167-0500 [thread-6] about to acquire distributed lock 'TimeSkewFailNewTest_lock_8/tp2.10gen.cc:30004:1378409369:154154505:thread-6:637885744:
 m30004| { "state" : 1,
 m30004|   "who" : "tp2.10gen.cc:30004:1378409369:154154505:thread-6:637885744",
 m30004|   "process" : "tp2.10gen.cc:30004:1378409369:154154505:thread-6:637885744",
 m30004|   "when" : { "$date" : "Fri Sep  6 13:07:36 2013" },
 m30004|   "why" : "Testing distributed lock with skew.",
 m30004|   "ts" : { "$oid" : "5228dba6f601126a3fc57c4a" } }

Thread 3 detected that it was not successful in getting the lock. This implies that the lock's state was already 0 (at least for the first config)

 m30004| 2013-09-05T14:29:42.191-0500 [thread-3] warning: distributed lock 'TimeSkewFailNewTest_lock_8/tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223 did not propagate properly. :: caused by :: 8017 update not consistent  ns: config.locks query: { _id: "TimeSkewFailNewTest_lock_8", state: 0, ts: ObjectId('5228dba4f601126a3fc57c43') } update: { $set: { state: 1, who: "tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223", process: "tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223", when: new Date(1378204232655), why: "Testing distributed lock with skew.", ts: ObjectId('5228dba6f601126a3fc57c49') } } gle1: { updatedExisting: true, n: 1, connectionId: 11, fsyncFiles: 4, err: null, ok: 1.0 } gle2: { updatedExisting: false, n: 0, connectionId: 11, fsyncFiles: 4, err: null, ok: 1.0 }

Thread 6 detected that it was not successful in getting the lock. This implies that the lock's state was already 0 (at least for the first config)

 m30004| 2013-09-05T14:29:42.197-0500 [thread-6] warning: distributed lock 'TimeSkewFailNewTest_lock_8/tp2.10gen.cc:30004:1378409369:154154505:thread-6:637885744 did not propagate properly. :: caused by :: 8017 update not consistent  ns: config.locks query: { _id: "TimeSkewFailNewTest_lock_8", state: 0, ts: ObjectId('5228dba4f601126a3fc57c43') } update: { $set: { state: 1, who: "tp2.10gen.cc:30004:1378409369:154154505:thread-6:637885744", process: "tp2.10gen.cc:30004:1378409369:154154505:thread-6:637885744", when: new Date(1378490856134), why: "Testing distributed lock with skew.", ts: ObjectId('5228dba6f601126a3fc57c4a') } } gle1: { updatedExisting: false, n: 0, connectionId: 9, fsyncFiles: 4, err: null, ok: 1.0 } gle2: { updatedExisting: true, n: 1, connectionId: 9, fsyncFiles: 4, err: null, ok: 1.0 }

Some other thread tries to acquire the lock but sees that thread 3 already has it.

 m30004| 2013-09-05T14:29:42.220-0500 [thread-8] could not force lock 'TimeSkewFailNewTest_lock_8/tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223' because elapsed time 7 <= takeover time 4000

Thread 3 concludes that it was not able to grab the lock?????????? This inserts a new entry for 'deferred' force unlock the lockPinger thread will pickup.

 m30004| 2013-09-05T14:29:42.228-0500 [thread-3] lock update lost, lock 'TimeSkewFailNewTest_lock_8/tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223' not propagated.
 m30004| 2013-09-05T14:29:42.228-0500 [thread-3] distributed lock 'TimeSkewFailNewTest_lock_8/tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223' was not acquired.

lockPinger sets the lock state to 0!!!!!!

 m30004| 2013-09-05T14:29:42.228-0500 [LockPinger] trying to delete 1 old lock entries for process tp2.10gen.cc:30004:1378409369:154154505:thread-4:1056387920

Thread 4 was not able to set the state to 0 cleanly on all config servers, most likely because of the interleaving of updates from thread 3&6.

 m30004| 2013-09-05T14:29:42.230-0500 [thread-4] distributed lock 'TimeSkewFailNewTest_lock_8/tp2.10gen.cc:30004:1378409369:154154505:thread-4:1056387920' unlocked (messily). 
 m30004| 2013-09-05T14:29:42.230-0500 [thread-4] warning: distributed lock 'TimeSkewFailNewTest_lock_8/tp2.10gen.cc:30004:1378409369:154154505:thread-4:1056387920' couldn't consummate unlock request. lock may be taken over after 0 minutes timeout.

Thread 6 got confused because it thought that the state was already 1 and it became 0 again:

 m30004| 2013-09-05T14:29:42.230-0500 [thread-6] Assertion failure indUpdate[LocksType::state()].numberInt() > 0 src/mongo/client/distlock.cpp 888

We can see from the logs below that the "ts" field the lockPinger thread set the lock to state 0 matches with the thread-3 "ts" field.

 m30004| 2013-09-05T14:29:42.238-0500 [thread-2] could not force lock 'TimeSkewFailNewTest_lock_8/tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223' because elapsed time 24 <= takeover time 4000
 m30004| 2013-09-05T14:29:42.238-0500 [thread-2] created new distributed lock for TimeSkewFailNewTest_lock_8 on localhost:30000,localhost:30001,localhost:30002 ( lock timeout : 4000, ping interval : 133, process : 1 )
 m30004| 2013-09-05T14:29:42.238-0500 [LockPinger] partially removed old distributed lock with ts 5228dba6f601126a3fc57c49

Generated at Thu Feb 08 03:23:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.