|
Looks like a race in the distLock:
Thread 4 is about to unlock.
m30004| 2013-09-05T14:29:42.117-0500 [thread-4] **** Unlocking for thread 4 with ts ts: ObjectId('5228dba4f601126a3fc57c43')
|
Thread 3 tries to get lock
m30004| 2013-09-05T14:29:42.167-0500 [thread-3] about to acquire distributed lock 'TimeSkewFailNewTest_lock_8/tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223:
|
m30004| { "state" : 1,
|
m30004| "who" : "tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223",
|
m30004| "process" : "tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223",
|
m30004| "when" : { "$date" : "Tue Sep 3 05:30:32 2013" },
|
m30004| "why" : "Testing distributed lock with skew.",
|
m30004| "ts" : { "$oid" : "5228dba6f601126a3fc57c49" } }
|
m30004| { "_id" : "TimeSkewFailNewTest_lock_8",
|
m30004| "state" : 0,
|
m30004| "ts" : { "$oid" : "5228dba4f601126a3fc57c43" } }
|
Thread 6 tries to get the lock as well
m30004| 2013-09-05T14:29:42.167-0500 [thread-6] about to acquire distributed lock 'TimeSkewFailNewTest_lock_8/tp2.10gen.cc:30004:1378409369:154154505:thread-6:637885744:
|
m30004| { "state" : 1,
|
m30004| "who" : "tp2.10gen.cc:30004:1378409369:154154505:thread-6:637885744",
|
m30004| "process" : "tp2.10gen.cc:30004:1378409369:154154505:thread-6:637885744",
|
m30004| "when" : { "$date" : "Fri Sep 6 13:07:36 2013" },
|
m30004| "why" : "Testing distributed lock with skew.",
|
m30004| "ts" : { "$oid" : "5228dba6f601126a3fc57c4a" } }
|
Thread 3 detected that it was not successful in getting the lock. This implies that the lock's state was already 0 (at least for the first config)
m30004| 2013-09-05T14:29:42.191-0500 [thread-3] warning: distributed lock 'TimeSkewFailNewTest_lock_8/tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223 did not propagate properly. :: caused by :: 8017 update not consistent ns: config.locks query: { _id: "TimeSkewFailNewTest_lock_8", state: 0, ts: ObjectId('5228dba4f601126a3fc57c43') } update: { $set: { state: 1, who: "tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223", process: "tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223", when: new Date(1378204232655), why: "Testing distributed lock with skew.", ts: ObjectId('5228dba6f601126a3fc57c49') } } gle1: { updatedExisting: true, n: 1, connectionId: 11, fsyncFiles: 4, err: null, ok: 1.0 } gle2: { updatedExisting: false, n: 0, connectionId: 11, fsyncFiles: 4, err: null, ok: 1.0 }
|
Thread 6 detected that it was not successful in getting the lock. This implies that the lock's state was already 0 (at least for the first config)
m30004| 2013-09-05T14:29:42.197-0500 [thread-6] warning: distributed lock 'TimeSkewFailNewTest_lock_8/tp2.10gen.cc:30004:1378409369:154154505:thread-6:637885744 did not propagate properly. :: caused by :: 8017 update not consistent ns: config.locks query: { _id: "TimeSkewFailNewTest_lock_8", state: 0, ts: ObjectId('5228dba4f601126a3fc57c43') } update: { $set: { state: 1, who: "tp2.10gen.cc:30004:1378409369:154154505:thread-6:637885744", process: "tp2.10gen.cc:30004:1378409369:154154505:thread-6:637885744", when: new Date(1378490856134), why: "Testing distributed lock with skew.", ts: ObjectId('5228dba6f601126a3fc57c4a') } } gle1: { updatedExisting: false, n: 0, connectionId: 9, fsyncFiles: 4, err: null, ok: 1.0 } gle2: { updatedExisting: true, n: 1, connectionId: 9, fsyncFiles: 4, err: null, ok: 1.0 }
|
Some other thread tries to acquire the lock but sees that thread 3 already has it.
m30004| 2013-09-05T14:29:42.220-0500 [thread-8] could not force lock 'TimeSkewFailNewTest_lock_8/tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223' because elapsed time 7 <= takeover time 4000
|
Thread 3 concludes that it was not able to grab the lock?????????? This inserts a new entry for 'deferred' force unlock the lockPinger thread will pickup.
m30004| 2013-09-05T14:29:42.228-0500 [thread-3] lock update lost, lock 'TimeSkewFailNewTest_lock_8/tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223' not propagated.
|
m30004| 2013-09-05T14:29:42.228-0500 [thread-3] distributed lock 'TimeSkewFailNewTest_lock_8/tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223' was not acquired.
|
lockPinger sets the lock state to 0!!!!!!
m30004| 2013-09-05T14:29:42.228-0500 [LockPinger] trying to delete 1 old lock entries for process tp2.10gen.cc:30004:1378409369:154154505:thread-4:1056387920
|
Thread 4 was not able to set the state to 0 cleanly on all config servers, most likely because of the interleaving of updates from thread 3&6.
m30004| 2013-09-05T14:29:42.230-0500 [thread-4] distributed lock 'TimeSkewFailNewTest_lock_8/tp2.10gen.cc:30004:1378409369:154154505:thread-4:1056387920' unlocked (messily).
|
m30004| 2013-09-05T14:29:42.230-0500 [thread-4] warning: distributed lock 'TimeSkewFailNewTest_lock_8/tp2.10gen.cc:30004:1378409369:154154505:thread-4:1056387920' couldn't consummate unlock request. lock may be taken over after 0 minutes timeout.
|
Thread 6 got confused because it thought that the state was already 1 and it became 0 again:
m30004| 2013-09-05T14:29:42.230-0500 [thread-6] Assertion failure indUpdate[LocksType::state()].numberInt() > 0 src/mongo/client/distlock.cpp 888
|
We can see from the logs below that the "ts" field the lockPinger thread set the lock to state 0 matches with the thread-3 "ts" field.
m30004| 2013-09-05T14:29:42.238-0500 [thread-2] could not force lock 'TimeSkewFailNewTest_lock_8/tp2.10gen.cc:30004:1378409369:154154505:thread-3:913546223' because elapsed time 24 <= takeover time 4000
|
m30004| 2013-09-05T14:29:42.238-0500 [thread-2] created new distributed lock for TimeSkewFailNewTest_lock_8 on localhost:30000,localhost:30001,localhost:30002 ( lock timeout : 4000, ping interval : 133, process : 1 )
|
m30004| 2013-09-05T14:29:42.238-0500 [LockPinger] partially removed old distributed lock with ts 5228dba6f601126a3fc57c49
|
|