[SERVER-6303] shouldRetry needs a write lock Created: 05/Jul/12  Updated: 11/Jul/16  Resolved: 12/Jul/12

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 2.2.0-rc0

Type: Bug Priority: Critical - P2
Reporter: Azat Khuzhin Assignee: Eric Milkie
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

git version: 873e06d16edfe8f6cdc0030e924e3b55087afd6d


Issue Links:
Related
is related to SERVER-6446 replication fails with empty host Closed
Operating System: ALL
Participants:

 Description   

Replication fail with:

Wed Jul  4 20:10:48 [conn622] SocketException handling request, closing client connection: 9001 socket exception [2] server [192.168.1.21:49885]                                                                                                                       [58/1856]
Wed Jul  4 20:10:48 [conn610] SocketException handling request, closing client connection: 9001 socket exception [2] server [192.168.1.21:33146] 
Wed Jul  4 20:10:49 [rsSync] replSet initialSyncOplogApplication applied 1124 operations, synced to Jul  4 17:25:28:32
Wed Jul  4 20:10:49 [conn650] end connection 192.168.1.21:33004 (3 connections now open)
Wed Jul  4 20:10:49 [repl writer worker] error expected my_db.ad to be locked 
Wed Jul  4 20:10:49 [repl writer worker] lock status: unlocked
Wed Jul  4 20:10:49 [repl writer worker] Assertion: 16104:expected to be read locked for my_db.ad
Wed Jul  4 20:10:48 [conn550] SocketException handling request, closing client connection: 9001 socket exception [2] server [192.168.1.21:34033] 
Wed Jul  4 20:10:48 [conn646] command config.$cmd command: { serverStatus: 1 } ntoreturn:1 keyUpdates:0 locks(micros) r:1293 reslen:2549 82344ms
Wed Jul  4 20:10:48 [conn558] SocketException handling request, closing client connection: 9001 socket exception [2] server [192.168.1.21:44962] 
Wed Jul  4 20:10:48 [conn614] command config.$cmd command: { serverStatus: 1 } ntoreturn:1 keyUpdates:0 locks(micros) r:1691 reslen:2549 562401ms
Wed Jul  4 20:10:48 [conn646] end connection 192.168.1.21:54959 (10 connections now open)
Wed Jul  4 20:10:48 [conn514] SocketException handling request, closing client connection: 9001 socket exception [2] server [192.168.1.21:39918] 
Wed Jul  4 20:10:48 [conn614] SocketException handling request, closing client connection: 9001 socket exception [2] server [192.168.1.21:38702] 
Wed Jul  4 20:10:48 [conn602] serverStatus was very slow: { after basic: 0, middle of mem: 0, after mem: 0, after connections: 0, after extra info: 0, after counters: 0, after repl: 0, after asserts: 0, after dur: 734417, at end: 734417 }
Wed Jul  4 20:10:48 [conn602] command config.$cmd command: { serverStatus: 1 } ntoreturn:1 keyUpdates:0 locks(micros) r:52 reslen:2549 742522ms
Wed Jul  4 20:10:48 [conn622] serverStatus was very slow: { after basic: 0, middle of mem: 0, after mem: 0, after connections: 0, after extra info: 0, after counters: 0, after repl: 0, after asserts: 0, after dur: 437590, at end: 437590 }
Wed Jul  4 20:10:48 [conn610] serverStatus was very slow: { after basic: 0, middle of mem: 0, after mem: 0, after connections: 0, after extra info: 0, after counters: 0, after repl: 0, after asserts: 0, after dur: 615416, at end: 615416 }
Wed Jul  4 20:10:48 [conn602] SocketException handling request, closing client connection: 9001 socket exception [2] server [192.168.1.21:49500] 
Wed Jul  4 20:10:48 [conn622] command config.$cmd command: { serverStatus: 1 } ntoreturn:1 keyUpdates:0 locks(micros) r:26183 reslen:2549 442455ms
Wed Jul  4 20:10:48 [conn610] command config.$cmd command: { serverStatus: 1 } ntoreturn:1 keyUpdates:0 locks(micros) r:26216 reslen:2549 622220ms
Wed Jul  4 20:10:48 [conn622] SocketException handling request, closing client connection: 9001 socket exception [2] server [192.168.1.21:49885] 
Wed Jul  4 20:10:48 [conn610] SocketException handling request, closing client connection: 9001 socket exception [2] server [192.168.1.21:33146] 
Wed Jul  4 20:10:49 [rsSync] replSet initialSyncOplogApplication applied 1124 operations, synced to Jul  4 17:25:28:32
Wed Jul  4 20:10:49 [conn650] end connection 192.168.1.21:33004 (3 connections now open)
Wed Jul  4 20:10:49 [repl writer worker] error expected my_db.ad to be locked 
Wed Jul  4 20:10:49 [repl writer worker] lock status: unlocked
Wed Jul  4 20:10:49 [repl writer worker] Assertion: 16104:expected to be read locked for my_db.ad
0x72b5f3 0x8ac744 0x8ac82c 0xb0af56 0xa88c7b 0x99d669 0x99d872 0x734f64 0x5e8e1f 0x95428a 0x7fe6be 0x7f38dcb40b50 0x7f38dbee46dd 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x23) [0x72b5f3]
 /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x94) [0x8ac744]
 /usr/bin/mongod() [0x8ac82c]
 /usr/bin/mongod(_ZN5mongo4Lock23assertAtLeastReadLockedERKNS_10StringDataE+0x146) [0xb0af56]
 /usr/bin/mongod(_ZN5mongo14DatabaseHolder11getOrCreateERKSsS2_Rb+0x10b) [0xa88c7b]
 /usr/bin/mongod(_ZN5mongo6Client7Context11_finishInitEb+0x49) [0x99d669]
 /usr/bin/mongod(_ZN5mongo6Client7ContextC2ERKSsSsbb+0x92) [0x99d872]
 /usr/bin/mongod(_ZN5mongo4Sync11shouldRetryERKNS_7BSONObjE+0xb4) [0x734f64]
 /usr/bin/mongod(_ZN5mongo7replset21multiInitialSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x4f) [0x5e8e1f]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x2a) [0x95428a]
 /usr/bin/mongod() [0x7fe6be]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50) [0x7f38dcb40b50]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f38dbee46dd]
Wed Jul  4 20:10:49 [repl writer worker] ERROR: exception: expected to be read locked for my_db.ad on: { ts: Timestamp 1341437174000|22, h: -8912815504339682628, op: "u", ns: "my_db.ad", o2: { _id: ObjectId('4ff216da347caf99133789bb') }, o: { $set: { cou
nt: 29 }, $set: { ts_end: new Date(1341437174303) } } }
Wed Jul  4 20:10:49 [repl writer worker]   Fatal Assertion 16361
0x72b5f3 0x8ac493 0x5e8f1f 0x95428a 0x7fe6be 0x7f38dcb40b50 0x7f38dbee46dd 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x23) [0x72b5f3]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0x63) [0x8ac493]
 /usr/bin/mongod(_ZN5mongo7replset21multiInitialSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x14f) [0x5e8f1f]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x2a) [0x95428a]
 /usr/bin/mongod() [0x7fe6be]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50) [0x7f38dcb40b50]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f38dbee46dd]
Wed Jul  4 20:10:49 [repl writer worker] 
 
***aborting after fassert() failure
 
 
Wed Jul  4 20:10:49 Got signal: 6 (Aborted).
 
Wed Jul  4 20:10:49 Backtrace:
0x72b5f3 0x574c74 0x7f38dbe3e4f0 0x7f38dbe3e475 0x7f38dbe416f0 0x8ac4ce 0x5e8f1f 0x95428a 0x7fe6be 0x7f38dcb40b50 0x7f38dbee46dd 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x23) [0x72b5f3]
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x3e4) [0x574c74]
 /lib/x86_64-linux-gnu/libc.so.6(+0x324f0) [0x7f38dbe3e4f0]
 /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f38dbe3e475]
 /lib/x86_64-linux-gnu/libc.so.6(abort+0x180) [0x7f38dbe416f0]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0x9e) [0x8ac4ce]
 /usr/bin/mongod(_ZN5mongo7replset21multiInitialSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x14f) [0x5e8f1f]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x2a) [0x95428a]
 /usr/bin/mongod() [0x7fe6be]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50) [0x7f38dcb40b50]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f38dbee46dd]

When statr mngo one more time:

Thu Jul  5 01:18:31 [repl writer worker]   Fatal Assertion 16359
0x72b5f3 0x8ac493 0x5e8d11 0x95428a 0x7fe6be 0x7f07e55aab50 0x7f07e494e6dd 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x23) [0x72b5f3]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0x63) [0x8ac493]
 /usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x51) [0x5e8d11]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x2a) [0x95428a]
 /usr/bin/mongod() [0x7fe6be]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50) [0x7f07e55aab50]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f07e494e6dd]
Thu Jul  5 01:18:31 [repl writer worker] 
 
***aborting after fassert() failure
 
 
Thu Jul  5 01:18:31 Got signal: 6 (Aborted).
 
Thu Jul  5 01:18:31 Backtrace:
0x72b5f3 0x574c74 0x7f07e48a84f0 0x7f07e48a8475 0x7f07e48ab6f0 0x8ac4ce 0x5e8d11 0x95428a 0x7fe6be 0x7f07e55aab50 0x7f07e494e6dd 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x23) [0x72b5f3]
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x3e4) [0x574c74]
 /lib/x86_64-linux-gnu/libc.so.6(+0x324f0) [0x7f07e48a84f0]
 /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f07e48a8475]
 /lib/x86_64-linux-gnu/libc.so.6(abort+0x180) [0x7f07e48ab6f0]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0x9e) [0x8ac4ce]
 /usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x51) [0x5e8d11]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x2a) [0x95428a]
 /usr/bin/mongod() [0x7fe6be]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50) [0x7f07e55aab50]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f07e494e6dd]



 Comments   
Comment by Eric Milkie [ 12/Jul/12 ]

Excellent. Thanks for testing this out.

Comment by Azat Khuzhin [ 12/Jul/12 ]

Seem like this commit fix this issue.

I'v try to configure replSet, and initial sync complete without errors.
And indexes created.

Thanks.

Comment by Eric Milkie [ 11/Jul/12 ]

The commit will fix the first fassert failure. It will give us more debugging information for the second one, if it should happen again.

Comment by Azat Khuzhin [ 11/Jul/12 ]

Also see comment to commit

Comment by Azat Khuzhin [ 11/Jul/12 ]

So your last commit must fix both fassert failures?

Comment by auto [ 10/Jul/12 ]

Author:

{u'date': u'2012-07-10T10:51:19-07:00', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-6303 correct locking for initial sync retrying

With multithreaded replication, the call site for retrying if Cloner misses documents
was changed. We now must grab a write lock before calling shouldRetry().
I also changed the logging to always log when a retry situation is present.
These situations can result if Cloner misses a document because it was updated, grew,
and moved backward on disk.
Branch: master
https://github.com/mongodb/mongo/commit/9284a9a3db7ed4e94827ad1ba323a08ed559d5f4

Comment by Azat Khuzhin [ 07/Jul/12 ]

Thanks.

Comment by Eric Milkie [ 06/Jul/12 ]

Thanks for reporting this problem. I have found an issue with initial sync that will need to be fixed up and tested.

Comment by Azat Khuzhin [ 05/Jul/12 ]

Sorry, but I'v already reconfigure this replica-set.

Before that:
SECONDARY is in state SECONDARY.
If I restart SECONDARY it works for 5-10 secs, and than fail with that messages.

Maybe indexes not created because of replication fail, because after I just turn replication of.
And then doing next: I revert mongo on secondary for version "d15b244ccabe1d8d0d1e93f2cb8de03d6b1d09d8", and continue replication - and it replicated fine, but without indexes.

Replication works fine with older version

        "version" : "2.1.2-pre-",
        "gitVersion" : "d15b244ccabe1d8d0d1e93f2cb8de03d6b1d09d8",

Because now I just delete old node, create new one (where mongo compiled from sources with git hash "d15b244ccabe1d8d0d1e93f2cb8de03d6b1d09d8") and replication works find, and all indexes are created.

Could you write step by step what replication do and how? (IT would be great)

Comment by Eric Milkie [ 05/Jul/12 ]

Can you post the entire log from the primary and the secondary?
Also, can you show the output of rs.conf() from the primary?
What activity did the primary have when you experienced the failure; any interestingly large or high volume writes?

Comment by Azat Khuzhin [ 05/Jul/12 ]

And also when I configure replica-set nodes, on SECONDARY it create only "_id" index, while there is another indexes too

Comment by Azat Khuzhin [ 05/Jul/12 ]

The latest
f662d00f1331c85aba533b58b3714611beb9f415

Comment by Eliot Horowitz (Inactive) [ 05/Jul/12 ]

What version? I can't find anythig with that git hash.

Comment by Azat Khuzhin [ 05/Jul/12 ]

I think because of #SERVER-5685

Generated at Thu Feb 08 03:11:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.