[SERVER-11119] mongod connections hang because of assertion failure Created: 10/Oct/13  Updated: 10/Jun/14  Resolved: 10/Jun/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Andreas Heck Assignee: Victor Hooi
Resolution: Duplicate Votes: 0
Labels: heartbeat
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 12.04


Attachments: File mongo-logs.tar.gz    
Issue Links:
Duplicate
duplicates SERVER-6302 Race condition when multiple fsyncLoc... Closed
duplicates SERVER-1423 reads often aren't possible while in ... Closed
Operating System: ALL
Steps To Reproduce:

Happens randomly but seems to be more likely when you do repairDatabase or fsync lock in one of the OTHER nodes of a replica set.

Participants:

 Description   

I got into a situation where all new connections with the mongo shell hang and never reach the prompt. The log says that an assertion failed but the mongod process does not terminate so that this problem prevails until I manually restart mongod.

Log messages:

Thu Oct 10 08:34:10.016 [conn44524] end connection 10.240.29.54:56207 (100 connections now open)
Thu Oct 10 08:34:10.017 [initandlisten] connection accepted from 10.240.29.54:56211 #44528 (101 connections now open)
Thu Oct 10 08:34:10.065 [conn44528]  authenticate db: local { authenticate: 1, nonce: "c055bc811fe10892", user: "__system", key: "deefa05ea4f51cd6cfdf43219f339c3b" }
Thu Oct 10 08:34:14.059 [rsHealthPoll]   warning assertion failure d.size() < 1024 src/mongo/util/concurrency/task.cpp 121
0xb942f3 0xb57191 0xb5f652 0xa355ce 0xa3676c 0xb5f29c 0xb5af25 0xb5b89b 0xbddc5b 0x7fabf8fa5e9a 0x7fabf7e87ccd
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x23) [0xb942f3]
 /usr/bin/mongod(_ZN5mongo9wassertedEPKcS1_j+0x121) [0xb57191]
 /usr/bin/mongod(_ZN5mongo4task6Server4sendEN5boost8functionIFvvEEE+0x182) [0xb5f652]
 /usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask2upERKNS_7BSONObjERNS_13HeartbeatInfoE+0x88e) [0xa355ce]
 /usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask6doWorkEv+0xec) [0xa3676c]
 /usr/bin/mongod(_ZN5mongo4task4Task3runEv+0x2c) [0xb5f29c]
 /usr/bin/mongod(_ZN5mongo13BackgroundJob7jobBodyEN5boost10shared_ptrINS0_9JobStatusEEE+0xb5) [0xb5af25]
 /usr/bin/mongod(_ZN5boost6detail11thread_dataINS_3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo13BackgroundJobENS_10shared_ptrINS7_9JobStatusEEEEENS2_5list2INS2_5valueIPS7_EENSD_ISA_EEEEEEE3runEv+0x7b) [0xb5b89b]
 /usr/bin/mongod() [0xbddc5b]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7fabf8fa5e9a]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fabf7e87ccd]
Thu Oct 10 08:34:15.592 [initandlisten] connection accepted from 127.0.0.1:43664 #44529 (102 connections now open)



 Comments   
Comment by Thomas Rueckstiess [ 10/Jun/14 ]

Hi Andreas,

Apologies that it has taken us so long to get back on this ticket. After further inspection of the log files, I believe the issue you're seeing is a combination of two known issues: SERVER-6302 and SERVER-1423 (and duplicate ticket SERVER-5533, which more accurately describes your situation).

I found that the server at 10.36.* was fsyncLocked twice in a row without unlock in between, which can exhibit unexpected behavior and lead to a node that remains locked (see SERVER-6302).

$ grep -e "CMD fsync:" -e "command: unlock requested" mongodb-10.36.169.114.log
...
Wed Oct  9 06:46:29.031 [conn27615] CMD fsync: sync:1 lock:1
Wed Oct  9 06:49:45.683 [conn27651] command: unlock requested
Wed Oct  9 07:50:30.603 [conn28326] CMD fsync: sync:1 lock:1
Wed Oct  9 07:53:45.315 [conn28362] command: unlock requested
Wed Oct  9 09:21:32.443 [conn29252] CMD fsync: sync:1 lock:1    <---
Wed Oct  9 09:27:57.022 [conn29334] CMD fsync: sync:1 lock:1    <---
Wed Oct  9 09:31:10.904 [conn29369] command: unlock requested
Wed Oct  9 10:52:03.914 [conn29252] command: unlock requested

The following replica set reconfiguration would have blocked on the thread holding the fsync lock.

Thu Oct 10 08:15:05.830 [rsMgr] replSet info saving a newer config version to local.system.replset

We can also see the stack traces and warning shortly after. You were then unable to connect to the node, because the authentication requests were also blocked behind the write (SERVER-1423, SERVER-5533) until after a restart of that node.

The repair on 10.210.* may have been unrelated.

As we are already tracking both of these issues, I'm going to close this one as duplicate.

Regards,
Thomas

Comment by Andreas Heck [ 10/Oct/13 ]

Logs from all three MongoDB nodes

Yes I removed and readded the other secondary a few minutes before the problem occured

Comment by Scott Hernandez (Inactive) [ 10/Oct/13 ]

Can you please attach the full logs from all members at this time? The snippet you have included is just a warning not an error; it pertains to internal replication heartbeats and by itself doesn't show any problem.

Were you doing a replication reconfigure during this time as well?

Generated at Thu Feb 08 03:24:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.