[SERVER-9868] heartbeats not responded to during mmap flushing on Windows Created: 06/Jun/13  Updated: 16/Nov/21  Resolved: 08/Jul/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: David Verdejo Assignee: Mark Benvenuto
Resolution: Duplicate Votes: 2
Labels: DBClientCursor, rsHealthPoll
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Thu Jun 06 13:21:10.788 [initandlisten] MongoDB starting : pid=7356 port=27017 dbpath=e:\mongodb_rssc01\data\db 64-bit host=LOG-MNGSC21
Thu Jun 06 13:21:10.788 [initandlisten] db version v2.4.3
Thu Jun 06 13:21:10.788 [initandlisten] git version: fe1743177a5ea03e91e0052fb5e2cb2945f6d95f
Thu Jun 06 13:21:10.788 [initandlisten] build info: windows sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, service_pack='Service Pack 1') BOOST_LIB_VERSION=1_49


Attachments: Zip Archive mongo_logmngsc11.zip     Zip Archive mongo_logmngsc21.zip    
Issue Links:
Depends
depends on SERVER-13681 MongoDB stalls during background flus... Closed
depends on SERVER-13725 Reads & Writes are blocked during Mem... Closed
depends on SERVER-13729 Reads & Writes are blocked during dat... Closed
Related
is related to SERVER-12880 Server pauses on requests every 60 se... Closed
is related to SERVER-13444 Long locked flush without inserts and... Closed
Operating System: Windows
Participants:

 Description   

We have a environment with 2 nodes and 1 arbiter with the following configuration:

{
"_id" : "rssc01",
"version" : 3,
"members" : [

{ "_id" : 0, "host" : "LOG-MNGSC11:27017" }

,

{ "_id" : 1, "host" : "log-mngsc21:27017" }

,

{ "_id" : 2, "host" : "log-mngsc22:27018", "arbiterOnly" : true }

]
}

LOG-MNGSC11 is the primary and LOG-MNGSC21 is the secondary.

Suddenly, the replication fails with the following message on secondary:

Thu Jun 06 11:56:50.656 [rsHealthPoll] replset info LOG-MNGSC11:27017 thinks that we are down
Thu Jun 06 11:56:52.310 [rsHealthPoll] replset info log-mngsc22:27018 thinks that we are down
Thu Jun 06 11:56:52.481 [conn12160] command admin.$cmd command:

{ writebacklisten: ObjectId('51afe4406127103ae52a9fc0') }

ntoreturn:1 keyUpdates:0 reslen:44 300005ms
Thu Jun 06 11:56:52.668 [rsHealthPoll] replset info LOG-MNGSC11:27017 thinks that we are down
Thu Jun 06 11:56:52.793 [rsBackgroundSync] Socket recv() timeout 172.29.106.92:27017
Thu Jun 06 11:56:52.793 [rsBackgroundSync] SocketException: remote: 172.29.106.92:27017 error: 9001 socket exception [3] server [172.29.106.92:27017]
Thu Jun 06 11:56:52.793 [rsBackgroundSync] DBClientCursor::init call() failed

At primary, I see the following messages:

Thu Jun 06 11:56:52.524 [initandlisten] connection accepted from 172.29.106.95:56714 #16239 (66 connections now open)
Thu Jun 06 11:56:55.847 [rsHealthPoll] DBClientCursor::init call() failed
Thu Jun 06 11:56:57.329 [conn16236] query local.oplog.rs query: { ts:

{ $gte: Timestamp 1370511563000|895 }

} cursorid:479480611067557781 ntoreturn:0 ntoskip:0 nscanned:102 keyUpdates:0 numYields: 2264 locks(micros) r:727945 nreturned:101 reslen:12039 35319ms
Thu Jun 06 11:56:57.329 [conn16236] end connection 172.29.106.95:56704 (65 connections now open)



 Comments   
Comment by Thomas Rueckstiess [ 04/Apr/14 ]

I believe the original poster's replica set failover was due to very long flushes

Fri Jun  6 11:48:34.274 [DataFileSync] flushing mmaps took 439956ms for 17 files

during which the heartbeats and other requests were not processed (Windows platform). The flushes here were caused by a multi remove (here log entry from primary). The time of the "thinks that we are down" messages correlates directly with the period of the removes.

Fri Jun  6 13:21:19.677 [conn16132] remove cachehoteles.Cache_AGHSmart_XMLHotel ndeleted:3876379 keyUpdates:0 numYields: 17683 locks(micros) w:5953725444 6558335ms

This is similar to what we see on SERVER-12880 and SERVER-13444.

Comment by sam flint [ 23/Oct/13 ]

I am experiencing this issue in production and I have 6TB of data on this system, so re-syncing the data isn't as easy and just doing it. I am running 2.4.4 and the repl lag is about 57 hours behind now. Please advise asap.

Comment by Siddhartha Goyal [ 09/Oct/13 ]

Hi yes we are actually running that version. The problem seems quite random and in the current setup I have all reads going to the primary so it's not like there's some long running query that might cause this. Is there something that I can do to figure out what's going on that causes the issue to happen? When the issue does happen I always see these sequence of messages:

Wed Oct 9 20:16:58.563 [rsBackgroundSync] Socket recv() timeout 172.17.0.85:27017
Wed Oct 9 20:16:58.563 [rsBackgroundSync] SocketException: remote: 172.17.0.85:27017 error: 9001 socket exception [RECV_TIMEOUT] server [172.17.0.85:27017]
Wed Oct 9 20:16:58.563 [rsBackgroundSync] DBClientCursor::init call() failed

Comment by Daniel Pasette (Inactive) [ 09/Oct/13 ]

The underlying bug referred to in this ticket was resolved in the latest version of MongoDB (v2.4.6). Are you able to try running with this version?

Comment by Siddhartha Goyal [ 03/Oct/13 ]

Hi,

I've run into the same exact problem multiple times in my setup which is also a setup with 2 replicas and 1 arbiter. I run into this problem on the secondary quite often which requires an entire resync in order to fix the issue. I'm running on FreeBSD 9.2-RC4 on a ZFS filesystem if that helps. Is there an equivalent gdb command to run on FreeBSD to gather thread dumps?

Comment by David Verdejo [ 26/Jul/13 ]

Hello,

Please, could you send me the command to execute on Windows?

Comment by sam.helman@10gen.com [ 25/Jul/13 ]

Hello,

Sorry for the response delay. The issue that both of you are seeing looks like it may be SERVER-10085, which causes heartbeat requests to time out if there is a long-running oplog request pending. If you are able to gather some more information for us and post or attach it, we can take a look and see if it points to the issue we expect.

The information we need can be obtained by running

gdb --pid $MONGODB_PID --batch --eval-command='thread apply all bt'

(replacing $MONGODB_PID with the pid of the running process). This will provide stack trace info from gdb on all the currently running threads. Ideally, we would like the information from times that the process is healthy and reachable, as well as from times it is unreachable, so we can compare.

Thanks!

Comment by James Robb [ 18/Jul/13 ]

I am also experiencing this exact same issue. Any resolution as of yet?

Comment by David Verdejo [ 06/Jun/13 ]

Secondary server

Comment by David Verdejo [ 06/Jun/13 ]

Primary server

Comment by David Verdejo [ 06/Jun/13 ]

2 notes:

  • When I receive this message, I try to make a telnet connection to port 27017 to primary and it works fine.
  • To resolve this issue, I tried to stop the service and start but it was still failing. I stopped the service, deleted all files in dbpath and started the service and now it's working fine.

I will send you the logs from the servers.

Generated at Thu Feb 08 03:21:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.