[SERVER-9868] heartbeats not responded to during mmap flushing on Windows Created: 06/Jun/13 Updated: 16/Nov/21 Resolved: 08/Jul/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | David Verdejo | Assignee: | Mark Benvenuto |
| Resolution: | Duplicate | Votes: | 2 |
| Labels: | DBClientCursor, rsHealthPoll | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Thu Jun 06 13:21:10.788 [initandlisten] MongoDB starting : pid=7356 port=27017 dbpath=e:\mongodb_rssc01\data\db 64-bit host=LOG-MNGSC21 |
||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Operating System: | Windows | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Description |
|
We have a environment with 2 nodes and 1 arbiter with the following configuration: { , , { "_id" : 2, "host" : "log-mngsc22:27018", "arbiterOnly" : true } ] LOG-MNGSC11 is the primary and LOG-MNGSC21 is the secondary. Suddenly, the replication fails with the following message on secondary: Thu Jun 06 11:56:50.656 [rsHealthPoll] replset info LOG-MNGSC11:27017 thinks that we are down ntoreturn:1 keyUpdates:0 reslen:44 300005ms At primary, I see the following messages: Thu Jun 06 11:56:52.524 [initandlisten] connection accepted from 172.29.106.95:56714 #16239 (66 connections now open) } cursorid:479480611067557781 ntoreturn:0 ntoskip:0 nscanned:102 keyUpdates:0 numYields: 2264 locks(micros) r:727945 nreturned:101 reslen:12039 35319ms |
| Comments |
| Comment by Thomas Rueckstiess [ 04/Apr/14 ] | ||
|
I believe the original poster's replica set failover was due to very long flushes
during which the heartbeats and other requests were not processed (Windows platform). The flushes here were caused by a multi remove (here log entry from primary). The time of the "thinks that we are down" messages correlates directly with the period of the removes.
This is similar to what we see on | ||
| Comment by sam flint [ 23/Oct/13 ] | ||
|
I am experiencing this issue in production and I have 6TB of data on this system, so re-syncing the data isn't as easy and just doing it. I am running 2.4.4 and the repl lag is about 57 hours behind now. Please advise asap. | ||
| Comment by Siddhartha Goyal [ 09/Oct/13 ] | ||
|
Hi yes we are actually running that version. The problem seems quite random and in the current setup I have all reads going to the primary so it's not like there's some long running query that might cause this. Is there something that I can do to figure out what's going on that causes the issue to happen? When the issue does happen I always see these sequence of messages: Wed Oct 9 20:16:58.563 [rsBackgroundSync] Socket recv() timeout 172.17.0.85:27017 | ||
| Comment by Daniel Pasette (Inactive) [ 09/Oct/13 ] | ||
|
The underlying bug referred to in this ticket was resolved in the latest version of MongoDB (v2.4.6). Are you able to try running with this version? | ||
| Comment by Siddhartha Goyal [ 03/Oct/13 ] | ||
|
Hi, I've run into the same exact problem multiple times in my setup which is also a setup with 2 replicas and 1 arbiter. I run into this problem on the secondary quite often which requires an entire resync in order to fix the issue. I'm running on FreeBSD 9.2-RC4 on a ZFS filesystem if that helps. Is there an equivalent gdb command to run on FreeBSD to gather thread dumps? | ||
| Comment by David Verdejo [ 26/Jul/13 ] | ||
|
Hello, Please, could you send me the command to execute on Windows? | ||
| Comment by sam.helman@10gen.com [ 25/Jul/13 ] | ||
|
Hello, Sorry for the response delay. The issue that both of you are seeing looks like it may be The information we need can be obtained by running
(replacing $MONGODB_PID with the pid of the running process). This will provide stack trace info from gdb on all the currently running threads. Ideally, we would like the information from times that the process is healthy and reachable, as well as from times it is unreachable, so we can compare. Thanks! | ||
| Comment by James Robb [ 18/Jul/13 ] | ||
|
I am also experiencing this exact same issue. Any resolution as of yet? | ||
| Comment by David Verdejo [ 06/Jun/13 ] | ||
|
Secondary server | ||
| Comment by David Verdejo [ 06/Jun/13 ] | ||
|
Primary server | ||
| Comment by David Verdejo [ 06/Jun/13 ] | ||
|
2 notes:
I will send you the logs from the servers. |