[SERVER-8189] Unstable connectivity in replica sets environment Created: 16/Jan/13  Updated: 15/Feb/13  Resolved: 22/Jan/13

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.2.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Gusev Petr Assignee: Tad Marshall
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Debian: Linux elba-mongo3 2.6.32-5-amd64 #1 SMP Sun May 6 04:00:17 UTC 2012 x86_64 GNU/Linux


Attachments: File log     Text File mongo1.log     Text File mongo2.log     Text File mongo3.log     Text File rsStatus.txt     File strace.out    
Issue Links:
Related
related to SERVER-6889 Refuse to start on NUMA hardware if "... Closed
Operating System: Linux
Participants:

 Description   

Hello! We have production setup of three nodes in replica sets, elba-mongo1 on one rack (the primary) and elba-mongo2, elba-mongo3 on another (secondaries).
Periodically in our mongo logs we see something similar to:

Wed Jan 16 11:59:13 [rsBackgroundSync] replSet syncing to: elba-mongo1:27017
Wed Jan 16 11:59:16 [rsHealthPoll] DBClientCursor::init call() failed
Wed Jan 16 11:59:16 [rsHealthPoll] replSet info elba-mongo1:27017 is down (or slow to respond): DBClientBase::findN: transport error: elba-mongo1:27017 ns: admin.$cmd query: { replSetHeartbeat: "rs0", v: 6, pv: 1, checkEmpty: false, from: "elba-mongo3:27017", $auth: {} }

After this point there is an election of new primary. elba-mongo1 pinged successfully from both elba-mongo2 and 3, but we can't connect to elba-mongo1 instance from elba-moongo1 machine using mongo console - it just hangs on "connecting to":

Croot@elba-mongo1:~# telnet localhost 27017
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
^]
telnet> quit
Connection closed.
root@elba-mongo1:~# mongo
MongoDB shell version: 2.2.2
connecting to: test



 Comments   
Comment by Tad Marshall [ 22/Jan/13 ]

Hi Dmitry,

I had assumed that actually connecting from the shell in order to do something was a goal, but your timing approach is interesting as well.

I'm glad that you're up-and-running now. I'll close this ticket, and thanks for letting us know the resolution!

Tad

Comment by Tad Marshall [ 22/Jan/13 ]

Hi Gusev,

Thanks for letting us know what the problem was. I'm sorry that I didn't latch onto NUMA myself, since it is right there in the log you posted. Since you noted connectivity problems and two primaries, I didn't think to look for alternate explanations.

Version 2.4 will display any startup warnings from the server when connecting to it from the mongo shell, which will help to raise the visibility of issues such as NUMA. A proposal I made in SERVER-6889 to require an explicit command line switch to ignore the NUMA warning has not garnered any votes and is marked as "features we're not sure of".

Tad

Comment by Dmitry Karmazin [ 22/Jan/13 ]

Oh, yeah.. That was a typo in command. My mistake... Speaking about omitting "shell" option - I just tried to measure connect time running three commands in a row (one for each node). I thought leaving client connected or not is not so important, isn't it?

Nevertheless, we've installed numactl last night. Mongod starts without warnings now and works without any issues.
We'll be checking logs for a couple of days. If errors come back, I'll post a comment. And I'll keep in mind your advices on using serverStatus and currentOp.

Comment by Gusev Petr [ 22/Jan/13 ]

As it turns out the root of the problem was NUMA. We didn't pay much attention to it initially because the symptoms look differently from what have beed described here http://www.wireclub.com/development/TqnkQwQ8CxUYTVT90/read. But this article
http://blog.jcole.us/2010/09/28/mysql-swap-insanity-and-the-numa-architecture have greatly clarified NUMA related stuff. The very reason of unstable server response times (which resulted in replication cycles) was out of physical memory on node-0 (one of two our NUMA nodes) and corresponding long paging intervals.
Thereby maybe it would make sense to make NUMA warning on server startup more strict - it is very unlikely that current implementation of mongo daemon can operate on NUMA architecture without strange and hard to reproduce instabilities.
Great thanks to Tad Marshall and all mongo team, you guys do a great job!

Comment by Tad Marshall [ 21/Jan/13 ]

If this is how you spelled "prompt" (sic. "propmt") then you just set a new variable and didn't change the actual prompt. Changing the prompt to a fixed string is intended to sidestep any issues related to prompt negotiation. Setting the "prompt" variable will accomplish this. Setting "propmt" just sets a variable; not one that means anything to the shell.

You omitted the "-shell" option in the examples above ... is this an actual log, or are you retyping text from another source? You need "shell" to make the shell continue running after it executes the "-eval" option.

If you have a working connection from elba-mongo2 to elba-mongo1, you should be able to use serverStatus and currentOp commands to see what state it is in.

Are your ulimit settings sufficient to allow the number of connections you are trying to have?

Comment by Dmitry Karmazin [ 21/Jan/13 ]

Hi Tad! I'm another Petr's co-worker and also one of the admins of datacenter, where above servers are.
We looked carefully, what's happening on server when node stop responding. Here's what we found at this time.

Mongod continues to accept new connections both from local and remote nodes (proven by telnet), no packet loss while nodes pinging each other but
mongo console can't connect to local instance even with custom prompt:

root@elba-mongo1:~# time mongo --eval "propmt='>'" --host elba-mongo1
MongoDB shell version: 2.2.2
connecting to: elba-mongo1:27017/test
>

real 1m4.098s
user 0m0.024s
sys 0m0.004s

but it can connect to second node from same host:

root@elba-mongo1:~# time mongo --eval "propmt='>'" --host elba-mongo2
MongoDB shell version: 2.2.2
connecting to: elba-mongo2:27017/test
>

real 0m0.028s
user 0m0.020s
sys 0m0.008s

At the same time, if I leave mongo client at elba-mongo2 connected to elba-mongo1, it can communicate with server, while new clients from other nodes (including localhost) can't connect. Here's rs.status() from elba-mongo1 at that moment:

rs0:PRIMARY> rs.status()
{
"set" : "rs0",
"date" : ISODate("2013-01-21T09:58:29Z"),
"myState" : 1,
"members" : [

{ "_id" : 0, "name" : "elba-mongo1:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 355478, "optime" : Timestamp(1358762294000, 4), "optimeDate" : ISODate("2013-01-21T09:58:14Z"), "self" : true }

,

{ "_id" : 1, "name" : "elba-mongo2:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 478, "optime" : Timestamp(1358754311000, 7), "optimeDate" : ISODate("2013-01-21T07:45:11Z"), "lastHeartbeat" : ISODate("2013-01-21T09:58:27Z"), "pingMs" : 300 }

,

{ "_id" : 2, "name" : "elba-mongo3:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 9145, "optime" : Timestamp(1358762308000, 7), "optimeDate" : ISODate("2013-01-21T09:58:28Z"), "lastHeartbeat" : ISODate("2013-01-21T09:58:29Z"), "pingMs" : 0 }

],
"ok" : 1
}

I noticed, that mongod process is often in uninterruptable sleep state:
mongodb 17568 0.5 35.1 161820972 17377724 ? Dl Jan17 33:28 /usr/bin/mongod --config /etc/mongodb.conf

I'm attaching strace output for mongod from elba-mongo1. It was started slightly before last heartbeat from elba-mongo1 received by elba-mongo2 and lasts until mongo client connets to elba-mongo1.

I'm also seeing from logs, that mongod is running with NUMA enabled. Can it be casue for this issue?

Comment by Tad Marshall [ 17/Jan/13 ]

Hi Renat,

The cases where a replica set has two primaries are caused by a network partition, particularly an asymmetrical partition where node 1 can see the node 2 but node 2 cannot see node 1. As you saw, these conditions are temporary, because eventually the original primary will see that there is no longer a majority and will step down.

It may require more exhaustive tests of your network than you have performed so far to narrow down the problem. Can every node ping every other node continuously while you are seeing these problems? Are the pings equally reliable with large packets?

We have been able to reproduce issues similar to yours by simulating delays on the network that exceed the heartbeat frequency and we are looking into ways to increase reliability in the presence of network problems.

But we are unable to see these issues on a fully functional network. Your logs show replication cycles that begin, proceed for a while and then fail with socket errors, and do this over and over again. It seems that something is wrong at the network layer; this could be a configuration problem or an intermittent segment, router or switch, a network card or interface that sometimes corrupts data or almost anything that prevents packets from being sent and received error-free between all nodes in a timely manner.

Tad

Comment by Renat Khayretdinov [ 17/Jan/13 ]

(I'm Petr's co-worker) Today we observed another issue. We had elba-mongo1 as PRIMARY and elba-mongo2 and elba-mongo3 as SECONDARY. Suddenly second and third stopped to see first instance (i.e. rs.status() outputs that it is not reachable/healthy). But first continued to send and receive heartbeats to second and third and preserved its PRIMARY state! After a while second became primary and replica set began to accept client requests. This situation (elba-mongo1 is seeing others but others isn't see it and there is two primaries) continued for several minutes. There were some another clutter with elections which you can see on logs (mongo1.log, etc.). It continued until I restared mongo instance on elba-mongo1. That is it is unlikely that it is network problem. Probably it is issue of listener of heartbeats.
How can it be that there were two primaries and elba-mongo1 didn't step down when it had seen another primary? I've attached logs from all machines and rs.status() output during the disaster.

Comment by Gusev Petr [ 16/Jan/13 ]

Hi Tad

1)I will try your suggestion about mongo shell next time we experience the problem. Usually this happen under daily load.
2)Can we definitly determine is this a networking problem or server side node listening problem ? We run some networking tests and they report ok.

Comment by Tad Marshall [ 16/Jan/13 ]

Hi Gusev,

You log file shows a number of rollback attempts, some of which succeed but most of which fail. It looks like there were network problems during this period; several like "exception: 10278 dbclient error communicating with server: elba-mongo2:27017" and like "replSet info elba-mongo2:27017 is down (or slow to respond): socket exception [SEND_ERROR] for 192.168.116.150:27017".

The mongo shell attempts some communication with the server to determine the prompt to display (e.g. "PRIMARY> " or "SECONDARY> ") and this may be where the shell is appearing to hang.

Can you try running the shell with a command line to disable this prompt calculation and see if this enables you to connect? Try this command at the bash prompt:

mongo --eval "prompt='>>> ' " --shell

This will force the prompt to be ">>> " (you can substitute anything you want), which will enable you to see if you can connect to the server even if the prompt negotiation is not working for some reason due to the state of the server. (The "--eval" part passes a JavaScript command to the shell, "prompt='text' " sets the "prompt" variable and "--shell" tells the shell not to exit after executing the JavaScript).

Tad

Generated at Thu Feb 08 03:16:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.