Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Minor - P4
Fix Version/s: 3.0.0
Affects Version/s: None
Component/s: Networking
Labels:
None
Environment:
Dell XPS13 laptop, Ubuntu 12.04, MongoDB 2.4.5 (10gen repo)

Backwards Compatibility:
Fully Compatible
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We were playing with heartbeatTimeoutSecs with Edouard, and noticed that the behavior in socket timeouts is different when you have a mongos as part of the cluster vs when there is only a plain replica set (only mongod/arbitrator servers).

Consider the following test:

MongoD is running on ports 27001, 27002 and 27003 - it's easiest to run them in the foreground in a terminal (no --fork).
A MongoS is running on port 27000, connected to 1 config server. (config server isn't shown in attached logs as I didn't use --syslog for it)
Symptoms are the same whether i run some operations or if the cluster is completely idle.
Attached logs are with -vvvv verbosity

Tip: On a linux system I can use --syslog to get a single log of all processes. Apparently on a mac the syslog will not distinguish between the source of each log entry, so it becomes impossible to know which node wrote what.

The test is simply to stop the PRIMARY with Ctrl-Z and measure failover time until a new PRIMARY is active and writable. In this ticket, we focus on the time until we get network socket timeout, which looks like this in logs:

--> 27001 is PRIMARY. The last log entry from 27001 signifies the time I pressed Ctrl-Z:

Jul 16 17:05:29 hingo-sputnik mongod.27001[14713]: Tue Jul 16 17:05:29.845 [conn107] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1373976850000|1 } } cursorid:6149432603763897 ntoreturn:0 keyUpdates:0 locks(micros) r:383 nreturned:0 reslen:20 5011ms

--> 27003 notices that 27001 is not responding:

Jul 16 17:05:33 hingo-sputnik mongod.27003[13420]: Tue Jul 16 17:05:33.472 [rsHealthPoll] Socket recv() timeout 127.0.1.1:27001

I have called out these occurences in the attached logs, so they are easy to find.

The default behavior is that this timeout is 10 seconds (plus some milliseconds).

Now, we configure heartbeatTimeoutSecs

myrs:PRIMARY> config = rs.config()
{
        "_id" : "myrs",
        "version" : 14,
        "members" : [
                {
                        "_id" : 0,
                        "host" : "hingo-sputnik:27001"
                },
                {
                        "_id" : 1,
                        "host" : "hingo-sputnik:27002"
                },
                {
                        "_id" : 2,
                        "host" : "hingo-sputnik:27003"
                }
        ]
}
myrs:PRIMARY> config["settings"] = { heartbeatTimeoutSecs : 2 }
{ "heartbeatTimeoutSecs" : 2 }
myrs:PRIMARY> config
{
        "_id" : "myrs",
        "version" : 14,
        "members" : [
                {
                        "_id" : 0,
                        "host" : "hingo-sputnik:27001"
                },
                {
                        "_id" : 1,
                        "host" : "hingo-sputnik:27002"
                },
                {
                        "_id" : 2,
                        "host" : "hingo-sputnik:27003"
                }
        ],
        "settings" : {
                "heartbeatTimeoutSecs" : 2
        }
}
myrs:PRIMARY> rs.reconfig(config)
{ "ok" : 1 }
// Just to verify it sticks:
myrs:PRIMARY> rs.config()
{
        "_id" : "myrs",
        "version" : 15,
        "members" : [
                {
                        "_id" : 0,
                        "host" : "hingo-sputnik:27001"
                },
                {
                        "_id" : 1,
                        "host" : "hingo-sputnik:27002"
                },
                {
                        "_id" : 2,
                        "host" : "hingo-sputnik:27003"
                }
        ],
        "settings" : {
                "heartbeatTimeoutSecs" : 2
        }
}
myrs:PRIMARY>

And then we repeat the test.

Actual result

We find that with a cluster that includes one mongos and one config server, the socket timeout becomes shorter, as expected. (3.5 secs in attached log, but still in the right ballpark.) If I shut down the mongos and config server, then the socket timeout remains at 10+ secs regardless of what I set heartbeatTimeoutSecs to be.

Expected result

Expected behavior would be that the socket timeout also becomes shorter when there is no mongos present.

Workarounds

Apparently most of our users are using sharding, so they would see the correct behavior (should they ever try to use heartbeatTimeoutSecs). Also, for someone for whom this timeout is important, it is feasible to simply add a mongos process to the cluster as a workaround. Hence I set priority to minor.

More importantly, as a first step we would just want to understand why mongos has any influence over what is happening.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

heartbeatTimeoutSecs-with-mongos.log
401 kB
Jul 20 2013 01:19:52 AM UTC
heartbeatTimeoutSecs-without-mongos.log
158 kB
Jul 20 2013 01:19:52 AM UTC

Assignee:: Matt Dannenberg (Inactive)
Reporter:: Henrik Ingo (Inactive)
Participants:: Andy Schwerin, Henrik Ingo, Matt Dannenberg
Votes:: 0 Vote for this issue
Watchers:: 8 Start watching this issue

Created:: Jul 20 2013 01:19:52 AM UTC
Updated:: Nov 06 2017 12:36:32 AM UTC
Resolved:: May 26 2015 12:17:44 PM UTC

Details

Description

Actual result

Expected result

Workarounds

Attachments

Attachments

Activity

People

Dates