|
Recently, we had some issues with replica set failover, due to server not being able to cope with the high I/O load caused by spiky migrations (see SERVER-7686 for a more detailed description of the original problem) – mongod intermittently stops responding to network connections because it's busy doing I/O, so other RS members believe it's dead.
While this problem by itself is bad enough, the failover result we got was, well, very bad – let me describe it in detail.
The cluster in question hosts a MongoDB database containing GridFS data, which currently has about 30 million files occupying 1.5TiB of data. It has two mongos routers (b1, b2), three config servers (c1, c2, c3) and four shards that are each comprised of two data servers (d1.sN, d2.sN) and an arbiter (a.sN), where N is the shard number (1-4).
Writes to the database are done via mongos routers (using C# driver, but that is irrelevant since we already ruled out that problem), with SafeMode.true, but without the w:majority (yes, we know that w:majority is required for consistent failover, but we were not yet able to implement it).
When the problem hit us, I first thought that everything went more or less okay – after I've got a notification from our Zabbix monitoring, I went and checked the "rollback" folders on all servers and found out that there were none. Then I checked the logs and saw that there were repeated "failures" in quick succession, so I thought that client code wasn't able to pick up new master, hence the absence of the rollback. A few days later I was checking the other server (the one which only reads from the database) logs and found out that some files could not be readfrom GridFS. Surprisingly, the fs.file document was intact, but there were no associated documents in the fs.chunks collection. Then, I did a quick check for the amount of files and chunks, expecting count of GridFS chunks to be a bit larger than amount of files and to my own horror discovered, that we're missing at least 500000 GridFS for non-zero-length files.
From this on I had three primary suspects:
1. database corruption due to external reasons
2. mongos chunk migration gone awry (either failed migration mistakenly recorded as successful or inconsistent shard min/max values in chunk map)
3. ?
I quickly ruled out #3, because there were no power failures, disks are fine (checked SMART attributes, did a patrol read and a selftest), syslog/dmesg are clean, etc.
Then, I checked chunk migration. While it was the original cause of disk overload, the missing files were not in ranges of chunks that were transferred (checked the primary logs along with config.changelog collection of the config database).
Then, I decided to rule out user error, so I checked the oplog for the related file (since I knew the fs.files id (NumberLong("1113206594855763968")), I executed the following for all shard replica sets (directly connected to "local" databases, not via mongos):
db.oplog.rs.find({"o.files_id":NumberLong("1113206594855763968")}).sort({$natural:-1}).limit(1)<code>
|
which found exactly one "insert" operation on exactly one replica set:
{ "ts" : Timestamp(1352980035000, 28), "h" : NumberLong("2879588865588641652"), "v" : 2, "op" : "i", "ns" : "a.fs.chunks", "o" : { "_id" : ObjectId("50a4d6366b8dda1458505dbb"), "files_id" : NumberLong("1113206594855763968"), "n" : 0, "data" : BinData(0,"...")
|
After this, I knew the _id for the fs.chunks document, so I checked if it was deleted, partially to ensure it's not chunk migration, partially to completely rule out errors in client code (which was working fine) with:
db.oplog.rs.find({"o._id":ObjectId("50a4d6366b8dda1458505dbb")}).sort({$natural:-1}).limit(1)
|
Too bad it returned the same "insert" operation, so as far as I know how oplog is structured, the document was inserted and never deleted (which completely rules out migration failure, since I checked each RS directly), yet is not present in the database.
Reading logs was my next step. Arbiter didn't have anything interesting aside from the fact that servers were going "up" and "down" (I would like to reminde the reader that mongod processes were running fine, just were blocked with I/O so much that they could not accept connections in a timely fashion). Primary had almost the same logs with addition of migration-related info (moveChunk started, committing, etc.).
Thu Nov 15 15:26:53 [rsHealthPoll] DBClientCursor::init call() failed
|
Thu Nov 15 15:26:53 [rsHealthPoll] replSet info d1.s1.fs.drive.bru:27021 is down (or slow to respond): DBClientBase::findN: transpor
|
Thu Nov 15 15:26:53 [rsHealthPoll] replSet member d1.s1.fs.drive.bru:27021 is now in state DOWN
|
Thu Nov 15 15:26:53 [rsMgr] not electing self, we are not freshest
|
Thu Nov 15 15:26:59 [rsMgr] not electing self, we are not freshest
|
Thu Nov 15 15:27:05 [rsHealthPoll] DBClientCursor::init call() failed
|
Thu Nov 15 15:27:05 [rsMgr] replSet info electSelf 1
|
Thu Nov 15 15:27:17 [rsHealthPoll] DBClientCursor::init call() failed
|
Thu Nov 15 15:27:25 [rsMgr] replSet PRIMARY
|
Thu Nov 15 15:27:26 [rsBackgroundSync] replset 1134533 ops were not applied from buffer, this should cause a rollback on the former
|
Thu Nov 15 15:27:27 [rsHealthPoll] replSet member d1.s1.fs.drive.bru:27021 is up
|
Thu Nov 15 15:27:27 [rsHealthPoll] replSet member d1.s1.fs.drive.bru:27021 is now in state PRIMARY
|
Thu Nov 15 15:27:27 [rsMgr] replSet relinquishing primary state
|
Thu Nov 15 15:27:27 [rsMgr] replSet SECONDARY
|
Thu Nov 15 15:27:27 [rsMgr] replSet closing client sockets after relinquishing primary
|
Thu Nov 15 15:27:28 [rsBackgroundSync] replSet syncing to: d1.s1.fs.drive.bru:27021
|
Thu Nov 15 15:27:28 [rsHealthPoll] replSet info a.s1.fs.drive.bru:27021 is down (or slow to respond): socket exception [SEND_ERROR]
|
Thu Nov 15 15:27:28 [rsHealthPoll] replSet member a.s1.fs.drive.bru:27021 is now in state DOWN
|
Thu Nov 15 15:27:29 [rsHealthPoll] replSet info d1.s1.fs.drive.bru:27021 is down (or slow to respond): socket exception [SEND_ERROR]
|
Thu Nov 15 15:27:29 [rsHealthPoll] replSet member d1.s1.fs.drive.bru:27021 is now in state DOWN
|
Thu Nov 15 15:27:29 [rsMgr] replSet can't see a majority, will not try to elect self
|
Thu Nov 15 15:27:31 [rsHealthPoll] replSet member d1.s1.fs.drive.bru:27021 is up
|
Thu Nov 15 15:27:31 [rsHealthPoll] replSet member d1.s1.fs.drive.bru:27021 is now in state PRIMARY
|
Thu Nov 15 15:27:31 [rsHealthPoll] replSet member a.s1.fs.drive.bru:27021 is up
|
Thu Nov 15 15:27:31 [rsHealthPoll] replSet member a.s1.fs.drive.bru:27021 is now in state ARBITER
|
Thu Nov 15 15:27:32 [rsSyncNotifier] Socket say send() errno:9 Bad file descriptor 10.3.1.12:27021
|
Thu Nov 15 15:27:32 [rsSyncNotifier] replset tracking exception: exception: 9001 socket exception [SEND_ERROR] for 10.3.1.12:27021
|
On the other hand, secondary (which of course became primary at some point and then gone back) had not only similar lines about lost connections and election-related messages, but also two very interesting lines:
Thu Nov 15 15:27:26 [rsBackgroundSync] replset 1134533 ops were not applied from buffer, this should cause a rollback on the former
|
...
|
Thu Nov 15 15:41:15 [rsBackgroundSync] replset 7806611 ops were not applied from buffer, this should cause a rollback on the former
|
So, it appear that there should be a rollback, but there is no "rollback" directory (permissions are fine and logs are silent on both servers), and there are no rollback-related records on the primary, so it appears that the data just vanished.
Also, analyzing historical monitoring graphs and matching missing fs.chunks with upload timestamps, we found out that missing data always corellates with failover that is caused by overloaded I/O triggered by shard chunk migration, so this problem existed in 2.0.7 too.
Now, let me get to the point.
I understand that w:majority is a must for robust operation where one cannot afford rollbacks and manual reconcilation, usually associated with conflicts that might be impossible to correctly resolve. Still, since this database only gets inserts and (occasional) deletes, I presumed that even if rollback happens, I just would run a mongorestore on it as usual. Also, existing documentation points out that rollback does not lead to data loss (see http://www.mongodb.org/display/DOCS/Replica+Sets+-+Rollbacks) by itself, only to manual reconciliation (which in our case would just be adding the rolled back data back to the database). Unfortunately, in our case it appears that rollback failed to properly occur for unknown reason, leading to large amounts of lost data (also, I'm not 100% sure if w:majority would be 100% safe in case it's a bug).
So I guess it's either a misunderstanding how MongoDB works from our side, a documentation deficiency or a very nasty bug.
Of course, my analysis might be completely wrong. I still have logs (can backup them before they get rotated out, tell me if you would like to have a look) and can make a mongos config server dump. Dumping the database itself or its oplog would be problematic though, while it doesn't contain private data, it's too large to move it over anything except LAN.
As for recovery – while I know what steps I should take (manually replaying oplog with some kind of a pre-scan optimzation like not trying to replay inserts that get deleted later) I think it's unfeasible, since our current oplog is about 200GB per shard RS member and matching about nine million oplog records would overload the server. Maybe you have any advice?
|
|
No problem! Just let us know if you need anything else.
|
|
Many thanks for the detailed explanation! I'll review our older logs for the clock skew messages just to check. Also, it appears that we in fact really had some network issues due to a faulty second switch, so the SERVER-1929 would be useful should similar problems arise.
|
|
Sorry for the delay.
There are two scenarios that pop up in your logs, here's what causes each of them and what the implications are:
- When a secondary becomes a primary (d2, say), it clears any ops it got from the former primary that have not yet been applied. This triggers the "should cause rollback" message. There are two possibilities at that point, either:
- d2 will do some of its own writes, in which case they'll be rolled back when it realizes d1 is still primary or
- d2 will step down before doing any writes, in which case the cleared ops will be refetched from the primary. Inefficient, but not dangerous.
- If the message does not appear when d2 becomes primary, it does some writes, and then steps down again, d2 now has some writes that don't exist anywhere else (which is probably not what you want). The indication that this has happened is that you'll see warnings about "clock skew" in the logs, because d2 is applying ops fetched from d1 after ops from its own stint at being primary, so the ops look like they're going "back in time." So, you may want to be on the lookout for the clock skew warning after a flap: it means the member got some writes that may not be present elsewhere.
The clock skew errors look like this:
Thu Nov 15 15:47:28 [rsSync] replSet error possible failover clock skew issue? 50a4d64f:4
|
You have one clock skew message in the November 27th logs: around 18:00:18 in d2 & d3's log, from a quick failover when d3 became primary. We can help you come up with a way of finding unreplicated ops when you get a clock skew error, if you'd like.
SERVER-7759 should fix both of these cases and I hope to have it done in the next couple of days (for 2.4.0). SERVER-1929 should also help, limiting failovers in the future
|
|
Sorry for pestering you, did you have time to check the logs?
|
|
Attaching logs for November 27.
|
|
Sure, let me package them. I'd send you uncut versions for the entire day so I won't accidentially delete something important.
|
|
Sorry, for being unclear.
The "should be a rollback" message without a rollback does NOT alone there is any loss or inconsistency.
Can you send the logs for the last segment so we can verify?
|
|
By the way, does this mean that secondaries and primary not only lost some writes completely, but have different data, so we need to resync secondaries from scratch to get consistent replicase?
|
|
But there was no rollback on the d2 (former primary) at all.
Also, I grepped rollback on the logs, this line is logged every time after failover and most of the cases there is not only no rollback on the former primary, but no rollback on any of the members of the set.
|
|
Sorry for the confusion, I got it backwards. That line is okay.
On the d1, d2, d3 clarification: if d2 was primary while d1 was a secondary, then yes. It's referring to "the other primary" in the system.
|
|
I would also ask for some clarification.
There are three nodes: d1, d2, d3:
d1[P] d2[S] d3[S]
There is a failover on d1:
d1[X] d2[P] d3[S]
Then, d1 returns to normal, performing a rollback before becoming primary:
d1[R] d2[P] d3[S]
...
d1[P] d2[S] d3[S]
And then d1 (after becoming primary) logs the "... ops were not applied from buffer..."
Where the mentioned "rollback on the former primary" should occur in this case? On d2?
|
|
What would be the best course of action for us to prevent data loss on failover? Rolling back to 2.0? Also, how we can help diagnosing the issue?
|
|
For example in the latest logs I posted there was no rollback on the secondary.
|
|
Well, it appears that we have this problem often enough.
|
|
Yes: the "should cause a rollback" line is a problem. If you see the rollback line and there was no rollback, then those operations were skipped on the secondary.
|
|
It's the "should cause a rollback" line when there is no rollback on the other servers that worries me. We had the same line in the original case and lost some data. Unfortunately, I don't know enough of the internals to decide whether it's an indicator of a real problem.
|
|
d1.s3 doens't have anything important except the "rollback should be on the former primary". Attaching a part of the log from the startup to normal operation (there was a rollback, but as far as I understand, "former primary" refers to d3.s3 in this case).
|
|
Looks normal, but can you send some more log from s1.s3 to confirm.
|
|
We had a failover today (due to networki reconfiguration). After primary returned to its primary status it wrote to the log.
d1.s3:
15:06:00 [rsHealthPoll] replSet member d3.s3.fs.drive.bru:27023 is now in state SECONDARY
|
15:06:00 [rsMgr] not electing self, d3.s3.fs.drive.bru:27023 would veto
|
15:06:05 [rsMgr] replSet info electSelf 0
|
15:06:05 [rsMgr] replSet PRIMARY
|
15:06:08 [rsBackgroundSync] replset 6962 ops were not applied from buffer, this should cause a rollback on the former primary
|
There is no sign of rollback on the secondaries though.
d2.s3:
15:05:59 [rsSyncNotifier] replset tracking exception: exception: 10278 dbclient error communicating with server: d3.s3.fs.drive.bru:27023
|
15:05:59 [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: d3.s3.fs.drive.bru:27023
|
15:06:00 [conn636] d3.s3.fs.drive.bru:27023 is trying to elect itself but d3.s3.fs.drive.bru:27023 is already primary and more up-to-date
|
15:06:01 [rsHealthPoll] replSet member d3.s3.fs.drive.bru:27023 is now in state SECONDARY
|
15:06:01 [rsMgr] not electing self, d1.s3.fs.drive.bru:27023 would veto
|
15:06:06 [conn635] replSet info voting yea for d1.s3.fs.drive.bru:27023 (0)
|
15:06:07 [rsMgr] not electing self, d3.s3.fs.drive.bru:27023 would veto
|
15:06:07 [rsHealthPoll] replSet member d1.s3.fs.drive.bru:27023 is now in state PRIMARY
|
15:06:09 [rsBackgroundSync] replSet syncing to: d1.s3.fs.drive.bru:27023
|
15:06:09 [rsSyncNotifier] replset setting oplog notifier to d1.s3.fs.drive.bru:27023
|
d3.s3:
15:05:59 [conn1678] replSet info stepping down as primary secs=1
|
15:05:59 [conn1678] replSet relinquishing primary state
|
15:05:59 [conn1678] replSet SECONDARY
|
15:05:59 [conn1678] replSet closing client sockets after relinquishing primary
|
15:06:00 [rsHealthPoll] replSet info d1.s3.fs.drive.bru:27023 is down (or slow to respond): socket exception [SEND_ERROR] for 10.3.1.12:27023
|
15:06:00 [rsHealthPoll] replSet member d1.s3.fs.drive.bru:27023 is now in state DOWN
|
15:06:00 [rsMgr] replSet info electSelf 3
|
15:06:00 [rsMgr] replSet couldn't elect self, only received -9999 votes
|
15:06:02 [rsHealthPoll] replSet member d1.s3.fs.drive.bru:27023 is up
|
15:06:02 [rsHealthPoll] replSet member d1.s3.fs.drive.bru:27023 is now in state SECONDARY
|
15:06:02 [rsMgr] not electing self, d2.s3.fs.drive.bru:27023 would veto
|
15:06:06 [conn1677] replSet info voting yea for d1.s3.fs.drive.bru:27023 (0)
|
15:06:08 [rsHealthPoll] replSet member d1.s3.fs.drive.bru:27023 is now in state PRIMARY
|
15:06:08 [rsBackgroundSync] replSet syncing to: d1.s3.fs.drive.bru:27023
|
15:06:08 [rsSyncNotifier] replset setting oplog notifier to d1.s3.fs.drive.bru:27023
|
Is this normal, or we're losing data again?
|
|
Yes, looks like syncdelay would limit the write problem. But, this is more of a solution for SERVER-7686 (which we already tried to fix with enabling write caching, moving to deadline scheduler, and enabling _secondaryThrottle – I just can't tell you if that helped since we didn't have fs.files migrations from the last failure).
|
|
One thing to try would be lower syncdelay a lot (from 60 -> 5) and see if that smooths things
|
|
Have to go now, it's late here. Thanks for the attention to this case, I look forward to seeing you at MongoSV.
|
|
My theory is close to this: http://serverfault.com/questions/126413/limit-linux-background-flush-dirty-pages
|
|
I also checked the CPU graphs, iowait peaks at ~30%, user at 3-5% all other (system, irq, etc.) are less than 1% for both servers.
|
|
Also, please note that 150-170 write IOPS is actually pretty much the sustained maximum for this class of drives with write caching disabled.
|
|
The stats for physical drives that comprise this raid array are similar to the array ones.
|
|
Attaching IOPS and I/O rate graphs. It looks like MongoDB isn't the only application that stops responding – there are gaps in the graph which mean that zabbix agent was also unreachable at the time.
|
|
We don't use authentication at all (these servers are behind NAT/firewall and can be only accessed through VPN).
|
|
As a side note, I want to tell you that I wrote a small tool that walks the oplog backwards, checks every insert against the database and collects deletes to prevent deleted items to be re-added. If it finds an insert that is not in the database and had no subsequent delete, it repeats through mongos. While the oplog is only 20+ days long (and we had these failovers which were losing data for some time now), it would allow us to recover at least lost data for the last month. It's halfway through the oplog on three of four shards (fourth was added recently so its oplog is shorter).
|
|
Its very weird that there is reachability errors correlated to disk saturation.
Are you using auth? (Just trying to collect data and make connections)
Do you have disk monitoring you can share?
|
|
kern.log and dmesg for d1 and d2 at this time is completely empty.
Checked syslog for both machines, only usual lines about cron, puppet, etc. (no crontabs or backup scripts are scheduled to do any work at this time, puppet configuration was unchanged)
Nov 15 15:00:37 pisces crontab[16705]: (root) LIST (root)
|
Nov 15 15:00:37 pisces crontab[16707]: (root) LIST (mongodb)
|
Nov 15 15:00:41 pisces puppet-agent[19846]: Finished catalog run in 4.95 seconds
|
Nov 15 15:10:48 pisces crontab[19311]: (root) LIST (root)
|
Nov 15 15:10:48 pisces crontab[19314]: (root) LIST (mongodb)
|
Nov 15 15:10:52 pisces puppet-agent[19846]: Finished catalog run in 5.58 seconds
|
Nov 15 15:17:01 pisces CRON[22970]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
|
Nov 15 15:20:59 pisces crontab[28542]: (root) LIST (root)
|
Nov 15 15:20:59 pisces crontab[28544]: (root) LIST (mongodb)
|
Nov 15 15:21:03 pisces puppet-agent[19846]: Finished catalog run in 4.72 seconds
|
Nov 15 15:31:09 pisces crontab[10930]: (root) LIST (root)
|
Nov 15 15:31:09 pisces crontab[10932]: (root) LIST (mongodb)
|
Nov 15 15:31:13 pisces puppet-agent[19846]: Finished catalog run in 4.33 seconds
|
Nov 15 15:41:20 pisces crontab[25912]: (root) LIST (root)
|
Nov 15 15:41:20 pisces crontab[25914]: (root) LIST (mongodb)
|
Nov 15 15:41:24 pisces puppet-agent[19846]: Finished catalog run in 4.68 seconds
|
Nov 15 15:51:32 pisces crontab[10896]: (root) LIST (root)
|
Nov 15 15:51:32 pisces crontab[10898]: (root) LIST (mongodb)
|
Nov 15 15:51:36 pisces puppet-agent[19846]: Finished catalog run in 5.66 seconds
|
Nov 15 16:01:43 pisces crontab[24907]: (root) LIST (root)
|
Nov 15 16:01:43 pisces crontab[24909]: (root) LIST (mongodb)
|
Nov 15 16:01:47 pisces puppet-agent[19846]: Finished catalog run in 5.19 seconds
|
Nov 15 16:11:56 pisces crontab[7497]: (root) LIST (root)
|
Nov 15 16:11:56 pisces crontab[7499]: (root) LIST (mongodb)
|
Nov 15 16:12:01 pisces puppet-agent[19846]: Finished catalog run in 5.58 seconds
|
Nov 15 16:17:01 pisces CRON[14739]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
|
Nov 15 16:22:09 pisces crontab[22443]: (root) LIST (root)
|
Nov 15 16:22:09 pisces crontab[22445]: (root) LIST (mongodb)
|
Nov 15 16:22:13 pisces puppet-agent[19846]: Finished catalog run in 4.67 seconds
|
Nov 15 16:32:20 pisces crontab[1758]: (root) LIST (root)
|
Nov 15 16:32:20 pisces crontab[1760]: (root) LIST (mongodb)
|
Nov 15 16:32:24 pisces puppet-agent[19846]: Finished catalog run in 4.75 seconds
|
Nov 15 16:42:30 pisces crontab[8749]: (root) LIST (root)
|
Nov 15 16:42:30 pisces crontab[8751]: (root) LIST (mongodb)
|
Nov 15 16:42:35 pisces puppet-agent[19846]: Finished catalog run in 5.17 seconds
|
Nov 15 16:52:42 pisces crontab[11379]: (root) LIST (root)
|
Nov 15 16:52:42 pisces crontab[11381]: (root) LIST (mongodb)
|
Nov 15 16:52:46 pisces puppet-agent[19846]: Finished catalog run in 5.36 seconds
|
|
|
Also, the reachability problems usually manifest themselves when disk I/O is high (I guess the VMM just can't cope with a large queue to flush so everything stalls).
|
|
Yeah, bonding can do that.
Nothing in system logs around that time?
|
|
I see there are dropped packets, but I guess these come from the balance-alb algorithm since it rewrites MAC addresses on the fly to allow transparent port balancing.
|
|
No, unfortunately we monitor only ops/sec and RS lag.
Please take note that these counters might not reflect the problem, because both servers were restarted about one day ago (not simultaneously) as the part of my diagnostics attempts (checked HBA disk health information along with ME event logs, etc.).
netstat -s @d1
Ip:
|
201165938 total packets received
|
3639 with invalid addresses
|
0 forwarded
|
0 incoming packets discarded
|
201162299 incoming packets delivered
|
190485556 requests sent out
|
Icmp:
|
3356 ICMP messages received
|
2224 input ICMP message failed.
|
ICMP input histogram:
|
destination unreachable: 3347
|
timeout in transit: 6
|
echo requests: 3
|
3371 ICMP messages sent
|
0 ICMP messages failed
|
ICMP output histogram:
|
destination unreachable: 3368
|
echo replies: 3
|
IcmpMsg:
|
InType3: 3347
|
InType8: 3
|
InType11: 6
|
OutType0: 3
|
OutType3: 3368
|
Tcp:
|
74233 active connections openings
|
1831918 passive connection openings
|
1252 failed connection attempts
|
2157 connection resets received
|
1972 connections established
|
200546175 segments received
|
334410694 segments send out
|
119987 segments retransmited
|
0 bad segments received.
|
2882 resets sent
|
Udp:
|
619143 packets received
|
113 packets to unknown port received.
|
0 packet receive errors
|
607001 packets sent
|
UdpLite:
|
TcpExt:
|
36 resets received for embryonic SYN_RECV sockets
|
1797144 TCP sockets finished time wait in fast timer
|
3351962 delayed acks sent
|
2275 delayed acks further delayed because of locked socket
|
Quick ack mode was activated 5470 times
|
790 times the listen queue of a socket overflowed
|
790 SYNs to LISTEN sockets dropped
|
165252261 packets directly queued to recvmsg prequeue.
|
3540748772 bytes directly in process context from backlog
|
2353661999 bytes directly received in process context from prequeue
|
119142543 packet headers predicted
|
20271994 packets header predicted and directly queued to user
|
15384610 acknowledgments not containing data payload received
|
126389342 predicted acknowledgments
|
7894 times recovered from packet loss by selective acknowledgements
|
84 bad SACK blocks received
|
Detected reordering 8 times using FACK
|
Detected reordering 18 times using SACK
|
Detected reordering 21 times using time stamp
|
22 congestion windows fully recovered without slow start
|
49 congestion windows partially recovered using Hoe heuristic
|
887 congestion windows recovered without slow start by DSACK
|
9186 congestion windows recovered without slow start after partial ack
|
7653 TCP data loss events
|
TCPLostRetransmit: 418
|
656 timeouts after SACK recovery
|
16 timeouts in loss state
|
57519 fast retransmits
|
4270 forward retransmits
|
32686 retransmits in slow start
|
22184 other TCP timeouts
|
1462 SACK retransmits failed
|
7 times receiver scheduled too late for direct processing
|
5902 DSACKs sent for old packets
|
22 DSACKs sent for out of order packets
|
10292 DSACKs received
|
25 DSACKs for out of order packets received
|
222 connections reset due to unexpected data
|
1382 connections reset due to early user close
|
8 connections aborted due to timeout
|
TCPDSACKIgnoredNoUndo: 79
|
TCPSpuriousRTOs: 1
|
TCPSackShifted: 63945
|
TCPSackMerged: 45521
|
TCPSackShiftFallback: 44090
|
TCPBacklogDrop: 235
|
TCPDeferAcceptDrop: 1525080
|
IpExt:
|
InBcastPkts: 6311
|
OutBcastPkts: 243
|
InOctets: -1747362436
|
OutOctets: 562190328
|
InBcastOctets: 661583
|
OutBcastOctets: 54834
|
ifconfig -a @d1
eth0 Link encap:Ethernet HWaddr 00:25:90:61:c4:c4
|
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
|
RX packets:51621933 errors:0 dropped:1 overruns:0 frame:0
|
TX packets:183475419 errors:0 dropped:0 overruns:0 carrier:0
|
collisions:0 txqueuelen:1000
|
RX bytes:34874034607 (34.8 GB) TX bytes:143781937078 (143.7 GB)
|
Memory:faee0000-faf00000
|
|
eth1 Link encap:Ethernet HWaddr 00:25:90:61:c4:c5
|
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
|
RX packets:276753438 errors:0 dropped:188354 overruns:0 frame:0
|
TX packets:130823553 errors:0 dropped:0 overruns:0 carrier:0
|
collisions:0 txqueuelen:1000
|
RX bytes:247358369701 (247.3 GB) TX bytes:118896407143 (118.8 GB)
|
Memory:fae60000-fae80000
|
|
lan Link encap:Ethernet HWaddr 00:25:90:61:c4:c4
|
inet addr:10.3.1.12 Bcast:10.3.1.255 Mask:255.255.255.0
|
inet6 addr: fe80::225:90ff:fe61:c4c4/64 Scope:Link
|
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
|
RX packets:328375369 errors:0 dropped:188355 overruns:0 frame:0
|
TX packets:314298969 errors:0 dropped:0 overruns:0 carrier:0
|
collisions:0 txqueuelen:0
|
RX bytes:282232403974 (282.2 GB) TX bytes:262678343788 (262.6 GB)
|
|
lo Link encap:Local Loopback
|
inet addr:127.0.0.1 Mask:255.0.0.0
|
inet6 addr: ::1/128 Scope:Host
|
UP LOOPBACK RUNNING MTU:16436 Metric:1
|
RX packets:18572916 errors:0 dropped:0 overruns:0 frame:0
|
TX packets:18572916 errors:0 dropped:0 overruns:0 carrier:0
|
collisions:0 txqueuelen:0
|
RX bytes:93322878598 (93.3 GB) TX bytes:93322878598 (93.3 GB)
|
netstat -s @d2
Ip:
|
327408276 total packets received
|
3638 with invalid addresses
|
0 forwarded
|
0 incoming packets discarded
|
327404638 incoming packets delivered
|
286104959 requests sent out
|
Icmp:
|
3357 ICMP messages received
|
2222 input ICMP message failed.
|
ICMP input histogram:
|
destination unreachable: 3351
|
timeout in transit: 6
|
3364 ICMP messages sent
|
0 ICMP messages failed
|
ICMP output histogram:
|
destination unreachable: 3364
|
IcmpMsg:
|
InType3: 3351
|
InType11: 6
|
OutType3: 3364
|
Tcp:
|
74585 active connections openings
|
1620545 passive connection openings
|
3179 failed connection attempts
|
2254 connection resets received
|
1928 connections established
|
326802810 segments received
|
856773151 segments send out
|
173128 segments retransmited
|
0 bad segments received.
|
4566 resets sent
|
Udp:
|
592058 packets received
|
106 packets to unknown port received.
|
0 packet receive errors
|
592217 packets sent
|
UdpLite:
|
TcpExt:
|
1 invalid SYN cookies received
|
22 resets received for embryonic SYN_RECV sockets
|
1568636 TCP sockets finished time wait in fast timer
|
6475830 delayed acks sent
|
3882 delayed acks further delayed because of locked socket
|
Quick ack mode was activated 13948 times
|
730 times the listen queue of a socket overflowed
|
730 SYNs to LISTEN sockets dropped
|
203165314 packets directly queued to recvmsg prequeue.
|
761690672 bytes directly in process context from backlog
|
2392950864 bytes directly received in process context from prequeue
|
115348790 packet headers predicted
|
83624452 packets header predicted and directly queued to user
|
23146741 acknowledgments not containing data payload received
|
158822397 predicted acknowledgments
|
14260 times recovered from packet loss by selective acknowledgements
|
17 bad SACK blocks received
|
Detected reordering 16 times using FACK
|
Detected reordering 1220 times using SACK
|
Detected reordering 21 times using time stamp
|
799 congestion windows fully recovered without slow start
|
416 congestion windows partially recovered using Hoe heuristic
|
7318 congestion windows recovered without slow start by DSACK
|
20688 congestion windows recovered without slow start after partial ack
|
4889 TCP data loss events
|
TCPLostRetransmit: 826
|
1157 timeouts after SACK recovery
|
30 timeouts in loss state
|
68584 fast retransmits
|
26065 forward retransmits
|
39782 retransmits in slow start
|
35009 other TCP timeouts
|
1412 SACK retransmits failed
|
120 times receiver scheduled too late for direct processing
|
17867 DSACKs sent for old packets
|
26 DSACKs sent for out of order packets
|
54702 DSACKs received
|
26 DSACKs for out of order packets received
|
280 connections reset due to unexpected data
|
1391 connections reset due to early user close
|
9 connections aborted due to timeout
|
TCPDSACKIgnoredOld: 450
|
TCPDSACKIgnoredNoUndo: 225
|
TCPSackShifted: 474048
|
TCPSackMerged: 188891
|
TCPSackShiftFallback: 911357
|
TCPBacklogDrop: 4
|
TCPDeferAcceptDrop: 1302164
|
IpExt:
|
InBcastPkts: 6307
|
InOctets: -49320814
|
OutOctets: 1081770300
|
InBcastOctets: 660667
|
ifconfig -a @d2
eth0 Link encap:Ethernet HWaddr 00:25:90:60:33:44
|
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
|
RX packets:113090371 errors:2 dropped:188201 overruns:0 frame:1
|
TX packets:418567871 errors:0 dropped:0 overruns:0 carrier:0
|
collisions:0 txqueuelen:1000
|
RX bytes:148850003011 (148.8 GB) TX bytes:487299854067 (487.2 GB)
|
Memory:faee0000-faf00000
|
|
eth1 Link encap:Ethernet HWaddr 00:25:90:60:33:45
|
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
|
RX packets:321402656 errors:0 dropped:8 overruns:8 frame:0
|
TX packets:413332453 errors:0 dropped:0 overruns:0 carrier:0
|
collisions:0 txqueuelen:1000
|
RX bytes:198971160226 (198.9 GB) TX bytes:474174366599 (474.1 GB)
|
Memory:fae60000-fae80000
|
|
lan Link encap:Ethernet HWaddr 00:25:90:60:33:45
|
inet addr:10.3.1.13 Bcast:10.3.1.255 Mask:255.255.255.0
|
inet6 addr: fe80::225:90ff:fe60:3345/64 Scope:Link
|
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
|
RX packets:434493027 errors:2 dropped:188209 overruns:8 frame:1
|
TX packets:831900324 errors:0 dropped:0 overruns:0 carrier:0
|
collisions:0 txqueuelen:0
|
RX bytes:347821163237 (347.8 GB) TX bytes:961474220666 (961.4 GB)
|
|
lo Link encap:Local Loopback
|
inet addr:127.0.0.1 Mask:255.0.0.0
|
inet6 addr: ::1/128 Scope:Host
|
UP LOOPBACK RUNNING MTU:16436 Metric:1
|
RX packets:22235277 errors:0 dropped:0 overruns:0 frame:0
|
TX packets:22235277 errors:0 dropped:0 overruns:0 carrier:0
|
collisions:0 txqueuelen:0
|
RX bytes:137115007492 (137.1 GB) TX bytes:137115007492 (137.1 GB)
|
|
|
Can you send
netstat -s
|
/sbin/ifconfig -a
|
|
|
Also, we rarely use slaveOk for this database, so increased read traffic for d1 is expected. Not sure about why the incoming traffic on d2 is larger than the outgoing.
|
|
Some network asymmetry might come from the fact that these servers also run mongos that is used by our applications via the same link and I'm not sure if these are properly balanced.
|
|
You don't happen to have mongostat or other per/second monitoring do you?
I'm asking because there is asymmetry somewhere that caused the issue.
We need to handle that better, but I want to understand exactly what happened also.
It looks like d1 was able to ping d2, but d2 was not able to ping d1.
Really want to figure out why.
|
|
Both computers are almost the same:
2xXeon 5520, 48GiB ECC DDR3, LSI 2008-based SATA3 HBA + SuperMicro L24X 12-disk backplane
OS, swap and journal drive: mdadm RAID1 over a pair of Intel 320 Series SSDs
Each shard RS data directory: mdadm RAID1 over a pair of Seagate SATA3 disks (total eight disks)
So everything is attached locally, the only place where network is involved is between RS members.
|
|
What kind of hardware is this running on?
It seems there was some sort of network asymmetry for a moment.
Is the disk local or network based?
|
|
Attaching graphs for network I/O ("lan" is a balance-alb bond of two 1Gbps interfaces that are plugged into two stacked switches).
|
|
Ok, I think we have a clue.
To confirm, do you have network monitoring/bandwidth/anything during this time period?
|
|
Yes, the oplog query was done on the d2.
d1.s1:
>mongo d1.s1.fs.drive.bru:27021
|
MongoDB shell version: 2.2.1
|
connecting to: d1.s1.fs.drive.bru:27021/test
|
driveFS-1:PRIMARY> rs.conf()
|
{
|
"_id" : "driveFS-1",
|
"version" : 1,
|
"members" : [
|
{
|
"_id" : 0,
|
"host" : "d1.s1.fs.drive.bru:27021",
|
"priority" : 10
|
},
|
{
|
"_id" : 1,
|
"host" : "d2.s1.fs.drive.bru:27021"
|
},
|
{
|
"_id" : 2,
|
"host" : "a.s1.fs.drive.bru:27021",
|
"arbiterOnly" : true
|
}
|
]
|
}
|
driveFS-1:PRIMARY> rs.status()
|
{
|
"set" : "driveFS-1",
|
"date" : ISODate("2012-11-21T14:47:36Z"),
|
"myState" : 1,
|
"members" : [
|
{
|
"_id" : 0,
|
"name" : "d1.s1.fs.drive.bru:27021",
|
"health" : 1,
|
"state" : 1,
|
"stateStr" : "PRIMARY",
|
"uptime" : 66043,
|
"optime" : Timestamp(1353509256000, 9),
|
"optimeDate" : ISODate("2012-11-21T14:47:36Z"),
|
"self" : true
|
},
|
{
|
"_id" : 1,
|
"name" : "d2.s1.fs.drive.bru:27021",
|
"health" : 1,
|
"state" : 2,
|
"stateStr" : "SECONDARY",
|
"uptime" : 66041,
|
"optime" : Timestamp(1353509254000, 6),
|
"optimeDate" : ISODate("2012-11-21T14:47:34Z"),
|
"lastHeartbeat" : ISODate("2012-11-21T14:47:35Z"),
|
"pingMs" : 0
|
},
|
{
|
"_id" : 2,
|
"name" : "a.s1.fs.drive.bru:27021",
|
"health" : 1,
|
"state" : 7,
|
"stateStr" : "ARBITER",
|
"uptime" : 66041,
|
"lastHeartbeat" : ISODate("2012-11-21T14:47:35Z"),
|
"pingMs" : 0
|
}
|
],
|
"ok" : 1
|
}
|
d2.s1:
>mongo d2.s1.fs.drive.bru:27021
|
MongoDB shell version: 2.2.1
|
connecting to: d2.s1.fs.drive.bru:27021/test
|
driveFS-1:SECONDARY> rs.conf()
|
{
|
"_id" : "driveFS-1",
|
"version" : 1,
|
"members" : [
|
{
|
"_id" : 0,
|
"host" : "d1.s1.fs.drive.bru:27021"
|
"priority" : 10
|
},
|
{
|
"_id" : 1,
|
"host" : "d2.s1.fs.drive.bru:27021"
|
},
|
{
|
"_id" : 2,
|
"host" : "a.s1.fs.drive.bru:27021",
|
"arbiterOnly" : true
|
}
|
]
|
}
|
driveFS-1:SECONDARY> rs.status()
|
{
|
"set" : "driveFS-1",
|
"date" : ISODate("2012-11-21T14:47:57Z"),
|
"myState" : 2,
|
"syncingTo" : "d1.s1.fs.drive.bru:27021",
|
"members" : [
|
{
|
"_id" : 0,
|
"name" : "d1.s1.fs.drive.bru:27021",
|
"health" : 1,
|
"state" : 1,
|
"stateStr" : "PRIMARY",
|
"uptime" : 66063,
|
"optime" : Timestamp(1353509276000, 13),
|
"optimeDate" : ISODate("2012-11-21T14:47:56Z"),
|
"lastHeartbeat" : ISODate("2012-11-21T14:47:56Z"),
|
"pingMs" : 0
|
},
|
{
|
"_id" : 1,
|
"name" : "d2.s1.fs.drive.bru:27021",
|
"health" : 1,
|
"state" : 2,
|
"stateStr" : "SECONDARY",
|
"uptime" : 66069,
|
"optime" : Timestamp(1353509274000, 5),
|
"optimeDate" : ISODate("2012-11-21T14:47:54Z"),
|
"self" : true
|
},
|
{
|
"_id" : 2,
|
"name" : "a.s1.fs.drive.bru:27021",
|
"health" : 1,
|
"state" : 7,
|
"stateStr" : "ARBITER",
|
"uptime" : 66065,
|
"lastHeartbeat" : ISODate("2012-11-21T14:47:56Z"),
|
"pingMs" : 1
|
}
|
],
|
"ok" : 1
|
}
|
|
|
Can you send rs.conf() and rs.status() from every node?
|
|
The oplog query you did was on d2?
|
|
Attaching log from d2.s1 (the secondary of the 1st shard) that logged "should cause rollback".
|
|
Well, I guess that the absence of the rollback is the problem itself. The "primary.log" is from d1.s1 (that was primary at the start of the problems).
|
|
What node is the log from?
Not seeing any rollback at all in there.
What node did you check the oplog for?
|
|
Some network topology notes to prevent possible confusion when analyzing logs where host names are not available (socket errors, etc.) – there are three physical servers:
10.3.1.10: c1 (SSD), a.sX (ports 27021-27024)
10.3.1.12: c2 (SSD), d1.sX (ports 27021-27024, dedicated RAID1 array for each shard/RS member, a total of 4), b1
10.3.1.13: c3 (SSD), d2.sX (ports 27021-27024, dedicated RAID1 array for each shard/RS member, a total of 4), b2
|
|
Attaching the log from the primary for two hours around the problem.
|
|
Can you attach the logs from the primary where the rollback should have happened?
|
|
"I quickly ruled out #3" should read "I quickly ruled out #1"
|
Generated at Thu Feb 08 03:15:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.