[SERVER-7732] Failover + [broken?] rollback causes data loss Created: 20/Nov/12  Updated: 11/Jul/16  Resolved: 18/Jan/13

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.0.7, 2.2.1
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Aristarkh Zagorodnikov Assignee: Kristina Chodorow (Inactive)
Resolution: Done Votes: 1
Labels: replay, replicaset, rollback, sharding
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux x86-64 (Ubuntu 12.04.1)


Attachments: PNG File d1-iops.png     PNG File d1-rate.png     PNG File d1.png     File d1.s3.log.gz     PNG File d2-iops.png     PNG File d2-rate.png     PNG File d2.png     File d2.s1.log.gz     File d2.s3.log.gz     File d3.s3.log.gz     Text File mongodb-driveFS-3.log     File primary.log.gz    
Issue Links:
Related
is related to SERVER-7759 Don't purge bgbuffer after SECONDARY-... Closed
Operating System: ALL
Participants:

 Description   

Recently, we had some issues with replica set failover, due to server not being able to cope with the high I/O load caused by spiky migrations (see SERVER-7686 for a more detailed description of the original problem) – mongod intermittently stops responding to network connections because it's busy doing I/O, so other RS members believe it's dead.
While this problem by itself is bad enough, the failover result we got was, well, very bad – let me describe it in detail.

The cluster in question hosts a MongoDB database containing GridFS data, which currently has about 30 million files occupying 1.5TiB of data. It has two mongos routers (b1, b2), three config servers (c1, c2, c3) and four shards that are each comprised of two data servers (d1.sN, d2.sN) and an arbiter (a.sN), where N is the shard number (1-4).
Writes to the database are done via mongos routers (using C# driver, but that is irrelevant since we already ruled out that problem), with SafeMode.true, but without the w:majority (yes, we know that w:majority is required for consistent failover, but we were not yet able to implement it).
When the problem hit us, I first thought that everything went more or less okay – after I've got a notification from our Zabbix monitoring, I went and checked the "rollback" folders on all servers and found out that there were none. Then I checked the logs and saw that there were repeated "failures" in quick succession, so I thought that client code wasn't able to pick up new master, hence the absence of the rollback. A few days later I was checking the other server (the one which only reads from the database) logs and found out that some files could not be readfrom GridFS. Surprisingly, the fs.file document was intact, but there were no associated documents in the fs.chunks collection. Then, I did a quick check for the amount of files and chunks, expecting count of GridFS chunks to be a bit larger than amount of files and to my own horror discovered, that we're missing at least 500000 GridFS for non-zero-length files.
From this on I had three primary suspects:
1. database corruption due to external reasons
2. mongos chunk migration gone awry (either failed migration mistakenly recorded as successful or inconsistent shard min/max values in chunk map)
3. ?
I quickly ruled out #3, because there were no power failures, disks are fine (checked SMART attributes, did a patrol read and a selftest), syslog/dmesg are clean, etc.
Then, I checked chunk migration. While it was the original cause of disk overload, the missing files were not in ranges of chunks that were transferred (checked the primary logs along with config.changelog collection of the config database).
Then, I decided to rule out user error, so I checked the oplog for the related file (since I knew the fs.files id (NumberLong("1113206594855763968")), I executed the following for all shard replica sets (directly connected to "local" databases, not via mongos):

db.oplog.rs.find({"o.files_id":NumberLong("1113206594855763968")}).sort({$natural:-1}).limit(1)<code>

which found exactly one "insert" operation on exactly one replica set:

{ "ts" : Timestamp(1352980035000, 28), "h" : NumberLong("2879588865588641652"), "v" : 2, "op" : "i", "ns" : "a.fs.chunks", "o" : { "_id" : ObjectId("50a4d6366b8dda1458505dbb"), "files_id" : NumberLong("1113206594855763968"), "n" : 0, "data" : BinData(0,"...")

After this, I knew the _id for the fs.chunks document, so I checked if it was deleted, partially to ensure it's not chunk migration, partially to completely rule out errors in client code (which was working fine) with:

db.oplog.rs.find({"o._id":ObjectId("50a4d6366b8dda1458505dbb")}).sort({$natural:-1}).limit(1)

Too bad it returned the same "insert" operation, so as far as I know how oplog is structured, the document was inserted and never deleted (which completely rules out migration failure, since I checked each RS directly), yet is not present in the database.
Reading logs was my next step. Arbiter didn't have anything interesting aside from the fact that servers were going "up" and "down" (I would like to reminde the reader that mongod processes were running fine, just were blocked with I/O so much that they could not accept connections in a timely fashion). Primary had almost the same logs with addition of migration-related info (moveChunk started, committing, etc.).

Thu Nov 15 15:26:53 [rsHealthPoll] DBClientCursor::init call() failed
Thu Nov 15 15:26:53 [rsHealthPoll] replSet info d1.s1.fs.drive.bru:27021 is down (or slow to respond): DBClientBase::findN: transpor
Thu Nov 15 15:26:53 [rsHealthPoll] replSet member d1.s1.fs.drive.bru:27021 is now in state DOWN
Thu Nov 15 15:26:53 [rsMgr] not electing self, we are not freshest
Thu Nov 15 15:26:59 [rsMgr] not electing self, we are not freshest
Thu Nov 15 15:27:05 [rsHealthPoll] DBClientCursor::init call() failed
Thu Nov 15 15:27:05 [rsMgr] replSet info electSelf 1
Thu Nov 15 15:27:17 [rsHealthPoll] DBClientCursor::init call() failed
Thu Nov 15 15:27:25 [rsMgr] replSet PRIMARY
Thu Nov 15 15:27:26 [rsBackgroundSync] replset 1134533 ops were not applied from buffer, this should cause a rollback on the former
Thu Nov 15 15:27:27 [rsHealthPoll] replSet member d1.s1.fs.drive.bru:27021 is up
Thu Nov 15 15:27:27 [rsHealthPoll] replSet member d1.s1.fs.drive.bru:27021 is now in state PRIMARY
Thu Nov 15 15:27:27 [rsMgr] replSet relinquishing primary state
Thu Nov 15 15:27:27 [rsMgr] replSet SECONDARY
Thu Nov 15 15:27:27 [rsMgr] replSet closing client sockets after relinquishing primary
Thu Nov 15 15:27:28 [rsBackgroundSync] replSet syncing to: d1.s1.fs.drive.bru:27021
Thu Nov 15 15:27:28 [rsHealthPoll] replSet info a.s1.fs.drive.bru:27021 is down (or slow to respond): socket exception [SEND_ERROR]
Thu Nov 15 15:27:28 [rsHealthPoll] replSet member a.s1.fs.drive.bru:27021 is now in state DOWN
Thu Nov 15 15:27:29 [rsHealthPoll] replSet info d1.s1.fs.drive.bru:27021 is down (or slow to respond): socket exception [SEND_ERROR]
Thu Nov 15 15:27:29 [rsHealthPoll] replSet member d1.s1.fs.drive.bru:27021 is now in state DOWN
Thu Nov 15 15:27:29 [rsMgr] replSet can't see a majority, will not try to elect self
Thu Nov 15 15:27:31 [rsHealthPoll] replSet member d1.s1.fs.drive.bru:27021 is up
Thu Nov 15 15:27:31 [rsHealthPoll] replSet member d1.s1.fs.drive.bru:27021 is now in state PRIMARY
Thu Nov 15 15:27:31 [rsHealthPoll] replSet member a.s1.fs.drive.bru:27021 is up
Thu Nov 15 15:27:31 [rsHealthPoll] replSet member a.s1.fs.drive.bru:27021 is now in state ARBITER
Thu Nov 15 15:27:32 [rsSyncNotifier] Socket say send() errno:9 Bad file descriptor 10.3.1.12:27021
Thu Nov 15 15:27:32 [rsSyncNotifier] replset tracking exception: exception: 9001 socket exception [SEND_ERROR] for 10.3.1.12:27021

On the other hand, secondary (which of course became primary at some point and then gone back) had not only similar lines about lost connections and election-related messages, but also two very interesting lines:

Thu Nov 15 15:27:26 [rsBackgroundSync] replset 1134533 ops were not applied from buffer, this should cause a rollback on the former
...
Thu Nov 15 15:41:15 [rsBackgroundSync] replset 7806611 ops were not applied from buffer, this should cause a rollback on the former

So, it appear that there should be a rollback, but there is no "rollback" directory (permissions are fine and logs are silent on both servers), and there are no rollback-related records on the primary, so it appears that the data just vanished.

Also, analyzing historical monitoring graphs and matching missing fs.chunks with upload timestamps, we found out that missing data always corellates with failover that is caused by overloaded I/O triggered by shard chunk migration, so this problem existed in 2.0.7 too.

Now, let me get to the point.
I understand that w:majority is a must for robust operation where one cannot afford rollbacks and manual reconcilation, usually associated with conflicts that might be impossible to correctly resolve. Still, since this database only gets inserts and (occasional) deletes, I presumed that even if rollback happens, I just would run a mongorestore on it as usual. Also, existing documentation points out that rollback does not lead to data loss (see http://www.mongodb.org/display/DOCS/Replica+Sets+-+Rollbacks) by itself, only to manual reconciliation (which in our case would just be adding the rolled back data back to the database). Unfortunately, in our case it appears that rollback failed to properly occur for unknown reason, leading to large amounts of lost data (also, I'm not 100% sure if w:majority would be 100% safe in case it's a bug).
So I guess it's either a misunderstanding how MongoDB works from our side, a documentation deficiency or a very nasty bug.

Of course, my analysis might be completely wrong. I still have logs (can backup them before they get rotated out, tell me if you would like to have a look) and can make a mongos config server dump. Dumping the database itself or its oplog would be problematic though, while it doesn't contain private data, it's too large to move it over anything except LAN.

As for recovery – while I know what steps I should take (manually replaying oplog with some kind of a pre-scan optimzation like not trying to replay inserts that get deleted later) I think it's unfeasible, since our current oplog is about 200GB per shard RS member and matching about nine million oplog records would overload the server. Maybe you have any advice?



 Comments   
Comment by Kristina Chodorow (Inactive) [ 11/Dec/12 ]

No problem! Just let us know if you need anything else.

Comment by Aristarkh Zagorodnikov [ 11/Dec/12 ]

Many thanks for the detailed explanation! I'll review our older logs for the clock skew messages just to check. Also, it appears that we in fact really had some network issues due to a faulty second switch, so the SERVER-1929 would be useful should similar problems arise.

Comment by Kristina Chodorow (Inactive) [ 11/Dec/12 ]

Sorry for the delay.

There are two scenarios that pop up in your logs, here's what causes each of them and what the implications are:

  1. When a secondary becomes a primary (d2, say), it clears any ops it got from the former primary that have not yet been applied. This triggers the "should cause rollback" message. There are two possibilities at that point, either:
    1. d2 will do some of its own writes, in which case they'll be rolled back when it realizes d1 is still primary or
    2. d2 will step down before doing any writes, in which case the cleared ops will be refetched from the primary. Inefficient, but not dangerous.
  2. If the message does not appear when d2 becomes primary, it does some writes, and then steps down again, d2 now has some writes that don't exist anywhere else (which is probably not what you want). The indication that this has happened is that you'll see warnings about "clock skew" in the logs, because d2 is applying ops fetched from d1 after ops from its own stint at being primary, so the ops look like they're going "back in time." So, you may want to be on the lookout for the clock skew warning after a flap: it means the member got some writes that may not be present elsewhere.

The clock skew errors look like this:

Thu Nov 15 15:47:28 [rsSync] replSet error possible failover clock skew issue? 50a4d64f:4

You have one clock skew message in the November 27th logs: around 18:00:18 in d2 & d3's log, from a quick failover when d3 became primary. We can help you come up with a way of finding unreplicated ops when you get a clock skew error, if you'd like.

SERVER-7759 should fix both of these cases and I hope to have it done in the next couple of days (for 2.4.0). SERVER-1929 should also help, limiting failovers in the future

Comment by Aristarkh Zagorodnikov [ 30/Nov/12 ]

Sorry for pestering you, did you have time to check the logs?

Comment by Aristarkh Zagorodnikov [ 28/Nov/12 ]

Attaching logs for November 27.

Comment by Aristarkh Zagorodnikov [ 28/Nov/12 ]

Sure, let me package them. I'd send you uncut versions for the entire day so I won't accidentially delete something important.

Comment by Eliot Horowitz (Inactive) [ 28/Nov/12 ]

Sorry, for being unclear.
The "should be a rollback" message without a rollback does NOT alone there is any loss or inconsistency.
Can you send the logs for the last segment so we can verify?

Comment by Aristarkh Zagorodnikov [ 28/Nov/12 ]

By the way, does this mean that secondaries and primary not only lost some writes completely, but have different data, so we need to resync secondaries from scratch to get consistent replicase?

Comment by Aristarkh Zagorodnikov [ 28/Nov/12 ]

But there was no rollback on the d2 (former primary) at all.
Also, I grepped rollback on the logs, this line is logged every time after failover and most of the cases there is not only no rollback on the former primary, but no rollback on any of the members of the set.

Comment by Kristina Chodorow (Inactive) [ 28/Nov/12 ]

Sorry for the confusion, I got it backwards. That line is okay.

On the d1, d2, d3 clarification: if d2 was primary while d1 was a secondary, then yes. It's referring to "the other primary" in the system.

Comment by Aristarkh Zagorodnikov [ 28/Nov/12 ]

I would also ask for some clarification.
There are three nodes: d1, d2, d3:
d1[P] d2[S] d3[S]
There is a failover on d1:
d1[X] d2[P] d3[S]
Then, d1 returns to normal, performing a rollback before becoming primary:
d1[R] d2[P] d3[S]
...
d1[P] d2[S] d3[S]
And then d1 (after becoming primary) logs the "... ops were not applied from buffer..."
Where the mentioned "rollback on the former primary" should occur in this case? On d2?

Comment by Aristarkh Zagorodnikov [ 28/Nov/12 ]

What would be the best course of action for us to prevent data loss on failover? Rolling back to 2.0? Also, how we can help diagnosing the issue?

Comment by Aristarkh Zagorodnikov [ 28/Nov/12 ]

For example in the latest logs I posted there was no rollback on the secondary.

Comment by Aristarkh Zagorodnikov [ 28/Nov/12 ]

Well, it appears that we have this problem often enough.

Comment by Kristina Chodorow (Inactive) [ 28/Nov/12 ]

Yes: the "should cause a rollback" line is a problem. If you see the rollback line and there was no rollback, then those operations were skipped on the secondary.

Comment by Aristarkh Zagorodnikov [ 27/Nov/12 ]

It's the "should cause a rollback" line when there is no rollback on the other servers that worries me. We had the same line in the original case and lost some data. Unfortunately, I don't know enough of the internals to decide whether it's an indicator of a real problem.

Comment by Aristarkh Zagorodnikov [ 27/Nov/12 ]

d1.s3 doens't have anything important except the "rollback should be on the former primary". Attaching a part of the log from the startup to normal operation (there was a rollback, but as far as I understand, "former primary" refers to d3.s3 in this case).

Comment by Eliot Horowitz (Inactive) [ 27/Nov/12 ]

Looks normal, but can you send some more log from s1.s3 to confirm.

Comment by Aristarkh Zagorodnikov [ 27/Nov/12 ]

We had a failover today (due to networki reconfiguration). After primary returned to its primary status it wrote to the log.
d1.s3:

15:06:00 [rsHealthPoll] replSet member d3.s3.fs.drive.bru:27023 is now in state SECONDARY
15:06:00 [rsMgr] not electing self, d3.s3.fs.drive.bru:27023 would veto
15:06:05 [rsMgr] replSet info electSelf 0
15:06:05 [rsMgr] replSet PRIMARY
15:06:08 [rsBackgroundSync] replset 6962 ops were not applied from buffer, this should cause a rollback on the former primary

There is no sign of rollback on the secondaries though.
d2.s3:

15:05:59 [rsSyncNotifier] replset tracking exception: exception: 10278 dbclient error communicating with server: d3.s3.fs.drive.bru:27023
15:05:59 [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: d3.s3.fs.drive.bru:27023
15:06:00 [conn636] d3.s3.fs.drive.bru:27023 is trying to elect itself but d3.s3.fs.drive.bru:27023 is already primary and more up-to-date
15:06:01 [rsHealthPoll] replSet member d3.s3.fs.drive.bru:27023 is now in state SECONDARY
15:06:01 [rsMgr] not electing self, d1.s3.fs.drive.bru:27023 would veto
15:06:06 [conn635] replSet info voting yea for d1.s3.fs.drive.bru:27023 (0)
15:06:07 [rsMgr] not electing self, d3.s3.fs.drive.bru:27023 would veto
15:06:07 [rsHealthPoll] replSet member d1.s3.fs.drive.bru:27023 is now in state PRIMARY
15:06:09 [rsBackgroundSync] replSet syncing to: d1.s3.fs.drive.bru:27023
15:06:09 [rsSyncNotifier] replset setting oplog notifier to d1.s3.fs.drive.bru:27023

d3.s3:

15:05:59 [conn1678] replSet info stepping down as primary secs=1
15:05:59 [conn1678] replSet relinquishing primary state
15:05:59 [conn1678] replSet SECONDARY
15:05:59 [conn1678] replSet closing client sockets after relinquishing primary
15:06:00 [rsHealthPoll] replSet info d1.s3.fs.drive.bru:27023 is down (or slow to respond): socket exception [SEND_ERROR] for 10.3.1.12:27023
15:06:00 [rsHealthPoll] replSet member d1.s3.fs.drive.bru:27023 is now in state DOWN
15:06:00 [rsMgr] replSet info electSelf 3
15:06:00 [rsMgr] replSet couldn't elect self, only received -9999 votes
15:06:02 [rsHealthPoll] replSet member d1.s3.fs.drive.bru:27023 is up
15:06:02 [rsHealthPoll] replSet member d1.s3.fs.drive.bru:27023 is now in state SECONDARY
15:06:02 [rsMgr] not electing self, d2.s3.fs.drive.bru:27023 would veto
15:06:06 [conn1677] replSet info voting yea for d1.s3.fs.drive.bru:27023 (0)
15:06:08 [rsHealthPoll] replSet member d1.s3.fs.drive.bru:27023 is now in state PRIMARY
15:06:08 [rsBackgroundSync] replSet syncing to: d1.s3.fs.drive.bru:27023
15:06:08 [rsSyncNotifier] replset setting oplog notifier to d1.s3.fs.drive.bru:27023

Is this normal, or we're losing data again?

Comment by Aristarkh Zagorodnikov [ 22/Nov/12 ]

Yes, looks like syncdelay would limit the write problem. But, this is more of a solution for SERVER-7686 (which we already tried to fix with enabling write caching, moving to deadline scheduler, and enabling _secondaryThrottle – I just can't tell you if that helped since we didn't have fs.files migrations from the last failure).

Comment by Eliot Horowitz (Inactive) [ 21/Nov/12 ]

One thing to try would be lower syncdelay a lot (from 60 -> 5) and see if that smooths things

Comment by Aristarkh Zagorodnikov [ 21/Nov/12 ]

Have to go now, it's late here. Thanks for the attention to this case, I look forward to seeing you at MongoSV.

Comment by Aristarkh Zagorodnikov [ 21/Nov/12 ]

My theory is close to this: http://serverfault.com/questions/126413/limit-linux-background-flush-dirty-pages

Comment by Aristarkh Zagorodnikov [ 21/Nov/12 ]

I also checked the CPU graphs, iowait peaks at ~30%, user at 3-5% all other (system, irq, etc.) are less than 1% for both servers.

Comment by Aristarkh Zagorodnikov [ 21/Nov/12 ]

Also, please note that 150-170 write IOPS is actually pretty much the sustained maximum for this class of drives with write caching disabled.

Comment by Aristarkh Zagorodnikov [ 21/Nov/12 ]

The stats for physical drives that comprise this raid array are similar to the array ones.

Comment by Aristarkh Zagorodnikov [ 21/Nov/12 ]

Attaching IOPS and I/O rate graphs. It looks like MongoDB isn't the only application that stops responding – there are gaps in the graph which mean that zabbix agent was also unreachable at the time.

Comment by Aristarkh Zagorodnikov [ 21/Nov/12 ]

We don't use authentication at all (these servers are behind NAT/firewall and can be only accessed through VPN).

Comment by Aristarkh Zagorodnikov [ 21/Nov/12 ]

As a side note, I want to tell you that I wrote a small tool that walks the oplog backwards, checks every insert against the database and collects deletes to prevent deleted items to be re-added. If it finds an insert that is not in the database and had no subsequent delete, it repeats through mongos. While the oplog is only 20+ days long (and we had these failovers which were losing data for some time now), it would allow us to recover at least lost data for the last month. It's halfway through the oplog on three of four shards (fourth was added recently so its oplog is shorter).

Comment by Eliot Horowitz (Inactive) [ 21/Nov/12 ]

Its very weird that there is reachability errors correlated to disk saturation.
Are you using auth? (Just trying to collect data and make connections)
Do you have disk monitoring you can share?

Comment by Aristarkh Zagorodnikov [ 21/Nov/12 ]

kern.log and dmesg for d1 and d2 at this time is completely empty.
Checked syslog for both machines, only usual lines about cron, puppet, etc. (no crontabs or backup scripts are scheduled to do any work at this time, puppet configuration was unchanged)

Nov 15 15:00:37 pisces crontab[16705]: (root) LIST (root)
Nov 15 15:00:37 pisces crontab[16707]: (root) LIST (mongodb)
Nov 15 15:00:41 pisces puppet-agent[19846]: Finished catalog run in 4.95 seconds
Nov 15 15:10:48 pisces crontab[19311]: (root) LIST (root)
Nov 15 15:10:48 pisces crontab[19314]: (root) LIST (mongodb)
Nov 15 15:10:52 pisces puppet-agent[19846]: Finished catalog run in 5.58 seconds
Nov 15 15:17:01 pisces CRON[22970]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Nov 15 15:20:59 pisces crontab[28542]: (root) LIST (root)
Nov 15 15:20:59 pisces crontab[28544]: (root) LIST (mongodb)
Nov 15 15:21:03 pisces puppet-agent[19846]: Finished catalog run in 4.72 seconds
Nov 15 15:31:09 pisces crontab[10930]: (root) LIST (root)
Nov 15 15:31:09 pisces crontab[10932]: (root) LIST (mongodb)
Nov 15 15:31:13 pisces puppet-agent[19846]: Finished catalog run in 4.33 seconds
Nov 15 15:41:20 pisces crontab[25912]: (root) LIST (root)
Nov 15 15:41:20 pisces crontab[25914]: (root) LIST (mongodb)
Nov 15 15:41:24 pisces puppet-agent[19846]: Finished catalog run in 4.68 seconds
Nov 15 15:51:32 pisces crontab[10896]: (root) LIST (root)
Nov 15 15:51:32 pisces crontab[10898]: (root) LIST (mongodb)
Nov 15 15:51:36 pisces puppet-agent[19846]: Finished catalog run in 5.66 seconds
Nov 15 16:01:43 pisces crontab[24907]: (root) LIST (root)
Nov 15 16:01:43 pisces crontab[24909]: (root) LIST (mongodb)
Nov 15 16:01:47 pisces puppet-agent[19846]: Finished catalog run in 5.19 seconds
Nov 15 16:11:56 pisces crontab[7497]: (root) LIST (root)
Nov 15 16:11:56 pisces crontab[7499]: (root) LIST (mongodb)
Nov 15 16:12:01 pisces puppet-agent[19846]: Finished catalog run in 5.58 seconds
Nov 15 16:17:01 pisces CRON[14739]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Nov 15 16:22:09 pisces crontab[22443]: (root) LIST (root)
Nov 15 16:22:09 pisces crontab[22445]: (root) LIST (mongodb)
Nov 15 16:22:13 pisces puppet-agent[19846]: Finished catalog run in 4.67 seconds
Nov 15 16:32:20 pisces crontab[1758]: (root) LIST (root)
Nov 15 16:32:20 pisces crontab[1760]: (root) LIST (mongodb)
Nov 15 16:32:24 pisces puppet-agent[19846]: Finished catalog run in 4.75 seconds
Nov 15 16:42:30 pisces crontab[8749]: (root) LIST (root)
Nov 15 16:42:30 pisces crontab[8751]: (root) LIST (mongodb)
Nov 15 16:42:35 pisces puppet-agent[19846]: Finished catalog run in 5.17 seconds
Nov 15 16:52:42 pisces crontab[11379]: (root) LIST (root)
Nov 15 16:52:42 pisces crontab[11381]: (root) LIST (mongodb)
Nov 15 16:52:46 pisces puppet-agent[19846]: Finished catalog run in 5.36 seconds

Comment by Aristarkh Zagorodnikov [ 21/Nov/12 ]

Also, the reachability problems usually manifest themselves when disk I/O is high (I guess the VMM just can't cope with a large queue to flush so everything stalls).

Comment by Eliot Horowitz (Inactive) [ 21/Nov/12 ]

Yeah, bonding can do that.
Nothing in system logs around that time?

Comment by Aristarkh Zagorodnikov [ 21/Nov/12 ]

I see there are dropped packets, but I guess these come from the balance-alb algorithm since it rewrites MAC addresses on the fly to allow transparent port balancing.

Comment by Aristarkh Zagorodnikov [ 21/Nov/12 ]

No, unfortunately we monitor only ops/sec and RS lag.
Please take note that these counters might not reflect the problem, because both servers were restarted about one day ago (not simultaneously) as the part of my diagnostics attempts (checked HBA disk health information along with ME event logs, etc.).

netstat -s @d1

Ip:
    201165938 total packets received
    3639 with invalid addresses
    0 forwarded
    0 incoming packets discarded
    201162299 incoming packets delivered
    190485556 requests sent out
Icmp:
    3356 ICMP messages received
    2224 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 3347
        timeout in transit: 6
        echo requests: 3
    3371 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 3368
        echo replies: 3
IcmpMsg:
        InType3: 3347
        InType8: 3
        InType11: 6
        OutType0: 3
        OutType3: 3368
Tcp:
    74233 active connections openings
    1831918 passive connection openings
    1252 failed connection attempts
    2157 connection resets received
    1972 connections established
    200546175 segments received
    334410694 segments send out
    119987 segments retransmited
    0 bad segments received.
    2882 resets sent
Udp:
    619143 packets received
    113 packets to unknown port received.
    0 packet receive errors
    607001 packets sent
UdpLite:
TcpExt:
    36 resets received for embryonic SYN_RECV sockets
    1797144 TCP sockets finished time wait in fast timer
    3351962 delayed acks sent
    2275 delayed acks further delayed because of locked socket
    Quick ack mode was activated 5470 times
    790 times the listen queue of a socket overflowed
    790 SYNs to LISTEN sockets dropped
    165252261 packets directly queued to recvmsg prequeue.
    3540748772 bytes directly in process context from backlog
    2353661999 bytes directly received in process context from prequeue
    119142543 packet headers predicted
    20271994 packets header predicted and directly queued to user
    15384610 acknowledgments not containing data payload received
    126389342 predicted acknowledgments
    7894 times recovered from packet loss by selective acknowledgements
    84 bad SACK blocks received
    Detected reordering 8 times using FACK
    Detected reordering 18 times using SACK
    Detected reordering 21 times using time stamp
    22 congestion windows fully recovered without slow start
    49 congestion windows partially recovered using Hoe heuristic
    887 congestion windows recovered without slow start by DSACK
    9186 congestion windows recovered without slow start after partial ack
    7653 TCP data loss events
    TCPLostRetransmit: 418
    656 timeouts after SACK recovery
    16 timeouts in loss state
    57519 fast retransmits
    4270 forward retransmits
    32686 retransmits in slow start
    22184 other TCP timeouts
    1462 SACK retransmits failed
    7 times receiver scheduled too late for direct processing
    5902 DSACKs sent for old packets
    22 DSACKs sent for out of order packets
    10292 DSACKs received
    25 DSACKs for out of order packets received
    222 connections reset due to unexpected data
    1382 connections reset due to early user close
    8 connections aborted due to timeout
    TCPDSACKIgnoredNoUndo: 79
    TCPSpuriousRTOs: 1
    TCPSackShifted: 63945
    TCPSackMerged: 45521
    TCPSackShiftFallback: 44090
    TCPBacklogDrop: 235
    TCPDeferAcceptDrop: 1525080
IpExt:
    InBcastPkts: 6311
    OutBcastPkts: 243
    InOctets: -1747362436
    OutOctets: 562190328
    InBcastOctets: 661583
    OutBcastOctets: 54834

ifconfig -a @d1

eth0      Link encap:Ethernet  HWaddr 00:25:90:61:c4:c4  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:51621933 errors:0 dropped:1 overruns:0 frame:0
          TX packets:183475419 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:34874034607 (34.8 GB)  TX bytes:143781937078 (143.7 GB)
          Memory:faee0000-faf00000 
 
eth1      Link encap:Ethernet  HWaddr 00:25:90:61:c4:c5  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:276753438 errors:0 dropped:188354 overruns:0 frame:0
          TX packets:130823553 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:247358369701 (247.3 GB)  TX bytes:118896407143 (118.8 GB)
          Memory:fae60000-fae80000 
 
lan       Link encap:Ethernet  HWaddr 00:25:90:61:c4:c4  
          inet addr:10.3.1.12  Bcast:10.3.1.255  Mask:255.255.255.0
          inet6 addr: fe80::225:90ff:fe61:c4c4/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:328375369 errors:0 dropped:188355 overruns:0 frame:0
          TX packets:314298969 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:282232403974 (282.2 GB)  TX bytes:262678343788 (262.6 GB)
 
lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:18572916 errors:0 dropped:0 overruns:0 frame:0
          TX packets:18572916 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:93322878598 (93.3 GB)  TX bytes:93322878598 (93.3 GB)

netstat -s @d2

Ip:
    327408276 total packets received
    3638 with invalid addresses
    0 forwarded
    0 incoming packets discarded
    327404638 incoming packets delivered
    286104959 requests sent out
Icmp:
    3357 ICMP messages received
    2222 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 3351
        timeout in transit: 6
    3364 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 3364
IcmpMsg:
        InType3: 3351
        InType11: 6
        OutType3: 3364
Tcp:
    74585 active connections openings
    1620545 passive connection openings
    3179 failed connection attempts
    2254 connection resets received
    1928 connections established
    326802810 segments received
    856773151 segments send out
    173128 segments retransmited
    0 bad segments received.
    4566 resets sent
Udp:
    592058 packets received
    106 packets to unknown port received.
    0 packet receive errors
    592217 packets sent
UdpLite:
TcpExt:
    1 invalid SYN cookies received
    22 resets received for embryonic SYN_RECV sockets
    1568636 TCP sockets finished time wait in fast timer
    6475830 delayed acks sent
    3882 delayed acks further delayed because of locked socket
    Quick ack mode was activated 13948 times
    730 times the listen queue of a socket overflowed
    730 SYNs to LISTEN sockets dropped
    203165314 packets directly queued to recvmsg prequeue.
    761690672 bytes directly in process context from backlog
    2392950864 bytes directly received in process context from prequeue
    115348790 packet headers predicted
    83624452 packets header predicted and directly queued to user
    23146741 acknowledgments not containing data payload received
    158822397 predicted acknowledgments
    14260 times recovered from packet loss by selective acknowledgements
    17 bad SACK blocks received
    Detected reordering 16 times using FACK
    Detected reordering 1220 times using SACK
    Detected reordering 21 times using time stamp
    799 congestion windows fully recovered without slow start
    416 congestion windows partially recovered using Hoe heuristic
    7318 congestion windows recovered without slow start by DSACK
    20688 congestion windows recovered without slow start after partial ack
    4889 TCP data loss events
    TCPLostRetransmit: 826
    1157 timeouts after SACK recovery
    30 timeouts in loss state
    68584 fast retransmits
    26065 forward retransmits
    39782 retransmits in slow start
    35009 other TCP timeouts
    1412 SACK retransmits failed
    120 times receiver scheduled too late for direct processing
    17867 DSACKs sent for old packets
    26 DSACKs sent for out of order packets
    54702 DSACKs received
    26 DSACKs for out of order packets received
    280 connections reset due to unexpected data
    1391 connections reset due to early user close
    9 connections aborted due to timeout
    TCPDSACKIgnoredOld: 450
    TCPDSACKIgnoredNoUndo: 225
    TCPSackShifted: 474048
    TCPSackMerged: 188891
    TCPSackShiftFallback: 911357
    TCPBacklogDrop: 4
    TCPDeferAcceptDrop: 1302164
IpExt:
    InBcastPkts: 6307
    InOctets: -49320814
    OutOctets: 1081770300
    InBcastOctets: 660667

ifconfig -a @d2

eth0      Link encap:Ethernet  HWaddr 00:25:90:60:33:44  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:113090371 errors:2 dropped:188201 overruns:0 frame:1
          TX packets:418567871 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:148850003011 (148.8 GB)  TX bytes:487299854067 (487.2 GB)
          Memory:faee0000-faf00000 
 
eth1      Link encap:Ethernet  HWaddr 00:25:90:60:33:45  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:321402656 errors:0 dropped:8 overruns:8 frame:0
          TX packets:413332453 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:198971160226 (198.9 GB)  TX bytes:474174366599 (474.1 GB)
          Memory:fae60000-fae80000 
 
lan       Link encap:Ethernet  HWaddr 00:25:90:60:33:45  
          inet addr:10.3.1.13  Bcast:10.3.1.255  Mask:255.255.255.0
          inet6 addr: fe80::225:90ff:fe60:3345/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:434493027 errors:2 dropped:188209 overruns:8 frame:1
          TX packets:831900324 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:347821163237 (347.8 GB)  TX bytes:961474220666 (961.4 GB)
 
lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:22235277 errors:0 dropped:0 overruns:0 frame:0
          TX packets:22235277 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:137115007492 (137.1 GB)  TX bytes:137115007492 (137.1 GB)

Comment by Eliot Horowitz (Inactive) [ 21/Nov/12 ]

Can you send

netstat -s
/sbin/ifconfig -a

Comment by Aristarkh Zagorodnikov [ 21/Nov/12 ]

Also, we rarely use slaveOk for this database, so increased read traffic for d1 is expected. Not sure about why the incoming traffic on d2 is larger than the outgoing.

Comment by Aristarkh Zagorodnikov [ 21/Nov/12 ]

Some network asymmetry might come from the fact that these servers also run mongos that is used by our applications via the same link and I'm not sure if these are properly balanced.

Comment by Eliot Horowitz (Inactive) [ 21/Nov/12 ]

You don't happen to have mongostat or other per/second monitoring do you?
I'm asking because there is asymmetry somewhere that caused the issue.
We need to handle that better, but I want to understand exactly what happened also.

It looks like d1 was able to ping d2, but d2 was not able to ping d1.
Really want to figure out why.

Comment by Aristarkh Zagorodnikov [ 21/Nov/12 ]

Both computers are almost the same:
2xXeon 5520, 48GiB ECC DDR3, LSI 2008-based SATA3 HBA + SuperMicro L24X 12-disk backplane
OS, swap and journal drive: mdadm RAID1 over a pair of Intel 320 Series SSDs
Each shard RS data directory: mdadm RAID1 over a pair of Seagate SATA3 disks (total eight disks)

So everything is attached locally, the only place where network is involved is between RS members.

Comment by Eliot Horowitz (Inactive) [ 21/Nov/12 ]

What kind of hardware is this running on?
It seems there was some sort of network asymmetry for a moment.
Is the disk local or network based?

Comment by Aristarkh Zagorodnikov [ 21/Nov/12 ]

Attaching graphs for network I/O ("lan" is a balance-alb bond of two 1Gbps interfaces that are plugged into two stacked switches).

Comment by Eliot Horowitz (Inactive) [ 21/Nov/12 ]

Ok, I think we have a clue.
To confirm, do you have network monitoring/bandwidth/anything during this time period?

Comment by Aristarkh Zagorodnikov [ 21/Nov/12 ]

Yes, the oplog query was done on the d2.
d1.s1:

>mongo d1.s1.fs.drive.bru:27021
MongoDB shell version: 2.2.1
connecting to: d1.s1.fs.drive.bru:27021/test
driveFS-1:PRIMARY> rs.conf()
{
        "_id" : "driveFS-1",
        "version" : 1,
        "members" : [
                {
                        "_id" : 0,
                        "host" : "d1.s1.fs.drive.bru:27021",
                        "priority" : 10
                },
                {
                        "_id" : 1,
                        "host" : "d2.s1.fs.drive.bru:27021"
                },
                {
                        "_id" : 2,
                        "host" : "a.s1.fs.drive.bru:27021",
                        "arbiterOnly" : true
                }
        ]
}
driveFS-1:PRIMARY> rs.status()
{
        "set" : "driveFS-1",
        "date" : ISODate("2012-11-21T14:47:36Z"),
        "myState" : 1,
        "members" : [
                {
                        "_id" : 0,
                        "name" : "d1.s1.fs.drive.bru:27021",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 66043,
                        "optime" : Timestamp(1353509256000, 9),
                        "optimeDate" : ISODate("2012-11-21T14:47:36Z"),
                        "self" : true
                },
                {
                        "_id" : 1,
                        "name" : "d2.s1.fs.drive.bru:27021",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 66041,
                        "optime" : Timestamp(1353509254000, 6),
                        "optimeDate" : ISODate("2012-11-21T14:47:34Z"),
                        "lastHeartbeat" : ISODate("2012-11-21T14:47:35Z"),
                        "pingMs" : 0
                },
                {
                        "_id" : 2,
                        "name" : "a.s1.fs.drive.bru:27021",
                        "health" : 1,
                        "state" : 7,
                        "stateStr" : "ARBITER",
                        "uptime" : 66041,
                        "lastHeartbeat" : ISODate("2012-11-21T14:47:35Z"),
                        "pingMs" : 0
                }
        ],
        "ok" : 1
}

d2.s1:

>mongo d2.s1.fs.drive.bru:27021
MongoDB shell version: 2.2.1
connecting to: d2.s1.fs.drive.bru:27021/test
driveFS-1:SECONDARY> rs.conf()
{
        "_id" : "driveFS-1",
        "version" : 1,
        "members" : [
                {
                        "_id" : 0,
                        "host" : "d1.s1.fs.drive.bru:27021"
                        "priority" : 10
                },
                {
                        "_id" : 1,
                        "host" : "d2.s1.fs.drive.bru:27021"
                },
                {
                        "_id" : 2,
                        "host" : "a.s1.fs.drive.bru:27021",
                        "arbiterOnly" : true
                }
        ]
}
driveFS-1:SECONDARY> rs.status()
{
        "set" : "driveFS-1",
        "date" : ISODate("2012-11-21T14:47:57Z"),
        "myState" : 2,
        "syncingTo" : "d1.s1.fs.drive.bru:27021",
        "members" : [
                {
                        "_id" : 0,
                        "name" : "d1.s1.fs.drive.bru:27021",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 66063,
                        "optime" : Timestamp(1353509276000, 13),
                        "optimeDate" : ISODate("2012-11-21T14:47:56Z"),
                        "lastHeartbeat" : ISODate("2012-11-21T14:47:56Z"),
                        "pingMs" : 0
                },
                {
                        "_id" : 1,
                        "name" : "d2.s1.fs.drive.bru:27021",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 66069,
                        "optime" : Timestamp(1353509274000, 5),
                        "optimeDate" : ISODate("2012-11-21T14:47:54Z"),
                        "self" : true
                },
                {
                        "_id" : 2,
                        "name" : "a.s1.fs.drive.bru:27021",
                        "health" : 1,
                        "state" : 7,
                        "stateStr" : "ARBITER",
                        "uptime" : 66065,
                        "lastHeartbeat" : ISODate("2012-11-21T14:47:56Z"),
                        "pingMs" : 1
                }
        ],
        "ok" : 1
}

Comment by Eliot Horowitz (Inactive) [ 21/Nov/12 ]

Can you send rs.conf() and rs.status() from every node?

Comment by Eliot Horowitz (Inactive) [ 21/Nov/12 ]

The oplog query you did was on d2?

Comment by Aristarkh Zagorodnikov [ 21/Nov/12 ]

Attaching log from d2.s1 (the secondary of the 1st shard) that logged "should cause rollback".

Comment by Aristarkh Zagorodnikov [ 21/Nov/12 ]

Well, I guess that the absence of the rollback is the problem itself. The "primary.log" is from d1.s1 (that was primary at the start of the problems).

Comment by Eliot Horowitz (Inactive) [ 21/Nov/12 ]

What node is the log from?
Not seeing any rollback at all in there.
What node did you check the oplog for?

Comment by Aristarkh Zagorodnikov [ 21/Nov/12 ]

Some network topology notes to prevent possible confusion when analyzing logs where host names are not available (socket errors, etc.) – there are three physical servers:
10.3.1.10: c1 (SSD), a.sX (ports 27021-27024)
10.3.1.12: c2 (SSD), d1.sX (ports 27021-27024, dedicated RAID1 array for each shard/RS member, a total of 4), b1
10.3.1.13: c3 (SSD), d2.sX (ports 27021-27024, dedicated RAID1 array for each shard/RS member, a total of 4), b2

Comment by Aristarkh Zagorodnikov [ 21/Nov/12 ]

Attaching the log from the primary for two hours around the problem.

Comment by Eliot Horowitz (Inactive) [ 21/Nov/12 ]

Can you attach the logs from the primary where the rollback should have happened?

Comment by Aristarkh Zagorodnikov [ 21/Nov/12 ]

"I quickly ruled out #3" should read "I quickly ruled out #1"

Generated at Thu Feb 08 03:15:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.