[SERVER-10595] Replicating many array operations is excessively slow Created: 21/Aug/13  Updated: 15/Nov/21  Resolved: 06/Jan/14

Status: Closed
Project: Core Server
Component/s: Performance, Replication, Write Ops
Affects Version/s: 2.4.6
Fix Version/s: 2.5.5

Type: Bug Priority: Major - P3
Reporter: Nelson Elhage Assignee: Andrew Morrow (Inactive)
Resolution: Done Votes: 2
Labels: array, oplog
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Tested on Linux amd64 Ubuntu 13.04 Raring


Issue Links:
Depends
depends on SERVER-10717 New implementation of $push should lo... Closed
Related
related to SERVER-12258 Repeated $push to end of an array is ... Closed
related to SERVER-10159 Evaluate performance of new update fr... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Clone https://github.com/nelhage/mongod-tests/ and run push.py

Observe that long after the 10,000 $push ops have been applied, both secondaries are spinning CPU and are at 100% lock percentage. On my machine, with mongo 2.4.6, it takes around 90s for the secondaries to catch up.

Participants:

 Description   

The code to apply operations to arrays is excessively inefficient in several ways. In addition to causing performance problems, there are many ops that are fast to apply on the primary, but extremely slow on secondaries, which can result in replicating lagging massively and secondaries becoming useless (spinning at 100% CPU and write lock).

There are at least two causes for slowness:

  • BSONArrayIteratorSorted performs an unnecessary expensive sort on the array in order to walk it in order.
  • ModSetState::createNewFromMods calls compareDottedFieldNames for every element of the array, even though we already know the keys to be integers in order.

For the second issue, the issue is exacerbated by the number of allocations involved in compareDottedFieldNames:

  • update_internal allocates two std::string's to pass in, since it has char*'s. One of the std::string's is a concatenation of "root" and "e.fieldName()", even though we know that m->second->m->fieldName must already begin with "root", so we're doing an unnecessary concatenation and re-compare of the prefix.
  • compareDottedFieldNames calles std::string::substr(), which must allocate memory for the substring, to pass to LexNumCmp, even though LexNumCmp accepts a StringData, which could be constructed in O(1) for the relevant component.

Changing BSONArrayIteratorSorted to just backend to a BSONObjectIterator would fix the unneeded sort.

Changing compareDottedFieldNames to accept and use StringData would fix most of the allocations, and changing createNewFromMods to compare e.fieldName() and m->second->m->fieldName + root.size() (instead of doing the concatenation) would fix the rest.



 Comments   
Comment by Andrew Morrow (Inactive) [ 06/Jan/14 ]

Hi Nelson -

Thanks for the quick turnaround on testing. I'm going to mark this as resolved, please feel free to ping the ticket if you need anything else.

Andrew

Comment by Nelson Elhage [ 06/Jan/14 ]

Hi Andrew,

Testing against the 2.5.4 release, I concur that the problem seems to be resolved. Thanks!

Comment by Andrew Morrow (Inactive) [ 06/Jan/14 ]

Hi Nelson -

I've recently re-run your push.py test. While I can reproduce your long replication delay against 2.4.8, I've found that the replication delay does not occur when using the 2.5.4 development release.

I'd be interested to know if you can still reproduce the delay in your development environment with the latest 2.5 release (currently 2.5.4) or 2.5 nightly.

Thanks,
Andrew

Comment by Zach Lipton [ 26/Sep/13 ]

Thanks asya.

Most of the arrays are less than 100 elements, but the largest I've encountered so far is 18,000. Is that simply too large? I'd expect slower performance when documents need to be moved, but not this slow with 100% CPU usage for extended periods of time. It's the 'cookies' collection in group 'dijit' in MMS if you're following along at home. While my initial theory was that the slow $push operations were happening only on the large arrays, I think I've seen a mix from the profiler data. Is there a way to get more visibility into what's going on with these operations where nscanned=1, nmoved=0, but timeLockedMicros is 80+ seconds?

This is Amazon Linux on EC2. I've used both m1.large and m3.xlarge instances. Disks are a RAID configuration as recommended in the documentation, but the problem seems to be CPU-bound.

Happy to break this out into another ticket if I'm hijacking an unrelated issue.

Comment by Asya Kamsky [ 26/Sep/13 ]

zach@dijit.com how large of arrays are we talking about here?

Also, what OS/arch is this and is NUMA possibly involved?

Comment by Zach Lipton [ 25/Sep/13 ]

So I think that this is the same issue I'm currently on 2.4.6, but I don't know the server internals well enough to be certain. In my application, I'm collecting analytics-type data on a per-user basis. I have a collection of users, and each user contains an array of events, so each document looks more-or-less like:

{
    _id: ObjectId("..."),
    name: "John Smith",
    hits: [{
        "_id": ObjectId("..."),
        type: "request"
        timestamp: Date("..."),
        path: "/foo/bar",
        partner: "acmecorp"
    }, {
        _id: ObjectId("..."),
        type: "registration"
        timestamp: Date("..."),
        path: "/foo/bar/quaz",
        partner: "aperturescience"
    }]
}

As events come in, the application performs an upsert on the relevant user, creating the user if it does not already exist and $pushing the event onto the array. The server is currently configured with two-node replication (plus an arbiter), running on EC2, reporting to MMS under group name "dijit." (If you look at MMS, ignore the recent downtime on the secondary, as I took it offline for a little while to troubleshoot).

What I'm seeing is sporadic periods of time in which CPU usage spikes to 100%, writes to this collection are held up waiting for locks, and individual writes could take 75+ seconds. This occurs both on the primary and the secondary, though the secondary is practically useless, often with constant 100% CPU usage. While I initially suspected a delay when documents needed to be moved as they expanded (though that shouldn't be this slow certainly), the profiler logs show "nmoved" : 0 for many of the slow writes. I did try setting usePowerOf2Sizes on the collection, which seems to have helped reduce the frequency of these lockups a little, but not much and only on the primary.

So my question really is: is the issue I'm seeing here the same root problem reported in this bug report? And are there any known mitigation strategies for 2.4.x?

Comment by Andrew Morrow (Inactive) [ 23/Aug/13 ]

Hi -

Thank you for the detailed (and automated) test case and performance analysis. I was immediately able to reproduce the behavior you described.

You should know that we are in the process of entirely reworking the update code path for the next major release, 2.6. You can experiment with this in the just released 2.5.2 development release. However, the new implementation has not yet reached performance parity with the old in several cases. In particular, reasonable performance to the master for your test case relies on positional $sets in the oplog for $push, which the new update framework does not yet offer. I'll be using your test case to tune that particular operation for 2.5.3.

We will need to separately consider what changes we want to make for 2.4. Also, we will verify that the new update framework does not re-iterate the issues you have identified above.

Thanks,
Andrew

Comment by Nelson Elhage [ 21/Aug/13 ]

Oh, one extra comment – contrary to rumors I've heard, the slowness comes from `$push` being translated to `$set:

{ 'array.NNN': ...}

`, rather than from the entire array being replicated in the oplog:

rs_5586:PRIMARY> db.oplog.rs.find().sort({$natural: -1}).limit(1).next()
{
	"ts" : {
		"t" : 1377121580,
		"i" : 144
	},
	"h" : NumberLong("-1841530108187566546"),
	"v" : 2,
	"op" : "u",
	"ns" : "test.test_push",
	"o2" : {
		"_id" : ObjectId("5215352996bae515d2a264cc")
	},
	"o" : {
		"$set" : {
			"nums.2316" : 2316
		}
	}
}

So there is in fact no fundamental reason for this to be slow.

Generated at Thu Feb 08 03:23:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.