[SERVER-27344] splitVector should be under a different built-in authorization role Created: 08/Dec/16  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Security
Affects Version/s: 3.2.11
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Andre Spiegel Assignee: Backlog - Security Team
Resolution: Unresolved Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-10117 expose splitVector functionality Closed
Assigned Teams:
Server Security
Participants:
Case:

 Description   

The unofficial, semi-documented splitVector command has its own privilege action and is included in the clusterAdmin role by default. Unfortunately, this means that splitVector cannot be used in Atlas, since Atlas does not make the clusterAdmin role available to users, and does not allow creation of user-defined roles.

I'd suggest to put splitVector under a different built-in role such as read, readWrite, or dbAdmin.



 Comments   
Comment by Charlie Swanson [ 12/Dec/16 ]

Hmm I'm not sure I buy that argument.

I scanned the implementation of splitVector with david.storch, and it seems pretty naive. It looks like it does some back-of-the-envelope style calculations to determine that roughly X documents should fit into each range, and then takes every Xth value out of an index scan. Can you elaborate on how you are using this? It seems like it's entirely possible this would give you a very skewed distribution and offers no guarantees on how accurately the ranges are split, which are the same guarantees $sample has. I will admit that a simple index or collection scan followed by a $bucketAuto stage would introduce a significant memory and perhaps computational overhead compared to the splitVector command.

A $sample followed by a $bucketAuto actually seems like the most efficient way to do this I can imagine. If you have a sample of size 1000, you will look at ~1000 index keys (not exact since you might get duplicates) and have a fairly good split of your collection. There is one unfortunate detail in that we currently place a FETCH stage on top of the random cursor used for $sample unconditionally, but in the case where we've determined we only need the _id, we can convert this to a covered index query as well. I've filed SERVER-27389 to track that optimization. I would be surprised if the $bucketAuto stage introduced very much overhead to compute the buckets, but it is possible. If you're really looking for this to be super quick, you could do away with the $bucketAuto and just use a $sample of the desired size. You'd loose a bit of accuracy of the distribution though.

More generally, the server team has been making an effort to condense the number of commands exposed, thus reducing the number of inevitable bugs of the sort "the X command forgot to profile itself", or "the Y command doesn't take the required steps to prevent reading orphaned/unowned data" (fun fact, it looks like splitVector has this problem at first glance). I think we would prefer a solution to this problem that doesn't expose the splitVector command, and instead uses the aggregation framework to achieve the same goal. Perhaps we could introduce some aggregation stage that would do a similar "get every Nth value from this stream" that it appears is happening in the splitVector command?

I'll also just throw out here that it seems like the parallelCollectionScan command is related, and maybe what we really want is SERVER-17688 and possibly SERVER-23162?

Comment by Andre Spiegel [ 12/Dec/16 ]

charlie.swanson As far as I understand, splitVector doesn't actually read documents and put them into buckets. It walks the index on the collection and determines the split boundaries based on that. So even if $sample reduces the work of the bucket-building (trading accuracy for speed), it is still a second-best guess compared to walking the index and computing boundaries without ever reading even a fraction of the documents involved.

Comment by Charlie Swanson [ 12/Dec/16 ]

I agree, I think $sample is what you're looking for? $sample combined with $bucketAuto make it pretty easy to get roughly equal distributions with a configurable sample size to tune the accuracy of the distribution:

> db.foo.drop()
true
> var bulk = db.foo.initializeUnorderedBulkOp();
> for (var i = 0; i < 10000; i++) { bulk.insert({}); }
> bulk.execute()
BulkWriteResult({
	"writeErrors" : [ ],
	"writeConcernErrors" : [ ],
	"nInserted" : 10000,
	"nUpserted" : 0,
	"nMatched" : 0,
	"nModified" : 0,
	"nRemoved" : 0,
	"upserted" : [ ]
})
>  db.foo.aggregate([{$sample: {size: 1000}}, {$bucketAuto: {groupBy: "$_id", buckets: 10}}]).table();
╔═══════════════════════════════════════════════════════════════════════════════════╤═══════╗
║ _id                                                                               │ count ║
║     ┌──────────────────────────────────────┬──────────────────────────────────────┤       ║
║     │ _id.min                              │ _id.max                              │       ║
╠═════╪══════════════════════════════════════╪══════════════════════════════════════╪═══════╣
║     │ ObjectId("584f06626b5bdc71cf22bc43") │ ObjectId("584f06626b5bdc71cf22bf99") │ 100   ║
╟─────┼──────────────────────────────────────┼──────────────────────────────────────┼───────╢
║     │ ObjectId("584f06626b5bdc71cf22bf99") │ ObjectId("584f06626b5bdc71cf22c330") │ 100   ║
╟─────┼──────────────────────────────────────┼──────────────────────────────────────┼───────╢
║     │ ObjectId("584f06626b5bdc71cf22c330") │ ObjectId("584f06626b5bdc71cf22c773") │ 100   ║
╟─────┼──────────────────────────────────────┼──────────────────────────────────────┼───────╢
║     │ ObjectId("584f06626b5bdc71cf22c773") │ ObjectId("584f06626b5bdc71cf22cb22") │ 100   ║
╟─────┼──────────────────────────────────────┼──────────────────────────────────────┼───────╢
║     │ ObjectId("584f06626b5bdc71cf22cb22") │ ObjectId("584f06626b5bdc71cf22cefa") │ 100   ║
╟─────┼──────────────────────────────────────┼──────────────────────────────────────┼───────╢
║     │ ObjectId("584f06626b5bdc71cf22cefa") │ ObjectId("584f06626b5bdc71cf22d29a") │ 100   ║
╟─────┼──────────────────────────────────────┼──────────────────────────────────────┼───────╢
║     │ ObjectId("584f06626b5bdc71cf22d29a") │ ObjectId("584f06626b5bdc71cf22d71b") │ 100   ║
╟─────┼──────────────────────────────────────┼──────────────────────────────────────┼───────╢
║     │ ObjectId("584f06626b5bdc71cf22d71b") │ ObjectId("584f06626b5bdc71cf22db55") │ 100   ║
╟─────┼──────────────────────────────────────┼──────────────────────────────────────┼───────╢
║     │ ObjectId("584f06626b5bdc71cf22db55") │ ObjectId("584f06626b5bdc71cf22df39") │ 100   ║
╟─────┼──────────────────────────────────────┼──────────────────────────────────────┼───────╢
║     │ ObjectId("584f06626b5bdc71cf22df39") │ ObjectId("584f06626b5bdc71cf22e33b") │ 100   ║
╚═════╧══════════════════════════════════════╧══════════════════════════════════════╧═══════╝
> var buckets = db.foo.aggregate([{$sample: {size: 1000}}, {$bucketAuto: {groupBy: "$_id", buckets: 10}}]).toArray();
> buckets.forEach((bucket, i) => { printjson(db.foo.aggregate([{$match: {_id: {$gte: bucket._id.min, $lt: bucket._id.max}}}, {$count: "nResults"}, {$addFields: {bucket: i}}]).toArray()[0]); })
{ "nResults" : 997, "bucket" : 0 }
{ "nResults" : 942, "bucket" : 1 }
{ "nResults" : 1122, "bucket" : 2 }
{ "nResults" : 878, "bucket" : 3 }
{ "nResults" : 941, "bucket" : 4 }
{ "nResults" : 1095, "bucket" : 5 }
{ "nResults" : 1039, "bucket" : 6 }
{ "nResults" : 1011, "bucket" : 7 }
{ "nResults" : 1056, "bucket" : 8 }
{ "nResults" : 916, "bucket" : 9 }

What would a separate command give you that would be better than that?

Comment by Andre Spiegel [ 12/Dec/16 ]

spencer.jackson luke.lovett@mongodb.com Yes, the idea with $sample looks practical, although access to a function that performs an efficient index traversal on the server side would be preferable. So I'd still vote for having access to splitVector or something equivalent, but I'll try the $sample idea as a second-best guess.

Comment by Spencer Jackson [ 12/Dec/16 ]

andre.spiegel Does the advice on $sample seem applicable to you?

Comment by Luke Lovett [ 09/Dec/16 ]

I made a Splitter in the Hadoop connector based on the $sample aggregation operator: https://github.com/mongodb/mongo-hadoop/blob/r2.0.1/core/src/main/java/com/mongodb/hadoop/splitter/SampleSplitter.java. This goal of this splitter is to partition a collection into pieces of roughly equal size. So far, I haven't had any complaints about it, so I assume that it works well enough. It's nice because it doesn't require the extra privileges in order to run, as the splitVector command does. This won't work for older server versions that don't have $sample; however, moving splitVector into its own role of course won't help with older server versions, either.

Comment by Spencer Brody (Inactive) [ 09/Dec/16 ]

From a security perspective I don't think there's anything wrong with putting splitVector under the 'read' role, however like Andy I am worried about making splitVector easier to use as it was never intended to be a user-facing command. In SERVER-10117 there was discussion around whether using an aggregation with $sample is a viable alternative to splitVector for use cases that are trying to partition the collection into units for parallel processing. luke.lovett mentioned he would attempt to create a hadoop partitioner based on $sample - andre.spiegel you may want to follow up with him to see how that experiment went and determine if that approach could work for your framework as well.

Comment by Andre Spiegel [ 09/Dec/16 ]

My framework needs to find the boundaries (split points) between buckets, in order to create a table of work units for asynchronous use by multiple threads.

The $bucket stage requires that I already know what the boundaries are, and will then create the buckets for me. However, in my framework, each thread only needs to process the documents of a single bucket, so the $bucket stage does not help here. Also, each thread needs to see all documents from the bucket in their entirety, so they cannot be passed along as a group in an aggregation stage.

The $bucketAuto stage does seem to compute the bucket boundaries (I wasn't fully aware of that before), but it seems to do so by performing the actual aggregation, i.e. it reads all documents from the pipeline and actually creates the buckets, and then derives the boundaries from that. This is the same thing that my framework does client-side if the splitVector command is not available. However, for just finding the bucket boundaries, using splitVector is much more efficient, since it only walks the index rather than actually reading and grouping the documents. So I believe the advantage of doing the grouping server-side with $bucketAuto is not enough, compared to what I do client-side. If $bucketAuto used the same mechanism as splitVector (or actually used splitVector) to compute the boundaries, that would indeed be what I need. It would effectively be a public interface to the splitVector mechanism. But I don't think it does that today. Or does it?

Comment by Andre Spiegel [ 09/Dec/16 ]

I am using splitVector in a framework to partition a collection into units of work, so that parallel threads can each work on different parts of the data. The Hadoop connector does something similar. It is much more efficient to do this server-side, although I do have alternate code that performs this partitioning in the application, and my framework actually falls back to that if splitVector is not available.

I have argued in the past that it would be nice to have splitVector elevated to an official command, or that some kind of official command be introduced to fulfill its purpose. The recently added $bucket and $bucketAuto stages seem to serve a similar purpose, but they don't quite do what I need. If you're interested in the details, have a look at my SplitFinder class that uses splitVector (or falls back to client-side splitting if necessary): https://github.com/10gen/single-view/blob/master/src/com/mongodb/single/workers/SplitFinder.java

That being said, I agree that putting splitVector under the same authorization as find seems sufficient.

Generated at Thu Feb 08 04:14:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.