[SERVER-10117] expose splitVector functionality Created: 06/Jul/13  Updated: 06/Dec/22  Resolved: 23/Aug/18

Status: Closed
Project: Core Server
Component/s: Sharding, Usability
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Antoine Girbal Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Won't Fix Votes: 8
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-23917 splitVector can't be run against seco... Closed
Duplicate
is duplicated by SERVER-19170 Add splitVector permission to cluster... Closed
Related
related to DOCS-1686 document the splitVector command Closed
related to SPARK-54 Create a pagination partitioner Closed
is related to SERVER-27344 splitVector should be under a differe... Backlog
Assigned Teams:
Sharding
Participants:

 Description   

splitVector is used by sharding and returns the split points for a collection. It shows pretty impressive performance while doing so. There are many cases where it can be extremely useful to know to split points of a collection, for example:

  • to subdivide workload accross application threads
  • to subdivide workload for map reduce

There is no easy alternative for the application in case it is not aware of the distribution of a key.
In the context of sharding, mongos could just return the chunk ranges.
This should be made available to a 'read' or 'readWrite' application user.
It seems also that the hadoop connector currently relies on these things, so we should make it properly supported.



 Comments   
Comment by Kaloian Manassiev [ 23/Aug/18 ]

Thanks ross.lawley. I am closing this ticket as Won't Fix.

Comment by Ross Lawley [ 23/Aug/18 ]

kaloian.manassiev no longer a pain point. $sample does the job well and has been the default approach in the Spark Connector collection partitioner without issue.

Comment by Kaloian Manassiev [ 23/Aug/18 ]

ross.lawley, is this still a pain point for you? Looks like $sample does the job and we would really like to not start supporting splitVector as first class citizen.

Comment by Luke Lovett [ 25/May/16 ]

dan@10gen.com, this sounds like a good alternative for users on MongoDB 3.2+. I'll make a HADOOP ticket for creating a new splitter based on $sample.

Comment by Ross Lawley [ 25/May/16 ]

dan@10gen.com I'm still trying to fully grok the code, it should work. I'll try creating a new partitioner using $sample and see how it goes.

Comment by Daniel Pasette (Inactive) [ 25/May/16 ]

ross.lawley/luke.lovett, what if we move the hadoop and spark connectors to use the $sample agg stage to calculate split points instead of the internal splitVector cmd? This requires only read privileges and is exactly what we use to calculate split points in the oplog on WiredTiger. Here's the code: https://github.com/mongodb/mongo/blob/r3.3.6/src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp#L343. Would this work? $sample is only available in v3.2+

Comment by Ross Lawley [ 24/May/16 ]

splitVector is also used by the new spark connector as a mechanism for partitioning a collection for any users on a non-sharded system.
Having it assigned to the same roles as collStats would be ideal as then users with read permissions would be able to use the connector.

Comment by Ben McCann [ 29/Jun/15 ]

+1 to adding splitVector to the clusterMonitor role. It's already exposed in the clusterAdmin role and other admin roles. I feel that the only thing that's being accomplished by leaving it off of clusterMonitor is encouraging folks to assign higher permission levels than is necessary.

You already require lots of customers using your Mongo Hadoop Connector to assign the permission. You can leave it undocumented so that folks aren't encouraged to use it beyond that, but it's seems really silly to tell people that they need to use it and then make it so that the easiest way to do that is by assigning a less secure role than is necessary. In fact, by not adding it in clusterMonitor you end up exposing it even more. Now people are creating custom roles that use splitVector instead of just using the built-in roles that you control and can change with future releases.

To see where the Mongo Hadoop Connector uses it look at https://github.com/mongodb/mongo-hadoop/blob/master/core/src/main/java/com/mongodb/hadoop/splitter/StandaloneMongoSplitter.java

Comment by Lars Francke [ 17/Jul/14 ]

I can open a separate issue for this but I think it's also worthwhile to allow splitVector to run on secondary servers as well. Especially for the Hadoop use-case. We don't want to point the customer's Hadoop cluster at the primary node.

See CS-13607 and HADOOP-150 for more information.

Comment by Spencer Brody (Inactive) [ 08/Jul/13 ]

Updated the title of this ticket and put it into needs triage. Not sure how important it is to change what system role this is in given the upcoming ability for users to define custom roles.

Comment by Antoine Girbal [ 08/Jul/13 ]

I will open a ticket to document it then, because right now there is nothing:
http://docs.mongodb.org/manual/reference/command/splitVector/

I think clusterAdmin role is too harsh here, since:

  • it only works for unsharded collection on single server
  • it's read only
  • it's backed by an index so no risk of table scan

One point of this ticket is to make it available to application's 'read' or 'readWrite' users.

Comment by Spencer Brody (Inactive) [ 08/Jul/13 ]

It is officially supported, you need the "clusterAdmin" role to use.

Generated at Thu Feb 08 03:22:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.