[SERVER-22571] splitVector should return a cursor so that it can return more than 16Mb worth of split points Created: 11/Feb/16  Updated: 06/Dec/22  Resolved: 15/Nov/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.0.9
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Dmitry Ryabtsev Assignee: [DO NOT USE] Backlog - Sharding NYC
Resolution: Won't Do Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Sharding NYC
Sprint: Sharding 17 (07/15/16), Sharding 18 (08/05/16)
Participants:
Case:

 Description   

Currently the 16Mb response size limit prevents sharding an existing large collection if doing so would generate too many initial split points. Changing splitVector to return a cursor should allow us to shard arbitrarily large existing collections without needing to increase the chunk size.

Original Summary: splitVector fails if the document containing the split points exceeds 16 Mb limit

Original Description: Generally that's the case when either initial split happens for a large collection and/or the chunks size is configured to be small.

The symptoms when it happens during autosplit are:

16-01-25T19:40:37.720-0500 W SHARDING [conn64] could not autosplit collection database1.col2 :: caused by :: 13345 splitVector command failed: { timeMillis: 2314, errmsg: "exception: BSONObj size: 20242689 (0x134E101) is invalid. Size must be between 0 and 16793600(16MB) First element: 0: { region: "Europe", randomP...", code: 10334, ok: 0.0, $gleStats: { lastOpTime: Timestamp 0|0, electionId: ObjectId('569684c288621bdbc786b963') } }

Can be reproduced manually as:

  1. Insert a million of documents into a collection with default (implicitly created) _id values.
  2. Run splitVector:

    db.runCommand({splitVector: "test.mgendata", keyPattern: {_id:1}, maxChunkSizeBytes: 1*1024})
    {
    	"timeMillis" : 1127,
    	"errmsg" : "exception: BSONObj size: 19888875 (0x12F7AEB) is invalid. Size must be between 0 and 16793600(16MB) First element: 0: { _id: ObjectId('56a088e086c70418eb3e75fa') }",
    	"code" : 10334,
    	"ok" : 0
    }
    > show log
    <...>
    2016-01-21T19:17:28.232+1100 I SHARDING [conn3] request split points lookup for chunk test.mgendata { : MinKey } -->> { : MaxKey }
    2016-01-21T19:17:29.360+1100 W SHARDING [conn3] Finding the split vector for test.mgendata over { _id: 1.0 } keyCount: 2 numSplits: 666666 lookedAt: 2 took 1127ms
    2016-01-21T19:17:29.470+1100 I -        [conn3] Assertion: 10334:BSONObj size: 19888875 (0x12F7AEB) is invalid. Size must be between 0 and 16793600(16MB) First element: 0: { _id: ObjectId('56a088e086c70418eb3e75fa') }
    2016-01-21T19:17:29.477+1100 I CONTROL  [conn3]
     0xf840d2 0xf2b679 0xf0ef7e 0xf0f02c 0x8122eb 0x946664 0xe21f79 0x9c6e61 0x9c7e6c 0x9c8b3b 0xba5b9b 0xab2a1a 0x7ea855 0xf3ffa9 0x7f14e578e182 0x7f14e488f47d
    ----- BEGIN BACKTRACE -----
    {"backtrace":[{"b":"400000","o":"B840D2","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"B2B679","s":"_ZN5mongo10logContextEPKc"},{"b":"400000","o":"B0EF7E","s":"_ZN5mongo11msgassertedEiPKc"},{"b":"400000","o":"B0F02C"},{"b":"400000","o":"4122EB","s":"_ZNK5mongo7BSONO
    2016-01-21T19:17:29.491+1100 I COMMAND  [conn3] command test.$cmd command: splitVector { splitVector: "test.mgendata", keyPattern: { _id: 1.0 }, maxChunkSizeBytes: 1024.0 } keyUpdates:0 writeConflicts:0 numYields:15625 reslen:239 locks:{ Global: { acquireCount: { r: 31252 } }, Database: { acquireCount: { r: 15626 } }, Collection: { acquireCount: { r: 15626 } } } 1258ms

There are two problems with the existing behaviour:
1) The error message is misleading as it may be interpreted as corruption-related
2) The splitVector fails, while it should not be necessary as sharding should be able to carry on even if the list of split points returned is incomplete (truncated)

If we are only to address the first problem, we could make the error message to be more apparent:
“Could not split this chunk because it results in too many small chunks. Increase the max chunks size. For more info see …."

For problem (2) we could make splitVector to truncate the list of split points to ensure it fits into 16 Mb limit.



 Comments   
Comment by Kaloian Manassiev [ 15/Nov/21 ]

This ticket has been open for more than 5 years and due to competing priorities we have not been able to schedule it. While it may be a good idea for some narrow use cases, given the amount of work required, we will not fix it in the foreseeable future. The way it works now is sufficient for our internal use cases.

Comment by Andrew Ryder (Inactive) [ 08/Jun/16 ]

schwerin, that makes sense. Leveraging $sample for other tasks is great. In the case of splitVector (and probably the hadoop-connector too) the sample size just needs to be somewhat larger than the expected the output so the distribution comes out fairly flat.

Comment by Andy Schwerin [ 08/Jun/16 ]

To answer your question, andrew.ryder, I had thought to have current users of splitVector instead use a $sample aggregation, but oversample as you suggest in your second concept. However, the work of choosing which of those returned samples to use as split points would be the caller's responsibility or the job of a later stage in the aggregation pipeline that contained the $sample. The goal would be to remove the use of the splitVector command by sharding. This has the advantage of reducing the amount of reading done by splitVector, as you mentioned.

The impediment to this approach is that splitVector also does chunk size and document count estimation, today, and finding a substitute for that functionality would be difficult in the short term. That might argue for keeping splitVector around for a little longer.

Comment by Max Hirschhorn [ 07/Jun/16 ]

dmitry.ryabtsev, andrew.ryder, it is my understanding from pasette and spencer that the usage of the "splitVector" command in both the Spark and Hadoop connections is undesirable; the "splitVector" command is intended to be an internal command for supporting sharded clusters.

Regarding the Spark and Hadoop connectors

https://docs.mongodb.com/ecosystem/tutorial/getting-started-with-hadoop/#mongodb

The documentation page doesn't list any assumptions about the behavior of the "splitVector" command - it doesn't even really explain to the user what the command is going to be used for. I would prefer any discussion about the requirements for the Spark and Hadoop connections to be done in SERVER-24274 because a separate ticket has already been filed to try and address the permissions issue with the "splitVector" command approach.

Regarding an earlier comment in this ticket

I don't think using $sample is a good idea as it won't guarantee an even distribution of split points across the data range.

I think it might be worth clarifying what it is exactly you want an even distribution of.

  1. Is it approximately an equal number of documents in each range?
  2. Is it approximately an equal number of (uncompressed) bytes in each range?
  3. Something else?

I might be missing some details about how chunk ranges and balancing work, but my impression is that the system would be self-correcting. For example, even if the initial assignments of chunks to shards were slightly uneven via some bias in the storage engine's random cursor, then the system would eventually move chunks from shards with more and/or larger chunks to shards with fewer and/or smaller chunks.

Comment by Andrew Ryder (Inactive) [ 07/Jun/16 ]

schwerin, there was some debate/confusion here over how you propose to utilize $sample. Do you mean:

  1. Return the results of $sample in place of the current splitVector
  2. Use a large $sample to provide a smaller set of seed values for splitVector to select from so it can avoid reading the majority of the collection.
  3. Something else.

#1 I think would suck.
#2 I think would be fine, but we'd need to ensure at least 5 (or 10?) samples per expected split point to ensure a somewhat even distribution.

Comment by Andrew Ryder (Inactive) [ 07/Jun/16 ]

schwerin, I agree with Dmitry. Also note that our own hadoop-connector expects splitVector to provide a flat distribution in order to parcel up work evenly: https://docs.mongodb.com/ecosystem/tutorial/getting-started-with-hadoop/#mongodb

Comment by Dmitry Ryabtsev [ 07/Jun/16 ]

Hi schwerin, I don't think using $sample is a good idea as it won't guarantee an even distribution of split points across the data range.

Comment by Andy Schwerin [ 06/Jun/16 ]

An alternative might be for us to start using a $sample aggregation to establish the split points. That would get us a cursor for free, and might have better performance than splitVector, since it wouldn't need to do a whole index scan.

dmitry.ryabtsev, do you have other uses of splitVector that would preclude this approach?

Generated at Thu Feb 08 04:00:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.