[JAVA-2410] Create custom Spliterator implementation for MongoIterable Created: 14/Dec/16  Updated: 30/Mar/22

Status: Backlog
Project: Java Driver
Component/s: Query Operations
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Charles DuBose Assignee: Unassigned
Resolution: Unresolved Votes: 2
Labels: Java8
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Heroku Linux



 Description   

I've been playing around with using a spliterator to process a result set in parallel and have discovered that it really doesn't work how I expect it to work.

What I expect to happen:

  • findIterable.spliterator().characteristics() returns SUBSIZED
  • findIterable.trySplit() returns a spliterator over records equal to the batch size (if set)
  • streams are able to process records in parallel

What I'm finding to happen:

  • findIterable.spliterator().characteristics() returns 0
  • findIterable.trySplit() returns a spliterator of inconsistent sizing starting with 1024. The next split gives 2048 results. I'm not sure what the subsequent trySplit() gives as I run out of memory before it returns.
  • when used with a stream, so far as I can tell, it burns through a large number of batches to fill the first and second sets, processes a few times, then fails with OOM errors once it tries to get the 3rd split

Pseudojava for what I'm trying to accomplish:

      FindIterable<Document> allRecords = MongoStore.filterAll(databaseName, collectionName, startDate, endDate);
 
      Spliterator<Document> spliterator = allRecords.spliterator();
      Stream<Document> docStream = StreamSupport.stream(spliterator, true);
 
      docStream.forEach(document -> {
        // process documents
      });

Is this just a case of the spliterator not being implemented in mongo, or am I using it wrong?



 Comments   
Comment by Ian Whalen (Inactive) [ 07/Jan/19 ]

Not certain that we want to do this, but will revisit in the post 4.0 time frame.

Comment by Charles DuBose [ 15/Dec/16 ]

I think there's value in splitting on the batch. Each trySplit could return the next batch, retrieved from the cursor. Once the batch is realized, then it can be processed in parallel. Say you set a batch size of 50 and are running 4 streams. The stream would pull the cursor 4 times, process the records, then pull the cursor again each time the batch finishes processing. Mongo itself doesn't need to be parallel, just provide a sane split for the actual parallel stream.

I could probably write something custom to do the same thing, but I'm not certain that the FindIterator provides that level of control over pulling each batch. I guess I could do it the old-fashioned way, push e.next into an array batchSize times, return that as the trySplit result, keep that up until there's nothing left.

Comment by Jeffrey Yemin [ 15/Dec/16 ]

The driver does not implement any special support for Spliterator, so your application is ending up with the default implementation from java.lang.Iterable, which returns

    Spliterators.spliteratorUnknownSize(iterator(), 0).

I'm not all that familiar with Spliterator implementations, but it strikes me that supporting a parallel Spliterator in the driver may not be possible, as MongoDB cursors return batches one at a time. So in terms of memory usage, it would be no better than, say:

    ArrayList<Document> documents = allRecords.into(new ArrayList<>());
    Stream<Document> docStream = StreamSupport.stream(documents.spliterator(), true);

This will require more research, but I'm interested in any initial feedback you may have.

Generated at Thu Feb 08 08:57:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.