Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.3.2
Component/s: Performance
Labels:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Link:
None
Goal Name(s):
None

I've got a couple of questions here, but the main one is the following.

I'm trying to insert a DataFrame with around 90k docs performing upserts inside a .foreach over the set of elements.

val mongoConnector = MongoConnector(writeConfig.asOptions)
writeData.foreach(item =>
  mongoConnector.withCollectionDo(writeConfig, { collection: MongoCollection[Document] =>
      collection.updateOne( /** Has upsert == true **/)
  })
)

I observe huge 100% CPU peaks in the primary whenever I perform this upserts. The collection lacks indexes, which is most probably the main issue there.

However, when I perform the same operation with .bulkWrite the following way...

val writeData = dataDf
  .collect // Happens to run out of memory in here without a proper config if I have a lot of elements, way more than >90k.
  .map(
    item =>
      UpdateOneModel( /** Has upsert == true **/ ))
  
val mongoConnector = MongoConnector(writeConfig.asOptions)

mongoConnector.withCollectionDo(writeConfig, { collection: MongoCollection[Document] =>
    collection.bulkWrite(
      writeData.toList.asJava
    )

... the CPU graph seem more linear. It's clear that the load is higher but nowhere near 100%.

Also, it takes 40 minutes approximately for both to complete, so no time gain in there.

Any clue regarding any optimization that I could add or a much better way to perform bulk writes?

Assignee:: Ross Lawley
Reporter:: Eddy H
Reviewers:: None
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Jan 20 2020 04:39:16 PM UTC
Updated:: Sep 22 2021 06:48:55 PM UTC
Resolved:: Jan 29 2020 03:45:14 PM UTC

Details

Description

Attachments

Activity

People

Dates