-
Type: Task
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: 2.3.2
-
Component/s: Performance
-
None
I've got a couple of questions here, but the main one is the following.
I'm trying to insert a DataFrame with around 90k docs performing upserts inside a .foreach over the set of elements.
val mongoConnector = MongoConnector(writeConfig.asOptions)
writeData.foreach(item =>
mongoConnector.withCollectionDo(writeConfig, { collection: MongoCollection[Document] =>
collection.updateOne( /** Has upsert == true **/)
})
)
I observe huge 100% CPU peaks in the primary whenever I perform this upserts. The collection lacks indexes, which is most probably the main issue there.
However, when I perform the same operation with .bulkWrite the following way...
val writeData = dataDf .collect // Happens to run out of memory in here without a proper config if I have a lot of elements, way more than >90k. .map( item => UpdateOneModel( /** Has upsert == true **/ )) val mongoConnector = MongoConnector(writeConfig.asOptions) mongoConnector.withCollectionDo(writeConfig, { collection: MongoCollection[Document] => collection.bulkWrite( writeData.toList.asJava )
... the CPU graph seem more linear. It's clear that the load is higher but nowhere near 100%.
Also, it takes 40 minutes approximately for both to complete, so no time gain in there.
Any clue regarding any optimization that I could add or a much better way to perform bulk writes?