Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-266

No performance difference between bulkWrite and upsert

    XMLWordPrintable

    Details

    • Type: Task
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Done
    • Affects Version/s: 2.3.2
    • Fix Version/s: None
    • Component/s: Performance
    • Labels:
      None

      Description

      I've got a couple of questions here, but the main one is the following.

      I'm trying to insert a DataFrame with around 90k docs performing upserts inside a .foreach over the set of elements.

       

       

      val mongoConnector = MongoConnector(writeConfig.asOptions)
      writeData.foreach(item =>
        mongoConnector.withCollectionDo(writeConfig, { collection: MongoCollection[Document] =>
            collection.updateOne( /** Has upsert == true **/)
        })
      )
      

      I observe huge 100% CPU peaks in the primary whenever I perform this upserts. The collection lacks indexes, which is most probably the main issue there.

       

      However, when I perform the same operation with .bulkWrite the following way...

       

      val writeData = dataDf
        .collect // Happens to run out of memory in here without a proper config if I have a lot of elements, way more than >90k.
        .map(
          item =>
            UpdateOneModel( /** Has upsert == true **/ ))
        
      val mongoConnector = MongoConnector(writeConfig.asOptions)
       
      mongoConnector.withCollectionDo(writeConfig, { collection: MongoCollection[Document] =>
          collection.bulkWrite(
            writeData.toList.asJava
          )
      

      ... the CPU graph seem more linear. It's clear that the load is higher but nowhere near 100%.

      Also, it takes 40 minutes approximately for both to complete, so no time gain in there.

      Any clue regarding any optimization that I could add or a much better way to perform bulk writes?

       

        Attachments

          Activity

            People

            Assignee:
            ross.lawley Ross Lawley
            Reporter:
            edgarherrero@protonmail.com Eddy H
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: