Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-152

can not save DataFrame correctly after join() action

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Cannot Reproduce
    • Affects Version/s: 2.2.0
    • Fix Version/s: None
    • Component/s: Writes
    • Labels:
    • Environment:
    • # Replies:
      4
    • Last comment by Customer:
      false

      Description

      0. run command

      ```
      ./bin/spark-submit \
      --master "spark://YOUR_HOST_NAME:7077" \
      --deploy-mode client \
      --executor-memory 3G \
      --num-executors 2 \
      --conf "spark.mongodb.input.uri=mongodb://127.0.0.1:27017/test" \
      --conf "spark.mongodb.output.uri=mongodb://127.0.0.1:27017/test" \
      --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 \
      /YOUR/PATH/TO/THIS/joinErr.py
      ```

      1. insert some data into test db

      ```
      db.calc.drop()

      db.calc.insert([

      {devId: "001", prdId: "product1", size: 10}

      ,

      {devId: "002", prdId: "product2", size: 20}

      ])

      db.calc.find()

      { "_id" : ObjectId("59f284a8a22c47422c2059d2"), "devId" : "001", "prdId" : "product1", "size" : 10 } { "_id" : ObjectId("59f284a8a22c47422c2059d3"), "devId" : "002", "prdId" : "product2", "size" : 20 }

      ```

      2. run the command in section 0, its output show as below

      ```
      ==> collect sumDF: [Row(devId=u'001', prdId=u'product1', size=10), Row(devId=u'002', prdId=u'product2', size=20)]
      ==> collect dpDF: [Row(devId=u'001', prdId=u'product1', size=1), Row(devId=u'002', prdId=u'product2', size=2)]
      ==> final sumDF.collect: [Row(devId=u'002', prdId=u'product2', size=22), Row(devId=u'001', prdId=u'product1', size=11)]
      ```
      Yep!, the infal sumDF is what I wanted. but look in the mongo shell:

      ```
      db.calc.find()

      { "_id" : ObjectId("59f2855896b438192f213ecf"), "devId" : "001", "prdId" : "product1", "size" : NumberLong(1) } { "_id" : ObjectId("59f2855896b438192f213ece"), "devId" : "002", "prdId" : "product2", "size" : NumberLong(2) }

      ```
      it seems that sumDF is replaced by dpDF.

      3. if sumDF is create by hand or read throud Pymongo, it works as intended. just as test-case2 and test-case3 indicated.

        Attachments

          Activity

            People

            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                Days since reply:
                1 year, 50 weeks, 6 days ago
                Date of 1st Reply: