Uploaded image for project: 'Python Integrations'
  1. Python Integrations
  2. INTPYTHON-253

Schema questions for pymongoarrow when converting from pymongo

    • Type: Icon: Question Question
    • Resolution: Unresolved
    • Priority: Icon: Minor - P4 Minor - P4
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Python Drivers

      Via https://github.com/mongodb-labs/mongo-arrow/issues/239

       

      we have 3 mongo collections which hold a permission object all 3 slightly different in structure.

      ```
      "permissions": [
                         

      {                         "activity": "never"                     }

      ,
                         

      {                         "pushNotifications": "always",                         "location": "foreground"                     }

                      ],
      ```

      ```
                      "permissions":

      {                     "geolocation": "prompt"                 }

      ,
      ```

      ```
                      "permissions": [
                         

      {                         "activity": "never"                     }

      ,
                         

      {                         "location": "foreground"                     }

      ,
                         

      {                         "pushNotifications": "always"                     }

                      ],
      ```

      in pymongo I could just project the object and then convert the pandas column into string ```df[c] = df[c].astype(pd.StringDtype())```

      then using Fastparquet as the engine write to parquet with the output like this

      ```[\{'location': 'notRequested'}, \{'activity': 'never'}, \{'pushNotifications': 'never'}, \{'backgroundAuthStatus': 'permitted'}, \{'att': 'denied'}, \{'isPrecise': 'notRequested'}, \{'adPersonalisation': 'true'}]```

      I am having issues when converting to use pymongoarrow. if I set the schema object as ```"permissions": pa.list_(pa.string()),```
      then I get null/None, I have tried using ps.struct but then get empty values for the items that are missing in the structure.

      currently my project in my query is
      ```
      'permissions': {
                          '$map': {
                              'input': '$os.permissions',
                              'as': 'permission',
                              'in': {
                                  '$function':

      {                                 'body': 'function(perm) \{ return JSON.stringify(perm); }

      ',
                                      'args': [
                                          '$$permission'
                                      ],
                                      'lang': 'js'
                                  }
                              }
                          }
                      },
      ```
      with a schema element of ```"permissions": pa.list_(pa.string()),```
      but then need to convert the column with
      ``` df['permissions'] = df['permissions'].apply(list).astype(str).str.replace("'", "").str.replace('"', "'")```

      there must be an easier way to deal with these json objects as string. ultimately these are ending up in Redshift so can be parsed in queries. Any help or suggestions for something I thought would be quite simple.

      3 days messing with mongo data and converting a migration to pymongoarrow. the other collections have been a breeze and the memory consumption has come down and have a speed improvement.

      John

            Assignee:
            Unassigned Unassigned
            Reporter:
            alex.clark@mongodb.com Alex Clark
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: