Keep track of previously uploaded object keys in S3EmitOperator

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Fixed
    • Priority: Major - P3
    • 8.2.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Atlas Streams
    • Fully Compatible
    • None
    • 3
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Background

       

      For background, the S3EmitOperator will write files to S3 with the following schema for the key:

       

      <path>/<wall time><taskId><part number>.<extension

       

      For example:

      myStaticPath/1739652142669-1ac2-0000.json

       

      The wall time has millisecond-level precision. Therefore, the part number is used to disambiguate between two files with the same wall time value. The part number is incremented whenever a sink writer sees that it's writing 2 files with the same wall clock time.

       

       

      Suggested Implementation

       

      We should make the sink writer hold onto a set/map data structure tracking filenames it has previously uploaded. 

       

      When generating an S3 object key, we should check within the map/set to see if it's previously used this name. If so, we should increment the part number and repeat the check. Once we find a name that has not previously been uploaded, we can write that name in the map and then use this name for the upload.

              Assignee:
              Andrew Chen
              Reporter:
              Andrew Chen
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: