-
Type:
Task
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Atlas Streams
-
Fully Compatible
-
None
-
3
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Background
For background, the S3EmitOperator will write files to S3 with the following schema for the key:
<path>/<wall time><taskId><part number>.<extension
For example:
myStaticPath/1739652142669-1ac2-0000.json
The wall time has millisecond-level precision. Therefore, the part number is used to disambiguate between two files with the same wall time value. The part number is incremented whenever a sink writer sees that it's writing 2 files with the same wall clock time.
Suggested Implementation
We should make the sink writer hold onto a set/map data structure tracking filenames it has previously uploaded.
When generating an S3 object key, we should check within the map/set to see if it's previously used this name. If so, we should increment the part number and repeat the check. Once we find a name that has not previously been uploaded, we can write that name in the map and then use this name for the upload.