-
Type: Task
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Atlas Streams
When restoring from a large state checkpoint, the restore fails and it seems to be due to this:
(ExceededMemoryLimit) $push used too much memory and cannot spill to disk. Memory limit: 104857600 bytes
The full logs for this can be obtained from this query in splunk:
This SP first created a largish checkpoint (~5.8GB) and was then stopped. When it was later started, the start fails due to the ^ error. The query that the SP is running is:
constpipeline= [ { $source: { connectionName:"Cluster0", db:"mk-testdb", coll:"inputColl", timeField: { $toDate:"$fullDocument.ts", } } }, {$replaceRoot: {newRoot: "$fullDocument"}}, { $project: { value: \{$range: [1, "$idx"]} , ts:"$ts", } }, {$unwind: "$value"}, { $addFields: { "customerId": \{$mod: ["$value", 50]} , "max":"$value", "idarray0": ["$_id", "$_id", "$_id", "$_$id", "$_id", "$_id"], "idarray1": ["$_id", "$_id", "$_id", "$_$id", "$_id", "$_id"], "idarray2": ["$_id", "$_id", "$_id", "$_$id", "$_id", "$_id"], "idarray3": ["$_id", "$_id", "$_id", "$_$id", "$_id", "$_id"], } }, { $tumblingWindow: { interval: \{size:NumberInt(3), unit:"hour"} , allowedLateness: {size:NumberInt(0), unit:"second"}, pipeline: [{ $group: {_id: "$customerId", customerDocs: {$push: "$$ROOT"}, max: {$max: "$max"}} }] } }, {$project: {customerId: "$_id", max: "$max"}}, { $merge: { into: \{connectionName:"Cluster0", db:"mk-testdb", coll:"outputColl"} , } } ];
The interesting thing about this failure is that it did not fail when taking the checkpoint but only when trying to restore from it.