Details
-
Bug
-
Resolution: Unresolved
-
Major - P3
-
None
-
None
-
None
-
None
-
Atlas Streams
-
ALL
Description
When restoring from a large state checkpoint, the restore fails and it seems to be due to this:
(ExceededMemoryLimit) $push used too much memory and cannot spill to disk. Memory limit: 104857600 bytes
|
The full logs for this can be obtained from this query in splunk:
This SP first created a largish checkpoint (~5.8GB) and was then stopped. When it was later started, the start fails due to the ^ error. The query that the SP is running is:
constpipeline= [
|
{
|
$source: {
|
connectionName:"Cluster0",
|
db:"mk-testdb",
|
coll:"inputColl",
|
timeField:
|
{ $toDate:"$fullDocument.ts", }
|
}
|
},
|
{$replaceRoot: {newRoot: "$fullDocument"}},
|
{
|
$project:
|
{ value: \{$range: [1, "$idx"]}
|
,
|
ts:"$ts",
|
}
|
},
|
{$unwind: "$value"},
|
{
|
$addFields:
|
{ "customerId": \{$mod: ["$value", 50]}
|
,
|
"max":"$value",
|
"idarray0": ["$_id", "$_id", "$_id", "$_$id", "$_id", "$_id"],
|
"idarray1": ["$_id", "$_id", "$_id", "$_$id", "$_id", "$_id"],
|
"idarray2": ["$_id", "$_id", "$_id", "$_$id", "$_id", "$_id"],
|
"idarray3": ["$_id", "$_id", "$_id", "$_$id", "$_id", "$_id"],
|
}
|
},
|
{
|
$tumblingWindow:
|
{ interval: \{size:NumberInt(3), unit:"hour"}
|
,
|
allowedLateness: {size:NumberInt(0), unit:"second"},
|
pipeline: [{
|
$group:
|
{_id: "$customerId", customerDocs: {$push: "$$ROOT"}, max: {$max: "$max"}}
|
}]
|
}
|
},
|
{$project: {customerId: "$_id", max: "$max"}},
|
{
|
$merge:
|
{ into: \{connectionName:"Cluster0", db:"mk-testdb", coll:"outputColl"}
|
,
|
}
|
}
|
];
|
The interesting thing about this failure is that it did not fail when taking the checkpoint but only when trying to restore from it.