[SERVER-30952] Initial (re)sync never completes, stuck in a loop Created: 05/Sep/17  Updated: 07/Apr/23  Resolved: 05/Sep/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.4.7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Robert Beekman Assignee: Kelsey Schubert
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
duplicates SERVER-4941 collection rename may not replicate /... Closed
Operating System: ALL
Steps To Reproduce:
  • Run a 3-node cluster
  • Run map/reduce jobs that require a temp collection
  • Re-sync one of the nodes, the re-sync will fail and start again
Participants:

 Description   

We have a 3-node replicaset running v3.4.7, the primary is running 6 map/reduce jobs every minute or so and due to circumstances we had to re-sync one of the secondary nodes. However at one of the last steps of the re-sync we get the following error:

2017-09-05T17:56:04.348+0000 E REPL     [repl writer worker 5] Error applying operation: OplogOperationUnsupported: Applying renameCollection not supported in initial sync: { ts: Timestamp 1504588941000|592, h: 4948566672906734558, v: 2, op: "c", ns: "graphs.$cmd", o: { renameCollection: "graphs.tmp.agg_out.989", to: "graphs.graphs_temp", stayTemp: false, dropTarget: true } } ({ ts: Timestamp 1504588941000|592, h: 4948566672906734558, v: 2, op: "c", ns: "graphs.$cmd", o: { renameCollection: "graphs.tmp.agg_out.989", to: "graphs.graphs_temp", stayTemp: false, dropTarget: true } })
2017-09-05T17:56:04.348+0000 E REPL     [replication-168] Failed to apply batch due to 'OplogOperationUnsupported: error applying batch: Applying renameCollection not supported in initial sync: { ts: Timestamp 1504588941000|592, h: 4948566672906734558, v: 2, op: "c", ns: "graphs.$cmd", o: { renameCollection: "graphs.tmp.agg_out.989", to: "graphs.graphs_temp", stayTemp: false, dropTarget: true } }'
2017-09-05T17:56:04.348+0000 I ASIO     [NetworkInterfaceASIO-RS-0] Ending connection to host graphs1-mongo3:27017 due to bad connection status; 2 connections to that host remain open
2017-09-05T17:56:04.348+0000 I REPL     [replication-167] Finished fetching oplog during initial sync: CallbackCanceled: Callback canceled. Last fetched optime and hash: { ts: Timestamp 1504634159000|4672, t: -1 }[-4041766555669726456]
2017-09-05T17:56:04.348+0000 I REPL     [replication-168] Initial sync attempt finishing up.

After this, MongoDB cleans up the files and starts the re-sync again, it's now basically stuck in a very big loop. This used to work fine with 2.x, when we re-synced quite a few times.

I'm not sure what to do about this, we can't stop the map/reduce jobs for the duration of the resync as it takes about 8 hours to get to this point.



 Comments   
Comment by Spencer Brody (Inactive) [ 07/Sep/17 ]

Hi thijs,
I agree with you that this was poorly messaged. I have updated the ticket description to include more information and link to the underlying issue it is trying to avoid. I have also filed DOCS-10763 to update our documentation and release notes to make clear this compatibility change. I also agree that that it was a pretty big change to make in a point release, but at the time it was decided that avoiding potential data corruption was more important than ensuring no users ability to perform initial syncs was broken. The discovery of this issue was a large motivator behind spending a significant portion of the 3.6 development effort towards adding UUIDs to collections so that collection renames could be handled properly.

For now I encourage you to try out Ramon's workaround, it seems the lowest impact way to work around this issue, although unfortunately it only works for avoiding the renames in mapReduce $out, it doesn't work for aggregations with $out.

If you'd like help figuring out the issue with indexes that are preventing you from completing initial sync on 3.2.11, please let us know and we'd be happy to work with you to understand what's happening there.

Sorry again for the difficulties this issue has caused you.

-Spencer

Comment by Ramon Fernandez Marina [ 07/Sep/17 ]

I think there's a better workaround than downgrading (which may have other issues). You can use output to a collection with an action. For example:

db.outputcollection.drop()
// The output collection can't be empty, so insert a marker document
db.outputcollection.insert({marker:1})
db.mycollection.mapReduce(myMapFunction, myReduceFunction, { out: { merge : "outputcollection" } })

Using this approach avoids the implicit renameCollection operation used by the out option of mapReduce altogether and should help with the initial sync issue originally reported.

Regards,
Ramón.

Comment by Thijs Cadier [ 07/Sep/17 ]

Unfortunately we can't seem to sync a 3.2.11 node, the sync fails with an error message about missing indexes. We've started work on a solution to completely move away from using aggregations in our processing pipeline.

I think this change was communicated extremely poorly. It was introduced in 3.2.12 with benign sounding description "renameCollection ‘c’ op should restart initial sync upon application". Also it's not even mentioned in the overview page here (https://docs.mongodb.com/manual/release-notes/3.2/), you have to click to the detailed list to even find it.

There is no mention at all that this change would break initial syncs when you use big aggregations. This is a change with huge impact. I really don't understand that this change was made in a point release and didn't come with a huge warning in the change log.

Comment by Daniel Pasette (Inactive) [ 06/Sep/17 ]

Glad you have an escape valve. Please look through the caveats Spencer called out and remember that the version must be 3.2.11 or lower.

Comment by Thijs Cadier [ 06/Sep/17 ]

Thanks for the thoughtful reply Spencer.

Great news that this can be mitigated by running 3.2 We were under the impression that this issue was also present in 3.2. We can still run 3.2 since we haven't performed the auth upgrade yet. We are trying the downgrade approach now, will let you know how it goes.

Comment by Spencer Brody (Inactive) [ 06/Sep/17 ]

Hi thijs and robert@80beans.com,
To provide some more context here, the underlying problem is that aggregation and mapReduce with $out perform a rename collection internally to rename the output collection from a temporary name to its final name. Unfortunately due to SERVER-4941 any collection that is renamed while it is being initial sync'ed can wind up with incomplete or incorrect data on the node that did the initial sync. Earlier versions of mongodb (<3.2.12) would allow initial sync to complete in these cases, leading to different members of the replica set potentially winding up with different data for the same collection. In SERVER-26117 we changed initial sync to no longer allow processing a renameCollection and to instead fail and restart the initial sync, to avoid possible data corruption. 3.6 will avoid this problem by using new features from the storage engine to ensure that initial sync can properly handle renameCollection without losing data.

Until 3.6 is released, however, I understand you are in a tricky situation. The best thing to do is one of the workarounds that Thomas describes - either cease running the aggregations and mapReduces causing the problem for the duration of the initial sync, or rework the app to not use $out to have the agg or m/r write directly to the output collection, but instead get the results streamed back to your app then have your app manually write the data out to the output collection.

If neither of those are an option, you can downgrade the initially syncing node to 3.2.11 or wait for SERVER-29772 to be completed, backported, and released. Both options will allow the initial sync to complete even if it encounters a renameCollection. In that case, however, the data for any collections that were renamed during the initial sync may be corrupt (see SERVER-4941 for more details). So if you were to do this, you would need to make sure to drop any collections generated by an agg or m/r as soon as the initial sync completes. You would also want to ensure that you're not reading from any of those collections to guarantee the integrity of the data returned. Once the initial sync is complete and all the agg and m/r output collections dropped, it would then be safe to run new aggregations or mapReduces and use the resulting data.

I hope this clarifies the issue and the steps to work around it. I recognize that this is a difficult bug to work around and can cause real operational hurdles for you, and for that I sincerely apologize on behalf of MongoDB. We are working hard to get this fixed for 3.6.

-Spencer

Comment by Thijs Cadier [ 06/Sep/17 ]

Hi,

We're in a very tight spot at the moment because of this issue:

  • We cannot stop the aggregations and/or map reduces that are creating the temporary collections.
  • When not allowing disk usage on aggregations our jobs cannot complete.
  • Making a filesystem snapshot takes more than an hour which also breaks everything.

Basically we need a fix for this otherwise we have to do an extremely hasty project to take this entire system out of commission . Do you have any other suggestions? Are nodes running WiredTiger also affected by this?

To me it seems like this issue is not a duplicate of SERVER-4941. We're not renaming any collections or databases. The aggregation system creates temporary collections which I imagine don't even need to be synced to the secondaries. I therefore feel it makes more sense to keep this issue open.

Why is a fix for this issue that causes a real risk of data-loss not given more priority?

Thijs

Comment by Kelsey Schubert [ 05/Sep/17 ]

Hi robert@80beans.com,

Unfortunately, there is no simple workaround for this issue. There are two possible workarounds, the first you've already identified and ruled out:

  1. Disable whatever application/process runs the map-reduce commands until the initial sync can complete
  2. Update the application/process that's running the map-reduce to not use out, but instead to get the results via a cursor and manually write the results back out to the output collection.

Kind regards,
Kelsey

Comment by Robert Beekman [ 05/Sep/17 ]

Will do, but that issue looks unresolved. In the meantime we're down one replicaset member. Is there any workaround to get an initial sync to complete?

Comment by Kelsey Schubert [ 05/Sep/17 ]

Hi robert@80beans.com,

Thanks for the report, this issue is being tracked under SERVER-4941 and the fix will be included in MongoDB 3.6.

Kind regards,
Kelsey

Comment by Robert Beekman [ 05/Sep/17 ]

Additional information: The map/reduce jobs are preceded by an aggregation that also creates a temp collection, this because the aggregation is a lot faster then the map/reduce jobs.

Generated at Thu Feb 08 04:25:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.