[SERVER-23274] Collections created with the $out aggregation pipeline in MongoDB 3.2 get dropped on replica set election Created: 21/Mar/16  Updated: 28/Aug/18  Resolved: 24/Mar/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.1.9
Fix Version/s: 3.2.5, 3.3.4

Type: Bug Priority: Critical - P2
Reporter: Paul Reed Assignee: Benjamin Murphy
Resolution: Done Votes: 0
Labels: code-and-test
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-23299 Remove temp flag on all collections i... Closed
is related to SERVER-23514 Remove code and tests from SERVER-23299 Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Steps To Reproduce:

example:

SOURCE (collection)
{ "_id" : 1, "srcindex" : 1, "N" : 1 }
{ "_id" : 2, "srcindex" : 2, "N" : 2 }
{ "_id" : 3, "srcindex" : 3, "N" : 3 }

command:

db.SOURCE.aggregate([
  { $group: { "_id" : { "B" : '$B', "CCG" : '$srcindex' }, "N" : { "$sum" : "$N" } } },
  { $out: "byChapter" }
]);

gives:

{ "_id" : { "CCG" : 3 }, "N" : 3 }
{ "_id" : { "CCG" : 2 }, "N" : 2 }
{ "_id" : { "CCG" : 1 }, "N" : 1 }

All replicasets give the same result, if I add another item into the collection it persists across all recordsets.

If I then create a new collection with a simple

db.another.insert({})

that is also present.

now:

rs.stepDown()

machines switch around, and my primary steps down. I get this logging:

2016-03-21T19:39:54.876+0000 I COMMAND  [conn349] Attempting to step down in response to replSetStepDown command
2016-03-21T19:39:54.876+0000 I REPL     [ReplicationExecutor] transition to SECONDARY
2016-03-21T19:39:54.876+0000 I NETWORK  [conn353] end connection ?.?.?.? (18 connections now open)
...
2016-03-21T19:39:54.878+0000 I NETWORK  [conn349] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [127.0.0.1:58627] 
2016-03-21T19:39:54.885+0000 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:58645 #373 (5 connections now open)
2016-03-21T19:39:55.180+0000 I REPL     [ReplicationExecutor] replSetElect voting yea for R3:27017 (2)
2016-03-21T19:39:55.897+0000 I NETWORK  [conn366] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [?.?.?.?:59354] 
2016-03-21T19:39:55.897+0000 I NETWORK  [conn369] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [?.?.?.?:63795] 
2016-03-21T19:39:56.525+0000 I REPL     [ReplicationExecutor] Member R3:27017 is now in state PRIMARY
2016-03-21T19:39:57.491+0000 I REPL     [ReplicationExecutor] syncing from: R3:27017
2016-03-21T19:39:57.498+0000 I NETWORK  [SyncSourceFeedback] Socket say send() errno:10038 An operation was attempted on something that is not a socket. ?.?.?.?:27017
2016-03-21T19:39:57.498+0000 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update: socket exception [SEND_ERROR] for ?.?.?.?:27017
2016-03-21T19:39:57.499+0000 I REPL     [SyncSourceFeedback] updateUpstream failed: Location9001: socket exception [SEND_ERROR] for ?.?.?.?:27017, will retry
2016-03-21T19:39:57.504+0000 I ASIO     [NetworkInterfaceASIO-0] Successfully connected to R3:27017
2016-03-21T19:39:57.510+0000 I COMMAND  [repl writer worker 15] CMD: drop data_1601.byChapter
2016-03-21T19:39:57.513+0000 I REPL     [ReplicationExecutor] could not find member to sync from
 
----

So the aggregated out collection is dropped, but the inserted one is not.

Sprint: Query 12 (04/04/16)
Participants:
Case:
Linked BF Score: 0

 Description   

.

Issue Status as of Apr 14, 2016

ISSUE SUMMARY
On MongoDB 3.2, collections created using the $out operator in the aggregation pipeline are incorrectly marked as temporary collections.

In a replica set, when an election takes place, all temporary collections are removed from the dataset.

USER IMPACT
All collections created with MongoDB 3.2 via an aggregation pipeline are removed if there's an election on a replica set.

These collections must be re-created by re-running the aggregation pipeline used to create them originally. To prevent them from being dropped again please see the WORKAROUNDS section below or upgrade to MongoDB 3.2.5.

WORKAROUNDS
After an aggregation command has successfully finished and created a collection (e.g.: agg_out), users can rename the collection with the renameCollection command to avoid running into this issue:

use admin
db.runCommand( { renameCollection: "dbname.created_with_$out", to: "dbname.some_other_name" } )

Upon renaming the collection, its temporary flag is cleared, so a future replica set election will not drop the collection. Note that it's easy to restore the required name by executing another renameCollection command.

AFFECTED VERSIONS
Only collections created in MongoDB 3.2 via the $out operator from the aggregation pipeline are affected by this issue.

Collections created using earlier versions of MongoDB that are now hosted on a MongoDB 3.2 replica set are not affected by this issue.

FIX VERSION
The fix is included in the 3.2.5 production release.

Original description

Any collection created using an aggregate operation will be dropped when the resultset steps down.

I thought it was todo with lookup, but after removing that pipe, I find that it is all aggregations.



 Comments   
Comment by Adam Schwartz [ 04/Jul/16 ]

emoshe, I am sorry to hear this bug caused you significant data loss. We will work with you via a support ticket to minimize the impact and help you recover the data if possible.

Please understand that we review every server bug and assess if it warrants a Critical Advisory. In this case, we decided not issue an advisory. We fixed the bug, wrote a detailed summary (which describes impact and workarounds), added a note in the release announcement, and notified our support team. We will re-assess our prioritization and alert processes in light of your feedback.

We appreciate the many JIRA tickets you have opened and contributed to over the past couple of years. Your feedback helps us improve and become more responsive to customer needs.

Comment by Elad Moshe [ 03/Jul/16 ]

Unfortunately, we were affected by this bug, which caused us significant data loss.
I can't even start describing how disappointed I am by the way this issue was handled by MongoDB Inc. The single thing a database must do right is to ensure that that data is actually kept safe. As a developer I definitely understand that critical bugs might happen.
However, in this case I would expect MongoDB Inc. to do its' very best to let its' users know in time, so they can react (e.g. by renaming collection names) before data lost have already occurred!

Comment by Ramon Fernandez Marina [ 14/Apr/16 ]

paul.reed, this is to let you know that we've just released the next version in the 3.2 series, 3.2.5, containing a fix for this bug; it's available for download here.

Please note that published releases can't be modified, so it is not possible to fix this issue in versions 3.2.4 down to 3.2.0 (3.0.x and 2.6.x versions are not affected) – users affected by this bug should upgrade to 3.2.5 as soon as possible.

Thanks again for reporting the issue.

Cheers,
Ramón.

Comment by Paul Reed [ 25/Mar/16 ]

Seems a pretty big issue to not be retrofitted for 3.2.4 and further back.
My code is simple to work around - so I am fine(ish). I ensure I drop before I rename, but surely there must be some users who may still download 3.2.4 and fall into this caveat - I can expect it could be catastrophic for some of them, especially as the collection is present on all RS's until a cycle drops it !!

At the very least highlight the issue as a known and dangerous !

Paul

Comment by Ramon Fernandez Marina [ 25/Mar/16 ]

paul.reed, 3.2.4 was already released, so as Benjamin pointed out the first stable release to include a fix will be 3.2.5, currently scheduled for mid-April.

As for your last question above, this is a manifestation of the same issue: setting dropTarget to true triggers the bug, so to work around it you'll need to issue the renameCollection with dropTarget set to false (or omitted, since false is the default).

Regards,
Ramón.

Comment by Benjamin Murphy [ 25/Mar/16 ]

paul.reed, it will be part of 3.2.5, as well as 3.3.4, which is a development release. Both are in the works! In the meantime, the workaround you identified will serve to prevent this from happening to a collection created with $out.

Comment by Paul Reed [ 24/Mar/16 ]

Will this fix now be within 3.2.4 ?
if not when will 3.2.5 be going out ?

Comment by Githook User [ 24/Mar/16 ]

Author:

{u'username': u'benjaminmurphy', u'name': u'Benjamin Murphy', u'email': u'benjamin_murphy@me.com'}

Message: SERVER-23274 renameCollection on a temporary collection correctly replicates.

[cherry-picked from commit a19406fdedac2bff515a0b162c8d496b11f4e455]
Branch: v3.2
https://github.com/mongodb/mongo/commit/4e3b4bdf625353e3aced9193682b66b2af2f3de4

Comment by Githook User [ 24/Mar/16 ]

Author:

{u'username': u'benjaminmurphy', u'name': u'Benjamin Murphy', u'email': u'benjamin_murphy@me.com'}

Message: SERVER-23274 renameCollection on a temporary collection correctly replicates.
Branch: master
https://github.com/mongodb/mongo/commit/a19406fdedac2bff515a0b162c8d496b11f4e455

Comment by Paul Reed [ 24/Mar/16 ]

I am getting another funny with this. So when I aggregate using c# driver.

collection.Aggregate().Group($"{{ _id:{{

{idclause} }} }}").Project($"{{ {project} }}").Out(outCollection+"_AGGFIX");
collection.Database.RenameCollection(outCollection + "_AGGFIX", outCollection, new RenameCollectionOptions() { DropTarget = true });

when I step down - the outCollection gets dropped.
If I run this instead

collection.Aggregate().Group($"{{ _id:{{ {idclause}

}} }}").Project($"{{

{project}

}}").Out(outCollection+"_AGGFIX");
collection.Database.RenameCollection(outCollection + "_AGGFIX", outCollection);
it doesn't.

In both cases the outCollection does not exist prior to operation.

Is this the same issue ? Same fix ?

Comment by Paul Reed [ 24/Mar/16 ]

No problem

Is there a scenario, with say certain machine rotations, which would have cleared the erroneous drop? Wondering why I or no-one else had spotted this earlier.

btw, there is nothing quite as horrid as watching logs go by which proceed to drop a 30 hour aggregation process in a matter of seconds.

Comment by Ramon Fernandez Marina [ 23/Mar/16 ]

paul.reed, this is to let you know that we've identified the source of the problem and a fix is on code review now. As you already found out, renaming the collection created by the aggregation pipeline is a suitable workaround to prevent it from being dropped. Thanks for reporting this issue.

Comment by Paul Reed [ 21/Mar/16 ]

Also: renaming the collection prior to stepDown - will prevent the erroneous drop.

Comment by Ramon Fernandez Marina [ 21/Mar/16 ]

Thanks for your report paul.reed, we can reproduce this behavior and we're investigating.

Generated at Thu Feb 08 04:02:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.