[SERVER-19644] Seg Fault on cloneCollection (specifically gridfs) Created: 29/Jul/15  Updated: 19/Sep/15  Resolved: 04/Aug/15

Status: Closed
Project: Core Server
Component/s: GridFS, Replication
Affects Version/s: 2.4.9, 3.0.5, 3.1.6
Fix Version/s: 3.0.6, 3.1.7

Type: Bug Priority: Critical - P2
Reporter: Amit Lichtenberg Assignee: Scott Hernandez (Inactive)
Resolution: Done Votes: 1
Labels: RF
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongod_partial.log    
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Steps To Reproduce:

Run cloneCollection from a remote gridfs chunks collection

Sprint: RPL 7 08/10/15
Participants:

 Description   

The following error started to occur right after we started deploying with the new v 3.0.5:

2015-07-29T10:57:50.272+0300 I COMMAND  [conn353] cloneCollection.  db:my_coll collection:my_coll.fs.chunks from: remote_ip:27017 query: { _id: { $in: [ ObjectId('55b886f90429ea0918c63f8a') ] } } 
2015-07-29T10:57:50.275+0300 F -        [conn353] Invalid access at address: 0xd8
2015-07-29T10:57:50.761+0300 F -        [conn353] Got signal: 11 (Segmentation fault).
 
 0xf5ba59 0xf5b322 0xf5b67e 0x7f9e352f4340 0x922b9a 0x922cb6 0x911769 0x939bf2 0x85801d 0x934fa8 0x936a92 0x949266 0x9d5f64 0x9d6eed 0x9d7bfb 0xb9c776 0xab2b50 0x80e06d 0xf0e9fb 0x7f9e352ec182 0x7f9e33db500d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B5BA59"},{"b":"400000","o":"B5B322"},{"b":"400000","o":"B5B67E"},{"b":"7F9E352E4000","o":"10340"},{"b":"400000","o":"522B9A"},{"b":"400000","o":"522CB6"},{"b":"400000","o":"511769"},{"b":"400000","o":"539BF2"},{"b":"400000","o":"45801D"},{"b":"400000","o":"534FA8"},{"b":"400000","o":"536A92"},{"b":"400000","o":"549266"},{"b":"400000","o":"5D5F64"},{"b":"400000","o":"5D6EED"},{"b":"400000","o":"5D7BFB"},{"b":"400000","o":"79C776"},{"b":"400000","o":"6B2B50"},{"b":"400000","o":"40E06D"},{"b":"400000","o":"B0E9FB"},{"b":"7F9E352E4000","o":"8182"},{"b":"7F9E33CBA000","o":"FB00D"}],"processInfo":{ "mongodbVersion" : "3.0.5", "gitVersion" : "8bc4ae20708dbb493cb09338d9e7be6698e4a3a3", "uname" : { "sysname" : "Linux", "release" : "3.13.0-24-generic", "version" : "#46-Ubuntu SMP Thu Apr 10 19:11:08 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "695FC6828398A9DB1F99718671147885B5ED116D" }, { "b" : "7FFFD8AFE000", "elfType" : 3, "buildId" : "6755FAD2CADACDF1667E5B57FF1EDFC28DD1C976" }, { "b" : "7F9E352E4000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "9318E8AF0BFBE444731BB0461202EF57F7C39542" }, { "b" : "7F9E35086000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "70F9A7F3734C01FF8D442C21A03B631588C2FECD" }, { "b" : "7F9E34CAC000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "8D6FFA819931E68666BCEF4424BA6289838309D7" }, { "b" : "7F9E34AA4000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "92FCF41EFE012D6186E31A59AD05BDBB487769AB" }, { "b" : "7F9E348A0000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "C1AE4CB7195D337A77A3C689051DABAA3980CA0C" }, { "b" : "7F9E3459C000", "path" : "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", "elfType" : 3, "buildId" : "19EFDDAB11B3BF5C71570078C59F91CF6592CE9E" }, { "b" : "7F9E34296000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "1D76B71E905CB867B27CEF230FCB20F01A3178F5" }, { "b" : "7F9E34080000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "CC0D578C2E0D86237CA7B0CE8913261C506A629A" }, { "b" : "7F9E33CBA000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "882AD7AAD54790E2FA6EF64CA2E6188F06BF9207" }, { "b" : "7F9E35502000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "9F00581AB3C73E3AEA35995A0C50D24D59A01D47" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf5ba59]
 mongod(+0xB5B322) [0xf5b322]
 mongod(+0xB5B67E) [0xf5b67e]
 libpthread.so.0(+0x10340) [0x7f9e352f4340]
 mongod(_ZN5mongo12IndexCatalog13IndexIteratorC1EPNS_16OperationContextEPKS0_b+0xA) [0x922b9a]
 mongod(_ZNK5mongo12IndexCatalog11findIdIndexEPNS_16OperationContextE+0x16) [0x922cb6]
 mongod(_ZN5mongo10Collection14insertDocumentEPNS_16OperationContextERKNS_7BSONObjEb+0x49) [0x911769]
 mongod(_ZN5mongo6Cloner3FunclERNS_27DBClientCursorBatchIteratorE+0x262) [0x939bf2]
 mongod(_ZN5mongo18DBClientConnection5queryESt8functionIFvRNS_27DBClientCursorBatchIteratorEEERKSsNS_5QueryEPKNS_7BSONObjEi+0x1AD) [0x85801d]
 mongod(_ZN5mongo6Cloner4copyEPNS_16OperationContextERKSsRKNS_15NamespaceStringERKNS_7BSONObjES7_bbbbbNS_5QueryE+0x5C8) [0x934fa8]
 mongod(_ZN5mongo6Cloner14copyCollectionEPNS_16OperationContextERKSsRKNS_7BSONObjERSsbbbb+0x652) [0x936a92]
 mongod(_ZN5mongo18CmdCloneCollection3runEPNS_16OperationContextERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x546) [0x949266]
 mongod(_ZN5mongo12_execCommandEPNS_16OperationContextEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x34) [0x9d5f64]
 mongod(_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_iPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xC1D) [0x9d6eed]
 mongod(_ZN5mongo12_runCommandsEPNS_16OperationContextEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x28B) [0x9d7bfb]
 mongod(_ZN5mongo8runQueryEPNS_16OperationContextERNS_7MessageERNS_12QueryMessageERKNS_15NamespaceStringERNS_5CurOpES3_+0x746) [0xb9c776]
 mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xB10) [0xab2b50]
 mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0xDD) [0x80e06d]
 mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x34B) [0xf0e9fb]
 libpthread.so.0(+0x8182) [0x7f9e352ec182]
 libc.so.6(clone+0x6D) [0x7f9e33db500d]
-----  END BACKTRACE  -----

It replicates every time ever since.



 Comments   
Comment by Githook User [ 07/Aug/15 ]

Author:

{u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}

Message: SERVER-19644: change dassert to verify
Branch: master
https://github.com/mongodb/mongo/commit/c43b0e5105b95904848a5a9b3d5c3d238464fd9d

Comment by Githook User [ 05/Aug/15 ]

Author:

{u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}

Message: SERVER-19644: ensure valid collection creation when there are no collection options

(cherry picked from commit 66abb09a28b2d9ec6955f454b7ff86824ebc56c1)
Branch: v3.0
https://github.com/mongodb/mongo/commit/34b733c0ca7d231949bd7317bab4079a24189875

Comment by Amit Lichtenberg [ 04/Aug/15 ]

Thanks!

Comment by Githook User [ 04/Aug/15 ]

Author:

{u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}

Message: SERVER-19644: ensure valid collection creation when there are no collection options
Branch: master
https://github.com/mongodb/mongo/commit/66abb09a28b2d9ec6955f454b7ff86824ebc56c1

Comment by Scott Hernandez (Inactive) [ 31/Jul/15 ]

Looks like there were two interactions when using those versions which caused the problem starting in 3.0.5; this was due to the format of the old data versus the new data coming back from the listCollections command – specifically the new command always returns an "options" field that when missing from pre-2.6 server caused a problem during cloning.

This is good news for you as once we fix these two things it should let you keep running without upgrading from 2.4.x, for a while more at least.

Comment by Amit Lichtenberg [ 30/Jul/15 ]

I see. Indeed it is probably about time we upgrade the older dbs. So far everything worked so we didn't have a good-enough excuse, but now we do.

About your suggestion to copy the documents one at a time - the number of documents at each clone might be large (hundreds or more), so iterating & copying them pythonically takes (too much) time.

Thank you anyway for your attention. I would be glad to here farther updates if and when you have them.

Comment by Scott Hernandez (Inactive) [ 30/Jul/15 ]

amitlicht, unfortunately using a pre-2.6.x server, like 2.4.9, as a remote source is not supported for 3.0.x. This is covered in various places in the release notes, but not called out specifically for cloneCollection, as it is basically the same for replication and sharding: http://docs.mongodb.org/manual/release-notes/3.0-upgrade/#upgrade-a-replica-set-to-3-0

We will work on finding the root cause of seg fault but it is unlikely that it make it work with 2.4 since that was not a design goal between those versions. I should have more information in the next few days if you are interested in the underlying technical issues.

If you upgrade to 2.6.x it will work correctly, of course. Also, it might be just as easy for you to use the python client to copy the documents manually, from one server to the other, and not use cloneCollection at all – that will work on all server versions as longs you use a version of the python driver with support for them (and yes, the drivers have a much larger range of supported server versions).

Comment by Amit Lichtenberg [ 30/Jul/15 ]

That's great! Waiting for updates.

Comment by Ramon Fernandez Marina [ 30/Jul/15 ]

Hi amitlicht, using 2.4.9 as the source database (and 3.0.5 as the target) I'm able to reproduce the problem. I'm guessing it may have to do with listIndexes, but that's a first hunch – we're investigating.

Comment by Amit Lichtenberg [ 30/Jul/15 ]

Got your request about the -vv, I will make sure to use it when trying to replicate it again.
Having checked again (while trying to explain our flow better), I see now that there was a mistake in one of my previous comments - the source DB version is 2.4.9 and not 2.6.

About your request for snippets - I'd rather not share our actual source code. I will try to describe in more details what we try to do, tell me if you need more details and I'll try to share what I can:
We basically have two DBs, the remote (source) db, which is installed on one or more lightweight machines, and the local (destination) which is used as a centralized storage server. In the setup on which this crush occured the versions are 3.0.5 for the local\centralized DB, and 2.4.9 for the remote\lightweight DB.

On both DBs we use gridfs to save small-medium sized files (usually no more then 100Ks). To do this, we have 3 collections:

  • db.fs.files, db.fs.chunks - the regular gridfs structure
  • db.file - contains a collection of file metadatas, each containing the id of a gridfs file & additional metadata (i.e. filename, type, etc).

We use pymongo to clone, mongoengine has almost nothing to do with it.

The logic is as follows: we collect 3 lists of object ids \ uuids: files, fs_chunks & fs_files. The files list will contain all the (uuids) of file metadatas which should be cloned. The fs_files will contain all the object ids of the gridfs files which are associated with the files metadatas whose ids are in the files list. The fs_chunks will contain all the object ids of the gridfs chunks which are associated with those same files. Next, we use pymongo for the following clone command:
```
for collection_name in copied_collections:
destination_pymongo_db.command('cloneCollection', '%s.%s' % (my_db, collection_name),
**{'from': fromhost,
'query': {'_id': {'$in': copied_ids[collection_name]}},
'to': '%s.%s' % (destination_pymongo_db.name, collection_name)})
```

At the end of all three clones, we delete the documents in all three lists from the source db using a similar iteration logic.

If I had to point out a single thing about our usecase which might sound unusual is the way we clone the gridfs files by accessing the fs.files & fs.chunks collections directly (i.e. our clone operates directly on db.fs.files & db.fs.chunks). This sounds like a hack-ish way to do it, but it has been working so far.

Please update me if this helps in any way or if you need more data. As I said, I hope I'll manage to replicate this on Sunday in code & dbs I can share.

Thanks,
Amit

Comment by Ramon Fernandez Marina [ 30/Jul/15 ]

Thanks amitlicht. I tried again to reproduce using pymongo, unsuccessfully. I very much doubt it's an issue in mongoengine, but I'm not familiar with it so that hasn't been part of my testing. If you can share the relevant snippets on how collections are being cloned from your application I'll try those steps on my end.

We'll wait for your attempts on Sunday. Dan's request above (using the -vv switch) will yield more verbose logs, which hopefully will give us some useful hints.

Regards,
Ramón.

Comment by Daniel Pasette (Inactive) [ 30/Jul/15 ]

Hi Amit. If you are able to run the experiment again on Sunday, can you start the mongod server with verbose logging, i.e., mongod -vv

Comment by Amit Lichtenberg [ 30/Jul/15 ]

Hi again.
Uploading the original data is unfortunately not an option. I will try to think of a way to reproduce it outside of our application. This will probably be only on Sunday (I'm working in Israel time, weekend is almost here). Will update then.

Thanks again for your help,
Amit

Comment by Ramon Fernandez Marina [ 29/Jul/15 ]

amitlicht, thanks for all your help so far. Unfortunately we haven't been able to reproduce this issue on our end, so here are some options going forward:

  1. Would you consider sharing your fs.chunks collection with us? If the answer is "yes" I'll create an upload portal so you can send this data securely and privately
  2. If uploading this data is not an option, would you be able to set up a test machine with 3.0.5 and see if the problem reproduces again? If it does we could send you a modified mongod binary that would collect additional information to track down the source of the problem

Would you be able to accommodate either option?

Thanks,
Ramón.

Comment by Amit Lichtenberg [ 29/Jul/15 ]

Yes, the machine on which the error happened was downgraded to v3.0.4 (by replacing only the mongodb-org-server package). After that point the error did not happen again. The database was not dropped in between - a simple downgrade + starting the server again seem to have solved it.

Comment by J Rassi [ 29/Jul/15 ]

Thanks. When you said "had to downgrade it to an older mongo version to keep working", can you confirm that you downgraded the newly-deployed instance to 3.0.4, and that that resolved the issue? To help isolate the problem, I'm particularly interested in whether this issue is reproducible with all 3.0.x versions, or just 3.0.5.

Comment by Amit Lichtenberg [ 29/Jul/15 ]
  • this happened on a newly deployed machine which only ever had v3.0.5. We've been using 3.0.4 on other deployments (in which exactly the same logic has been working in the last year or so).
  • yes, it is automatically created on application start
  • did not drop the db between the two crushes. We did have clone commands on other (non-gridfs) collections in the same DB which worked perfectly before and in-between the crushes. We don't have any code which drops this db.
Comment by J Rassi [ 29/Jul/15 ]

Thanks for the information. Please do let us know when you're able to reproduce this again.

In the meantime, I have a few more follow-up questions:

  • What version of the server were you previously running, before you upgraded this instance to 3.0.5 (and what version did you downgrade to, to avoid the issue)?
  • Did the database "mydb" exist on the local instance at the time of clone?
  • After the first crash, did you drop the database or collection before retrying? And, do you have any application code that would have dropped this database or collection while it was running?

Thanks again.

Comment by Amit Lichtenberg [ 29/Jul/15 ]

As per your other question - the clone commands is executed from our application. We use mongoengine over pymongo, if you think it has anything to do with. Specifically the error occurred while trying to replicate data from a remote DB to our main storage, we do it by running cloneCollection selectively on the gridfs content (chunks & files) according to self-implemented availability criteria.

Comment by Amit Lichtenberg [ 29/Jul/15 ]

Reproducing will require some time as I no longer have that setup on which the command failed (had to downgrade it to an older mongo version to keep working). I'll see if we can redeploy and reproduce it. The following output was generated on the same remote db, but after running some other tests on it so it's not guaranteed to represent the status at the time of error.
Meanwhile:

  • db.fs.chunks.exists(): right now the output is { "name" : "my_db.fs.chunks" }

    . At the moment of the failure that db & collection definitely existed.

  • db.fs.chunks.stats(): again this is output from now, not from the time of error:
    ```
    {
    "ns" : "my_db.fs.chunks",
    "count" : 9,
    "size" : 468768,
    "avgObjSize" : 52085.333333333336,
    "storageSize" : 2252800,
    "numExtents" : 3,
    "nindexes" : 2,
    "lastExtentSize" : 2023424,
    "paddingFactor" : 1,
    "systemFlags" : 1,
    "userFlags" : 0,
    "totalIndexSize" : 16352,
    "indexSizes" : { "_id_" : 8176, "files_id_1_n_1" : 8176 }

    ,
    "ok" : 1
    }
    ```

The other commands are no longer relevant on current DB data.

Comment by J Rassi [ 29/Jul/15 ]

I'd like to ask for additional information to further diagnose the issue:

  • What version of the server were you previously running, before you upgraded this instance to 3.0.5?
  • Could you please provide output from db.fs.chunks.exists() on the remote collection, in addition to db.fs.chunks.stats()?
  • Could you please provide output from running the following query on the remote collection:
    • db.fs.chunks.find({ _id: { $in: [ ObjectId('55b87e0f0429ea379c610524'), ObjectId('55b886f90429ea0918c63f8a') ] } })
  • Would you please attempt to reproduce this issue, and then run the following commands immediately after restarting the crashed mongod process, on the local collection:
    • db.fs.chunks.getIndexes()
    • db.fs.chunks.stats()
    • db.fs.chunks.exists()

Thanks.

Comment by Daniel Pasette (Inactive) [ 29/Jul/15 ]

Also, the log output shows the command is passing a query to cloneCollection in both instances (but different query). Is this from your application or from the shell?

2015-07-29T10:13:44.018+0300 I COMMAND  [conn3655] cloneCollection.  db:my_db collection:my_db.fs.chunks from: 172.16.100.101:27017 query: { _id: { $in: [ ObjectId('55b87e0f0429ea379c610524') ] } } 
....
....
2015-07-29T10:57:50.272+0300 I COMMAND  [conn353] cloneCollection.  db:my_db collection:my_db.fs.chunks from: 172.16.100.101:27017 query: { _id: { $in: [ ObjectId('55b886f90429ea0918c63f8a') ] } } 

Comment by Daniel Pasette (Inactive) [ 29/Jul/15 ]

Can you paste the output from db.fs.chunks.stats() in the mongo shell on the remote collection please

Comment by Amit Lichtenberg [ 29/Jul/15 ]

I cannot add the full load due to company privacy issues. I added partial data including boot info & crush data from two different crushes we had on the same instance in a short time.

Comment by Ramon Fernandez Marina [ 29/Jul/15 ]

amitlicht, can you please upload full logs for this node since the last restart? I'm looking specifically for startup options and binary information that appear under the [initandlisten] thread at the beginning of the log.

Thanks,
Ramón.

Comment by Amit Lichtenberg [ 29/Jul/15 ]

Additional info: local DB version (the one running the clone) is 3.0.5. Remote DB version is 2.6. Another clone which operated on a regular (non-gridfs) collection worked fine.

Generated at Thu Feb 08 03:51:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.