[SERVER-6692] GridFS:mongod crashed when save file with many processes! Created: 02/Aug/12  Updated: 08/Mar/13  Resolved: 05/Sep/12

Status: Closed
Project: Core Server
Component/s: GridFS, Internal Client
Affects Version/s: 2.0.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: barongwang Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 0
Labels: crash, driver, insert
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: Linux
Participants:

 Description   

We turn on several clients to send data to the same mongod process. When speed of writing to disc come up to aproximate 4M/S, the mongod process crashes. We could not check mongod process using top under linux. However, when we use netstat, we find that those ports mornitored by mongod are still being mornitored, and every command related to mongod is bollocked.what's more,the files using to save the collections are locked(cann't access these files!)



 Comments   
Comment by Spencer Brody (Inactive) [ 05/Sep/12 ]

I'm closing this ticket due to lack of activity. If you'd like to continue investigating this, please reopen the ticket and add answers to the questions from my last post.

Comment by Spencer Brody (Inactive) [ 09/Aug/12 ]

If you are using a ScopedDbConnection you can call done() on it when you are finished which will return the connection to the internal connection pool so that it can be reused on a future request.

The problem with setChunkSize is a known bug, SERVER-5720, which has been fixed for 2.2. You can see the fix that we did in the ticket for SERVER-5720, it looks like you had the correct idea with changing the assert to "size != 0"

Is the tail of the log you attached the tail from the very end of your test run, after the mongod crashed? That log just seems to end... there's no error message, stack track, or shutdown messages in the logs? How can you tell the server is crashed? What behavior are you seeing when you try to connect to the server after the crash?

Comment by barongwang [ 09/Aug/12 ]

It seems that the c++ dirver didn't supports to change the chunksize of gridfs,as there have a assert in "setChunkSize(unsigned int size)" need size equal to zero.
I change this assert to "size != 0" and rebuild it. As i just try to save files smaller than 16k, i set the chunksize to 16k using the rebuild version.

Comment by barongwang [ 09/Aug/12 ]

It is running on a single mongod!

these days i use c++ driver to save files which smaller than 16k into gridfs, mongod crashed frequently. The amount of clients is 200.

//when i use mongo to connect to mongod, the process blocked.
Tencent64:/home/baronwang/mongodb/mongodb-linux-x86_64-static-legacy-2.0.6/bin # ./mongo 10.6.11.104:28810
MongoDB shell version: 2.0.6
connecting to: 10.6.11.104:28810/test //running to here and nerver give a wrong or continue.

when i try to connect to mongod with c++ dirver, mongod refused.

//tail -50 of log
c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa17b2'), files_id: ObjectId('5023320640bbf4b33d3dc70a'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa17c5'), files_id: ObjectId('5023320640bbf4b33d3dc70a'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa17df'), files_id: ObjectId('5023320640bbf4b33d3dc70a'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa17e0'), files_id: ObjectId('5023320640bbf4b33d3dc70a'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa17e3'), files_id: ObjectId('5023320640bbf4b33d3dc70a'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa17fc'), files_id: ObjectId('5023320640bbf4b33d3dc70a'), n: 0, data: BinData }

Thu Aug 9 11:44:06 [conn35291] should have chunk: 1 have:0
c->nextSafe():

{ _id: ObjectId('502331ffba73d3605cfa15ef'), files_id: ObjectId('502331ff40bbf4b33d3dc722'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('502331ffba73d3605cfa1616'), files_id: ObjectId('502331ff40bbf4b33d3dc722'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('502331ffba73d3605cfa1661'), files_id: ObjectId('502331ff40bbf4b33d3dc722'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('502331ffba73d3605cfa1683'), files_id: ObjectId('502331ff40bbf4b33d3dc722'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa16e0'), files_id: ObjectId('502331ff40bbf4b33d3dc722'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa1742'), files_id: ObjectId('502331ff40bbf4b33d3dc722'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa176a'), files_id: ObjectId('502331ff40bbf4b33d3dc722'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa17b0'), files_id: ObjectId('502331ff40bbf4b33d3dc722'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa17fe'), files_id: ObjectId('502331ff40bbf4b33d3dc722'), n: 0, data: BinData }

Thu Aug 9 11:44:06 [conn35298] end connection 10.6.11.104:50831
Thu Aug 9 11:44:06 [conn35343] should have chunk: 1 have:0
c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa172c'), files_id: ObjectId('5023320640bbf4b33d3dc71f'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa173f'), files_id: ObjectId('5023320640bbf4b33d3dc71f'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa1757'), files_id: ObjectId('5023320640bbf4b33d3dc71f'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa177f'), files_id: ObjectId('5023320640bbf4b33d3dc71f'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa17b7'), files_id: ObjectId('5023320640bbf4b33d3dc71f'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa17c8'), files_id: ObjectId('5023320640bbf4b33d3dc71f'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa17cf'), files_id: ObjectId('5023320640bbf4b33d3dc71f'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa17e2'), files_id: ObjectId('5023320640bbf4b33d3dc71f'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa1800'), files_id: ObjectId('5023320640bbf4b33d3dc71f'), n: 0, data: BinData }

Thu Aug 9 11:44:06 [conn35448] should have chunk: 1 have:0
c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa1727'), files_id: ObjectId('5023320640bbf4b33d3dc70b'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa177d'), files_id: ObjectId('5023320640bbf4b33d3dc70b'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa17b8'), files_id: ObjectId('5023320640bbf4b33d3dc70b'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa17bc'), files_id: ObjectId('5023320640bbf4b33d3dc70b'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa17c2'), files_id: ObjectId('5023320640bbf4b33d3dc70b'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa1801'), files_id: ObjectId('5023320640bbf4b33d3dc70b'), n: 0, data: BinData }

Thu Aug 9 11:44:06 [conn35258] should have chunk: 1 have:0
c->nextSafe():

{ _id: ObjectId('502331ffba73d3605cfa165e'), files_id: ObjectId('502331ff40bbf4b33d3dc72a'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('502331ffba73d3605cfa1663'), files_id: ObjectId('502331ff40bbf4b33d3dc72a'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('502331ffba73d3605cfa1687'), files_id: ObjectId('502331ff40bbf4b33d3dc72a'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa1724'), files_id: ObjectId('502331ff40bbf4b33d3dc72a'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa179e'), files_id: ObjectId('502331ff40bbf4b33d3dc72a'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa17ff'), files_id: ObjectId('502331ff40bbf4b33d3dc72a'), n: 0, data: BinData }

Thu Aug 9 11:44:06 [conn35274] should have chunk: 1 have:0
Thu Aug 9 11:44:06 [conn35246] end connection 10.6.11.104:50779
c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa16dd'), files_id: ObjectId('5023320640bbf4b33d3dc724'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('50233206ba73d3605cfa1802'), files_id: ObjectId('5023320640bbf4b33d3dc724'), n: 0, data: BinData }

Thu Aug 9 11:44:07 [initandlisten] connection accepted from 10.6.11.104:51057 #35518
Thu Aug 9 11:44:07 [conn35252] end connection 10.6.11.104:50785
Thu Aug 9 11:44:07 [conn35158] end connection 10.6.11.104:50691
Thu Aug 9 11:44:07 [conn35474] end connection 10.6.11.104:51007
Thu Aug 9 11:44:07 [conn35305] end connection 10.6.11.104:50839

//head 40 of log
Wed Aug 8 19:54:48 [initandlisten] MongoDB starting : pid=8867 port=28810 dbpath=/data/aoi 64-bit host=Tencent64
Wed Aug 8 19:54:48 [initandlisten] db version v2.0.6, pdfile version 4.5
Wed Aug 8 19:54:48 [initandlisten] git version: e1c0cbc25863f6356aa4e31375add7bb49fb05bc
Wed Aug 8 19:54:48 [initandlisten] build info: Linux domU-12-31-39-16-30-A2 2.6.21.7-2.fc8xen #1 SMP Fri Feb 15 12:34:28 EST 2008 x86_64 BOOST_LIB_VERSION=1_45
Wed Aug 8 19:54:48 [initandlisten] options:

{ bind_ip: "10.6.11.104", dbpath: "/data/aoi", fork: true, logpath: "/data/aoi/mongodb.log", port: 28810 }

Wed Aug 8 19:54:48 [initandlisten] journal dir=/data/aoi/journal
Wed Aug 8 19:54:48 [initandlisten] recover begin
Wed Aug 8 19:54:48 [initandlisten] recover lsn: 3042183
Wed Aug 8 19:54:48 [initandlisten] recover /data/aoi/journal/j._1
Wed Aug 8 19:54:48 [initandlisten] recover skipping application of section seq:579088 < lsn:3042183
Wed Aug 8 19:54:48 [initandlisten] recover skipping application of section seq:627899 < lsn:3042183
Wed Aug 8 19:54:48 [initandlisten] recover skipping application of section seq:827799 < lsn:3042183
Wed Aug 8 19:54:48 [initandlisten] recover skipping application of section seq:877809 < lsn:3042183
Wed Aug 8 19:54:48 [initandlisten] recover skipping application of section seq:927532 < lsn:3042183
Wed Aug 8 19:54:48 [initandlisten] recover skipping application of section seq:977411 < lsn:3042183
Wed Aug 8 19:54:48 [initandlisten] recover skipping application of section seq:1027195 < lsn:3042183
Wed Aug 8 19:54:49 [initandlisten] recover skipping application of section seq:1074363 < lsn:3042183
Wed Aug 8 19:54:49 [initandlisten] recover skipping application of section seq:1123988 < lsn:3042183
Wed Aug 8 19:54:49 [initandlisten] recover skipping application of section more...
Wed Aug 8 19:54:52 [initandlisten] recover cleaning up
Wed Aug 8 19:54:52 [initandlisten] removeJournalFiles
Wed Aug 8 19:54:52 [initandlisten] recover done
Wed Aug 8 19:54:52 [initandlisten] waiting for connections on port 28810
Wed Aug 8 19:54:52 [websvr] admin web console waiting for connections on port 29810
Wed Aug 8 19:55:00 [initandlisten] connection accepted from 10.6.11.104:46805 #1
Wed Aug 8 19:55:52 [clientcursormon] mem (MB) res:29 virt:8214 mapped:4047
Wed Aug 8 19:57:49 [initandlisten] connection accepted from 10.6.11.104:37506 #2
Wed Aug 8 19:57:49 [conn2] end connection 10.6.11.104:37506
Wed Aug 8 19:57:49 [initandlisten] connection accepted from 10.6.11.104:37507 #3
Wed Aug 8 19:57:49 [conn3] should have chunk: 1 have:0
c->nextSafe():

{ _id: ObjectId('5022543dba73d3605ceb0d7a'), files_id: ObjectId('5022543d6a8b885752a74c16'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('5022543dba73d3605ceb0d79'), files_id: ObjectId('5022543d6a8b885752a74c16'), n: 0, data: BinData }

Wed Aug 8 19:57:49 [conn3] end connection 10.6.11.104:37507
Wed Aug 8 19:57:49 [initandlisten] connection accepted from 10.6.11.104:37508 #4
Wed Aug 8 19:57:49 [conn4] should have chunk: 1 have:0
c->nextSafe():

{ _id: ObjectId('5022543dba73d3605ceb0d7c'), files_id: ObjectId('5022543d6a8b885752a74c16'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('5022543dba73d3605ceb0d7a'), files_id: ObjectId('5022543d6a8b885752a74c16'), n: 0, data: BinData }

c->nextSafe():

{ _id: ObjectId('5022543dba73d3605ceb0d79'), files_id: ObjectId('5022543d6a8b885752a74c16'), n: 0, data: BinData }

Wed Aug 8 19:57:49 [conn4] should have chunk: 1 have:0
c->nextSafe():

{ _id: ObjectId('5022543dba73d3605ceb0d7d'), files_id: ObjectId('5022543d6a8b885752a74c17'), n: 0, data: BinData }

In addition, i want to know if C++ dirver have any method to disconnect the connections .

Comment by Spencer Brody (Inactive) [ 03/Aug/12 ]

Can you post the mongod logs from a full run of a test that triggers the crash?

Is this running on a single mongod, a replica set, or a sharded cluster?

Comment by barongwang [ 03/Aug/12 ]

#include "mongotmphead.h"   
using namespace std;
using namespace mongo;
class mongoadapter
{
private:static string hostIp;
        static string proPort;
        static string dbCol;
        static string tmpDir;
public:
        mongoadapter();           
        ~mongoadapter();
public:
        bool init();       
        bool putfile(const string& filename,const char *addr,size_t sumLen);
private: 
        DBClientConnection conn;
        GridFS *fileOpt;
};
 
#endif
 
//mongoAdapter.cpp       this class  just package the interface of GridFs
#include <string.h>
#include "mongoAdapter.h"
using namespace mongo;
using namespace std;
 
string mongoadapter::hostIp = "xxxxxxxxxx";
string mongoadapter::proPort = "xxxxx";
string mongoadapter::dbCol = "xxxx";
string mongoadapter::tmpDir = "/home/baronwang/";  //nerver used
 
bool mongoadapter::init()
{
	string errmsg;
	if ( ! conn.connect( mongoadapter::hostIp + ":" + mongoadapter::proPort , errmsg ) ) 
	{
		cout << "couldn't connect : " << errmsg << endl;
		return false;
	}
	fileOpt = new GridFS(conn, dbCol);
	return true;
}
 
bool mongoadapter::putfile(const string& filename,const char* addr,size_t sumLen)
{
	BSONObj resultObj = fileOpt->storeFile(addr, sumLen, filename);
	return !resultObj.isEmpty();
}
 
 
 
 
//our call  //we use while to assure that once we have a file need to been saved, the progress should nerver return untill file insert into GridFs
myFlag = true;
while(myFlag)
{
     if ( myFileOpt.init() )
     {
	myFlag = false;
     }
}
 
myFlag = true;
while(myFlag)
{
    try
    {
         //sometimes,we cann't save file to GridFs
        iRet = (myFileOpt.putfile(tmp_file_ptr->filename, tmp_file_ptr->content, tmp_file_ptr->cur_len)?MySoapShared::Request_OK:MySoapShared::Request_FailFileWrite);
        printf("Insert file: %s in to mongodb successed! filesize: %d\n",tmp_file_ptr->filename, tmp_file_ptr->cur_len);
        myFlag = false;
        }
    catch(...)
    {
        printf("Insert file to mongodb fault! retry!\n");
    }
}
 
 
//the amount of clients is about 100 to 200 and every client save 10 files into GridFs. what'more, at this condition ,the mongod process crashed a few times, not every times!
mongodb version:  mongodb-linux-x86_64-static-legacy-2.0.6
 

Comment by barongwang [ 03/Aug/12 ]

The version which i used is mongodb-linux-x86_64-static-legacy-2.0.6

Comment by Spencer Brody (Inactive) [ 02/Aug/12 ]

Do you have a reproducible test case you could attach?

Can you check if this problem still exists in 2.0.6? There were many stability enhancements to mongod between 2.0.0 and 2.0.6.

Generated at Thu Feb 08 03:12:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.