[SERVER-8253] MongoD is spamming with hundreds of gigabytes of logs Created: 21/Jan/13  Updated: 11/Jul/16  Resolved: 28/May/13

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Anton V. Volokhov Assignee: James Wahlin
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-9791 Move "replication batch size is" log ... Closed
Operating System: ALL
Participants:

 Description   

Hi!
I'm using mongoDB 2.2.1
I tried to make initial sync for the one of secondaries, but it failed with following exceptions:

Mon Jan 21 07:22:07 [rsSync] SocketException: remote: 192.168.1.101:27017 error: 9001 socket exception [0] server [192.168.1.101:27017]
Mon Jan 21 07:22:07 [rsSync] Socket recv() conn closed? 192.168.1.101:27017
Mon Jan 21 07:22:07 [rsSync] SocketException: remote: 192.168.1.101:27017 error: 9001 socket exception [0] server [192.168.1.101:27017]
Mon Jan 21 07:22:07 [rsSync] Socket recv() conn closed? 192.168.1.101:27017
Mon Jan 21 07:22:07 [rsSync] SocketException: remote: 192.168.1.101:27017 error: 9001 socket exception [0] server [192.168.1.101:27017]
Mon Jan 21 07:22:07 [rsSync] Socket recv() conn closed? 192.168.1.101:27017
Mon Jan 21 07:22:07 [rsSync] SocketException: remote: 192.168.1.101:27017 error: 9001 socket exception [0] server [192.168.1.101:27017]
Mon Jan 21 07:22:07 [rsSync] Socket recv() conn closed? 192.168.1.101:27017
Mon Jan 21 07:22:07 [rsSync] SocketException: remote: 192.168.1.101:27017 error: 9001 socket exception [0] server [192.168.1.101:27017]
Mon Jan 21 07:22:07 [rsSync] Socket recv() conn closed? 192.168.1.101:27017
Mon Jan 21 07:22:07 [rsSync] SocketException: remote: 192.168.1.101:27017 error: 9001 socket exception [0] server [192.168.1.101:27017]
Mon Jan 21 07:22:07 [rsSync] Socket recv() conn closed? 192.168.1.101:27017
Mon Jan 21 07:22:07 [rsSync] SocketException: remote: 192.168.1.101:27017 error: 9001 socket exception [0] server [192.168.1.101:27017]
Mon Jan 21 07:22:07 [rsSync] Socket recv() conn closed? 192.168.1.101:27017
Mon Jan 21 07:22:07 [rsSync] SocketException: remote: 192.168.1.101:27017 error: 9001 socket exception [0] server [192.168.1.101:27017]
Mon Jan 21 07:22:07 [rsSync] Socket recv() conn closed? 192.168.1.101:27017

Logfile raised to 200gb size in 6 hours, with 0 logLevel.
The machine at 192.168.1.101 is primary now.



 Comments   
Comment by James Wahlin [ 28/May/13 ]

Resolving ticket. Anton, feel free to reopen with any further questions.

Thanks,
James

Comment by Eric Milkie [ 28/May/13 ]

Replication batch size is unrelated to ulimits.

Comment by Anton V. Volokhov [ 28/May/13 ]

Thanks!
One little question before closing the thread.
Could you briefly explain how replication batch size is related to ulimits?

Comment by James Wahlin [ 28/May/13 ]

Hi Anton,

I have created SERVER-9791, requesting that the "replication batch size is ..." message be moved to log level 2. I have made you the requestor on it. Please track SERVER-9791 for status.

Let me know if you have any further questions for this ticket, otherwise let's close out given your initial problem appears to have been ulimit related and we have a new ticket for the replication batch size statement.

Thanks,
James

Comment by Anton V. Volokhov [ 24/May/13 ]

We use more frequent logrotate.
Debug data can be useful, so loglevel 0 is undesirable.

Comment by James Wahlin [ 24/May/13 ]

Hi Anton,

Sorry about the delay on this. The "replication batch size" message in 2.2.2 is only printed at log level one. If you change logging to level 0 this will go away.

Can you give me an idea on your current status? Can we still be of help on this?

Thanks,
James

Comment by Anton V. Volokhov [ 24/Jan/13 ]

one more issue with logging.
We upgraded to mongoDB 2.2.2
Trying to make initial resync for one instance.
we have thousands of duplicate strings "replication batch size is 1"
$ grep 'replication batch size is 1' mongodb.log | wc -l
4854584
$ grep 'Thu\ Jan\ 24\ 08:45:06\ [rsSync]\ replication\ batch\ size\ is\ 1' mongodb.log | wc -l
862

Comment by Anton V. Volokhov [ 21/Jan/13 ]

We are making dump of a whole cluster and use it only if whole cluster is corrupted. We hoped, that secondaries can recover without any manual tuning.
We have about 200GB on two shards. Replicated only 15GBs. Graph of the dataSize(dataSize parameter of db.stats() for the corresponding database) looks weird:
http://yadi.sk/d/LyYVRmjl20vlL
Cluster architecture is following:
1st shard : cs-mongodb01* machines
2nd shard : cs-mongodb02* machines
second shard is lagging.

Comment by Gregor Macadam [ 21/Jan/13 ]

Hi
Initial sync over the network can be slow. How large is your database? Are you in MMS?
Have you considered seeding the secondary with a backup of the data files?
Thanks
Gregor

Comment by Anton V. Volokhov [ 21/Jan/13 ]

Initial sync is in progress now, but it crawls slowly.
Load average on master and the other secondary jumps up to 2000.

Comment by Anton V. Volokhov [ 21/Jan/13 ]

Yes, connectivity is ok.
There were problems with open connections(max 819 open connections with ulimits 1024)
Then I raised ulimits, restarted machines and the issue has gone. But still, 200GB of logs looks scary.

Comment by Eliot Horowitz (Inactive) [ 21/Jan/13 ]

Seems like there is a networking issue?
Have you checked connectivity?

Generated at Thu Feb 08 03:16:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.