[SERVER-10377] mongodump times-out intermittently in the sharded environment (ec2) Created: 30/Jul/13  Updated: 03/Jan/14  Resolved: 19/Sep/13

Status: Closed
Project: Core Server
Component/s: Networking, Sharding
Affects Version/s: 2.4.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Pavlo Grinchenko Assignee: David Hows
Resolution: Won't Fix Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu, EC2


Attachments: File SERVER-10377.tar.gz     File SH01.log.gz     File db.currentOps.log.gz     File mongodump_verbose.log.gz     File mongosniff.log.gz     File mongosniff_console_output.log.gz    
Operating System: ALL
Steps To Reproduce:

Nothing special. Just simple mongodump.

Participants:

 Description   

We have consistent problems with doing backups of our Reporting database.

  • OS: Ubuntu, EC2
  • Simple sharded environment with just 1 shard
  • Multiple databases - it consistently breaks with >50% probability on largest one around 5GB (still very small)

Output looks like this:

Dumping whole database...
Tue Jul 30 12:40:01.447 creating new connection to:localhost:27017
Tue Jul 30 12:40:01.448 BackgroundJob starting: ConnectBG
Tue Jul 30 12:40:01.448 connected connection!
connected to: localhost
Tue Jul 30 12:40:01.448 DATABASE: reportsRaw to /mnt/backup/ReportsRawDB/dump/reportsRaw
Tue Jul 30 12:40:01.453 reportsRaw.system.indexes to /mnt/backup/ReportsRawDB/dump/reportsRaw/system.indexes.bson
Tue Jul 30 12:40:01.457 4 objects
Tue Jul 30 12:40:01.457 skipping collection: reportsRaw.video.plays.$id
Tue Jul 30 12:40:01.457 skipping collection: reportsRaw.ads.errors.$id
Tue Jul 30 12:40:01.457 skipping collection: reportsRaw.ads.starts.$id
Tue Jul 30 12:40:01.457 skipping collection: reportsRaw.bid.errors.raw.$id
Tue Jul 30 12:40:01.457 reportsRaw.video.plays to /mnt/backup/ReportsRawDB/dump/reportsRaw/video.plays.bson
Tue Jul 30 12:40:04.054 Collection File Writing Progress: 850900/10958683 7% (objects)
Tue Jul 30 12:40:07.039 Collection File Writing Progress: 1955900/10958683 17% (objects)
Tue Jul 30 12:40:10.006 Collection File Writing Progress: 3060700/10958683 27% (objects)
Tue Jul 30 12:40:13.007 Collection File Writing Progress: 4228100/10958683 38% (objects)
Tue Jul 30 12:40:16.035 Collection File Writing Progress: 5376800/10958683 49% (objects)
Tue Jul 30 12:40:19.042 Collection File Writing Progress: 6429700/10958683 58% (objects)
Tue Jul 30 12:40:22.042 Collection File Writing Progress: 7490800/10958683 68% (objects)
Tue Jul 30 12:40:25.001 Collection File Writing Progress: 8534300/10958683 77% (objects)
Tue Jul 30 12:40:28.053 Collection File Writing Progress: 9533700/10958683 86% (objects)
Tue Jul 30 12:40:31.006 Collection File Writing Progress: 10577200/10958683 96% (objects)
Tue Jul 30 12:40:32.073 10959317 objects
Tue Jul 30 12:40:32.073 Metadata for reportsRaw.video.plays to /mnt/backup/ReportsRawDB/dump/reportsRaw/video.plays.metadata.json
Tue Jul 30 12:40:32.073 reportsRaw.ads.errors to /mnt/backup/ReportsRawDB/dump/reportsRaw/ads.errors.bson
Tue Jul 30 12:40:32.993 200000 objects
Tue Jul 30 12:40:32.993 Metadata for reportsRaw.ads.errors to /mnt/backup/ReportsRawDB/dump/reportsRaw/ads.errors.metadata.json
Tue Jul 30 12:40:32.994 reportsRaw.ads.starts to /mnt/backup/ReportsRawDB/dump/reportsRaw/ads.starts.bson
Tue Jul 30 12:40:35.008 Collection File Writing Progress: 475100/12558979 3% (objects)
Tue Jul 30 12:40:38.040 Collection File Writing Progress: 1275300/12558979 10% (objects)
Tue Jul 30 12:40:41.001 Collection File Writing Progress: 2108000/12558979 16% (objects)
Tue Jul 30 12:40:44.008 Collection File Writing Progress: 2931900/12558979 23% (objects)
Tue Jul 30 12:40:47.036 Collection File Writing Progress: 3749000/12558979 29% (objects)
Tue Jul 30 13:00:48.073 Socket recv() errno:104 Connection reset by peer 127.0.0.1:27017
Tue Jul 30 13:00:48.074 SocketException: remote: 127.0.0.1:27017 error: 9001 socket exception [1] server [127.0.0.1:27017]
Tue Jul 30 13:00:48.074 User Assertion: 10278:dbclient error communicating with server: localhost:27017
assertion: 10278 dbclient error communicating with server: localhost:27017
Can't create dump of reportsRaw

We had to move to 6 hour backup schedule from 24 hour to give 4 attempts for mongodump to succeed.

We upgraded everything to 2.4.5 (latest) version



 Comments   
Comment by Pavlo Grinchenko [ 18/Sep/13 ]

For some reason I cannot close it myself - please CLOSE it.

Comment by Pavlo Grinchenko [ 18/Sep/13 ]

We identified that when we roll-out our DB to the newly provisioned host - everything works fine. So we started to replace hosts in our cluster. This doesn't look like MongoDB issue, but something in the software stack in our AMI. Unfortunately we don't have resources to deal with it - so we are following with the replacement route.

I will close this issue.

Comment by Vladimir Poluyaktov [ 10/Sep/13 ]

I'm working on this

Comment by Daniel Pasette (Inactive) [ 09/Sep/13 ]

This issue does not appear to me to be related to the size of the data as many other users are using mongodump on larger collections with v2.4.5. Can you try a couple more experiments?

1. try dumping just this collection
2. if possible, take a copy of the data, startup a v2.2.6 mongod on the copy and use a v2.2.6 mongodump on the same data.

If these experiments don't yield any results, I'd say we will need to get an actual copy of your data files to progress this issue.

Comment by Pavlo Grinchenko [ 04/Sep/13 ]

In our previous project we were able to do dumps of collections 3x of the size we are having problems now. Used older Mongo DB though.

Comment by Vladimir Poluyaktov [ 03/Sep/13 ]

Oops
I forgot to redirect stderr stream to the mongosniff.log so you can't see the error messages I mentioned above in the log file.
I've captured 1000 last lines of the mongosniff output from my console and attached it as mongosniff_console_output.log.gz. So take a look at this file to see the these errors.

Comment by Vladimir Poluyaktov [ 03/Sep/13 ]

Asya, David,
I ran mongodump with --forceTableScan -vvvvv flags. It failed again (see mongodump log below).
Also I run mongosniff in parallel and I ran db.currentOp(true) few times during the process - right after the hang started, in 5 and 15 min after that and right after I got mongodump error message 'Connection reset by peer 127.0.0.1:27018' (see attached log files mongosniff.log.gz and db.currentOps.log.gz)

I see a lot of warning messages like this in the mongosniff output:

Warning: sequence # mismatch, there may be dropped packets
Invalid message start, skipping packet.

but it's not clear for me what caused these messages because the lo network interface statistics is showing no errors or dropped packets (see my previous post).

mongodump output:

ubuntu@ReportingDB-use1b-01:/data/backup$ sudo mongodump --host localhost --port 27018 --forceTableScan -vvvvv
 
Tue Sep  3 20:32:43.355 creating new connection to:localhost:27018
Tue Sep  3 20:32:43.355 BackgroundJob starting: ConnectBG
Tue Sep  3 20:32:43.355 connected connection!
connected to: localhost:27018
Tue Sep  3 20:32:43.356 all dbs
Tue Sep  3 20:32:43.357 DATABASE: reportsOps	 to 	dump/reportsOps
Tue Sep  3 20:32:43.357 	reportsOps.system.indexes to dump/reportsOps/system.indexes.bson
Tue Sep  3 20:32:43.358 		 5 objects
Tue Sep  3 20:32:43.358 	skipping collection: reportsOps.parser.files.$_id_
Tue Sep  3 20:32:43.358 	skipping collection: reportsOps.bid.errors.raw.$_id_
Tue Sep  3 20:32:43.358 	skipping collection: reportsOps.parser.pixels.$_id_
Tue Sep  3 20:32:43.358 	skipping collection: reportsOps.parser.errors.$_id_
Tue Sep  3 20:32:43.358 	skipping collection: reportsOps.parser.errors.raw.$_id_
Tue Sep  3 20:32:43.358 	reportsOps.parser.files to dump/reportsOps/parser.files.bson
Tue Sep  3 20:32:43.936 		 555969 objects
Tue Sep  3 20:32:50.366 	Metadata for reportsOps.parser.files to dump/reportsOps/parser.files.metadata.json
Tue Sep  3 20:32:50.366 	reportsOps.bid.errors.raw to dump/reportsOps/bid.errors.raw.bson
Tue Sep  3 20:32:50.367 		 0 objects
Tue Sep  3 20:32:50.367 	Metadata for reportsOps.bid.errors.raw to dump/reportsOps/bid.errors.raw.metadata.json
Tue Sep  3 20:32:50.367 	reportsOps.parser.pixels to dump/reportsOps/parser.pixels.bson
Tue Sep  3 20:32:50.370 		 3103 objects
Tue Sep  3 20:32:50.370 	Metadata for reportsOps.parser.pixels to dump/reportsOps/parser.pixels.metadata.json
Tue Sep  3 20:32:50.370 	reportsOps.parser.errors to dump/reportsOps/parser.errors.bson
Tue Sep  3 20:32:50.376 		 9498 objects
Tue Sep  3 20:32:50.404 	Metadata for reportsOps.parser.errors to dump/reportsOps/parser.errors.metadata.json
Tue Sep  3 20:32:50.404 	reportsOps.parser.errors.raw to dump/reportsOps/parser.errors.raw.bson
Tue Sep  3 20:32:50.679 		 200000 objects
Tue Sep  3 20:32:53.993 	Metadata for reportsOps.parser.errors.raw to dump/reportsOps/parser.errors.raw.metadata.json
Tue Sep  3 20:32:53.994 DATABASE: reportsAds	 to 	dump/reportsAds
Tue Sep  3 20:32:53.994 	reportsAds.system.indexes to dump/reportsAds/system.indexes.bson
Tue Sep  3 20:32:53.994 		 8 objects
Tue Sep  3 20:32:53.994 	skipping collection: reportsAds.ads.errors.$_id_
Tue Sep  3 20:32:53.995 	skipping collection: reportsAds.ads.mainStats.$_id_
Tue Sep  3 20:32:53.995 	skipping collection: reportsAds.ads.requests.$_id_
Tue Sep  3 20:32:53.995 	skipping collection: reportsAds.ads.fallbacks.$_id_
Tue Sep  3 20:32:53.995 	skipping collection: reportsAds.vrp-184.ads.mainStats.$_id_
Tue Sep  3 20:32:53.995 	skipping collection: reportsAds.ads.mainStats.$cid_1_bid_1_pid_1_bcid_1_vid_1_avid_1_at_1_date_1
Tue Sep  3 20:32:53.995 	skipping collection: reportsAds.ads.requests.$cid_1_bid_1_pid_1_bcid_1_vid_1_at_1_date_1
Tue Sep  3 20:32:53.995 	skipping collection: reportsAds.ads.byEngineName.$_id_
Tue Sep  3 20:32:53.995 	reportsAds.ads.errors to dump/reportsAds/ads.errors.bson
Tue Sep  3 20:32:54.175 		 200000 objects
Tue Sep  3 20:32:56.189 	Metadata for reportsAds.ads.errors to dump/reportsAds/ads.errors.metadata.json
Tue Sep  3 20:32:56.189 	reportsAds.ads.mainStats to dump/reportsAds/ads.mainStats.bson
Tue Sep  3 20:32:56.255 		 75273 objects
Tue Sep  3 20:32:56.900 	Metadata for reportsAds.ads.mainStats to dump/reportsAds/ads.mainStats.metadata.json
Tue Sep  3 20:32:56.900 	reportsAds.ads.requests to dump/reportsAds/ads.requests.bson
Tue Sep  3 20:32:57.021 		 34911 objects
Tue Sep  3 20:32:57.058 	Metadata for reportsAds.ads.requests to dump/reportsAds/ads.requests.metadata.json
Tue Sep  3 20:32:57.058 	reportsAds.ads.fallbacks to dump/reportsAds/ads.fallbacks.bson
Tue Sep  3 20:32:57.059 		 356 objects
Tue Sep  3 20:32:57.059 	Metadata for reportsAds.ads.fallbacks to dump/reportsAds/ads.fallbacks.metadata.json
Tue Sep  3 20:32:57.059 	reportsAds.vrp-184.ads.mainStats to dump/reportsAds/vrp-184.ads.mainStats.bson
Tue Sep  3 20:32:57.081 		 19440 objects
Tue Sep  3 20:32:57.257 	Metadata for reportsAds.vrp-184.ads.mainStats to dump/reportsAds/vrp-184.ads.mainStats.metadata.json
Tue Sep  3 20:32:57.257 	reportsAds.ads.byEngineName to dump/reportsAds/ads.byEngineName.bson
Tue Sep  3 20:32:57.259 		 1106 objects
Tue Sep  3 20:32:57.259 	Metadata for reportsAds.ads.byEngineName to dump/reportsAds/ads.byEngineName.metadata.json
Tue Sep  3 20:32:57.259 DATABASE: reportsCore	 to 	dump/reportsCore
Tue Sep  3 20:32:57.260 	reportsCore.system.indexes to dump/reportsCore/system.indexes.bson
Tue Sep  3 20:32:57.260 		 14 objects
Tue Sep  3 20:32:57.260 	skipping collection: reportsCore.bid.mainStats.$_id_
Tue Sep  3 20:32:57.260 	skipping collection: reportsCore.dashboard.mainStats.$_id_
Tue Sep  3 20:32:57.260 	skipping collection: reportsCore.dashboard.totalStats.$_id_
Tue Sep  3 20:32:57.260 	skipping collection: reportsCore.player.mainStats.$_id_
Tue Sep  3 20:32:57.260 	skipping collection: reportsCore.video.mainStats.$_id_
Tue Sep  3 20:32:57.260 	skipping collection: reportsCore.adPackage.mainStats.$_id_
Tue Sep  3 20:32:57.260 	skipping collection: reportsCore.vrp-184.player.mainStats.$_id_
Tue Sep  3 20:32:57.260 	skipping collection: reportsCore.vrp-184.video.mainStats.$_id_
Tue Sep  3 20:32:57.260 	skipping collection: reportsCore.bid.mainStats.$cid_1_bid_1_bcid_1_date_1
Tue Sep  3 20:32:57.260 	skipping collection: reportsCore.player.mainStats.$cid_1_bid_1_pid_1_bcid_1_date_1
Tue Sep  3 20:32:57.260 	skipping collection: reportsCore.video.mainStats.$cid_1_bid_1_pid_1_bcid_1_vid_1_date_1
Tue Sep  3 20:32:57.260 	skipping collection: reportsCore.adPackage.mainStats.$apid_1_cid_1_bid_1_pid_1_bcid_1_date_1
Tue Sep  3 20:32:57.260 	skipping collection: reportsCore.macro.mainStats.$_id_
Tue Sep  3 20:32:57.260 	skipping collection: reportsCore.macro.mainStats.$cid_1_bid_1_pid_1_date_1_bcid_1_mn_1_mv_1
Tue Sep  3 20:32:57.260 	reportsCore.bid.mainStats to dump/reportsCore/bid.mainStats.bson
Tue Sep  3 20:32:57.268 		 10871 objects
Tue Sep  3 20:32:57.293 	Metadata for reportsCore.bid.mainStats to dump/reportsCore/bid.mainStats.metadata.json
Tue Sep  3 20:32:57.329 	reportsCore.dashboard.mainStats to dump/reportsCore/dashboard.mainStats.bson
Tue Sep  3 20:32:57.332 		 3099 objects
Tue Sep  3 20:32:57.332 	Metadata for reportsCore.dashboard.mainStats to dump/reportsCore/dashboard.mainStats.metadata.json
Tue Sep  3 20:32:57.332 	reportsCore.dashboard.totalStats to dump/reportsCore/dashboard.totalStats.bson
Tue Sep  3 20:32:57.332 		 1 objects
Tue Sep  3 20:32:57.332 	Metadata for reportsCore.dashboard.totalStats to dump/reportsCore/dashboard.totalStats.metadata.json
Tue Sep  3 20:32:57.332 	reportsCore.player.mainStats to dump/reportsCore/player.mainStats.bson
Tue Sep  3 20:32:57.349 		 17867 objects
Tue Sep  3 20:32:57.463 	Metadata for reportsCore.player.mainStats to dump/reportsCore/player.mainStats.metadata.json
Tue Sep  3 20:32:57.463 	reportsCore.video.mainStats to dump/reportsCore/video.mainStats.bson
Tue Sep  3 20:32:57.485 		 20597 objects
Tue Sep  3 20:32:57.635 	Metadata for reportsCore.video.mainStats to dump/reportsCore/video.mainStats.metadata.json
Tue Sep  3 20:32:57.635 	reportsCore.adPackage.mainStats to dump/reportsCore/adPackage.mainStats.bson
Tue Sep  3 20:32:57.636 		 40 objects
Tue Sep  3 20:32:57.636 	Metadata for reportsCore.adPackage.mainStats to dump/reportsCore/adPackage.mainStats.metadata.json
Tue Sep  3 20:32:57.636 	reportsCore.vrp-184.player.mainStats to dump/reportsCore/vrp-184.player.mainStats.bson
Tue Sep  3 20:32:57.639 		 3530 objects
Tue Sep  3 20:32:57.639 	Metadata for reportsCore.vrp-184.player.mainStats to dump/reportsCore/vrp-184.player.mainStats.metadata.json
Tue Sep  3 20:32:57.639 	reportsCore.vrp-184.video.mainStats to dump/reportsCore/vrp-184.video.mainStats.bson
Tue Sep  3 20:32:57.643 		 3867 objects
Tue Sep  3 20:32:57.673 	Metadata for reportsCore.vrp-184.video.mainStats to dump/reportsCore/vrp-184.video.mainStats.metadata.json
Tue Sep  3 20:32:57.673 	reportsCore.macro.mainStats to dump/reportsCore/macro.mainStats.bson
Tue Sep  3 20:32:57.856 		 192233 objects
Tue Sep  3 20:32:59.600 	Metadata for reportsCore.macro.mainStats to dump/reportsCore/macro.mainStats.metadata.json
Tue Sep  3 20:32:59.600 DATABASE: reportsSites	 to 	dump/reportsSites
Tue Sep  3 20:32:59.601 	reportsSites.system.indexes to dump/reportsSites/system.indexes.bson
Tue Sep  3 20:32:59.601 		 4 objects
Tue Sep  3 20:32:59.601 	skipping collection: reportsSites.domain.mainStats.$_id_
Tue Sep  3 20:32:59.601 	skipping collection: reportsSites.site.mainStats.$_id_
Tue Sep  3 20:32:59.601 	skipping collection: reportsSites.site.mainStats.$cid_1_bid_1_pid_1_bcid_1_site_1_date_1
Tue Sep  3 20:32:59.601 	skipping collection: reportsSites.domain.mainStats.$cid_1_bid_1_pid_1_bcid_1_domain_1_date_1
Tue Sep  3 20:32:59.601 	reportsSites.domain.mainStats to dump/reportsSites/domain.mainStats.bson
Tue Sep  3 20:33:00.467 		 357273 objects
Tue Sep  3 20:33:03.969 	Metadata for reportsSites.domain.mainStats to dump/reportsSites/domain.mainStats.metadata.json
Tue Sep  3 20:33:03.969 	reportsSites.site.mainStats to dump/reportsSites/site.mainStats.bson
Tue Sep  3 20:33:04.442 		 485776 objects
Tue Sep  3 20:33:08.880 	Metadata for reportsSites.site.mainStats to dump/reportsSites/site.mainStats.metadata.json
Tue Sep  3 20:33:08.880 DATABASE: reportsRaw	 to 	dump/reportsRaw
Tue Sep  3 20:33:08.880 	reportsRaw.system.indexes to dump/reportsRaw/system.indexes.bson
Tue Sep  3 20:33:08.881 		 4 objects
Tue Sep  3 20:33:08.881 	skipping collection: reportsRaw.video.plays.$_id_
Tue Sep  3 20:33:08.881 	skipping collection: reportsRaw.ads.errors.$_id_
Tue Sep  3 20:33:08.881 	skipping collection: reportsRaw.ads.starts.$_id_
Tue Sep  3 20:33:08.881 	skipping collection: reportsRaw.bid.errors.raw.$_id_
Tue Sep  3 20:33:08.881 	reportsRaw.video.plays to dump/reportsRaw/video.plays.bson
Tue Sep  3 20:33:12.004 		Collection File Writing Progress: 7894200/111702967	7%	(objects)
Tue Sep  3 20:33:15.022 		Collection File Writing Progress: 9344600/111702967	8%	(objects)
Tue Sep  3 20:33:18.440 		Collection File Writing Progress: 9831300/111702967	8%	(objects)
Tue Sep  3 20:33:21.612 		Collection File Writing Progress: 9954100/111702967	8%	(objects)
Tue Sep  3 20:33:24.105 		Collection File Writing Progress: 10036700/111702967	8%	(objects)
Tue Sep  3 20:33:27.525 		Collection File Writing Progress: 10145400/111702967	9%	(objects)
Tue Sep  3 20:33:30.618 		Collection File Writing Progress: 10253400/111702967	9%	(objects)
Tue Sep  3 20:33:33.819 		Collection File Writing Progress: 10361300/111702967	9%	(objects)
Tue Sep  3 20:33:36.204 		Collection File Writing Progress: 10442400/111702967	9%	(objects)
Tue Sep  3 20:33:39.288 		Collection File Writing Progress: 10550400/111702967	9%	(objects)
Tue Sep  3 20:33:42.422 		Collection File Writing Progress: 10658000/111702967	9%	(objects)
Tue Sep  3 20:33:45.718 		Collection File Writing Progress: 10765800/111702967	9%	(objects)
Tue Sep  3 20:33:48.093 		Collection File Writing Progress: 10846300/111702967	9%	(objects)
Tue Sep  3 20:33:51.237 		Collection File Writing Progress: 10954300/111702967	9%	(objects)
Tue Sep  3 20:33:54.299 		Collection File Writing Progress: 11061600/111702967	9%	(objects)
Tue Sep  3 20:33:57.625 		Collection File Writing Progress: 11167800/111702967	9%	(objects)
Tue Sep  3 20:34:00.071 		Collection File Writing Progress: 11248200/111702967	10%	(objects)
Tue Sep  3 20:34:03.163 		Collection File Writing Progress: 11355800/111702967	10%	(objects)
Tue Sep  3 20:34:06.022 		Collection File Writing Progress: 12204300/111702967	10%	(objects)
Tue Sep  3 20:34:09.029 		Collection File Writing Progress: 13616400/111702967	12%	(objects)
Tue Sep  3 20:34:12.662 		Collection File Writing Progress: 14040700/111702967	12%	(objects)
Tue Sep  3 20:34:15.725 		Collection File Writing Progress: 14146000/111702967	12%	(objects)
Tue Sep  3 20:34:18.097 		Collection File Writing Progress: 14225100/111702967	12%	(objects)
Tue Sep  3 20:34:21.422 		Collection File Writing Progress: 14329800/111702967	12%	(objects)
Tue Sep  3 20:54:22.796 Socket recv() errno:104 Connection reset by peer 127.0.0.1:27018
Tue Sep  3 20:54:22.797 SocketException: remote: 127.0.0.1:27018 error: 9001 socket exception [1] server [127.0.0.1:27018]
Tue Sep  3 20:54:22.798 User Assertion: 16465:recv failed while exhausting cursor
assertion: 16465 recv failed while exhausting cursor
ubuntu@ReportingDB-use1b-01:/data/backup$

Comment by Vladimir Poluyaktov [ 03/Sep/13 ]

@Dan: I don't think could be an iptables related issue because we this is an EC2 host in it standard configuration that means I didn't setup any additional firewall besides EC2 Security Groups rules. In my tests ran the mongodump right on a primary Mongodb host against localhost:27018 so EC2 Security Groups may not cause any problem here I think.

Here is current iptables configuration and network interfaces statistics:

 
ubuntu@ReportingDB-use1b-01:~$ sudo iptables -L -v -n
Chain INPUT (policy ACCEPT 609 packets, 69975 bytes)
 pkts bytes target     prot opt in     out     source               destination
 
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination
 
Chain OUTPUT (policy ACCEPT 529 packets, 115K bytes)
 pkts bytes target     prot opt in     out     source               destination
ubuntu@ReportingDB-use1b-01:~$

ubuntu@ReportingDB-use1b-01:~$ sudo ufw status
Status: inactive
ubuntu@ReportingDB-use1b-01:~$

ubuntu@ReportingDB-use1b-01:~$ sudo ifconfig -v
eth0      Link encap:Ethernet  HWaddr 22:00:0a:91:c9:3a
          inet addr:10.145.201.58  Bcast:10.145.201.63  Mask:255.255.255.192
          inet6 addr: fe80::2000:aff:fe91:c93a/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1465108164 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1309414905 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:363073741081 (363.0 GB)  TX bytes:770517143796 (770.5 GB)
          Interrupt:48
 
lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:2416653 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2416653 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:10942134154 (10.9 GB)  TX bytes:10942134154 (10.9 GB)
 
ubuntu@ReportingDB-use1b-01:~$

Comment by Pavlo Grinchenko [ 02/Sep/13 ]

Thank you for all suggestions - we will try those asap and post results here.

Comment by David Hows [ 02/Sep/13 ]

Hi Pavel,

As you can see there is something which stops at that last getmore outlined by Dan above, beyond this it does not appear that further requests come from that mongodump process.

I suspect that the socket closure is due to a timeout as data stopped being sent over that connection.

If the suggestions from Dan and Asya fail to pan out would you be able to run mongosniff on the connection between mongodump and mongod; in addition to mongosniff could you also increase the verbosity of the mongodump by adding -vvvvv to the arguments you use to initiate the backup. This should give us a solid idea of what data is traveling between the two and if there are any blockages in the Mongodump side of the processing.

Thanks,
David

Comment by Asya Kamsky [ 31/Aug/13 ]

Pavel,

One more thing you could try for us: mongodump takes an option '--forceTableScan' - could you try that and let us know if it hangs the same way? Again, db.currentOp() from when the hang starts would be useful - you can also run db.currentOp(true) which will give much larger output, if you can attach. We can move this ticket to community private area if there is potentially sensitive information in the output.

Comment by Daniel Pasette (Inactive) [ 31/Aug/13 ]

Looking through the mongod logs you shared, we see the following events:
From the attached log:

mongodump connection established:

Thu Aug 29 18:53:42.406 [initandlisten] connection accepted from 127.0.0.1:53681 #2495168 (38 connections now open)

Last logged operation on the mongodump connection (conn2495168) before timeout:

Thu Aug 29 18:56:06.307 [conn2495168] getmore reportsRaw.video.plays query: { query: {}, $snapshot: true } cursorid:5161678121468035488 ntoreturn:0 exhaust:1 keyUpdates:0 numYields: 1 locks(micros) r:148426 nreturned:30694 reslen:4194289 78ms

Connection times out:

Thu Aug 29 19:11:45.483 [conn2495168] Socket say send() errno:110 Connection timed out 127.0.0.1:53681
Thu Aug 29 19:11:45.483 [conn2495168] SocketException handling request, closing client connection: 9001 socket exception [2] server [127.0.0.1:53681]

mongodump finally dies.

Thu Aug 29 19:16:06.615 Socket recv() errno:104 Connection reset by peer 127.0.0.1:27018

I'm aware of one other case which exhibited similar behavior to what you're describing. In that case, the user resolved the issue saying that new iptable rules had caused the issue. Unfortunately, they never explained what exactly the issue was. Does this perhaps give you a hint to any changes in your environment that coincide with this failure?

Could you run db.currentOp() when the mongodump begins to hang?

Comment by Vladimir Poluyaktov [ 30/Aug/13 ]

Actually I did run it with 'true' argument. Look at the first line of the validate log in one of my previous comments:

TpParser-use1b-02(mongos-2.4.5)[mongos] reportsRaw> db.video.plays.validate(true)
...

Comment by Asya Kamsky [ 30/Aug/13 ]

It's going to be time consuming but could you run validate again but with argument true? We may get more information from that.

Comment by Pavlo Grinchenko [ 30/Aug/13 ]

Asya and David,

This is a big problem now. We cannot do backups of our database (and it's relatively a small one). We tried to backup from the localhost (removing network aspect) and it failed again.

Comment by Vladimir Poluyaktov [ 29/Aug/13 ]

I also to re-ran the failing mongodump with log level 4 that David suggested (see result below).
I ran it right on a PRIMARY shard host (not via mongos) to exclude possible network issues.

Here is the mongodump log:

ubuntu@ReportingDB-use1b-01:/data/backup$ mongo localhost:27018
MongoDB shell version: 2.4.5
connecting to: localhost:27018/test
SH01:PRIMARY> db.adminCommand({setParameter:1, logLevel:4})
{ "was" : 0, "ok" : 1 }
SH01:PRIMARY>
SH01:PRIMARY> exit
bye
 
 
ubuntu@ReportingDB-use1b-01:/data/backup$ sudo mongodump --host localhost --port 27018
connected to: localhost:27018
Thu Aug 29 18:53:42.407 all dbs
Thu Aug 29 18:53:42.408 DATABASE: reportsOps	 to 	dump/reportsOps
Thu Aug 29 18:53:42.409 	reportsOps.system.indexes to dump/reportsOps/system.indexes.bson
Thu Aug 29 18:53:42.409 		 5 objects
Thu Aug 29 18:53:42.409 	reportsOps.parser.files to dump/reportsOps/parser.files.bson
Thu Aug 29 18:53:43.907 		 526598 objects
Thu Aug 29 18:53:49.414 	Metadata for reportsOps.parser.files to dump/reportsOps/parser.files.metadata.json
Thu Aug 29 18:53:49.414 	reportsOps.bid.errors.raw to dump/reportsOps/bid.errors.raw.bson
Thu Aug 29 18:53:49.500 		 0 objects
Thu Aug 29 18:53:49.500 	Metadata for reportsOps.bid.errors.raw to dump/reportsOps/bid.errors.raw.metadata.json
Thu Aug 29 18:53:49.501 	reportsOps.parser.pixels to dump/reportsOps/parser.pixels.bson
Thu Aug 29 18:53:49.608 		 2980 objects
Thu Aug 29 18:53:49.609 	Metadata for reportsOps.parser.pixels to dump/reportsOps/parser.pixels.metadata.json
Thu Aug 29 18:53:49.609 	reportsOps.parser.errors to dump/reportsOps/parser.errors.bson
Thu Aug 29 18:53:49.650 		 6761 objects
Thu Aug 29 18:53:49.651 	Metadata for reportsOps.parser.errors to dump/reportsOps/parser.errors.metadata.json
Thu Aug 29 18:53:49.651 	reportsOps.parser.errors.raw to dump/reportsOps/parser.errors.raw.bson
Thu Aug 29 18:53:50.212 		 199299 objects
Thu Aug 29 18:53:50.212 	Metadata for reportsOps.parser.errors.raw to dump/reportsOps/parser.errors.raw.metadata.json
Thu Aug 29 18:53:50.212 DATABASE: reportsAds	 to 	dump/reportsAds
Thu Aug 29 18:53:50.212 	reportsAds.system.indexes to dump/reportsAds/system.indexes.bson
Thu Aug 29 18:53:50.213 		 7 objects
Thu Aug 29 18:53:50.213 	reportsAds.ads.errors to dump/reportsAds/ads.errors.bson
Thu Aug 29 18:53:50.543 		 200000 objects
Thu Aug 29 18:53:51.993 	Metadata for reportsAds.ads.errors to dump/reportsAds/ads.errors.metadata.json
Thu Aug 29 18:53:51.993 	reportsAds.ads.mainStats to dump/reportsAds/ads.mainStats.bson
Thu Aug 29 18:53:53.518 		 66417 objects
Thu Aug 29 18:53:53.809 	Metadata for reportsAds.ads.mainStats to dump/reportsAds/ads.mainStats.metadata.json
Thu Aug 29 18:53:53.839 	reportsAds.ads.requests to dump/reportsAds/ads.requests.bson
Thu Aug 29 18:53:55.605 		 29695 objects
Thu Aug 29 18:53:55.606 	Metadata for reportsAds.ads.requests to dump/reportsAds/ads.requests.metadata.json
Thu Aug 29 18:53:55.606 	reportsAds.ads.fallbacks to dump/reportsAds/ads.fallbacks.bson
Thu Aug 29 18:53:55.680 		 125 objects
Thu Aug 29 18:53:55.681 	Metadata for reportsAds.ads.fallbacks to dump/reportsAds/ads.fallbacks.metadata.json
Thu Aug 29 18:53:55.681 	reportsAds.vrp-184.ads.mainStats to dump/reportsAds/vrp-184.ads.mainStats.bson
Thu Aug 29 18:53:55.710 		 19440 objects
Thu Aug 29 18:54:00.508 	Metadata for reportsAds.vrp-184.ads.mainStats to dump/reportsAds/vrp-184.ads.mainStats.metadata.json
Thu Aug 29 18:54:00.508 DATABASE: reportsCore	 to 	dump/reportsCore
Thu Aug 29 18:54:00.508 	reportsCore.system.indexes to dump/reportsCore/system.indexes.bson
Thu Aug 29 18:54:00.509 		 13 objects
Thu Aug 29 18:54:00.509 	reportsCore.bid.mainStats to dump/reportsCore/bid.mainStats.bson
Thu Aug 29 18:54:00.634 		 9837 objects
Thu Aug 29 18:54:00.634 	Metadata for reportsCore.bid.mainStats to dump/reportsCore/bid.mainStats.metadata.json
Thu Aug 29 18:54:00.634 	reportsCore.dashboard.mainStats to dump/reportsCore/dashboard.mainStats.bson
Thu Aug 29 18:54:00.693 		 2976 objects
Thu Aug 29 18:54:00.693 	Metadata for reportsCore.dashboard.mainStats to dump/reportsCore/dashboard.mainStats.metadata.json
Thu Aug 29 18:54:00.693 	reportsCore.dashboard.totalStats to dump/reportsCore/dashboard.totalStats.bson
Thu Aug 29 18:54:00.694 		 1 objects
Thu Aug 29 18:54:00.694 	Metadata for reportsCore.dashboard.totalStats to dump/reportsCore/dashboard.totalStats.metadata.json
Thu Aug 29 18:54:00.694 	reportsCore.player.mainStats to dump/reportsCore/player.mainStats.bson
Thu Aug 29 18:54:00.757 		 14819 objects
Thu Aug 29 18:54:00.757 	Metadata for reportsCore.player.mainStats to dump/reportsCore/player.mainStats.metadata.json
Thu Aug 29 18:54:00.757 	reportsCore.video.mainStats to dump/reportsCore/video.mainStats.bson
Thu Aug 29 18:54:00.899 		 17617 objects
Thu Aug 29 18:54:00.900 	Metadata for reportsCore.video.mainStats to dump/reportsCore/video.mainStats.metadata.json
Thu Aug 29 18:54:00.900 	reportsCore.adPackage.mainStats to dump/reportsCore/adPackage.mainStats.bson
Thu Aug 29 18:54:00.900 		 40 objects
Thu Aug 29 18:54:00.901 	Metadata for reportsCore.adPackage.mainStats to dump/reportsCore/adPackage.mainStats.metadata.json
Thu Aug 29 18:54:00.901 	reportsCore.vrp-184.player.mainStats to dump/reportsCore/vrp-184.player.mainStats.bson
Thu Aug 29 18:54:00.906 		 3530 objects
Thu Aug 29 18:54:00.906 	Metadata for reportsCore.vrp-184.player.mainStats to dump/reportsCore/vrp-184.player.mainStats.metadata.json
Thu Aug 29 18:54:00.906 	reportsCore.vrp-184.video.mainStats to dump/reportsCore/vrp-184.video.mainStats.bson
Thu Aug 29 18:54:00.913 		 3867 objects
Thu Aug 29 18:54:00.913 	Metadata for reportsCore.vrp-184.video.mainStats to dump/reportsCore/vrp-184.video.mainStats.metadata.json
Thu Aug 29 18:54:00.932 	reportsCore.macro.mainStats to dump/reportsCore/macro.mainStats.bson
Thu Aug 29 18:54:01.149 		 31984 objects
Thu Aug 29 18:54:01.287 	Metadata for reportsCore.macro.mainStats to dump/reportsCore/macro.mainStats.metadata.json
Thu Aug 29 18:54:01.287 DATABASE: reportsSites	 to 	dump/reportsSites
Thu Aug 29 18:54:01.287 	reportsSites.system.indexes to dump/reportsSites/system.indexes.bson
Thu Aug 29 18:54:01.288 		 4 objects
Thu Aug 29 18:54:01.288 	reportsSites.domain.mainStats to dump/reportsSites/domain.mainStats.bson
Thu Aug 29 18:54:04.003 		Collection File Writing Progress: 293000/315362	92%	(objects)
Thu Aug 29 18:54:04.118 		 315362 objects
Thu Aug 29 18:54:06.593 	Metadata for reportsSites.domain.mainStats to dump/reportsSites/domain.mainStats.metadata.json
Thu Aug 29 18:54:06.593 	reportsSites.site.mainStats to dump/reportsSites/site.mainStats.bson
Thu Aug 29 18:54:07.750 		 418294 objects
Thu Aug 29 18:54:07.750 	Metadata for reportsSites.site.mainStats to dump/reportsSites/site.mainStats.metadata.json
Thu Aug 29 18:54:07.750 DATABASE: reportsRaw	 to 	dump/reportsRaw
Thu Aug 29 18:54:07.751 	reportsRaw.system.indexes to dump/reportsRaw/system.indexes.bson
Thu Aug 29 18:54:07.751 		 4 objects
Thu Aug 29 18:54:07.751 	reportsRaw.video.plays to dump/reportsRaw/video.plays.bson
Thu Aug 29 18:54:10.024 		Collection File Writing Progress: 850900/93537719	0%	(objects)
Thu Aug 29 18:54:13.238 		Collection File Writing Progress: 1185300/93537719	1%	(objects)
Thu Aug 29 18:54:16.418 		Collection File Writing Progress: 1307800/93537719	1%	(objects)
Thu Aug 29 18:54:19.458 		Collection File Writing Progress: 1419200/93537719	1%	(objects)
Thu Aug 29 18:54:22.027 		Collection File Writing Progress: 1606700/93537719	1%	(objects)
Thu Aug 29 18:54:25.007 		Collection File Writing Progress: 2899100/93537719	3%	(objects)
Thu Aug 29 18:54:28.034 		Collection File Writing Progress: 4228100/93537719	4%	(objects)
Thu Aug 29 18:54:31.038 		Collection File Writing Progress: 5592400/93537719	5%	(objects)
Thu Aug 29 18:54:34.138 		Collection File Writing Progress: 6701900/93537719	7%	(objects)
Thu Aug 29 18:54:37.029 		Collection File Writing Progress: 7625500/93537719	8%	(objects)
Thu Aug 29 18:54:40.021 		Collection File Writing Progress: 8568200/93537719	9%	(objects)
Thu Aug 29 18:54:43.041 		Collection File Writing Progress: 9506400/93537719	10%	(objects)
Thu Aug 29 18:54:46.131 		Collection File Writing Progress: 10442400/93537719	11%	(objects)
Thu Aug 29 18:54:49.130 		Collection File Writing Progress: 11355800/93537719	12%	(objects)
Thu Aug 29 18:54:52.211 		Collection File Writing Progress: 11766800/93537719	12%	(objects)
Thu Aug 29 18:54:55.717 		Collection File Writing Progress: 11964200/93537719	12%	(objects)
Thu Aug 29 18:54:58.518 		Collection File Writing Progress: 12070500/93537719	12%	(objects)
Thu Aug 29 18:55:01.527 		Collection File Writing Progress: 12177400/93537719	13%	(objects)
Thu Aug 29 18:55:04.651 		Collection File Writing Progress: 12284600/93537719	13%	(objects)
Thu Aug 29 18:55:07.562 		Collection File Writing Progress: 12391700/93537719	13%	(objects)
Thu Aug 29 18:55:10.360 		Collection File Writing Progress: 12498600/93537719	13%	(objects)
Thu Aug 29 18:55:13.310 		Collection File Writing Progress: 12604600/93537719	13%	(objects)
Thu Aug 29 18:55:16.216 		Collection File Writing Progress: 12710100/93537719	13%	(objects)
Thu Aug 29 18:55:19.038 		Collection File Writing Progress: 12815500/93537719	13%	(objects)
Thu Aug 29 18:55:22.023 		Collection File Writing Progress: 12920900/93537719	13%	(objects)
Thu Aug 29 18:55:25.739 		Collection File Writing Progress: 13054200/93537719	13%	(objects)
Thu Aug 29 18:55:28.016 		Collection File Writing Progress: 13134300/93537719	14%	(objects)
Thu Aug 29 18:55:31.361 		Collection File Writing Progress: 13241600/93537719	14%	(objects)
Thu Aug 29 18:55:34.024 		Collection File Writing Progress: 13493900/93537719	14%	(objects)
Thu Aug 29 18:55:37.046 		Collection File Writing Progress: 14746800/93537719	15%	(objects)
Thu Aug 29 18:55:40.373 		Collection File Writing Progress: 15345200/93537719	16%	(objects)
Thu Aug 29 18:55:43.303 		Collection File Writing Progress: 15468100/93537719	16%	(objects)
Thu Aug 29 18:55:46.003 		Collection File Writing Progress: 15600500/93537719	16%	(objects)
Thu Aug 29 18:55:49.428 		Collection File Writing Progress: 15714900/93537719	16%	(objects)
Thu Aug 29 18:55:52.583 		Collection File Writing Progress: 15822800/93537719	16%	(objects)
Thu Aug 29 18:55:55.461 		Collection File Writing Progress: 16627700/93537719	17%	(objects)
Thu Aug 29 18:55:58.128 		Collection File Writing Progress: 17675600/93537719	18%	(objects)
Thu Aug 29 18:56:01.009 		Collection File Writing Progress: 18481500/93537719	19%	(objects)
Thu Aug 29 18:56:04.451 		Collection File Writing Progress: 18723800/93537719	20%	(objects)
Thu Aug 29 19:16:06.615 Socket recv() errno:104 Connection reset by peer 127.0.0.1:27018
Thu Aug 29 19:16:06.699 SocketException: remote: 127.0.0.1:27018 error: 9001 socket exception [1] server [127.0.0.1:27018]
assertion: 16465 recv failed while exhausting cursor
ubuntu@ReportingDB-use1b-01:/data/backup$

A mongod log attached (see SH01.log.gz file)

Comment by Vladimir Poluyaktov [ 29/Aug/13 ]

I've just run a validate() command for reportsRaw.video.plays collection how Asya Kamsky suggested in mongodb-user@googlegroups.com.

Here is a result (no errors found):

TpParser-use1b-02(mongos-2.4.5)[mongos] reportsRaw> db.video.plays.validate(true)
{
  "raw": {
    "SH01/ReportingDB-use1a-01.vidible.tv:27018,ReportingDB-use1b-01.vidible.tv:27018,ReportingDB-use1d-01.vidible.tv:27018": {
      "ns": "reportsRaw.video.plays",
      "firstExtent": "0:2000 ns:reportsRaw.video.plays",
      "lastExtent": "30:2000 ns:reportsRaw.video.plays",
      "extentCount": 27,
      "extents": [
        {
          "loc": "0:2000",
          "xnext": "0:11000",
          "xprev": "null",
          "nsdiag": "reportsRaw.video.plays",
          "size": 12288,
          "firstRecord": "0:20b0",
          "lastRecord": "0:4f60"
        },
        {
          "loc": "0:11000",
          "xnext": "3:962f000",
          "xprev": "0:2000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 49152,
          "firstRecord": "0:110b0",
          "lastRecord": "0:1cef0"
        },
        {
          "loc": "3:962f000",
          "xnext": "3:97a7000",
          "xprev": "0:11000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 196608,
          "firstRecord": "3:962f0b0",
          "lastRecord": "3:965ef10"
        },
        {
          "loc": "3:97a7000",
          "xnext": "3:9d87000",
          "xprev": "3:962f000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 786432,
          "firstRecord": "3:97a70b0",
          "lastRecord": "3:9866f20"
        },
        {
          "loc": "3:9d87000",
          "xnext": "3:aafe000",
          "xprev": "3:97a7000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 3145728,
          "firstRecord": "3:9d870b0",
          "lastRecord": "3:a086f10"
        },
        {
          "loc": "3:aafe000",
          "xnext": "3:d362000",
          "xprev": "3:9d87000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 12582912,
          "firstRecord": "3:aafe0b0",
          "lastRecord": "3:b6fdf50"
        },
        {
          "loc": "3:d362000",
          "xnext": "3:10bd6000",
          "xprev": "3:aafe000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 16990208,
          "firstRecord": "3:d3620b0",
          "lastRecord": "3:e395ef0"
        },
        {
          "loc": "3:10bd6000",
          "xnext": "3:163ac000",
          "xprev": "3:d362000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 22937600,
          "firstRecord": "3:10bd60b0",
          "lastRecord": "3:121b5ef0"
        },
        {
          "loc": "3:163ac000",
          "xnext": "3:1c5af000",
          "xprev": "3:10bd6000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 30965760,
          "firstRecord": "3:163ac0b0",
          "lastRecord": "3:18133ec0"
        },
        {
          "loc": "3:1c5af000",
          "xnext": "1:2000",
          "xprev": "3:163ac000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 41803776,
          "firstRecord": "3:1c5af0b0",
          "lastRecord": "3:1ed8ced0"
        },
        {
          "loc": "1:2000",
          "xnext": "4:5086000",
          "xprev": "3:1c5af000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 56438784,
          "firstRecord": "1:20b0",
          "lastRecord": "1:35d4f10"
        },
        {
          "loc": "4:5086000",
          "xnext": "4:1c4ee000",
          "xprev": "1:2000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 76193792,
          "firstRecord": "4:50860b0",
          "lastRecord": "4:992fed0"
        },
        {
          "loc": "4:1c4ee000",
          "xnext": "4:262d5000",
          "xprev": "4:5086000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 102862848,
          "firstRecord": "4:1c4ee0b0",
          "lastRecord": "4:22706f10"
        },
        {
          "loc": "4:262d5000",
          "xnext": "5:10b75000",
          "xprev": "4:1c4ee000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 138866688,
          "firstRecord": "4:262d50b0",
          "lastRecord": "4:2e743f30"
        },
        {
          "loc": "5:10b75000",
          "xnext": "5:3862d000",
          "xprev": "4:262d5000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 187469824,
          "firstRecord": "5:10b750b0",
          "lastRecord": "5:1be3df20"
        },
        {
          "loc": "5:3862d000",
          "xnext": "6:2000",
          "xprev": "5:10b75000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 253087744,
          "firstRecord": "5:3862d0b0",
          "lastRecord": "5:47789ee0"
        },
        {
          "loc": "6:2000",
          "xnext": "6:484cd000",
          "xprev": "5:3862d000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 341671936,
          "firstRecord": "6:20b0",
          "lastRecord": "6:145d9ee0"
        },
        {
          "loc": "6:484cd000",
          "xnext": "7:461cb000",
          "xprev": "6:2000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 461258752,
          "firstRecord": "6:484cd0b0",
          "lastRecord": "6:63cb0ed0"
        },
        {
          "loc": "7:461cb000",
          "xnext": "8:4af46000",
          "xprev": "6:484cd000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 622702592,
          "firstRecord": "7:461cb0b0",
          "lastRecord": "7:6b3a5ef0"
        },
        {
          "loc": "8:4af46000",
          "xnext": "9:2000",
          "xprev": "7:461cb000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 840650752,
          "firstRecord": "8:4af460b0",
          "lastRecord": "8:7d0faf00"
        },
        {
          "loc": "9:2000",
          "xnext": "11:2000",
          "xprev": "8:4af46000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 1134878720,
          "firstRecord": "9:20b0",
          "lastRecord": "9:43a4ff40"
        },
        {
          "loc": "11:2000",
          "xnext": "13:2000",
          "xprev": "9:2000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 1532088320,
          "firstRecord": "11:20b0",
          "lastRecord": "11:5b51ef50"
        },
        {
          "loc": "13:2000",
          "xnext": "17:2000",
          "xprev": "11:2000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 2068320256,
          "firstRecord": "13:20b0",
          "lastRecord": "13:7b482ec0"
        },
        {
          "loc": "17:2000",
          "xnext": "21:2000",
          "xprev": "13:2000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 2146426864,
          "firstRecord": "17:20b0",
          "lastRecord": "17:7feffed0"
        },
        {
          "loc": "21:2000",
          "xnext": "24:2000",
          "xprev": "17:2000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 2146426864,
          "firstRecord": "21:20b0",
          "lastRecord": "21:7fefff50"
        },
        {
          "loc": "24:2000",
          "xnext": "30:2000",
          "xprev": "21:2000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 2146426864,
          "firstRecord": "24:20b0",
          "lastRecord": "24:7fefff00"
        },
        {
          "loc": "30:2000",
          "xnext": "null",
          "xprev": "24:2000",
          "nsdiag": "reportsRaw.video.plays",
          "size": 2146426864,
          "firstRecord": "30:20b0",
          "lastRecord": "30:71755260"
        }
      ],
      "datasize": 14792138048,
      "nrecords": 93537719,
      "lastExtentSize": 2146426864,
      "padding": 1,
      "firstExtentDetails": {
        "loc": "0:2000",
        "xnext": "0:11000",
        "xprev": "null",
        "nsdiag": "reportsRaw.video.plays",
        "size": 12288,
        "firstRecord": "0:20b0",
        "lastRecord": "0:4f60"
      },
      "lastExtentDetails": {
        "loc": "30:2000",
        "xnext": "null",
        "xprev": "24:2000",
        "nsdiag": "reportsRaw.video.plays",
        "size": 2146426864,
        "firstRecord": "30:20b0",
        "lastRecord": "30:71755260"
      },
      "objectsFound": 93537719,
      "invalidObjects": 0,
      "bytesWithHeaders": 16288741552,
      "bytesWithoutHeaders": 14792138048,
      "deletedCount": 23,
      "deletedSize": 242922624,
      "nIndexes": 1,
      "keysPerIndex": {
        "reportsRaw.video.plays.$_id_": 93537719
      },
      "valid": true,
      "errors": [ ],
      "ok": 1
    }
  },
  "valid": true,
  "ok": 1
}
 

Comment by David Hows [ 29/Aug/13 ]

Hi Vladamir,

Thanks for taking the time and following up with this issue.

I've been looking into the logs you uploaded and the details you've attached and agree that something strange seems to be going on.

To that end, would it be possible for you to attempt to re-run the failing mongodump with log level 4 set on your MongoS and then upload the logs to this ticket after the execution ends (fails)? I'd like to see what is going on inside the MongoS and see if this can shed light on what is the problem.O

The command to set logLevel up to four is

db.adminCommand({setParameter:1, logLevel:4})

Thanks,
David

Comment by Vladimir Poluyaktov [ 29/Aug/13 ]

I've just performed one more experiment - stopped all loaders (that means there was now any load on the DB) and run mongodump without any parameters on one of secondary shard hosts against localhost and port 27018.
The mongodump failed with error:
Wed Aug 28 23:56:03.195 SocketException: remote: 127.0.0.1:27018 error: 9001 socket exception [1] server [127.0.0.1:27018]
assertion: 16465 recv failed while exhausting cursor

Here is the log:

Wed Aug 28 23:35:59.944         Metadata for reportsAds.ads.mainStats to dump/reportsAds/ads.mainStats.metadata.json
Wed Aug 28 23:35:59.944         reportsAds.ads.requests to dump/reportsAds/ads.requests.bson
Wed Aug 28 23:36:00.066                  28983 objects
Wed Aug 28 23:36:00.066         Metadata for reportsAds.ads.requests to dump/reportsAds/ads.requests.metadata.json
Wed Aug 28 23:36:00.066         reportsAds.ads.fallbacks to dump/reportsAds/ads.fallbacks.bson
Wed Aug 28 23:36:00.067                  88 objects
Wed Aug 28 23:36:00.067         Metadata for reportsAds.ads.fallbacks to dump/reportsAds/ads.fallbacks.metadata.json
Wed Aug 28 23:36:00.067         reportsAds.vrp-184.ads.mainStats to dump/reportsAds/vrp-184.ads.mainStats.bson
Wed Aug 28 23:36:00.093                  19440 objects
Wed Aug 28 23:36:00.093         Metadata for reportsAds.vrp-184.ads.mainStats to dump/reportsAds/vrp-184.ads.mainStats.metadata.json
Wed Aug 28 23:36:00.093 DATABASE: reportsCore    to     dump/reportsCore
Wed Aug 28 23:36:00.094         reportsCore.system.indexes to dump/reportsCore/system.indexes.bson
Wed Aug 28 23:36:00.094                  13 objects
Wed Aug 28 23:36:00.094         reportsCore.bid.mainStats to dump/reportsCore/bid.mainStats.bson
Wed Aug 28 23:36:00.108                  9670 objects
Wed Aug 28 23:36:00.108         Metadata for reportsCore.bid.mainStats to dump/reportsCore/bid.mainStats.metadata.json
Wed Aug 28 23:36:00.108         reportsCore.dashboard.mainStats to dump/reportsCore/dashboard.mainStats.bson
Wed Aug 28 23:36:00.111                  2958 objects
Wed Aug 28 23:36:00.111         Metadata for reportsCore.dashboard.mainStats to dump/reportsCore/dashboard.mainStats.metadata.json
Wed Aug 28 23:36:00.112         reportsCore.dashboard.totalStats to dump/reportsCore/dashboard.totalStats.bson
Wed Aug 28 23:36:00.112                  1 objects
Wed Aug 28 23:36:00.112         Metadata for reportsCore.dashboard.totalStats to dump/reportsCore/dashboard.totalStats.metadata.json
Wed Aug 28 23:36:00.112         reportsCore.player.mainStats to dump/reportsCore/player.mainStats.bson
Wed Aug 28 23:36:00.136                  14395 objects
Wed Aug 28 23:36:00.136         Metadata for reportsCore.player.mainStats to dump/reportsCore/player.mainStats.metadata.json
Wed Aug 28 23:36:00.136         reportsCore.video.mainStats to dump/reportsCore/video.mainStats.bson
Wed Aug 28 23:36:00.164                  17213 objects
Wed Aug 28 23:36:00.164         Metadata for reportsCore.video.mainStats to dump/reportsCore/video.mainStats.metadata.json
Wed Aug 28 23:36:00.164         reportsCore.adPackage.mainStats to dump/reportsCore/adPackage.mainStats.bson
Wed Aug 28 23:36:00.164                  40 objects
Wed Aug 28 23:36:00.164         Metadata for reportsCore.adPackage.mainStats to dump/reportsCore/adPackage.mainStats.metadata.json
Wed Aug 28 23:36:00.165         reportsCore.vrp-184.player.mainStats to dump/reportsCore/vrp-184.player.mainStats.bson
Wed Aug 28 23:36:00.170                  3530 objects
Wed Aug 28 23:36:00.170         Metadata for reportsCore.vrp-184.player.mainStats to dump/reportsCore/vrp-184.player.mainStats.metadata.json
Wed Aug 28 23:36:00.170         reportsCore.vrp-184.video.mainStats to dump/reportsCore/vrp-184.video.mainStats.bson
Wed Aug 28 23:36:00.176                  3867 objects
Wed Aug 28 23:36:00.176         Metadata for reportsCore.vrp-184.video.mainStats to dump/reportsCore/vrp-184.video.mainStats.metadata.json
Wed Aug 28 23:36:00.176         reportsCore.macro.mainStats to dump/reportsCore/macro.mainStats.bson
Wed Aug 28 23:36:00.191                  1938 objects
Wed Aug 28 23:36:00.191         Metadata for reportsCore.macro.mainStats to dump/reportsCore/macro.mainStats.metadata.json
Wed Aug 28 23:36:00.191 DATABASE: reportsSites   to     dump/reportsSites
Wed Aug 28 23:36:00.192         reportsSites.system.indexes to dump/reportsSites/system.indexes.bson
Wed Aug 28 23:36:00.192                  4 objects
Wed Aug 28 23:36:00.192         reportsSites.domain.mainStats to dump/reportsSites/domain.mainStats.bson
Wed Aug 28 23:36:00.848                  307600 objects
Wed Aug 28 23:36:00.848         Metadata for reportsSites.domain.mainStats to dump/reportsSites/domain.mainStats.metadata.json
Wed Aug 28 23:36:00.848         reportsSites.site.mainStats to dump/reportsSites/site.mainStats.bson
Wed Aug 28 23:36:01.701                  406203 objects
Wed Aug 28 23:36:01.701         Metadata for reportsSites.site.mainStats to dump/reportsSites/site.mainStats.metadata.json
Wed Aug 28 23:36:01.702 DATABASE: reportsRaw     to     dump/reportsRaw
Wed Aug 28 23:36:01.702         reportsRaw.system.indexes to dump/reportsRaw/system.indexes.bson
Wed Aug 28 23:36:01.702                  4 objects
Wed Aug 28 23:36:01.702         reportsRaw.video.plays to dump/reportsRaw/video.plays.bson
Wed Aug 28 23:56:03.195 Socket recv() errno:104 Connection reset by peer 127.0.0.1:27018
Wed Aug 28 23:56:03.195 SocketException: remote: 127.0.0.1:27018 error: 9001 socket exception [1] server [127.0.0.1:27018]
assertion: 16465 recv failed while exhausting cursor

The collection reportsRaw.video.plays.bson is big enough (~15Gb) but it shouldn't be a problem to dumping it, right?

TpParser-use1b-02(mongos-2.4.5)[mongos] reportsRaw> db.video.plays.stats()
{
  "sharded": false,
  "primary": "SH01",
  "ns": "reportsRaw.video.plays",
  "count": 91025282,
  "size": 14393205552,
  "avgObjSize": 158.12316354043264,
  "storageSize": 16531668928,
  "numExtents": 27,
  "nindexes": 1,
  "lastExtentSize": 2146426864,
  "paddingFactor": 1,
  "systemFlags": 1,
  "userFlags": 0,
  "totalIndexSize": 2953277488,
  "indexSizes": {
    "_id_": 2953277488
  },
  "ok": 1
}

Any ideas why the mongodump failed and how to fix this?

Comment by Pavlo Grinchenko [ 16/Aug/13 ]

We did the experiment when we removed all the load - still no successful mongodump. Would appreciate any comments how we can try to resolve this.

Comment by Vladimir Poluyaktov [ 31/Jul/13 ]

mongodump log and logs from all servers of our cluster have been attached (see SERVER-10377.tar.gz file)

Comment by Vladimir Poluyaktov [ 31/Jul/13 ]

Right, we saw this behavior - mongodump just freezes at arbitrary moment (it can be different collections and percent), nothing is going on for some time and then we get 'Connection reset by peer' error.

I'll run the backup script manually and will attach logs from all servers to this ticket in few min.

Comment by Daniel Pasette (Inactive) [ 30/Jul/13 ]

The times in the two log snippets are quite far apart (13:00:48 vs 16:53:36).

Also, there is a 20 minute (almost exactly) pause in the first log snippet:

Tue Jul 30 12:40:47.036 Collection File Writing Progress: 3749000/12558979 29% (objects)
Tue Jul 30 13:00:48.073 Socket recv() errno:104 Connection reset by peer 127.0.0.1:27017

There's no information in the mongos log at that time? What about in the shards?

Comment by Vladimir Poluyaktov [ 30/Jul/13 ]

I see only error in the mongos log:

Tue Jul 30 16:53:36.641 [Balancer] could not acquire lock 'balancer/ReportingService-use1b-01.vidible.tv:27017:1375149603:1804289383' (another update won)
Tue Jul 30 16:53:36.642 [Balancer] distributed lock 'balancer/ReportingService-use1b-01.vidible.tv:27017:1375149603:1804289383' was not acquired.

No errors in the system log, free memory is ok, disk space is ok, limits are ok.
I tried to stop a balancer during the backup time (sh.setBalancerState(false)) - no luck, the same error.

We have another absolutely identical mongodb setup, the only difference is a db version (2.4.3). But we never had such errors for that database.

Generated at Thu Feb 08 03:23:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.