[SERVER-2681] replica member never syncs up and starts from scratch multiple times. Created: 04/Mar/11 Updated: 12/Jul/16 Resolved: 14/Mar/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 1.7.5 |
| Fix Version/s: | 1.9.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Reinaldo Giudici | Assignee: | Kristina Chodorow (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
ubuntu, |
||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
seems like the member is not able to initialize fully and tries to start from scratch during the initial sync process. log0x:PRIMARY> db.printReplicationInfo() Attached is the log from the log12 the member trying to sync up. |
| Comments |
| Comment by Kristina Chodorow (Inactive) [ 14/Mar/11 ] |
|
Yay! Looks like it finally made it. The fix I made should be able to make it a bit more robust (so it doesn't have to keep starting over) in the future. Glad it finally worked out. |
| Comment by Reinaldo Giudici [ 14/Mar/11 ] |
|
I cleaned up the local database shutdown mongod on log11, and rm local.* cat /var/log/mongodb/mongod.node1.log | grep 13127 | wc -l Attached is the log. In all previous attempts the initialSyncOplogApplication was never on the log at all. Mon Mar 14 04:57:46 [replica set sync] replSet initial sync initial oplog application |
| Comment by auto [ 11/Mar/11 ] |
|
Author: {u'login': u'kchodorow', u'name': u'Kristina', u'email': u'kristina@10gen.com'}Message: handle cursor timeouts during initial sync |
| Comment by Kristina Chodorow (Inactive) [ 10/Mar/11 ] |
|
That's an error code. Initial sync should handle that, I'll work on fixing it so that it doesn't get stopped by that error. |
| Comment by Reinaldo Giudici [ 10/Mar/11 ] |
|
No killing of connections that I know of... Wed Mar 9 19:45:37 [replica set sync] replSet initial sync initial oplog application And the timeout call seems to show the same error, is that an oplog position ? or a error code ? [13:49:16 rgiudici@cloud-log12:~]$ cat mongod.log | grep oplog | grep failing |
| Comment by Kristina Chodorow (Inactive) [ 10/Mar/11 ] |
|
Could you have some sort of router or switch that is killing connections after some length of time? |
| Comment by Kristina Chodorow (Inactive) [ 10/Mar/11 ] |
|
The "no user in local.system.users to use for authentication" is harmless. |
| Comment by Kristina Chodorow (Inactive) [ 10/Mar/11 ] |
|
The problem is that you keep having network blips: Sun Mar 6 05:23:04 [replica set sync] MessagingPort recv() errno:104 Connection reset by peer 10.17.100.71:27017 } It can't reach the server it's trying to sync to. As I said above, you could try doing a fastsync. You could also see if you could do anything to fix your network or sync to a "closer" machine, if you're syncing over a WAN at the moment. Initial sync will be getting more tolerant of network failures, but not until 1.8.1 (at least). |
| Comment by Reinaldo Giudici [ 10/Mar/11 ] |
|
Attached is a new log. this one is after repairing all databases in the primary (except local and admin) and after dropping some data from the primary. Thu Mar 10 11:42:37 [replica set sync] replSet initial sync initial oplog application I see a few of these, we do not have authentication enabled: The rest interface shows like is necer able to catchup on replication. I think the optime on the secondary, moves really slowly sometimes does not change in 20/30 seconds and then when it does change it changes by 1 for example now (1 minute later is at: 4d790a52:37e ) Member id Up cctime Last heartbeat Votes Priority State Messages optime skew |
| Comment by Kristina Chodorow (Inactive) [ 09/Mar/11 ] |
|
Apparently it's cosmetic: https://jira.mongodb.org/browse/SERVER-2669. From the logs, it looks like MongoDB lost the connection to the server it was syncing from... it isn't very tolerant of blippy networks at the moment. Thu Mar 3 18:12:36 [replica set sync] MessagingPort say send() errno:32 Broken pipe 10.17.100.71:27017 If you have a backup of log11, you could try starting up using --fastsync with it, which should make it pull less over the network. |
| Comment by Reinaldo Giudici [ 09/Mar/11 ] |
|
another interesting find maybe, seems like the size for the oplog is negative ? or is that just an overflow log0x:PRIMARY> > db.oplog.rs.stats() }, |
| Comment by Reinaldo Giudici [ 07/Mar/11 ] |
|
I have not run repairDatabase on all of them. I ran repair database on a few of them This is a "log server" we had setup a 50GB capped collection per log collection, and then realized that only production logs needed that much. Will run repairDatabase() try to reindex the collections ? if so that will bring down the server as the indexing on the secondaries seem to be taking multiple hours which will be a problem in our setup. |
| Comment by Kristina Chodorow (Inactive) [ 07/Mar/11 ] |
|
Running out of diskspace on the master could definitely cause this on the slave. Have you run repair on every database that was handling writes when you ran out of disk space? |
| Comment by Reinaldo Giudici [ 04/Mar/11 ] |
|
The master log11 was having issue with diskspace until yesterday that we change the size of some capped collections and run repairDatabase on a few dtabases. |
| Comment by Kristina Chodorow (Inactive) [ 04/Mar/11 ] |
|
Is it trying to sync from the master that you're having corruption issues with? |
| Comment by Reinaldo Giudici [ 04/Mar/11 ] |
|
This log is actually the result of that. We did tried multiple things, but yesterday we removed all, and started to sync from scratch. I think this is the last restart info. We did remove the data. . start from scratch (no data on the dir) Also we upped the memory to 96GB and still no joy none of them have been able to finish at all. Each test takes more than 12h Thu Mar 3 05:08:59 [initandlisten] MongoDB starting : pid=5555 port=27017 dbpath=/var/lib/mongodb/node1 64-bit
Thu Mar 3 05:08:59 [initandlisten] db version v1.7.5, pdfile version 4.5 |
| Comment by Kristina Chodorow (Inactive) [ 04/Mar/11 ] |
|
It looks like you guys kill mongod multiple times, sometimes in the middle of the initial sync. I think this may have confused it. Can you stop mongod, remove everything from the data directory, and start the initial sync over again? |