[SERVER-7286] Can I access data that seems to be stored in my local.* files Created: 06/Oct/12  Updated: 08/Mar/13  Resolved: 12/Nov/12

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.0.4
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Robert Poskevich Assignee: Gregor Macadam
Resolution: Incomplete Votes: 0
Labels: crash, oplog, replication
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux Debian Squeeze


Attachments: Text File crash_log.txt    
Participants:

 Description   

I badly need to access the replication op log after a serious data corruption that occurred while trying to copy databases on the same server.

I have the following 5 local files in the current db path.

64M local.0
128M local.1
2.0G local.2
2.0G local.3
2.0G local.4

When I acces the "local" db via the mongo shell and the according "oplog.rs" collection, I only see 3 entries. Prior to the crash, I had upwards of 600,000 entries and so am hoping they have been orphaned in the local.1 - local.4 files.

Is there any that I can verify whether these rows still exist and, if so, how might I access them?

Thanks for any help you can provide.

-Robert Poskevich



 Comments   
Comment by Gregor Macadam [ 12/Nov/12 ]

Closing this issue out. Please feel free to reopen if you have more questions or when you have time to upload the logs.
Thanks
Gregor

Comment by Gregor Macadam [ 31/Oct/12 ]

Apologies for the delay in coming back to you - in log we can see that mongos1r1.qmobix.net:27018 becomes PRIMARY - this is the the SECONDARY member that you are referring to? I think the reason it has been elected PRIMARY is that the host domU-12-31-39-07-B6-7C.compute-1.internal has an empty oplog. I haven't been able to repro this unfortunately - could you attach more of the log from prior to the initial sync - perhaps it shows what happened to the oplog.
Thanks
Gregor

Comment by Robert Poskevich [ 23/Oct/12 ]

Gregor,

In performing the copy, Y was on the same host.

Do you have any idea what could be the cause of the my last stated issue? "The SECONDARY member which became the MASTER in this scenario was 3 months behind."

Comment by Gregor Macadam [ 18/Oct/12 ]

A resync takes place if the local oplog is empty - the code that does this is here.
That doesn't explain how the oplog came to be empty. I want to try to repro this here - when you did the copy database you were, Y was on a the same host or a different host?

Comment by Robert Poskevich [ 11/Oct/12 ]

Gregor,

The above portion of the log occurred after the copy command. The copy command was not documented in the log but you can see, in the very first line of the attached log, when I attempted to kill the mongod process.

What are the scenarios in which a replica set member will undergo an initial sync? Can this only occur if the dbpath directory is empty? Are there other scenarios in which this might occur?

Also, at the time, the SECONDARY member which became the MASTER in this scenario was 3 months behind. Can you think of why this might have been? That's the problem I really need to solve, so that I can avoid it happening and losing my data again.

Thanks,
Robert P.

Comment by Gregor Macadam [ 11/Oct/12 ]

Hi - the log shows that an initial sync is taking place for this node

Fri Oct  5 23:23:15 [rsSync] replSet initial sync pending
Fri Oct  5 23:23:15 [rsSync] replSet syncing to: mongos1r1.qmobix.net:27018
Fri Oct  5 23:23:15 [rsSync] build index local.me { _id: 1 }
Fri Oct  5 23:23:15 [rsSync] build index done 0 records 0 secs
Fri Oct  5 23:23:15 [rsSync] replSet initial sync drop all databases
Fri Oct  5 23:23:15 [rsSync] dropAllDatabasesExceptLocal 1
Fri Oct  5 23:23:15 [rsSync] replSet initial sync clone all databases

How is this related to the problems you had with copyDatabase? I'm not clear what you are doing.

Comment by Robert Poskevich [ 10/Oct/12 ]

In, examining the log, it looks as if all databased were dropped in the synching process. However, in performing the mongodump with journaling enabled (as journal files are present), I only get 6 entries, all dated after the incident.

Comment by Robert Poskevich [ 10/Oct/12 ]

MongoDB Log - Log activity prior to the first line pasted seems normal and consists of authentication checks. Log activity after the last line pasted is a sync with the MASTER node of all dbs, which looks to be successful. After the sync, the node assumes SECONDARY status and proceeds to make authentication requests. Unfortunately at the time, the MASTER was 3 months out of date. How can this be?

Comment by Gregor Macadam [ 10/Oct/12 ]

Do you have the mongod log from where you ran the copyDatabase?
As far as recovering the oplog entries - can you try to start a standalone mongodb instance and point it at the data files.

./mongod --dbpath <path_to_datafiles>

and then run a

mongodump --db local --repair

to dump out the local database and paste the output please.

Comment by Robert Poskevich [ 06/Oct/12 ]

Eliot,
Thank you, I have copied the files. The data corruption occurred when attempting to copy a database, let's say "X" is the original and "Y" is the clone.

db.fsyncLock()
db.copyDatabase(X, Y)

I let the operation run for ~20 minutes (much longer than other copy operations on the same dataset) and, at that point, closed my ssh connection. Upon reconnecting and accessing the shell, I found that both X and Y were empty, save for the "system.indexes" and "system.users" collections. I then attempted to shutdown the server and launch from a backup (I am running on Amazon EC2), only to find that the data in my backup was incomplete and was missing ~3 months of activity. And storage is not the issue here, as I've only consumed some 14% of the available.

I am running a single sharded system, with 2 replica nodes and an arbiter as well as 3 config instances. The node that crashed was the current PRIMARY, while the snapshots were taken off of the SECONDARY. The SECONDARY node, at this time and throughout the snapshot history, is missing X entirely, as well as any oplog history within the last 3 months. I hadn't checked the replication status in some time, though, the last time I did, all three nodes were healthy.

My intention in using fsync lock was to disallow any writes to X while I was copying to Y, but I think this could have been the cause for the initial process hang. I have disconnected the server in question from the replica set (the one with the listed local.* files) and have launched it as a single instance. Perhaps that is the reason I have two local.* files less than 2 GB.

I hope that description can help. Thanks for your attention.

Comment by Eliot Horowitz (Inactive) [ 06/Oct/12 ]

Can you describe what happened to cause the issue and what steps you took?
I would also recommend making a copy of the files immediately.

Generated at Thu Feb 08 03:14:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.