[SERVER-7077] Failure to journal list changes in NamespaceDetail::__stdAlloc Created: 19/Sep/12  Updated: 11/Jul/16  Resolved: 09/Nov/12

Status: Closed
Project: Core Server
Component/s: Internal Code
Affects Version/s: 2.0.7, 2.2.0
Fix Version/s: 2.3.1

Type: Bug Priority: Major - P3
Reporter: Tad Marshall Assignee: Tad Marshall
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Differing effects in Linux and Windows, but an issue in both


Issue Links:
Related
is related to SERVER-7068 Windows access violation in Namespace... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

The NamespaceDetails::__stdAlloc() routine starting at line 321 in src/mongo/db/namespace_details.cpp in today's master branch (line 308 in db/namespace.cpp in today's 2.0 branch) is used to allocate space for documents in non-capped collections. This routine has code to check for a bad link in either the NamespaceDetails::deletedList[] or in a deleted record pointed to by that list. On finding a bad pointer, it will log a warning and print a stack trace. After this, it will incorrectly attempt to change both the bad pointer and the pointer that pointed to the bad pointer.

There are two parts to "incorrectly":
1) These pointers live in memory-mapped files and the code that changes them (on lines 339 and 340 in src/mongo/db/namespace_details.cpp in today's master branch) is not using the journaling mechanism (e.g. getDur().writingDiskLoc(), etc.) to change these values. In Linux or Windows, this means that we may get inconsistent values written to disk and replaying the journal will not correct them because the journal didn't record the change we made. In Windows you have the additional opportunity to generate an access violation, because the private view will be mapped read-only unless the page has been made PAGE_WRITECOPY (copy-on-write) by another (correct) journaled write to the page within the last 100 milliseconds. This is the access violation that we see as part of SERVER-7068.
2) This code should not be attempting to patch a broken chain where the breakage could consist of valid data incorrectly pointed to; it should fassert and stop mongod before any damage is done.



 Comments   
Comment by auto [ 09/Nov/12 ]

Author:

{u'date': u'2012-11-09T13:35:31Z', u'email': u'tad@10gen.com', u'name': u'Tad Marshall'}

Message: SERVER-7077 fassert after logging message on corrupted deletedList

When finding an invalid DiskLoc pointer in the chain pointed to by
a deletedList bucket in __stdAlloc(), log a message describing the
corruption and then call fassertFailed() to kill the server. Test
the file offset as well as the file number in checking for invalid
DiskLoc pointers.
Branch: master
https://github.com/mongodb/mongo/commit/857699a1870d45d8b89534a934612c04df4701f9

Generated at Thu Feb 08 03:13:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.