[JAVA-594] 2.7.3 corrupts files saved with 2.6 on read Created: 03/Jul/12 Updated: 19/Oct/16 Resolved: 19/Jul/12 |
|
| Status: | Closed |
| Project: | Java Driver |
| Component/s: | None |
| Affects Version/s: | 2.6, 2.7.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Andreas Janson | Assignee: | Daniel Gottlieb (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | corrupt, driver | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Description |
|
Hi, after upgrading from 2.6 to 2.7.3 we are having are getting data corruption when opening pdfs that where saved with 2.6. Somehow the pdf gets altered so that embedded fonts can't be rendered anymore. When saving & reading with the same version (tried with both 2.6 and 2.7.3) everything works fine. I have attached an intact and a corrupt file. The corrupt file doesn't display properly in Adobe Acrobat (you will also get an error message there) and DiffPdf will find visual differences (text that is using embedded fonts isn't rendered). It will display properly in Chrome though. The intact file displays fine everywhere. The difference is visible on the first page on the line starting with '1.1'. DiffPdf doesn't find a difference in the files using word by word comparison. There are some minor binary differences though. E.g. in the first line there are some additional characters (see attached WinMerge screenshot). In case this is relevant: During my research I stumbled upon a thread where someone described the same problem in pdfs. As it turned out he was uploading pdfs via ftp in ascii mode. Switching to binary uploads solved the problem for him. http://www.macuser.de/forum/f17/eingebettete-schrift-konnte-352152/ Let me know if you need any more info. As for the priority - this isn't a blocker for us because only a hand full of files on our production server are affected by this issue. Cheers, |
| Comments |
| Comment by Daniel Gottlieb (Inactive) [ 12/Jul/12 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hand compiled jar from the r2.6 tag. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 12/Jul/12 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm glad the explanation helped and I hope now you know how to at least work around the problem if similar problems come up in the future. I am still up for tracking down the problem though. I wish I could definitively say the problem was with the 2.6 driver saving corrupted files. Unfortunately I still can't reproduce any problems with the driver. Given the fact that the actual chunk size is 4 bytes larger than the default size and having looked extensively at the 2.6 source code, there exist very good arguments for both why the problem can't be in your application code and why it also cannot be in the 2.6 driver code! I do however have a couple leads to go on if you're still willing to help figure out what happened. So if you don't mind, I'm curious if you can look up another value for me. I actually should have asked this with the previous inquiry as you may have deleted the corrupted files from your database. What I'm interested in knowing is if the `chunkSize` value in the `fs.files` collection for the corrupted file we're discussing is reported as `262,144` or `262,148`. I'm also curious, now that you have more context about how GridFS works, if you can still reproduce the problem? This is the code I've used to try to corrupt a file while saving it to the GridFS, but have still failed so far (using the Mongo Java Driver v2.6 and Apache Commons IOUtils v2.0.1):
To also rule out the possibility that you might actually using a different version of the java driver than v2.6, I'll ask if you can also download the jar I'm using to try to reproduce the issue (should be listed up top just under the ticket description). If you can still reproduce the problem with my 2.6 jar, can you modify my function to better match how you're saving with the PDF file into GridFS such that I can try using that to reproduce the problem on my side? Alternatively, if you can only reproduce the problem with your jar, if you could attach that and I can examine it to try to match what version it really is. Thanks for helping me narrow this down and I'm sorry for any inconvenience this weird bug has caused. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andreas Janson [ 11/Jul/12 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for your detailed instructions and explanations, they sure helped a lot. I checked the file's chunk lengths (saved with 2.6) and they are `262148, 262148, 132298` == incorrect. So the problem is that 2.6 corrupted the files when saving? I also did the same test with the intact file uploaded with 2.7.3. As expected all the chunks are 4 bytes less in size, so we can be pretty sure that this is no issue for us anymore. We'd be glad to help tracking down the problem anyway. Thanks & cheers, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 10/Jul/12 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm still unable to reproduce this. I'm going to ask you for more information about the PDF as it is saved in the database. I would like to know if the data chunks themselves contain these integers that represent the length. In case you need a little guide, I'll explain how you can look at this via the mongo shell. GridFS records the files it splits up in two different collections inside the database that the java GridFS object was instantiated with. The collections are `fs.files` and `fs.chunks`. The `fs.files` collection contains meta information about the file saved while `fs.chunks` contains the actual data. First we search the `fs.files` collection by filename:
Then we take the `_id` field returned in that query `ObjectId("4ffb29bce4b0b48ac03d349e")` and search for chunks with a matching `files_id`:
However, because the data field is quite large, I've omitted it from that query. Instead I want the length of the data field so I'll instead do for each chunk:
So if you can, give that a shot on your GridFS database (with the appropriately matching file_id). The file length however should exactly match the returned `262144, 262144, 132294`. If the lengths are instead `262148, 262148, 132298`, the data portion of the chunks is incorrectly prefixed with its length. Incorrect lengths would at least allow us to conclude that the 2.7.3 driver is operating correctly though it wouldn't explain why the 2.6 driver removes those lengths when reading from the data. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 10/Jul/12 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Unfortunately I do not speak German So I found the pattern of what's going wrong, though I'm still going through the code to figure out why. Some context about GridFS for the explanation: GridFS is just a driver feature that can break down big files into chunks (default size, 256K) and saves each one as a separate MongoDB document. What I found in the corrupt PDF file you sent is that when the bytes got stitched together, each chunk was preceded by the number of bytes in that chunk. The `00 00 04 00` bytes turn into 256K (when the order is reversed to 00 04 00 00). Those bytes are also repeated right at the 256K (+ 4 bytes) offset into the pdf file (the beginning of the second chunk). At the 512K (+ 8 bytes) offset we have the bytes `c6 04 02 00` which translates to 132,294 which all summed up exactly equals the file size of the intact pdf. This meaning that the length of the chunk in bytes is preceding the actual data when stitching the PDF back together. I will let you know when I know more. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andreas Janson [ 10/Jul/12 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
3. Yes, there's exactly one document matching the findOne() query. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andreas Janson [ 10/Jul/12 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for looking into this! Since your name sounds rather german, I'll post untranslated code snippets. If I'm wrong or a translation would be helpful anyway, let me know. 1. We are using the copy() method from commons-io-2.0.1 2. setMetaData adjusts the inputDokument before saving as follows:
"Bezeichner" is the document's name that the end user sees.
3. I'll check it out later. 4. GridFsOutputDokument is just a small wrapper. GetInputStream() is used in DownloadServlet.
5. We aren't directly subclassing any of the com.mongodb.gridfs classesn. We wrap GridFSFile to enrich the document with metadata.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 09/Jul/12 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I haven't been able to reproduce this with the naive attempt of merely saving the non-corrupt PDF to GridFS with the 2.6 driver and reading that with the 2.7.3 driver. I have a few questions, some of which probably won't have anything to do with the problem. I'm just trying to be thorough.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andreas Janson [ 04/Jul/12 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Ok, I hope I got all relevant code snippets. I also translated some of the names into english (or something closely related
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 03/Jul/12 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Can you attach the code you are using to save and retrieve the file from mongo? |