[JAVA-594] 2.7.3 corrupts files saved with 2.6 on read Created: 03/Jul/12  Updated: 19/Oct/16  Resolved: 19/Jul/12

Status: Closed
Project: Java Driver
Component/s: None
Affects Version/s: 2.6, 2.7.3
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Andreas Janson Assignee: Daniel Gottlieb (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: corrupt, driver
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PDF File corrupt --- saved with 2.6 - opened with 2.7.3.pdf     PDF File intact ---- saved with 2.6 - opened with 2.6.pdf     PNG File merge result.png     File mongo-java-driver-2.6.tar.gz    

 Description   

Hi,

after upgrading from 2.6 to 2.7.3 we are having are getting data corruption when opening pdfs that where saved with 2.6. Somehow the pdf gets altered so that embedded fonts can't be rendered anymore. When saving & reading with the same version (tried with both 2.6 and 2.7.3) everything works fine.

I have attached an intact and a corrupt file. The corrupt file doesn't display properly in Adobe Acrobat (you will also get an error message there) and DiffPdf will find visual differences (text that is using embedded fonts isn't rendered). It will display properly in Chrome though. The intact file displays fine everywhere. The difference is visible on the first page on the line starting with '1.1'.

DiffPdf doesn't find a difference in the files using word by word comparison. There are some minor binary differences though. E.g. in the first line there are some additional characters (see attached WinMerge screenshot).

In case this is relevant: During my research I stumbled upon a thread where someone described the same problem in pdfs. As it turned out he was uploading pdfs via ftp in ascii mode. Switching to binary uploads solved the problem for him. http://www.macuser.de/forum/f17/eingebettete-schrift-konnte-352152/

Let me know if you need any more info. As for the priority - this isn't a blocker for us because only a hand full of files on our production server are affected by this issue.

Cheers,
Andi



 Comments   
Comment by Daniel Gottlieb (Inactive) [ 12/Jul/12 ]

Hand compiled jar from the r2.6 tag.

Comment by Daniel Gottlieb (Inactive) [ 12/Jul/12 ]

I'm glad the explanation helped and I hope now you know how to at least work around the problem if similar problems come up in the future. I am still up for tracking down the problem though.

I wish I could definitively say the problem was with the 2.6 driver saving corrupted files. Unfortunately I still can't reproduce any problems with the driver. Given the fact that the actual chunk size is 4 bytes larger than the default size and having looked extensively at the 2.6 source code, there exist very good arguments for both why the problem can't be in your application code and why it also cannot be in the 2.6 driver code! I do however have a couple leads to go on if you're still willing to help figure out what happened.

So if you don't mind, I'm curious if you can look up another value for me. I actually should have asked this with the previous inquiry as you may have deleted the corrupted files from your database. What I'm interested in knowing is if the `chunkSize` value in the `fs.files` collection for the corrupted file we're discussing is reported as `262,144` or `262,148`.

I'm also curious, now that you have more context about how GridFS works, if you can still reproduce the problem? This is the code I've used to try to corrupt a file while saving it to the GridFS, but have still failed so far (using the Mongo Java Driver v2.6 and Apache Commons IOUtils v2.0.1):

    static void insertPdf(GridFS gridFS) throws Exception {
        File goodPdf = new File("./good.pdf");
        FileInputStream input = new FileInputStream(goodPdf);
 
        byte[] pdfBytes = new byte[(int)goodPdf.length()];
        for (int idx = 0; idx < pdfBytes.length; ++idx)
            pdfBytes[idx] = (byte)input.read();
 
        GridFSInputFile gridFile = gridFS.createFile("good.pdf");
        IOUtils.copy(new ByteArrayInputStream(pdfBytes), gridFile.getOutputStream());
        gridFile.getOutputStream().close();
    }

To also rule out the possibility that you might actually using a different version of the java driver than v2.6, I'll ask if you can also download the jar I'm using to try to reproduce the issue (should be listed up top just under the ticket description). If you can still reproduce the problem with my 2.6 jar, can you modify my function to better match how you're saving with the PDF file into GridFS such that I can try using that to reproduce the problem on my side? Alternatively, if you can only reproduce the problem with your jar, if you could attach that and I can examine it to try to match what version it really is.

Thanks for helping me narrow this down and I'm sorry for any inconvenience this weird bug has caused.
Dan

Comment by Andreas Janson [ 11/Jul/12 ]

Thanks for your detailed instructions and explanations, they sure helped a lot. I checked the file's chunk lengths (saved with 2.6) and they are `262148, 262148, 132298` == incorrect.

So the problem is that 2.6 corrupted the files when saving?

I also did the same test with the intact file uploaded with 2.7.3. As expected all the chunks are 4 bytes less in size, so we can be pretty sure that this is no issue for us anymore. We'd be glad to help tracking down the problem anyway.

Thanks & cheers,
Andi

Comment by Daniel Gottlieb (Inactive) [ 10/Jul/12 ]

I'm still unable to reproduce this. I'm going to ask you for more information about the PDF as it is saved in the database. I would like to know if the data chunks themselves contain these integers that represent the length. In case you need a little guide, I'll explain how you can look at this via the mongo shell.

GridFS records the files it splits up in two different collections inside the database that the java GridFS object was instantiated with. The collections are `fs.files` and `fs.chunks`. The `fs.files` collection contains meta information about the file saved while `fs.chunks` contains the actual data. First we search the `fs.files` collection by filename:

> db.fs.files.find({filename: "good.pdf"}).pretty()
{
	"_id" : ObjectId("4ffb29bce4b0b48ac03d349e"),
	"chunkSize" : NumberLong(262144),
	"length" : NumberLong(656582),
	"md5" : "0a69c449b0e2128c1f517b29ac51ab2e",
	"filename" : "good.pdf",
	"contentType" : null,
	"uploadDate" : ISODate("2012-07-09T18:58:04.690Z"),
	"aliases" : null
}

Then we take the `_id` field returned in that query `ObjectId("4ffb29bce4b0b48ac03d349e")` and search for chunks with a matching `files_id`:

> db.fs.chunks.find({files_id: ObjectId("4ffb29bce4b0b48ac03d349e")}, {data: 0}).pretty()
{
	"_id" : ObjectId("4ffb29bce4b0b48ac03d349f"),
	"files_id" : ObjectId("4ffb29bce4b0b48ac03d349e"),
	"n" : 0
}
{
	"_id" : ObjectId("4ffb29bce4b0b48ac03d34a0"),
	"files_id" : ObjectId("4ffb29bce4b0b48ac03d349e"),
	"n" : 1
}
{
	"_id" : ObjectId("4ffb29bce4b0b48ac03d34a1"),
	"files_id" : ObjectId("4ffb29bce4b0b48ac03d349e"),
	"n" : 2
}

However, because the data field is quite large, I've omitted it from that query. Instead I want the length of the data field so I'll instead do for each chunk:

> db.fs.chunks.findOne({files_id: ObjectId("4ffb29bce4b0b48ac03d349e"), n:0}).data.length()
262144
> db.fs.chunks.findOne({files_id: ObjectId("4ffb29bce4b0b48ac03d349e"), n:1}).data.length()
262144
> db.fs.chunks.findOne({files_id: ObjectId("4ffb29bce4b0b48ac03d349e"), n:2}).data.length()
132294

So if you can, give that a shot on your GridFS database (with the appropriately matching file_id). The file length however should exactly match the returned `262144, 262144, 132294`. If the lengths are instead `262148, 262148, 132298`, the data portion of the chunks is incorrectly prefixed with its length. Incorrect lengths would at least allow us to conclude that the 2.7.3 driver is operating correctly though it wouldn't explain why the 2.6 driver removes those lengths when reading from the data.

Comment by Daniel Gottlieb (Inactive) [ 10/Jul/12 ]

Unfortunately I do not speak German Fortunately, I don't think translation will be necessary at this point.

So I found the pattern of what's going wrong, though I'm still going through the code to figure out why. Some context about GridFS for the explanation:

GridFS is just a driver feature that can break down big files into chunks (default size, 256K) and saves each one as a separate MongoDB document. What I found in the corrupt PDF file you sent is that when the bytes got stitched together, each chunk was preceded by the number of bytes in that chunk. The `00 00 04 00` bytes turn into 256K (when the order is reversed to 00 04 00 00). Those bytes are also repeated right at the 256K (+ 4 bytes) offset into the pdf file (the beginning of the second chunk). At the 512K (+ 8 bytes) offset we have the bytes `c6 04 02 00` which translates to 132,294 which all summed up exactly equals the file size of the intact pdf. This meaning that the length of the chunk in bytes is preceding the actual data when stitching the PDF back together.

I will let you know when I know more.

Comment by Andreas Janson [ 10/Jul/12 ]

3. Yes, there's exactly one document matching the findOne() query.

Comment by Andreas Janson [ 10/Jul/12 ]

Thanks for looking into this! Since your name sounds rather german, I'll post untranslated code snippets. If I'm wrong or a translation would be helpful anyway, let me know.

1. We are using the copy() method from commons-io-2.0.1

2. setMetaData adjusts the inputDokument before saving as follows:

  
private void setMetaData(InputDokument inputDokument, String fileName, MimeType mimeType) {
    FilenameBezeichnerAnpasser anpasser = new FilenameBezeichnerAnpasser (fileName, fileName, mimeType);
    inputDokument.setMimeType(mimeType.getMimeTyp());
    inputDokument.setFilename(anpasser.getModifiedFilename());
    inputDokument.setName(anpasser.getModifiedBezeichner());
  }

"Bezeichner" is the document's name that the end user sees.

FilenameBezeichnerAnpasser

  public static final String UNIX_PATH_SEPERATOR = "/";
  public static final String WINDOWS_PATH_SEPERATOR = "\\";
  public String filename;
  public String bezeichner;
 
  public FilenameBezeichnerAnpasser(String filename, String bezeichner, MimeType mimeType) {
 
    String fileEnding = getFileEnding(filename, mimeType);
 
    if (isNotBlank(bezeichner)) {
      this.filename = bezeichner + (fileEnding == null ? "" : "." + fileEnding);
      this.bezeichner = bezeichner;
    }
    else {
      this.filename = createFilenameWithEnding(filename, mimeType);
      this.bezeichner = removeFileEnding(this.filename);
    }
  }
 
  private String removeFileEnding(String filename) {
    if (isBlank(filename)) {
      return null;
    }
 
    if (hasFileEnding(filename)) {
      return filename.substring(0, filename.lastIndexOf("."));
    }
 
    return filename;
  }
 
  private String getFileEnding(String filename, MimeType mimeType) {
    if (hasFileEnding(filename)) {
      return getFileEnding(filename);
    }
    else {
      return mimeType == null ? null : mimeType.getFileEnding();
    }
  }
 
  static String getFileEnding(String filename) {
    return filename.substring(filename.lastIndexOf(".") + 1, filename.length());
  }
 
  static String createFilenameWithEnding(String filename, MimeType mimeType) {
    if (isBlank(filename)) {
      return null;
    }
 
    filename = removePath(filename);
 
    if (mimeType == null) {
      return filename;
    }
 
    if (hasFileEnding(filename)) {
      return filename;
    }
    else {
      return filename + "." + mimeType.getFileEnding();
    }
  }
 
  static String removePath(String filename) {
    if (isBlank(filename)) {
      return null;
    }
    return removePath(removePath(filename, UNIX_PATH_SEPERATOR), WINDOWS_PATH_SEPERATOR);
  }
 
  private static String removePath(String filename, String pathSeperator) {
    if (filename.contains(pathSeperator)) {
      return filename.substring(filename.lastIndexOf(pathSeperator) + 1, filename.length());
    }
    else {
      return filename;
    }
  }
 
  static boolean hasFileEnding(String filename) {
    if (isBlank(filename)) {
      return false;
    }
    return filename.matches(".*\\.[a-zA-Z]{3,}$");
  }
 
  public String getAngepasstenFilename() {
    return filename;
  }
 
  public String getAngepasstenBezeichner() {
    return bezeichner;
  }

3. I'll check it out later.

4. GridFsOutputDokument is just a small wrapper. GetInputStream() is used in DownloadServlet.

GridFsOutputDokument

public class GridFsOutputDokument extends GridFsDokument<GridFSDBFile> implements OutputDokument {
 
  public GridFsOutputDokument(GridFSDBFile file) {
    super(file);
  }
 
  @Override
  public InputStream getInputStream() {
    return getFile().getInputStream();
  }
}

5. We aren't directly subclassing any of the com.mongodb.gridfs classesn. We wrap GridFSFile to enrich the document with metadata.

GridFsDokument

public abstract class GridFsDokument<T extends GridFSFile> implements DokumentMetadaten {
 
  private T file;
  protected static final String METADATA = "metadata";
  {...definition of metadata keys...}
 
  protected GridFsDokument(T file) {
    if (file == null) {
      throw new IllegalArgumentException("File should not be null");
    }
    this.file = file;
  }
 
  @Override
  public String getAnzeigeName() {
    if (isBlank(getBezeichner())) {
      return getFilename();
    }
    else {
      return getBezeichner();
    }
  }
 
  @Override
  public DokumentId getId() {
    return new DokumentId(file.getId().toString());
  }
 
  @Override
  public String getErstellenderBenutzer() {
    return (String) getMetadata().get(ERSTELLENDER_BENUTZER);
  }
 
  @Override
  public OrgaEinheitId getErstellendeOrganisationsEinheitId() {
    String value = (String) getMetadata().get(ERSTELLENDE_ORGANISATIONS_EINHEIT_ID);
    return value == null ? null : new OrgaEinheitId(value);
  }
 
  @Override
  public Date getErstellungsDatum() {
    return (Date) getMetadata().get(ERSTELLUNGS_DATUM);
  }
 
  @Override
  public String getMimeType() {
    return getFile().getContentType();
  }
 
  @Override
  public String getBezeichner() {
    return (String) getMetadata().get(BEZEICHNER);
  }
 
  @Override
  public String getFilename() {
    return getFile().getFilename();
  }
 
  @Override
  public OeffentlicherDokumentenSchluessel getOeffentlicherSchluessel() {
    String value = (String) getMetadata().get(OEFFENTLICHER_SCHLUESSEL);
    return new OeffentlicherDokumentenSchluessel(value);
  }
 
  private DBObject getMetadata() {
    return getFile().getMetaData();
  }
 
  protected T getFile() {
    return file;
  }
 
  @Override
  public Long getSize() {
    return getFile().getLength();
  }

Comment by Daniel Gottlieb (Inactive) [ 09/Jul/12 ]

I haven't been able to reproduce this with the naive attempt of merely saving the non-corrupt PDF to GridFS with the 2.6 driver and reading that with the 2.7.3 driver. I have a few questions, some of which probably won't have anything to do with the problem. I'm just trying to be thorough.

  1. The IOUtils.copy method: I assume that's the the apache commons function call? What version of the Commons IO library are you using?
  2. setMetaData(InputDocument, String, MimeType): Is that data also being inserted into GridFS/MongoDB? If so can I see the code for that function?
  3. In DokumentenRepositoryMongo.getDokument there's a constructed query that findOne() is called with. Can you confirm exactly one document matches that query?
  4. Can I see the code to GridFsOutputDokument?
  5. Are you subclassing any of the classes inside com.mongodb.gridfs?
Comment by Andreas Janson [ 04/Jul/12 ]

Ok, I hope I got all relevant code snippets. I also translated some of the names into english (or something closely related . I hope the code's understandable. Let me know if you need any more snippets.

DokumentSaveService

 
  public DokumentSpeichernResponse save(String fileName, byte[] inputBytes, MimeType mimeType) {
    InputDokument inputDokument = dokumentenRepository.createDokument();
    setMetaData(inputDokument, fileName, mimeType);
    IOUtils.copy(new ByteArrayInputStream(inputBytes), inputDokument.getOutputStream());
    inputDokument.save();
    return new DokumentSaveResponse(inputDokument.getId(), inputDokument.getDocumentKey());
  }

GridFsInputDokument

 
 public class GridFsInputDokument extends GridFsDokument<GridFSInputFile> implements InputDokument {
 
  protected GridFsInputDokument(GridFSInputFile file) {
    super(file);
    file.setMetaData(new BasicDBObject());
  }
 
  @Override
  public OutputStream getOutputStream() {
    return getFile().getOutputStream();
  }
 
 {....}
 
  @Override
  public void save() {
    try {
      getOutputStream().close();
    }
    catch (IOException e) {
      throw new RuntimeException(e);
    }
  }
}

DokumentenRepositoryMongo

 
  @Override
  public InputDokument createDokument() {
    GridFSInputFile gridFSFile = database.getGridFs().createFile();
    GridFsInputDokument inputDokument = new GridFsInputDokument(gridFSFile);
    {...}
    return inputDokument;
  }
 
  @Override
  public OutputDokument getDokument(OeffentlicherDokumentenSchluessel documentKey) throws NoSuchElementException {
    DBObject query = createQuery(...);
    GridFSDBFile file = database.getGridFs().findOne(query);
    if (file != null) {
      GridFsOutputDokument outputDokument = new GridFsOutputDokument(file);
      return outputDokument;
    }
    {...}
  }

DownloadServlet

 
  @Override
  protected void doGet(HttpServletRequest request, HttpServletResponse response) {...} {
  {...}
      authenticate(request);
      findDokument(request, response, documentKey);
  {...}
  }
 
private void findDokument(HttpServletRequest request, HttpServletResponse response, String documentKey) throws IOException, InterruptedException, NoSuchElementException, ServletException {
    OutputDokument outputDokument = dokumentenRepository.getDokument(new DocumentKey(documentKey));
    sendDokumentContent(request, response, outputDokument);
    {... retry...}
  }
 
  private void sendDokumentContent(HttpServletRequest request, HttpServletResponse response, OutputDokument outputDokument) throws IOException, ServletException {
    response.setContentType(outputDokument.getMimeType());
    response.setContentLength(outputDokument.getSize().intValue());
    response.setCharacterEncoding("UTF-8");
    response.addHeader("Content-Disposition", "inline; filename=\"" + outputDokument.getFilename() + "\"");
    IOUtils.copy(outputDokument.getInputStream(), response.getOutputStream());
  }
 

Comment by Scott Hernandez (Inactive) [ 03/Jul/12 ]

Can you attach the code you are using to save and retrieve the file from mongo?

Generated at Thu Feb 08 08:52:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.