[JAVA-2416] BSONReader mark - add support to unset a mark (in contrast to resetting), and add support for multiple marks Created: 22/Dec/16  Updated: 28/Jan/18  Resolved: 15/Mar/17

Status: Closed
Project: Java Driver
Component/s: BSON, Codecs
Affects Version/s: None
Fix Version/s: 3.5.0

Type: New Feature Priority: Major - P3
Reporter: Michael Weber Assignee: Jeffrey Yemin
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

n/a


Issue Links:
Duplicate
is duplicated by JAVA-2417 Add uset a mark and multiple marks in... Closed
Related
related to JAVA-2754 using multiple marks on reader causes... Closed

 Description   

It would be great if a mark set on a org.bson.BsonReader could be unset. Right now only resetiing to the mark(ed) position is possible.

My use case(s):
use case 1)
I wrote a reflection based org.bson.codecs.Codec that uses the new MongoDB java driver. Within the mongo database any inserted entity holds a field with the type (or name or...) of the correct Java-Pojo this entity maps to. When reading entities from the mongo I first need to read that type-property, then I create the java-Pojo instance and then I reset the mark to re-read all remaining fields into the newly created Java-instance. I am aware of the fact that I could have used the DocumentCodec provided by the mongo driver for that purpose, reading the whole entity as Document and then start to look out for the needed information within that Map-structure. This would imply double conversion hence double work hence less performance. And I would like to avoid this behaviour.
Now, in most cases the first field read is actually the type and I do not need to reset the mark. In this case I would like to unset the mark instead.

use case 2)
While decoding any decoder may internally mark the reader itself in order to apply logic as described above. Unfortunately the BsonReader is not capable of setting more than one mark at a time. An exception is being thrown if a mark is set more often. My desired behaviour would be: Any mark set can be reset AND unset.

The main issue here is, that I cannot be sure that I don’t know if the nested decoder uses a mark itself.

I hope I did not misunderstood the concept behind the BSONReader logic in general. If so, please make me aware of my wrong thinking.



 Comments   
Comment by Githook User [ 15/Mar/17 ]

Author:

{u'username': u'jyemin', u'name': u'Jeff Yemin', u'email': u'jeff.yemin@10gen.com'}

Message: JAVA-2416: Support multiple marks on a BsonReader by introducing a method that returns a BsonReaderMark.
Branch: master
https://github.com/mongodb/mongo-java-driver/commit/27fcfbd4a92aa2550c08ec37a0c152ef0d2d8651

Comment by Jeffrey Yemin [ 04/Jan/17 ]

Hi Michael,

That's an interesting use case, and not one I've encountered before. Thanks for taking the time to think it through.

I'll tentatively schedule this for the next release.

Comment by Michael Weber [ 04/Jan/17 ]

Hi Jeff,

it's actually a real use case for us.
Let's assume we have millions of entities in our database and we need to process all of them to create a list of all entries. There may be garbage/erroneous data in the database (instead an assumed string property there is a double or so...). In such a case the decoder would fail (if no explicit exception handling performed for every single read).
For us it actually does not really make a huge difference if ALL entities can be decoded. If almost all get encoded, we are happy. What certainly should not happen is a stop of the application when decoding a single entity fails. Instead we would log such issues and someone can take care later and repair the database entitiy or fix the code that failed decoding it.

Comment by Jeffrey Yemin [ 02/Jan/17 ]

I think what you'd need to do is set the mark at the start of the document, and if an exception is thrown then reset to the mark and then skip the entire document (via skipValue). In that case, you would need more than one mark. But is that a real use case? Would your application function properly if one or more documents can't be decoded?

Comment by Michael Weber [ 02/Jan/17 ]

Ok, cool.
One additional use case for unsetting a mark came just to my mind.

I was asking myself how to implement proper exception handling in case of reading multiple entities in one go.
If the decode method of a codec would throw an exception this would leave the reader in an somehow arbitrary state. It woud not be easy to recover the reader to point to the next entity in the stream. So a solution I first thought of was to set a mark whenever I start reading an entity and if the current entity could be properly decoded simply unset the mark or even better, just forget about its existence.

Is there a better way of implementing a faukt tolerant exception handling? I know I could wrap any reader.readXYZ with a try catch. But this seems somehow odd to me.

Comment by Jeffrey Yemin [ 02/Jan/17 ]

That's true about regular expressions but that type is rarely stored in documents and decoded. It's almost always used in queries and therefore only encoded by the driver.

Comment by Michael Weber [ 02/Jan/17 ]

Hi Jeff,

thanks for clearing things up for me.
Just for completion. The following lines of code within org.bson.BsonBinaryReader made me feel that skipping may be expensive in some cases:

            case REGULAR_EXPRESSION:
                bsonInput.skipCString();
                bsonInput.skipCString();

It seems in case of regular expressions the byte buffer is read until a null byte is found.

I agree that an unoptimized algorithm makes the code much easier and I might opt for this solution. Though I feel that other users of the java driver may stumble upon the "mark cannot be unset or set multiple times" problematic as well in different contexts in the future. I may be mistaken though.

Thank a lot once more.

Comment by Jeffrey Yemin [ 02/Jan/17 ]

My second point is maybe a lack of understanding of the reader itself. The reader what in my understanding is a stream of bytes cannot be garbage collected if a mark is set in the beginning, because I will need to go back to the marked position in order to decode the fields later, hence memory will be consumed and hold back from being garbage collected. (at maximum 16 megabytes as I remember correctly)

In the current implementation, there is a single buffer underlying the entire response from the server, and therefore it is not garbage collected until decoding of the entire response is complete. Whether the mark is set has no effect on this.

I quickly had a look at BsonReader.skipValue() and it seems, depending on the type of the field this method may be expensive for certain types as sizes need to be calculated?! Am I wrong?

It's actually not expensive, as the sizes are stored directly in the BSON structure and therefore don't need to be calculated. See http://bsonspec.org/spec.html for details.

Let me know if this information changes the situation sufficiently to justify the unoptimized algorithm. Since the type field is likely to be one of the first fields, usually the decoder won't have to skip much at all. In the worst case it will have to skip each metadata field as well as the payload field, but the payload field is cheap to skip since, as indicated above, its size is encoded into the BSON structure.

Comment by Michael Weber [ 02/Jan/17 ]

Hi Jeff,

the algorithm you describe is more or less how I implemented the reflection codec in the first run.
My concern was more related to performance as I need to fully read through all properties to determine the type even though I skip all values that are not the type field.
My second point is maybe a lack of understanding of the reader itself. The reader what in my understanding is a stream of bytes cannot be garbage collected if a mark is set in the beginning, because I will need to go back to the marked position in order to decode the fields later, hence memory will be consumed and hold back ffrom being garbage collected. (at maximum 16 megabytes as I remember correctly)
I quickly had a look at BsonReader.skipValue() and it seems, depending on the type of the field this method may be expensive for certain types as sizes need to be calculated?! Am I wrong?

Let me explain my use case in more detail in order to demonstrate the need for a more fine grained mark/unmark/reset control.
Assume you have the following data structure stored in mongo.

We decided to split any entity into meta data and the actual payload whereby the meta data are properties stored as first level properties and the payload is stored in the field payload. All entites have identical meta data fields ("realId", "version", "metaInfoA", "type", etc.) We could have put all meta data fields into a dedicated field to compose them, but for some reason that I do not recall (maybe some mongo internals like partial index or so??) we put them into level zero. Depending on the field "type" the payload contains different properties.

{
  "_id": ObjectId("57beb711ed91159e1e3857e1"),
  "realId": ObjectId("57beb711ed91159e1e3857e2"),
  "version": 1,
  "metaInfoA": [],
  "metaInfoB": [],
  "current": true,
  "type": "ARTICLE",
  "payload": {
    "propertyA" : 123,
    "propertyB" : 123
  }
}

The resulting java-Pojo has a different structure as follows:

public class MetaData {
    ObjectId id;
    ObjectId realId;
    List<String> metaInfoA;
    List<String> metaInfoB;
}
 
public class BaseType {
    MetaData metaData;
}
 
public class Article extends BaseType {
	int propertyA;
	int ropertyB;
}

The java code to decode this looks as follows:

    public T decode(BsonReader reader, DecoderContext decoderContext) {
        // first look ahead to find "type" property
        Type type = null;
        T newInstance = null;
        MetaData documentMeta = documentMetaCodec.newInstance();
        boolean dataMarkSet = false;
        reader.readStartDocument();
        while (reader.readBsonType() != BsonType.END_OF_DOCUMENT) {
            String fieldName = reader.readName();
            switch (fieldName) {
                case "type":
                    type = Type.getForName(reader.readString());
                    break;
                case "payload":
                    if (type != null) {
                        newInstance = decodeDataInstance(reader, decoderContext, type);
                    } else {
                        // ok we need to read later... :(
                        dataMarkSet = true;
                        reader.mark();
                        reader.skipValue();
                    }
                    break;
                // any other field is assumed to belong to MetaData
                default:
                    MappedField field = documentMetaCodec.getMappedField(fieldName);
                    if (field != null) {
                        field.decode(reader, documentMeta, decoderContext);
                    } else {
                        reader.skipValue();
                    }
            }
        }
 
        if (newInstance == null) {
 
            if (dataMarkSet) {
                LOGGER.warn("Performance issue: type property is provided after data property.");
                reader.reset();
                if (type == null) {
                    LOGGER.warn("Data-Inconsistency issue: no or unknown type found. Skipping entity. Returning null");
                    reader.skipValue();
                } else {
                    // now read data
                    newInstance = decodeDataInstance(reader, decoderContext, type);
                }
                //skip all remaining fields
                while (reader.readBsonType() != BsonType.END_OF_DOCUMENT) {
                    reader.skipName();
                    reader.skipValue();
                }
            }
            // no payload attribute found in mongo
            else {
                // create empty entity
                newInstance = newInstance(type);
            }
        }
 
        reader.readEndDocument();
        if (newInstance != null) {
            documentMetaMappedField.setFieldValue(newInstance, documentMeta);
        }
        return newInstance;
    }

As MetaData is a known structure for all entities we optimized the decoding in a way that all meta data is read independently from the resulting type. In the end when the type is known and the payload is decoded, the meta data is injected into that java object.
Admitted this is quite complex. The point to demonstrate here is, that these optimizations lead to using the reader in different codecs a the same time depending on what property is being read currently.
This works as long as no decoder being used "marks" the reader. It may happen that we reach the code afterline "// ok we need to read later... "

I am aware I could follow your suggested algorithm as described in your previous comment to avoid this issue.
If I wanted to open souce my codec I would actually need to distribute the unoptimized version as I do not know if other codecs (third party codecs) will use marks itself.

I hope you can follow my motivation for implementing a fine grained control over marks.

Cheers

Michael

Comment by Jeffrey Yemin [ 29/Dec/16 ]

Hi Michael,

I don't think it's necessary to have more than one mark to build a reflection-based codec. The pseudocode I imagine would look something like this:

T decode(reader)
   String type = findType(reader)
   return decodeType(type, reader) 
 
String findType(reader)
   reader.mark()
 
   try  
      while reader.readBsonType() != BsonType.END_OF_DOCUMENT
           if reader.readName().equals("_type")  // or whatever the name of the type field is...
               return reader.readString()
           reader.skipValue()
 
      throw SomeException("Unable to determine type")
   finally
      reader.reset()

So the mark is only set when finding the type, then immediately reset. By the time you actually decode the specific type, which is where the recursion happens, the mark has already been reset. I believe that takes care of use case 2.

As for use case 1, the decoder already has to handle the case where the type field is not the first, and ignore it. So while it would be a nice optimization to not have to do that when the type field has to be first, it's not strictly necessary for correctness. And by using BsonReader.skipValue() when that field is encountered during decoding, you can avoid the cost of decoding the type string if you'd like.

My thinking about this is informed by looking at our .NET driver, which has a full reflection-based codec included with it. While it's BsonReader has the more extensive bookmarking capabilities that you're requesting, those capabilities are not actually used by its own decoder.

Let me know what you think.

Comment by Michael Weber [ 23/Dec/16 ]

Hi Jeff,

thanks for your answers.
I try to make my second use case more clear before answering to the proposed solution for issue one (unsetting a mark) because these topics are related.

Assuming you would write a Codec similar to org.bson.codecs.DocumentCodec
In the flow of decode(...) within
https://github.com/mongodb/mongo-java-driver/blob/master/bson/src/main/org/bson/codecs/DocumentCodec.java#L135
additional "nested" decoders are called in these lines..
https://github.com/mongodb/mongo-java-driver/blob/master/bson/src/main/org/bson/codecs/DocumentCodec.java#L210
https://github.com/mongodb/mongo-java-driver/blob/master/bson/src/main/org/bson/codecs/DocumentCodec.java#L212

If now (a) mark(s) would be set (they are not used in DocumentCodec currently) these "nested" Codecs would throw the following error in case they would rely on setting a mark themself.
-> org.bson.BSONException: A mark already exists; it needs to be reset before creating a new one

Fortunately I haven't seen the use of mark() in the Codecs that are shipped with the mongo driver, so the problem does not exist there.

In my case I would love to use marks more often. And I actually nest my ReflectionCodec, hence I nest mark()s

A solotion whereby I could simply overwrite the last set mark (as you proposed in your first comment) is not feasable since I do not know, if the wrapping decode needs to unset the reader to a previous remembered state.

Either marks are put on a stack structure and they need to be unset/reset in the order they are put on the readers stack or as you pointed out handles to set marks are returned in order to have fine grained control.

I hope you now understand my use case? If not, let me know and I can assemble a code example.

Comment by Jeffrey Yemin [ 22/Dec/16 ]

For reference, here is prior art from the C# driver:

https://github.com/mongodb/mongo-csharp-driver/blob/ee97df791d0a42853295bcc5cd791e37b8bb076d/src/MongoDB.Bson/IO/BsonReader.cs#L119-L119
https://github.com/mongodb/mongo-csharp-driver/blob/ee97df791d0a42853295bcc5cd791e37b8bb076d/src/MongoDB.Bson/IO/BsonReader.cs#L350
https://github.com/mongodb/mongo-csharp-driver/blob/c735a8c0ff5283fffd3d935e248bd7aacf07b3b8/src/MongoDB.Bson/IO/BsonReaderBookmark.cs

Rather than requiring a stack, it exposes each bookmark as an object whose position can be returned to in any order.

Comment by Jeffrey Yemin [ 22/Dec/16 ]

Hi Michael,

I understand use case 1 and it seems that it would be enough to simply relax the current restriction that a mark can only be set if there is no active mark, and the only way to clear a mark is to reset it. In other words, this should be allowed:

reader.mark();
// do something will reader
reader.mark()  

Alternatively, or in addition, we could add BsonReader.clearMark(), which has the advantage of communicating to the BsonReader implementation that the current mark no longer has to be remembered.

For the second use case, I'm less clear on what you need. Would a stack of marks be sufficient, or do you need more control than that?

Generated at Thu Feb 08 08:57:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.