[CSHARP-710] Implement PreSerialize hook on objects Created: 22/Mar/13  Updated: 11/Mar/19  Resolved: 26/Mar/13

Status: Closed
Project: C# Driver
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Vladimir Perevalov Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

All



 Description   

Bson serialization allow objects to implement ISupportInitialize
and it will call appropriate methods. But I had not found any opposite method.
I think, it would be great to have some custom interface, e.g.:
interface IBsonSerializationHook
{
void BeforeSerialize();
void AfterDeserialize();
}
So it will be easy to do some work before serialization and after deserialization.
I have a use case for this:
There is a document per user {_id:123, ReadArticleIds:[1,2,3,4,5.... 10000]}
ReadArticleIds may contain thousands of Ids. If i just save/load it, it takes forever (more then a second for one record) to load.
But, I know, that I need ReadArticleIds only in memory, I will not be doing any queries against it. So I serialize it with BinaryFormatter and save it in BsonBinaryData. Now it will load/save 100 times faster at least. Even with overhead of binaryformatter.

I tried to implement IBsonSerializable and inside do my stuff and just call BsonSerializer.Seriaze(). But that obviuously caused stack overflow

Also, I just don't like doing such hacky customization. Code would be much easier to understand with appropriate interface/method names.



 Comments   
Comment by Vladimir Perevalov [ 27/Mar/13 ]

Here are some more detailed test results. First of all I'll describe my records and data. Here's my class for record:
public class UserReadArticlesNative
{
[BsonId]
public long UserId

{ get; set; }
public Dictionary<long, HashSet<long>> ReadArticleIds { get; set; }

}
It usually holds 20-60 records in the dictionary and 1000-20000 records in each HashSet. ArticleIds are mostly sequential (identity in SQL) and numbers are around 7000000 and growing.
I've selected 10 most active users, they have totally 5000-30000 ids in the dictionary (ReadArticleIds.Sum(x => x.Value.Count)).
I've created empty test database. First I do write tests (write only 10 users in a separate collection, but repeat (overwrite) many times). Then I read, again many times, I read each user's record in a loop.
For this test I have one mongod on the local 1Gbit network.

Following are results for different serialization schemes (numbers are operations per second:

BinFormatterGzip write: 46,5 read: 208,6
BinFormatterLZ4 write: 119,2 read: 380,1
BinFormatter write: 74,5 read: 284,0
Protobuf write: 95,0 read: 320,1
ProtobufLZ4 write: 123,6 read: 311,0
Native write: 27,7 read: 107,6

Names should be selfexplanatory (I used just serializer and serializer+compressor, Native means no special serializers - driver default).
This test ran for 10 seconds for each operation (write and read separately).
Results vary a bit from run to run, but relative picture is same everytime. I see two clear leaders: .Net BinaryFormatter compressed with LZ4Sharp and Protobuf(https://code.google.com/p/protobuf-net/)+LZ4Sharp(https://github.com/stangelandcl/LZ4Sharp).
They have almost same write speed, both 4+ times faster than native driver implementation, and read speed 3+ times faster.
Right now I plan to perform more tests, and then will choose the winner and will have to convert my database.

Comment by Craig Wilson [ 26/Mar/13 ]

I think that is a great solution to cross-platform compatibility. Our "Binary" type is just that, a bunch of bytes. We have special subtypes for UUIDS and MD5, but the general subtype is arbitrary and can hold any type of data. So standardizing your format on protocol buffers is perfectly fine and is cross-platform compatible.

I'm fairly certain that protocol buffers requires "compile time" support because it's binary format is not self-describing. In other words, it isn't dynamic schema compatible because it requires both parties to know the format of the messages.

Anyways, thanks for the suggestion. I'm going to close this ticket as "Works as Designed."

Comment by Vladimir Perevalov [ 26/Mar/13 ]

No problem with driver
It is only regarding Robert's comment "One other observation worth making about storing your data this way is that applications written with other platforms/languages/drivers will not be able to use this data since it is just a binary BLOB in the database and only .NET knows how to deal with that binary BLOB."
Now I actually use BinaryFormatter compressed with GZipStream. Two of them together yield best loading speed in tests. Even when GZip takes a lot of time, actually it is faster then transferring more data over the network.
But now I'm looking into using other serialization techniques, and found Google Protocol Buffers (https://developers.google.com/protocol-buffers/). It is an open standard and have implementations for lots of languages including C# (I used protobuf-net). Tests showed, that using protobuf-net without any other compression resulted in twice better write performance and 10% better read performance.
Of course, all number are for my specific cluster/network/data etc. So they should be taken too serious.
So using protobuf can gain me two goals at once: improve performance and make data available for other languages.

Comment by Craig Wilson [ 26/Mar/13 ]

Vladimir,
Could you expand upon your "multiplatform" and "open standard" comment? I'm unclear as to what you mean.

Comment by Vladimir Perevalov [ 26/Mar/13 ]

By the way, it is easy to make this thing multiplatform, if you use some open standard for binary serialization like Google Protocol Buffers.

Comment by Vladimir Perevalov [ 26/Mar/13 ]

Robert, your approach seem to work quite fine. Thanks again for the tip!

Comment by Vladimir Perevalov [ 22/Mar/13 ]

I understand this. For now this is not a problem.

Comment by Robert Stam [ 22/Mar/13 ]

You're welcome.

One other observation worth making about storing your data this way is that applications written with other platforms/languages/drivers will not be able to use this data since it is just a binary BLOB in the database and only .NET knows how to deal with that binary BLOB.

Comment by Vladimir Perevalov [ 22/Mar/13 ]

It looks really good, but I have to test it. I'll post my results probably next week.
Thanks for helping.

Comment by Robert Stam [ 22/Mar/13 ]

Here's how you could choose to serialize your array of integers using the BinaryFormatter in the current version of the driver:

Declare your class like this:

public class D
{
    public int Id { get; set; }
    [BsonSerializer(typeof(IntegerArrayBinarySerializer))]
    public int[] A { get; set; }
}

And write this short custom serializer:

public class IntegerArrayBinarySerializer: BsonBaseSerializer
{
    public override object Deserialize(BsonReader bsonReader, Type nominalType, Type actualType, IBsonSerializationOptions options)
    {
        var bytes = bsonReader.ReadBytes();
        using (var stream = new MemoryStream(bytes))
        {
            var formatter = new BinaryFormatter();
            return (int[])formatter.Deserialize(stream);
        }
    }
 
    public override void Serialize(BsonWriter bsonWriter, Type nominalType, object value, IBsonSerializationOptions options)
    {
        var a = (int[])value;
 
        using (var stream = new MemoryStream())
        {
            var formatter = new BinaryFormatter();
            formatter.Serialize(stream, a);
            var bytes = stream.ToArray();
            bsonWriter.WriteBytes(bytes);
        }
    }
}

When writing a custom serializer you implement Serialize and Deserialize (instead of BeforeSerialize and AfterSerialize in your proposed IBsonSerializationHook) with the additional advantage that you don't have to add a temporary field like ABlob that isn't really needed.

Also, a custom serializer can be used in many places, so if you had other classes that also had large integer arrays you could use the same custom serializer with them instead of having to implement IBonSerializationHook all over again for every class with a large integer aray.

Would an approach like this meet your needs?

Comment by Vladimir Perevalov [ 22/Mar/13 ]

Ok, maybe I missed correct number.
But, look at your result. 25 ms per record is already quite a lot. Only 40 records per second, not counting network overhead.
Also, I not comlaining about deserialization speed. I did my tests, and I found solution with BinaryWriter. I only want to have a convenient way to execute some code right before serialization.

Here's my sample code:

public class C : ISupportInitialize
{
public int Id

{ get; set; }
[BsonIgnore]
public int[] A { get; set; }

public BsonBinaryData ABlob

{get;set;}

static BinaryFormatter f = new BinaryFormatter();
public void Serialize()

{ var ms = new MemoryStream(); f.Serialize(ms, A); var result = new byte[ms.Position]; Buffer.BlockCopy(ms.GetBuffer(), 0, result, 0, (int)ms.Position); ABlob = result; }

public void EndInit()

{ if (ABlob == null) return; var ms = new MemoryStream(ABlob.Bytes); A = (int[])f.Deserialize(ms); }

public void BeginInit() { }
}

You only have to call Serialize before inserting into mongo.
Also, please test both cases on real DB, not just serialization.

Comment by Robert Stam [ 22/Mar/13 ]

I am unable to reproduce your claim that deserialization takes more than a second. When I serialize the following class with 10000 elements:

public class C
{
    public int Id { get; set; }
    public int[] A { get; set; }
}

This is how long it took on my machine:

Serialize took 0.0726004 seconds
Deserialize took 0.025541 seconds

Generated at Wed Feb 07 21:37:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.