Skip to content

v4; motivation and initial thoughts #951

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 114 commits into
base: main
Choose a base branch
from
Draft

v4; motivation and initial thoughts #951

wants to merge 114 commits into from

Conversation

mgravell
Copy link
Member

@mgravell mgravell commented Sep 6, 2022

This PR covers some initial exploration into v4

Key Motivations

  1. improve AOT support
  2. improve performance
  3. support additional memory usage scenarios
  4. smaller outputs

2 and 3 are most likely by way of a new reader/writer API with additional optimizations; 1 is most likely via new build tools which integrate with the outputs from 2 and 3

Improve AOT Support

Currently the core engine is focused on runtime reflection-based IL emit. The library conceptually supports AOT scenarios, including library separation of the core and reflection-based aspects, and attribute based annotation support for manually-written serializers, but none of the tools currently generate code-based serializers. We aim to provide both code-first and contract-first AOT scenarios, typically using Roslyn generators (either based on the discovered code model, or the .proto files parsed - the machinery ahead of these bits already exists).

Additionally:

  1. runtime reflection-based emit is slow at the initial usage, requiring lots of additional system assemblies, lots of type discovery, and consideration of a complicated system, and the actual emit; this impacts cold-start performance, particularly relevant for serverless scenarios where the process is typically short-lived
  2. runtime reflection discovery and emit is not well supported on all platforms, in particular impacting "unity" etc (also: IL2CPP doesn't support all IL scenarios, and isn't perfect in some of the relevant cases)
  3. runtime reflection discovery and emit demands a wide graph of system assemblies; this impacts "pruning", meaning either we need to retain a lot of libraries, or it won't work properly; this impacts "blazor" in particular
  4. runtime reflection discovery and emit is hard to debug, maintain, and extend; if we want to add radical new features (a new core reader/writer API, async, etc) it is prohibitively expensive to implement this in the existing design, and demands very niche skills (reducing the ability of people to contribute)

Improve performance

Profiling has shown that the existing API is sub-optimal; discovery work has been done ahead of this PR to investigate a "from first principles" re-imagining of the core reader/writer API. It is fundamentally not possible to achieve all of the aims here without a new API, although it may be possible to reuse the new API from without the older API as a wrapper layer.

These changes include:

  • reworking the data buffer to reduce all unnecessary optimizations
  • using CPU primitives where profiling shows it to be useful
  • using better generated serializer code to reduce operations
  • exploit framework features list list-span access

Support Additional Memory Usage Scenarios

Some models are inherently "allocatey"; consider, for example, a model with a repeated chunk of multiple sub-items, each of which has a bytes payload, resulting in large numbers of small byte[] chunks. The idea here is to facilitate more efficient scenarios here; e.g. we could generate ReadOnlyMemory<byte> instead of byte[], and allow multiple leaf levels to be slices of the same underlying oversized buffer. The existing PR explores this scenario. Note, however, that profiling is mixed on the outcome of this. We want to enable this, but as an option, allowing us to play with multiple options with real data.

Smaller Outputs

Right now the runtime library needs to contain chains for things it might need - niche random code paths for obscure and esoteric models. Because this discovery is done via reflection, these edge-cases are largely not trimmable (in the AOT sense), because discovering whether they are reached are not is basically impossible. By moving to an AOT path, without all the reflection gunk, it is very clear at build time what code is reached - there is no reflection gunk. This means we don't need all the reflection dependencies, and we don't need all the dependencies for all the stuff that isn't used by the model. This saving can be significant.


Likely implementation

We need to consider code-first and contract-first separately here. Let's consider a simple scenario:

syntax = "proto3";

message Foo {
    repeated string bar = 1;
}

Currently, this can be used to generate something akin to the same contract, as seen from a code-first perspective:

[ProtoContract]
public partial class Foo
{
    [ProtoMember(1)]
    public List<string> Bars { get; } = new();.
}

What we want to achieve is that whether starting code-first or contract-first, we generate code that includes the actual serialization code, either at the same time as generating the code (contract-first), or in an additional partial-class (code-first). Typical output code is shown in the exploration work in the PR.

The key point here, though, is that code-first and contract-first start from completely different code models - contract-first (and the existing code-gen) starts from the FileDescriptorSet view, where-as code-first starts from a Roslyn view. The actual code-gen should not have to content with this, and we do not intend duplication, so: the proposal instead is to create a new source-agnostic API that the new code-gen tools should use, and populate the source-agnostic API from the specific scenarios.

For example, we could have:

class CodeGenerationModel
  List<CodeGenerationFile> Files

class CodeGenerationFile
  string Name
  List<CodeGenerationType> Types


class CodeGenerationType
  string Name, OriginalName // takes Name when null
  string Namespace
  ReadOnlyMemory<string> ParentTypes

  List<CodeGenerationMember> Members
  // flags and other helpers; is it an enum? value-type?
  // what are we generating for this type? members? serializer?

  // note: we expect inbuilt primitives to exist as CodeGenerationType,
  // for example, maybe `static CodeGenerationType.String`

class CodeGenerationMember
  string Name, OriginalName // takes Name when null
  string BackingMember
  int FieldNumber
  CodeGenerationType Type
  // data format? wire-type?
  // repeated? if so, what kind? other flags?

So here, we would generate the equivalent of

var model = new() {
  Files = {
    new() {
        Name = "my.generated.cs",
        Types = {
          new() {
            Name = "Foo",
            Generate = /* serializer+members for contract-first; serializer for code-first */
            Members = {
              FieldNumber = 1,
              Name = "Bars", OriginalName = "bar",
              Type = CodeGenerationType.String,
              MemberType = Repeated
            }
          }
        }
    }
  }
};

So; the initial work items:

  1. define a rough skeleton model for the above new API
  2. parse the Roslyn code-first model to populate the new model
  3. parse the FileDescriptorSet contract-first model to populate the new model
  4. emit new model+serializer code from the new model, against the new serializer API
  5. implement the new serializer API

It is not a goal of the current stage to emit code for the old serializer API from the new model; while that might be a nice feature in the future, it is not seen as solving an immediate need, and will only add support costs.


High level tasks

  • setup test skeleton
    • parse .proto to FileDescriptorSet
    • parse C# to Roslyn model
  • setup new working model
  • populate working model from FileDescriptorSet
  • populate working model from Roslyn model
  • basic DTO output from working model
  • serializer output from working model
  • complete the reader/writer API

Test skeleton; somehow setup multi-input test (folder-based?) that takes a corpus of examples

# Conflicts:
#	protobuf-net.sln
#	src/Benchmark/Benchmark.csproj
#	src/BenchmarkBaseline/BenchmarkBaseline.csproj
#	src/BuildToolsUnitTests/BuildToolsUnitTests.csproj
#	src/Directory.Build.props
#	src/Examples/Examples.csproj
#	src/LongDataTests/LongDataTests.csproj
#	src/NativeGoogleTests/NativeGoogleTests.csproj
#	src/protobuf-net.AspNetCore/protobuf-net.AspNetCore.csproj
#	src/protobuf-net.BuildTools.Legacy/protobuf-net.BuildTools.Legacy.csproj
#	src/protobuf-net.BuildTools/protobuf-net.BuildTools.csproj
#	src/protobuf-net.Core/protobuf-net.Core.csproj
#	src/protobuf-net.FSharp.Test/protobuf-net.FSharp.Test.fsproj
#	src/protobuf-net.FSharp/protobuf-net.FSharp.csproj
#	src/protobuf-net.MSBuild.Test/protobuf-net.MSBuild.Test.csproj
#	src/protobuf-net.MSBuild/protobuf-net.MSBuild.csproj
#	src/protobuf-net.MessagePipes/protobuf-net.MessagePipes.csproj
#	src/protobuf-net.NodaTime/protobuf-net.NodaTime.csproj
#	src/protobuf-net.Protogen/protobuf-net.Protogen.csproj
#	src/protobuf-net.Reflection.Test/protobuf-net.Reflection.Test.csproj
#	src/protobuf-net.ServiceModel/protobuf-net.ServiceModel.csproj
#	src/protobuf-net.Test/protobuf-net.Test.csproj
#	src/protobuf-net/protobuf-net.csproj
#	src/protogen.site/protogen.site.csproj
#	src/protogen/protogen.csproj
@listepo
Copy link

listepo commented Aug 28, 2023

Hey @mgravell thanks for your work, is there any news about it?

@Dona278
Copy link

Dona278 commented Feb 23, 2024

Hi @mgravell , I know that you have a lot of work + family + combat criminals at night but I think this is the best protobuf library for dotnet, and Microsoft since net8 pushes a lot on performance + trimming + AOT + source generator, so I wanna ask:

  • After years, there is any eta for this work?
  • There is any chance to get help from microsoft to support this project as already did with Grpc.AspNetCore?

Anyway thank you for your work!

@mgravell
Copy link
Member Author

Hi; no hard ETA, but definitely still in progress; I'm very aware of the AOT work, and the hope is for the Dapper.AOT learnings to lead into the protobuf-net work; there exists an AOT branch for the analyzer pieces, but I think a lot of it will need some significant rework, but: I'm also a little distracted by Google's recent discussion of "edition 2024", and the "group" changes, which I also want to integrate (parser now works, so... yay!). This is relevant because the "editions" work and the "AOT" work need to interact, so understanding both pieces at the same time is essential.

As for MSFT time: my MSFT time is focused on cache work at the moment, but: let's see how it goes a little later in the year,

@michaldobrodenka
Copy link

About AOT, it seems, that AssemblyBuilder.Save will work in .NET 9. I know generating c# code is better solution, but would this be supported? Generating serializer assemblies for AOT in some "model.csproj" after build step?

@mgravell
Copy link
Member Author

@michaldobrodenka if AssemblyBuilder.Save starts working, I'll happily light up that API, and if that unblocks some scenarios: great! However, that will be unrelated to and tangential to the intended AOT route, which I hope to be codegen based

@tuga001-sme
Copy link

Any news?

@PanzerFowst
Copy link

First off, thank you for your work! It is great!

I know this is not a rushed change (family, day job, etc.), but I was curious what could be done to help this PR along? Are there API improvements of code generators in .NET 9 that can be taken advantage of now?

@mgravell
Copy link
Member Author

The APIs haven't changed hugely (I don't think interceptors give us much); but I do need to revisit this from the ground up, using our learnings here as a foundation - the object model needs a lot of rework based on my learnings from Roslyn incremental generators over the last few years; the approach here is naive. Doable: yes. But it needs dedicated time.

@PanzerFowst
Copy link

Thanks for getting back so quickly, Marc!

Ah, I see. So then would there be an issue / milestone with TODOs etc. to give a roadmap of what needs to be done so that we could help contribute where able?

@michaldobrodenka
Copy link

michaldobrodenka commented Apr 16, 2025

I started to play with generators and created a demo for protobuf generated serializers/deserializers from protobuf-net attributes.

https://github.com/michaldobrodenka/GProtobuf

It's far from usable, only deserialization is supported with only handful of types. Not tested/used. Just a proof of concept. Maybe will return to it sometimes. But when it's working, deserialization is crazy fast.

@PanzerFowst
Copy link

That's neat, @michaldobrodenka!

I am working on converting some code to be NativeAoT compliant and unfortunately haven't found a way to keep the NativeAoT runtime from trimming away protobuf-net. The only thing I have found so far is to use Google.Protobuf and manually create a .proto file for my DTOs, and it just ends up really messy...

But it did give me the idea (I haven't looked too deeply at this repo to see how feasible it is)--what if the [ProtoContract] and [ProtoMember(n)] attributes could create the .proto files automatically and and add the <Protobuf Include="car.proto" /> to the .csproj to generate the Google.Protobuf code that can then be used to automagically accomplish the same behavior in a NativeAoT context?

I am sure there are reasons that this wouldn't work, but with .NET 9 giving full NativeAoT support for iOS, I am seeing a lot of movement towards NativeAoT to get off of MonoAoT.

@mgravell
Copy link
Member Author

Eesh, I should just dust this off and ship something, even if it is incomplete. My plans are wider than my calendar, it seems.

@KybernetikGames
Copy link

Is there any chance v4 could bring back support for AsReference that was in v2 which allowed a full object graph to be serialized with multiple fields referencing the same object?

I'm trying to find a good serializer for Unity and ProtoBuf v2 is the only one I've found which meets all my needs except that I can't seem to use it in Android builds due to IL2CPP requiring AOT compilation so it would be a huge shame to find a solution to that problem only to lose such a useful feature.

@Dona278
Copy link

Dona278 commented Apr 26, 2025

@KybernetikGames did you looked at cysharp repos? They develop games with Unity and they are the creators of R3 (observables) and [Message/Memory]Pack (serializers) both developed in the way to be compatible with Unity.

@michaldobrodenka
Copy link

michaldobrodenka commented Apr 26, 2025

Is there any chance v4 could bring back support for AsReference that was in v2 which allowed a full object graph to be serialized with multiple fields referencing the same object?

I'm trying to find a good serializer for Unity and ProtoBuf v2 is the only one I've found which meets all my needs except that I can't seem to use it in Android builds due to IL2CPP requiring AOT compilation so it would be a huge shame to find a solution to that problem only to lose such a useful feature.

If you need solution now, you can check my protobuf-net 2 fork - with precompile you can prepare serializer in post build step as a dll. I'm using it in production. And you don't need old net framework. It works with net6+ https://github.com/michaldobrodenka/protobuf-net

@KybernetikGames
Copy link

@Dona278 I briefly tried MessagePack and MemoryPack but ran into issues with each of them (here and here) which would have required me to refactor quite a bit of my code base. ProtoBuf v2 seemed like a silver bullet which handled everything I need to do with it right up until I tried to use it in a runtime build. But if I can't get it going then I'll definitely be revisiting the cysharp systems.

@michaldobrodenka I found your repo earlier today and have been trying to get it to work in Unity with no success so far and there's no Issues page so I wasn't sure how to contact you. Do you have a preferred contact method?

@michaldobrodenka
Copy link

@KybernetikGames have you checked aot-net6 branch?
I have added issues to this project, but I don't plan to maintain this project much further; I'm just using it until I find a replacement. It works on all my projects and I'm looking for more modern solution - using Span and code generated. Something like my GProtobuf which is only a proof of concept now.

@mgravell
Copy link
Member Author

I genuinely do have plans to revisit the AOT work. I just need the world to switch to a 36 hour day so I have enough hours in each...

@PanzerFowst
Copy link

Well, I just wanted to ask if you maybe had an outline of the work (that you know of so far) that needed to be done so that anyone who has the time and could contribute would (I have been looking into IIncremementalGenerator and experimenting) be able to help?

I know I am certainly interested in contributing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants