State Divergence Postmortems

Below are the postmortems for almost all the state divergence bugs we ran into at Galvanic, copied verbatim from our internal documentation. We started keeping track some months after I joined, so we are missing the solutions to some of the earliest ones found by the team.

Wherever you see a portion of text starting with wizgun-production, it indicates a link to a pull request with the code used to solve the divergence. These are inaccessible to the public, so when the pull request’s name is not self explanatory, I’ve inserted my own explanation surrounded by square brackets.

Read the companion article here.

Dialogue Mismatch - 10252023

Symptom: Received a lot of user reports about mismatching during boss fights when you talk to the boss to start the fight, and also for various other dialogues

Root Cause: We have a characterCount variable in a component that we use for a few different things, mostly for deciding whether you’re advancing a conversation or auto-completing it but also whether its time to kick off the boss fight post-conversation. That character count wasn’t deterministic if the two users had different languages.

Solution: Use a deterministic character count (longest across all languages) for the boss delay, and send “intent” auto-completing or advancing conversation as part of the command so we’re no longer using non-deterministic numbers.

How we tracked it down:

We had several boss replays that were all diverging the moment the boss went into combat after a conversation. I figured it had something to do with dialogue timing, especially since we had other checksum mismatch reports around dialogues, so I went into the DialogueInteractionSystem and just started looking through it to see if I could spot anything that wouldn’t be deterministic. I noticed that we were using character counts for a wait time, and realized that wouldn’t be the same if folks were using different languages.

How can we make this faster?

Honestly it wasn’t too bad - the only thing that might have made this faster and easier to verify is if the replay put my settings into the same state as the replay. That would have changed my language so that it would have actually diverged. Not sure if its worth the effort though.

Charm Mismatch - 10182023

Symptom: Received a user report of a checksum mismatch when charming two enemies at once with a shotgun charm.

Root Cause: We were collecting all charmed entities for the frame, putting them in a hashmap keyed by ECSEntities, and then pulling a key value array for that hashmap. Since ECSEntities aren’t deterministic, the ordering of that array was also not.

Solution: Store charmed entities as a list instead of a hashmap so we could iterate over them deterministically.

How we tracked it down:

Luckily the report on this one was really specific. I watched the replay and fortunately the divergence was captured, so I was able to validate the solution. Since I knew it had to do with charm code and charming multiple things, I knew where to look. I just looked through that code and noticed pulling the keyvaluearray for ECSEntities, which I knew weren’t deterministic.

How can we make this faster?

This one was pretty fast, I don’t think it could have gone much faster. I had all the information I needed thanks to our bug reporting and replay tools.

Swedish Mismatch - 10162023

Symptom: Received 3 or 4 user reports that mentioned checksum mismatches that occurred very shortly after beginning gameplay.

Root Cause: Players who had the Swedish language set as their system language would sort ECS types differently than those who did not.

Solution: Switched to a non-locale-based sort: wizgun-production: #5429 [#13145] Checksum Mismatch : String Sorting

How we tracked it down:

Folks started reporting this bug through Discord. We got a few of them to submit user reports from both the client and host side. We ran the replays a few times and noted that we would always get the same checksum as one side or the other. Soon, we realized that the system language in the user report metadata was set to Swedish for one player on all of the reports. We then switched the Windows system language to Swedish, re-ran some of the replays, and noted that the checksum received was exactly the other side’s.

Finally, we ran the replay twice with checksum traces on: once under the Swedish language, and one under the English language, and noticed that the traces differed in the order and ID of two components.

How can we make this faster?

This ended up being about as fast as possible. Now that we know that this is an issue, we can keep an eye out for it in the future.

Prediction Divergence - 091123

Symptom: During a multiplayer playtest, players diverged. They both captured logs.

Root Cause: Component versions are incremented during prediction changes and not rolled back, which could cause an EntityQuery in non-prediction to detect a change when one didn’t occur. Since this can lead to simulation changes, and those versions would only be incremented on one of the clients since those prediction changes would only happen on one client, this lead to a mismatch.

Solution: Roll back component versions when we roll back their data. wizgun-production: #5125 [#12493] Fix: Checksum Mismatch

How we tracked it down:

I kept a running log while I was tracking this down. Hopefully not too much information:

  1. Made a windows debug build so that I could load the replay in there - was getting checksum mismatches when playing in the editor (will track down as separate issue)
  2. Pulled traces from host and client around the mismatch, scrolled through to mismatch frame
  3. Noticed that the Number of Queued Actions, which refers to the number of commands queued to the entity command buffer, was different just before the mismatch. The client had 2 less actions queued.
  4. Looked through the queued actions, noticed that the two queued actions were setting the liquid component and removing the frozen component, both which happen in the RemoveFrozenStatus method in FreezeHelper
  5. Ran the host replay with more logging to see what called RemoveFrozenStatus
  6. Found out it was CheckEntityFrozen in FreezeSystem, put some logs in to track the timer for client vs host replays
  7. Discovered that for a particular entity’s frozen status, the client and host had different ideas of how much time was remaining at time of mismatch.
  8. Added more logging and went back to when the entity was frozen to watch the entire process. Found out that the frozen countdowns are refilled by LiquidEvaporationSystem during AddLiquidJob - and on the host it stops adding time at some point where on the client it continues. Also noticed that the amount of water they were attempting to apply was inconsistent.
  9. More logging to find out what was contributing the triggers to the AddLiquidJob. Traced back to the SpreadGroundLiquidJob, soaking entities on water tiles. Added some logging to the code describing tile liquid amount and amount spread to entity.
  10. Discovered that at some point the client has one tile with water in it that continues adding water to the entity, where on the host a different tile starts applying it to the entity. More logging to figure out where the water was coming from and where it was going. Took a few rounds of new logs.
  11. Some logging around tile occupation revealed that it wasn’t that tiles had different amounts of water, rather that the divergence occurred around the entity occupying one tile on host and another on the client. Added logging around entity occupation.
  12. Confirmed, on host the entity moves over one tile where it does not in the client. More logging to try and figure out why.
  13. Noticed that the GatherEntitiesToUpdate job in the EntitiesOccupyingSystem stops running for the entity at the same rate between the two replays at a certain frame. Host continues to check every frame, client stops checking for a while. Eventually, a check that the host does but the client doesn’t leads to the entity switching tiles, which causes the divergence. Going to focus on why the job runs more for one than the other first, which means logging all the changes to position for that entity
  14. Noticed that I’m seeing a lot of positional updates for the frozen entity coming from prediction, which wouldn’t happen on the client since the client doesn’t predict that entity. Starting to suspect that changes made during prediction to a component can cause a FilterOnChanged component to trigger during sim. This… seems bad.
  15. Digging a bit further into why the occupying tile switched at that point if no position had actually changed. It looks like it changed because at that moment a tile bumped up in priority because it became unfrozen and became the new highest priority tile. So we’re back to the root of the issue being that on the host, GatherEntitiesToUpdateJob which is triggered by a change to PositionComp was triggered for the host but not for the client on that particular frame.
  16. Cleared out a bunch of old logging, threw in a bunch of new ones trying to capture why a physics sync was being triggered for the entity in host but not client
  17. To confirm that this is a prediction issue, turned off prediction for host replay. Doing so causes a checksum divergence right after the client disconnects, and logging matches up with the client. I believe this is sufficient evidence to confirm that prediction changes are triggering the physics sync out of prediction.
  18. Ok. Sitting down and noodling on it I’m pretty sure what happened is that both host and client are predicting forward, and both are making changes to their respective entities, and those changes increment the version. The changes to those components are rolled back, but the version is not. The next time the query comes through (outside of prediction) it reads the component as changed. Since the players are in separate chunks it only triggers the PhysicsSync buffer to run for their own entities, which can cause a divergence. In this case, because the tile under the host player had lost its frozen status, and the host player picked that up and updated its occupied entity position whereas the client player did not.
  19. I updated the SerializeComponents and DeserializeComponents in the ArchetypeChunk to not only store out the data, but the version as well. I ran into an issue immediately because I didn’t realize this was also used for other serializations, including loading the replay. Since as far as I can tell we only need this for prediction/rollback, I wrapped it in a passed in bool value.
  20. If this was the cause, replaying the replay should now result in the host player not switching to the new tile and should also diverge at the time of the mismatch. Unfortunately, neither happened.
  21. I spent a long time here throwing in a bunch of logs in various places trying to figure out what was incrementing the version and tracing back why rolling back the version wouldn’t work.
  22. At some point during this mad logging I happened to notice that in the physics world’s PredictionSystem, the predicting entity’s components were individually rolled back manually. It was not using the ArchetypeChunk’s rollback system. I was tempted to convert it to use the same thing, but that felt too destabilizing. I instead stored out the component versions and set them when the data was set.
  23. Played through the replay again - the host player did not move over to a new tile, and the checksum mismatched at the point of divergence.

How can we make this faster?

Only thing I can think of is to remember that if you have a replay divergence before the actual divergence, it might be worth the time to hunt that down first, even if it feels like a frustrating distraction. I could have saved time and made a lot fewer builds if I’d fixed the Editor/Replay divergence below first.

Other than that I can’t think of anything. It was a long dig, but I don’t know what would have made it faster. The logging I needed to find this was very specific. The checksum mismatch logs that lead me to the beginning of this trail - specifically the ones around queued entity command buffer actions - were very good and probably saved me a bunch of time. Also the update to replays to simulate prediction was absolutely necessary for this fix.

Editor/Build Replay Divergence - 091123

Symptom: Was working on another mismatch, was seeing mismatches in the editor before the actual mismatch. Was not seeing them during a build replay.

Root Cause: Prediction was the root cause of the other divergence, and prediction wasn’t running in the physics world for the replay player because SetPredictingEntity wasn’t being called when the player was started.

Solution: Set the predicting entity when we start the replay. wizgun-production: #5125 [#12493] Fix: Checksum Mismatch

How we tracked it down:

As previously mentioned, I was tracking down a different mismatch when I came across this one. I was generally running my replays through builds because I knew the editor was mismatching early. I had a suspicion that the other checksum had to do with prediction so I already had some logging in some prediction areas. I was checking something in the editor and noticed that the prediction logging wasn’t showing. After some investigation I confirmed that physics prediction was only working after a world transition.

How can we make this faster?

This felt a bit one-off. I’m not sure there was anything else we could have done to make this easier to find, it honestly wasn’t that hard to dig out I just had to realize what was happening.

Elemental effect expiry divergence - 07242023

Symptom: A one-off mismatch occurred while playing a multiplayer game.

Root cause: A botched transition to cached job scheduling left several EntityCommandBuffers un-updated in OnUpdate. This caused their queue indexes to always remain -1, screwing up the sorting logic in the CommandBuffer.ConvertBufferToSortedListJob`.

Solution: wizgun-production: #4669 [#11703] Fix checksum mismatch when multiple effects expire (staging) [This solution is quite difficult to explain, and involves project-specific animation.]

How we tracked it down:

I played the replay back focusing on the host and the client, and noticed that it returned two different checksums for each. I surmised it was a prediction issue, and attempted to gather checksum traces following the same method. However, I was wrong, and trying to follow this process inconsistently returned different checksums. I ran the replay a bunch of times simply following the host and noticed the world checksum was one about 50% of the time, and another about 50% of the time. At this point, it was clear that we had a good old thread timing issue on our hands.

From here, it was simply a matter of running the replay repeatedly (I took a snapshot on the frame just before the mismatch and used that) and adding logging using ChecksumBurstTracerHelper until I tracked down to the root of the issue - a struct which was not getting properly updated.

How can we make this faster? This, honestly, felt like our previous investigations paying off. I was able to reuse a few things from previous checksum investigations: most notably, a replay repeater component and a bunch of extra checksum tracing around the EntityCommandBuffer that I left in. These allowed me to track down the issue pretty rapidly, in about half a day. Happy to report that I don’t think we could have made this faster!

Chaos rift spawn divergence - 04092023

Symptom: Fairly consistent checksums a few minutes into playing a multiplayer game.

Root cause: The SpawnChaosRiftsAroundPlayer system was added to the main world twice. It only took read dependencies, so it could run in parallel with itself. However, it also wrote to data within the system itself, leading to a race condition.

Additionally, the test which would have caught the issue under JobsDebugger was malfunctioning.

Solution: wizgun-production: #3761 10042:, 10037: Multiplayer checksum mismatch/duplicate rifts

[In lieu of my own explanation, I will paste the description from the pull request:]

This was a perfect storm of a few things:

  • SpawnRiftsAroundPlayerSystem was being added to the world twice - once in a system group, once directly
  • It depended only on readonly data, which means the dependency system allowed it to run in parallel, but it wrote to data within the system which was not part of the dependency system, introducing a race condition.
  • playteststate-all-fragments had “skip tutorial” unchecked, so the “all fragments” overworld test was not running the SpawnRiftsAroundPlayerSystem. This was the only test which would have run the system with jobs debugger on, which would have caught the problem

The end result was that sometimes, given the perfect timing, one CheckSpawnRiftChaosJob could be running at the same time as another one, which would spawn two chaos rifts. However, this timing was thread timing, which is not consistent across systems, leading to a checksum mismatch when the perfect conditions were met.

How we tracked it down:

I tried a lot of things with the replay, and I saw it match the host’s checksum 99% of the time with both the client and the host’s replay. (I saw it diverge once, but was never able to reproduce it after that.)

After inspecting the replay, and after reproducing the issue locally, I noticed that the mismatch would occur immediately after an ambush rift spawned in the world. I did some basic analysis of the SpawnChaosRiftsAroundPlayer system, but could not find any issue.

Later, I remembered an issue that we’d encountered in the playtest where duplicate ambush rifts were spawning occasionally. I realized that this might be the issue, and started tracking it down. After realizing it was a thread timing issue, I reasoned that it must be the same problem.

How can we make this faster? I already fixed the test which would have caught the problem in the first place. I wonder whether we need to put more tests in place for other systems that run rarely (e.g. predicting systems).

Additionally, I wonder whether a dependency trace might be useful. I spent a while drilling down into the dependency stack before realizing that the system was only taking read dependencies, and thus would have been able to run in parallel with itself. I’ve done similar things before, and it’s time consuming.

Editor/Build Replay Divergence - 04032023

Symptom: Consistent, unexpected checksum mismatch trying to play back a replay in editor.

Root cause: We were processing objects in boss scenes in different orders in build vs editor.

Solution: wizgun-production: #3728 Fix replay mismatch after resetting world

(Specifically, see the changes to SceneEntityRegistrationSystem.cs)

[This pull request sorts entities that are processed from the scene by position before they are added to the ECS database.]

How we tracked it down:

While trying to track down a different mismatch, I played back both the client replay, and noticed that it diverged on the frame before the expected one, right after the player reset the world. I recalled that I’d seen this before, and knew that playing it back in a build would result in no checksum mismatch.

I took a trace with full trace dumping of the PositionComp in the main world before I realized it was the boss world that was mismatching. I then took a full trace dump of the PositionComp in the boss worlds. I noticed that the traces listed all the same positions in both the build & editor trace, just out-of-order. I did a quick static analysis of the world generation used in the boss worlds, and noticed the call to get objects out of the scene. I then verified that the objects were obtained out-of-order.

Host/Client Replay Divergence - 03222023

Symptom: Consistent, unexpected checksum mismatch trying to play back a host’s replay, but no mismatch when playing back a client’s replay.

Root cause: PredictableEntityQuery was changed to only return the chunks of the local player when it had been returning all players’ chunks, under the assumption that prediction was only predicting the local player ahead. However, prediction was, in fact, predicting all players ahead, leading to a mismatch both in replay and in real play.

Solution: wizgun-production: #3682 9967: Prediction query checksum mismatch

(Specifically, see the changes to PredictableEntityQuery.cs)

[The following is the full description from the pull request:]

Turns out we had been predicting all players, not only the local player, this entire time because of a buggily-written PredictableEntityQuery. When I corrected the bug to actually do what it said in the comment, it broke rollback.

A more thorough explanation:

  • Every frame, we predict ahead all commands that we have, including those from other players.
  • The WorldPrediction class uses the PredictableEntityQuery to serialize all chunks that are predictable before prediction. Then, at the beginning of every frame, it writes those chunks back before running simulation.
  • PredictableEntityQuery was changed to only serialize chunks belonging to the local player. However, since all commands are handled while predicting ahead, and not just the local player’s, chunks were changed during prediction but not rolled back.

How we tracked it down:

While trying to track down a different mismatch, I played back both the client and host replays, and noticed that the host replay diverged on frame before the expected one, while the client one did not. I ran checksum tracing on both (after correcting the code for it) on the diverging frame, and backtraced the diverging checksum from HandleForceJob down to PlayerInputSystem. Finally, it was a matter of deducing that we were processing the commands for all players, even though the PredictableEntityQuery in WorldPrediction was only rolling back the local player.

Note that this also fixed a divergence in real multiplayer, because the host/client players also play back all commands that are ready, including other players’, if available.

Resources/Addressable ID Divergence - 01172023

Symptom: Consistent checksum mismatch trying to play back a replay

Root cause: An asset loaded in using the Resources API referenced assets that were also referenced by addressables. Whenever we attempted to load the AssetHandles of those assets, it would return invalid, since the assets loaded in from Resources did not match those from addressables.

Solution: wizgun-production: #3381 A bunch of random fixes

(Specifically, see the changes to WorldGenSettings.cs)

[The fix in the pull request involves loading an asset through Unity’s Addressables API rather than the Resources API. This was tricky to find because we were using the Resources API only when running in the editor.]

How we tracked it down:

After trying a bunch of things that didn’t work to try to reproduce a case where the replay did not mismatch, I tried playing the replay in build and noticed that there was no mismatch. From there, it was a matter of adding logs and repeatedly reproducing until I tracked down that the asset handles were being returned as zero.

How can we make this faster? I’ve implemented snapshotting and checksum tracing in the build replay scene, which should make similar cases faster to debug in the future. Additionally, I’m going to edit the documentation I was viewing earlier to include references to playing back replays in the build.

Static Obstacle Serialization Divergence - 07282022

Symptom: Client received a checksum mismatch shortly after joining a game.

Root cause: A bug in StaticPathfindingObstacleDatabase was causing objects to register themselves as static obstacles in a location, but not unregister them if they were destroyed on the same frame they were created.

  • An ambush chaos rift is spawned
  • EntitySpawnSystem enqueues the rift’s destruction to the EntityCommandBuffer because it’s marked as SpawnAndDestroyImmediately
  • StaticPathfindingObstacleSystem.UpdateOccupationJob registers the static obstacle and enqueues an addition of the IsRegisteredAsStaticObstacle component to the EntityCommandBuffer
  • The EntityCommandSystem runs.
    • It first processes the destruction, moving the rift to the destroying archetype.
    • Then, it processes the component addition. It doesn’t add the IsRegisteredAsStaticObstacle component because the rift is in a destroying archetype.
  • StaticPathfindingObstacleSystem runs its cleanup job, which doesn’t pick up the destroyed rift for cleanup because the IsRegisteredAsStaticObstacle component was never added.

Thus, the obstacle would never get unregistered, leaving behind some orphaned cells that are marked as static obstacles.

This manifested as a checksum mismatch because clients re-generate the static obstacle database upon join, and since the entity that marked those tiles as obstacles was now gone, those tiles would not be marked as obstacles on the client side. Thus, paths calculated in that area would differ on host and client.

Solution: wizgun-production: #2724 #7771: Release static obstacles when they are created/destroyed in one frame

[The following is the description pasted straight from the pull request:]

The sequence of events that caused the checksum mismatch was:

  • An ambush chaos rift is spawned
  • EntitySpawnSystem enqueues the rift’s destruction because it’s marked as SpawnAndDestroyImmediately
  • StaticPathfindingObstacleSystem.UpdateOccupationJob registers the static obstacle and enqueues an addition of the IsRegisteredAsStaticObstacle component
  • When the EntityCommandSystem runs, it first moves the rift to the destroying archetype, then doesn’t add the IsRegisteredAsStaticObstacle component because the rift is in a destroying archetype
  • StaticPathfindingObstacleSystem runs its cleanup job, which doesn’t pick up the destroyed rift for cleanup because the IsRegisteredAsStaticObstacle component was never added

Thus, the obstacle would never get unregistered, leaving behind some orphaned cells that are marked as static obstacles.

This manifested as a checksum mismatch because clients re-generate the static obstacle database upon join, and since the entity that marked those tiles as obstacles was now gone, those tiles would not be marked as obstacles on the client side. Thus, paths calculated in that area would differ on host and client.

I’ve solved it by simply getting rid of the IsRegisteredAsStaticObstacle component and moving its data to the StaticObstacleComp. This might have some performance impacts because UpdateOccupationJob will now run over all static object chunks all the time. However, this might actually get balanced out by the fact that we’re not moving objects into another archetype as soon as they’re spawned, meaning less archetype fragmentation.

How we tracked it down:

  • Received bug reports for both host and client.
  • Running the replays for both host and client, I noticed that neither reported checksum mismatches, which meant that the replay was accurately showing the world states that both had experienced (meaning that the replays were showing divergent states).
  • Ran with checksum tracing on both sides and drilled down with logs until I hit the AI movement system, which indicated that an entity was following different paths on each side.
  • I put logs into the Pathfinding System. However, I could no longer keep running from the snapshot I had taken, because the path was calculated some frames earlier. backed up and started running from an earlier frame. The snapshot on the host’s side started checksum mismatching on the same frame that the client mismatched on.
  • At this point, I thought maybe it was an intermittent timing issue. I ran the replay using the replay repeater to try to see if I could get the replay to mismatch sometimes and match sometimes. I was unsuccessful after about 50 tries.
  • I backed up and started running from the start of the short replay and comparing the checksum traces from that against the snapshot’s checksum traces. I found that the PathfindingSystem was reporting an obstacle on the replay side, but not on the snapshot side.
  • Now, I was fairly sure it was a world serialization issue, as both the client and the snapshot were receiving the same checksums on the frame where the mismatch occurred, and both follow the same route of “serialize world, then run subsequent commands”.
  • In the host’s replay, I did a manual binary search through the frames to find the place where the obstacle was registered, and found that it was an ambush rift that registered an obstacle to some tiles when spawned, but did not deregister them after it was despawned.
  • Through analysis of the code and the checksum traces, I figured out that the order of commands buffered to the EntityCommandBuffer was causing the issue.

How can we make this faster? The log-drilling continues to be painful, as adding contexts to functions that are called from many places is supremely tedious. At some point, I’m going to invest more time into seeing if I can get a usable stacktrace on-demand in bursted code. In addition, I think I’d like to add some more checksumming for debugging through replays like this. For example, checksumming the PathfindingDatabase would have been useful here, even if it’s not trivial to do so.

3-Player Client Disconnect Divergence - 06282022

Symptom: Received bug reports from a host and two clients where one client disconnected with a checksum mismatch right as the other client disconnected with a timeout.

Root cause: There was a subtle bug in the client command queue logic which, when a client disconnected, could cause the host to think that the frame was ready when it was not. The following sequence of actions would occur:

  • A host receives commands from client 1, but not commands from client 2, for a frame that was not yet ready.
  • Client 1 disconnects for any reason.
  • The host disconnects client 1, and at the same time, incorrectly marks the frames for which client 1 sent commands as ready. (This is where the root cause bug occurs)
  • The host simulates forward those frames without commands from client 2.
  • Client 2, at some point, receives the host’s commands and also simulates those frames forward. However, it has both commands from host and itself, leading to a checksum mismatch.

Solution: wizgun-production: #2583 7418: Fix client disconnect triggering a checksum mismatch

[This pull request contains some involved changes to the multiplayer cmd processing layer.]

How we tracked it down:

  • Received bug reports for the host and one of the clients.
  • Noted that the client’s replay mismatched on the frame in the logs (12450). (It’s still an open question for me as to why the replay mismatched - I would have expected the client’s replay to match the client’s view of the world.)
  • Noted that the host replay did not mismatch on frame 12450.
  • Ran the replay with checksum tracing on both sides. Started drilling down from HandleForceJob, where the traces differed.
  • Drilled down until I noticed that PlayerInputCmdSystem was running on the client side on the mismatching frame, whereas it was not on the host’s side.
  • Noticed that the client replay included a cmd that the host’s did not.
  • Given that the host replay was not mismatching during playback, I reasoned that we must be seeing the same world state during playback as the host did during real simulation, and thus, that the host had truly only simulated one command during real simulation.
    • Another theory that would have been easy, and less scary, to assume was that there was some problem writing the replay that caused the command to be omitted. However, as the replay playback emitted the same checksum that the host had seen, the evidence was in opposition to this theory.
  • I asked Nohm (QA tester, second client) to look for the replay on their side, as we couldn’t locate the bug report for the second client. We located it.
  • The replay showed that the second client was disconnecting on the frame just before the checksum occurred - 12449. The Player.log backed this up, with a timeout occurring somewhere near the bottom.
  • At this point, it was impossible to gather any more useful information from the replays, as the error had obviously occurred at a lower level - somewhere in the networking stack. It appeared to be the case that the host had thought that frame 12450 was ready when the second client disconnected, but before it had received the commands for that frame from the first client.
  • I did a thorough look over of some of the command queueing code, especially around client disconnects, and discovered the bug.

How can we make this faster?: I’d like to add more checksum tracing around triggers and commands to make the log-drilling steps faster. In addition, I’ve added the universe checksum to the Player.log so that we don’t have to run both the host and client replays to discover what each of their checksums were at the time of the mismatch.

The scarier thing about this checksum mismatch, in particular, is how much deduction and manual code analysis it required to solve. No matter how much work we do on the replay, it would never have been enough to find a problem in the networking stack like this. We should do some thinking on what we can do to mitigate problems in this realm.

Extra Worlds Replay Divergence - 06062022

Symptom: There were checksum mismatches when playing back any replays. I don’t really know if this counts cause its wasn’t really a checksum mismatch, but I guess it doesn’t hurt.

Root cause: The replay loads all additional scenes but then calls bootstrap to do other setups which tries to load the additional scenes again, which is normally fine, but bootstrap wasn’t able to detect that the additional scenes were already loaded and it was loading them again, which was making a different number of ecs worlds between replay and recording which was causing a fuss.

Solution: wizgun-production: #2505 Fix for replays not working

[This pull request releases the Addressables scene handle after replays are bootstrapped.]

How we tracked it down:

  • Used checksum tracing to track down that the divergences were happening during scene entity registration
  • Discovered that there was something kinky with the serialized values in Scene Entity Registration, did a bunch of tests to confirm that what was being written in wasn’t the same as coming out
  • Finally realized that one more system was being serialized in than was being deserialized out, which lead to the realization that there were duplicate scenes/worlds being loaded in

How can we make this faster?: Honestly it would have gone faster if I’d just noticed the extra scenes that were loaded in or used the serialization tracing that I didn’t know was there and wrote a bunch of. I can’t really think of anything additional here.

DropItem Divergence - 06022022

Symptom: Player received a checksum mismatch while end of the world was occurring.

Root cause: SimulationChunkSystem was not sorting entities in the correct order. The fix from 01272022 was not actually correct.

Solution: wizgun-production: #2496 \#7189: Fix sim chunk sorting checksum mismatch

[This pull request changes a sorting algorithm to fall back to array position if thread IDs are the same.]

How we tracked it down:

  • I tried to play back the replay with both host and client and prediction on, but no mismatch occurred.
  • Randomly noticed, on one of those playbacks, that the checksum produced on the mismatching frame did not match what I’d seen the replay produce previously. I deduced that it must be some kind of frame timing issue.
  • Wrote up a quick tool to run a replay many times in consecutive order until a certain condition was met. Replays produced a mismatch against the replay’s checksum about 1 out of 10 times.
  • With a repro, I started checksum tracing.
    • Noticed a PositionComp mismatch after PhysicsSyncSystem. Started logging there. PhysicsSyncSystem’s operations all matched up.
    • I turned full tracing on to inspect the PositionComp. Noticed that two entities had swapped orders on either machine.
    • I took a look at the two entities through the entity viewer with the replay paused. I noticed they were both arcana that dropped at the same time. I added logs in arcana creation - no luck.
    • I added logging in EntityCommandSystem and in EntitiesData. Both entities were being created in the same order - also no luck there.
    • I made a deduction that the entities must have been getting put into a new archetype because of an added component. I added logging everywhere in EntitiesData where entities’ archetypes would get changed. Yep - SimulationChunkSystem adds a shared component to the arcana, and this was happening in different orders.

How can we make this faster?: I’ve already made the tool to run replays multiple times consecutively, and we should make this part of our standard checksum mismatch playbook. In addition, I kept the logging in EntitiesData around for future mismatches related to ECS structure.

Physics Replay Divergence - 04052022

Symptom: I tried to replay the replay from a crash log. I encountered a checksum mismatch in the replay at frame 22143, but only about 50% of the time.

Root cause: A section of code in physics was comparing entities by index, and assigning them to a struct based on that. Later, some arithmetic was executed on those entities’ velocity based on which struct field they were assigned to. Since entities’ indexes are non-deterministic, this led to a non-deterministic velocity.

Solution: wizgun-production: #2264 Fix physics checksum mismatch

[This PR removes a section of code in physics where entities were being reordered by index - a non-deterministic operation.]

How we tracked it down:

  • Noticed that the replay was mismatching intermittently
  • Obtained a checksum trace for a replay run which mismatched, and one for a run which did not
  • Compared the logs to narrow down the area to look at
  • Added new logs
  • Repeated this process until I found that entity contacts sometimes had entity1 and entity2 data switched
  • Deduced that the code at CollisionUtil.cs:50 was causing the mismatch, since we knew that entity indexes are nondeterministic

How can we make this faster?: We should create a tool for running a replay multiple times, grabbing the checksum trace, and detecting if a mismatch occurs non-deterministically between multiple runs of a replay.

Snapshot vs replay divergence - 03302022

Symptom: I tried to replay the replay from a crash log. I encountered a checksum mismatch in the replay at frame 20439. I grabbed the snapshot from frame 20000 for easier repro of the mismatch. When I played that back, I was receiving mismatches at frame 20012.

Root cause: StaticObstacleDatabase was not cleared on world reset

Solution: wizgun-production: #2253 Clear static obstacle database on world reset

How we tracked it down:

  • Added runtime flag for enabling checksum trace
  • Played the replay up to frame 20000, grabbed a snapshot.
  • Repeatedly played the replay and the snapshot from frame 20000 to 20012 with checksum tracing on and with additional logging each time
  • Narrowed it down from the physics system to the pathfinding system. Found that the pathfinding system was considering a tile unoccupied in the snapshot, but occupied in the replay.

How can we make this faster?: We can create a script to playback a replay with checksum tracing on from a specific frame. Unfortunately, we still need to add logging each run to try to narrow down the problem. We should, for each checksum mismatch we encounter, consider adding additional targeted logging to the checksum trace to make it faster to narrow down the issue.

Build/Editor Replay Divergence - 03032022

Symptom: When playing back a replay, we would reliably hit a checksum mismatch at a certain frame.

Root cause: AssetHandle IDs were different in editor vs. build. The Addressables package had been updated, changing the behavior of the ordering of the assets we received from one of their APIs.

Solution: wizgun-production: #2162 Make asset IDs consistent between editor and build

[The PR orders our asset IDs based on the asset path.]

How we tracked it down:

  • Tried running the replay in a build vs. the editor.
  • Got lucky and noticed that, on the frame at which the mismatch occurred, a Lil Mechana was starting a heal action in editor, but was shooting in build.
  • From there, it was a matter of tracking down why a heal action would be occurring, and figuring out what had gone wrong to cause the AI state machine to think a shoot action was a heal action.

How we can make this faster: We now know that this is a class of mismatch that can occur, and have added an asset ID dump. We should consider dumping these IDs in all automation we build in the future.

Upon first receipt of all checksum mismatches, we should consider always running the replay in both editor and in build, and noting mismatches as a datapoint.

World Transition Multiplayer Divergence - 01272022

Symptom: Whenever anyone would transition worlds in multiplayer, a checksum mismatch would occur.

Root cause: SimulationChunkSystem was making modifications to ECS structure in a non-deterministic order.

Solution: wizgun-production: #2044 Fix checksum mismatches when transitioning worlds

[The description of the PR is the following:]

A few issues are fixed here, but the one that was causing the mismatch was, for sure, the last one.

SimulationChunkSystem was missing a bunch of initialization and cleanup stuff. Not sure how I managed to miss that stuff. The biggest one was that entityToChunkMap wasn’t getting serialized to clients.

SimualtionChunkSystem was also using the entity command buffer even though I explicitly said that SuperLateSimulationSystems aren’t allowed to use it. I’ve changed it to use a transaction instead. Should be fine because it needs to run after everything else, anyway.

Finally, the iteration order of entities within a couple of jobs in SimulationChunkSystem was non-deterministic. This is important because changes to ECS structure must be made in a deterministic order. The most vexing problem was in GatherActivatorsJob, where UnsafeNativeHashSet.GetKeyValueArrays did not guarantee any ordering of the results. I’ve changed the set to a list, with a hefty note about performance.

How we tracked it down: Repeatedly ran the repro in multiplayer in a test scene with checksum traces on and logging.

How can we make this faster?: Automating a multiplayer session will probably prove to be pretty tricky, and the repro was fairly quick in this instance. Not sure exactly how we can improve on this.

Keyframe Serialization Divergence - 10292021

Symptom: Replays were diverging before they reached the end where the divergence should have occurred.

Root Cause: The struct KeyFrame is different in the editor and in the build - the struct has one less field in the build as in the editor. This caused serialization to go off and therefore a divergence to occur.

Solution: Remove keyframe from being serialized into components. wizgun-production: #1797 [#5482] Extract Distance Curve

[The PR modifies a component in such a way that it avoids serializing the Unity Keyframe object, which has some editor-dependent fields.]

How we tracked it down: I don’t remember a lot, just that we eventually figured out that a component was deserializing improperly, and when we logged info on one of the fileds (a curve, which contained keyframe) we noticed that keyframe had one less field in the build than in the editor.

How we can make this faster: I’m honestly not sure with this one. It didn’t even occur to me that a unity struct would be a different size in the build than in the editor, so caught me by surprise

Predict-ahead Replay Divergence - 10052021

Symptom: Replays captured in multiplayer after a save/load would instantly checksum mismatch.

Root cause: Prediction was running for the first ten frames of a saved game when hosting, but those frames were not being rolled back.

Solution: wizgun-production: #1688 Fix checksum mismatches with replays post-multiplayer-load

[This PR removes both prediction and rollback when the host’s simulation frame is a negative value.]

How we tracked it down: Added a bunch of logging to the checksum trace for prediction and ran the repro a bunch of times.

How we can make this faster: Hard to say, since the repro was quick.

Liquid Trigger Dettest Divergence - 0512021

Symptom: The determinism test was failing about 50% of the time with a divergence.

Root cause: Liquid triggers were getting sent out in a non-deterministic order, leading to floats getting added in a different order on different runs.

Solution: wizgun-production: #1195 Fix for the break in determinism in EntityOccupyingSystem

[The PR changes a job from running in parallel with itself to a job that runs serially, avoiding the need to use a parallel container.]

How we tracked it down: Ran the determinism test a bunch of times with extensive logging to backtrace from the point of mismatch all the way down to the point where the triggers were being queued non-deterministically.

How we can make this faster: Automated determinism test runs with checksum tracing on. Additional targeted logging.