fix(main): keep node spin resilient to unrecognised message types#133
Open
dakejahl wants to merge 1 commit into
Open
fix(main): keep node spin resilient to unrecognised message types#133dakejahl wants to merge 1 commit into
dakejahl wants to merge 1 commit into
Conversation
Unrecognised data type IDs (e.g. while prototyping new DSDL that isn't built into the tool) caused two distinct failures. First, they tripped the 1000-strike successive-error guard and terminated the local node after a few seconds, logging a full traceback on every 10 ms spin. Second, and more subtly: Node.spin() drains the RX queue and then runs its scheduler, but when Transfer.from_frames() raises on an undecodable transfer the drain aborts and the scheduler poll is skipped. The node monitor's liveness is scheduler-driven (periodic stale-sweep and outstanding-request timeouts), so at high message rates the scheduler is starved -- nodes flap in and out of the monitor and the UI stalls. Wrap Node._recv_frame so an undecodable transfer is dropped in place, letting spin() finish draining the queue and schedule normally. Keep a backstop catch in the spin loop that no longer counts these towards the fatal threshold, and throttle the logging so a high-rate unknown message cannot flood the log.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The GUI tool terminates a few seconds after it starts receiving a DroneCAN message whose data type ID isn't in the loaded DSDL set, and in the lead-up the node list flaps and the UI stalls. That makes it impossible to prototype a new message on a live bus without first building its DSDL into the tool. This makes the local node tolerate unrecognised transfers instead of falling over on them.
Problem
An unrecognised data type ID makes
Transfer.from_frames()raiseTransferError, and two things go wrong. First,_spin_nodecounts every such error toward the 1000-strike successive-error guard (and logs a full traceback on each 10 ms spin), so a continuously broadcast unknown message terminates the node within seconds. Second, and more subtly,Node.spin()drains the RX queue and then runs its scheduler, but the exception aborts the drain and skips the scheduler poll. Node-monitor liveness is scheduler-driven — the periodic stale-sweep and the outstanding-request timeouts — so at high message rates the scheduler is starved: nodes flap in and out of the monitor and the UI stalls.Solution
Wrap
Node._recv_frameso an undecodable transfer is reported and dropped in place, lettingspin()finish draining the queue and run its scheduler exactly as it does for clean traffic. The catch in_spin_nodeis kept as a backstop but no longer counts these benign per-transfer errors toward the fatal threshold. Logging is throttled to once per 10 s per distinct error with a suppressed count, so a high-rate unknown message can't flood the log. Raw frames remain visible in the bus monitor, so unknown traffic is still observable while prototyping.