Skip to content

fix(main): keep node spin resilient to unrecognised message types#133

Open
dakejahl wants to merge 1 commit into
dronecan:masterfrom
dakejahl:fix/graceful-unknown-message-handling
Open

fix(main): keep node spin resilient to unrecognised message types#133
dakejahl wants to merge 1 commit into
dronecan:masterfrom
dakejahl:fix/graceful-unknown-message-handling

Conversation

@dakejahl

@dakejahl dakejahl commented Jun 9, 2026

Copy link
Copy Markdown

Summary

The GUI tool terminates a few seconds after it starts receiving a DroneCAN message whose data type ID isn't in the loaded DSDL set, and in the lead-up the node list flaps and the UI stalls. That makes it impossible to prototype a new message on a live bus without first building its DSDL into the tool. This makes the local node tolerate unrecognised transfers instead of falling over on them.

Problem

An unrecognised data type ID makes Transfer.from_frames() raise TransferError, and two things go wrong. First, _spin_node counts every such error toward the 1000-strike successive-error guard (and logs a full traceback on each 10 ms spin), so a continuously broadcast unknown message terminates the node within seconds. Second, and more subtly, Node.spin() drains the RX queue and then runs its scheduler, but the exception aborts the drain and skips the scheduler poll. Node-monitor liveness is scheduler-driven — the periodic stale-sweep and the outstanding-request timeouts — so at high message rates the scheduler is starved: nodes flap in and out of the monitor and the UI stalls.

Solution

Wrap Node._recv_frame so an undecodable transfer is reported and dropped in place, letting spin() finish draining the queue and run its scheduler exactly as it does for clean traffic. The catch in _spin_node is kept as a backstop but no longer counts these benign per-transfer errors toward the fatal threshold. Logging is throttled to once per 10 s per distinct error with a suppressed count, so a high-rate unknown message can't flood the log. Raw frames remain visible in the bus monitor, so unknown traffic is still observable while prototyping.

Unrecognised data type IDs (e.g. while prototyping new DSDL that isn't
built into the tool) caused two distinct failures.

First, they tripped the 1000-strike successive-error guard and terminated
the local node after a few seconds, logging a full traceback on every
10 ms spin.

Second, and more subtly: Node.spin() drains the RX queue and then runs
its scheduler, but when Transfer.from_frames() raises on an undecodable
transfer the drain aborts and the scheduler poll is skipped. The node
monitor's liveness is scheduler-driven (periodic stale-sweep and
outstanding-request timeouts), so at high message rates the scheduler is
starved -- nodes flap in and out of the monitor and the UI stalls.

Wrap Node._recv_frame so an undecodable transfer is dropped in place,
letting spin() finish draining the queue and schedule normally. Keep a
backstop catch in the spin loop that no longer counts these towards the
fatal threshold, and throttle the logging so a high-rate unknown message
cannot flood the log.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant