Fix/1.9.6 prod observations#168
Conversation
| if client is None: | ||
| from channelfinder import ChannelFinderClient | ||
|
|
||
| client = PyCFClientAdapter( |
There was a problem hiding this comment.
I think you can just move this into main? Then you don't need the client is None guard.
There was a problem hiding this comment.
Nah, we use it with tests. So then we'd need heavier monkey patching - this seems easier IMO.
There was a problem hiding this comment.
I thought it would make less monkey patching...
f089b4e to
46712c1
Compare
recvDone called isDone(active=True) to free the connection slot, but never cleared self.active. connectionLost then called isDone(active=True) again, causing a second decrement or waiter promotion per completed upload. After N uploads NActive drifted to -N, maxActive throttling became permanently disabled. Fix: clear self.active in recvDone so connectionLost passes active=False. Guard isDone against Wait.remove on a proto that is no longer waiting.
When connection accounting is corrupted (NActive < 0), log a warning and report zero rather than the raw negative value. Prevents alerting rules like 'connections_active > connections_limit' from silently never firing when the throttle has been bypassed.
Per-IOC locks let up to maxActive commits land in parallel. The cleanOnStart sweep queried CF for active channels, then bulk-wrote Inactive over all of them — racing against commits that had already activated channels in the window between query and write. Restores a single global DeferredLock to serialise all CF writes. _ioc_channels (per-IOC channel set) is retained: without it a disconnect extends records_to_delete with all known channels rather than just the departing IOC's own.
Provides a safe manual alternative to cleanOnStart for sites that disable automatic sweeping. Marks all Active channels for a given recceiver_id Inactive. Supports --dry-run to preview the scope. Usage: recceiver-clean -f recceiver.conf [--recceiver-id ID] [--dry-run]
…channel_is_old If the IOC that last owned a channel has departed between the state update and the CF push, look it up with .get() and fall back to _orphan_channel rather than raising KeyError and silently dropping the channel from the write batch. Same guard applied to the alias path in the same function.
The commit path updates self.iocs and channel_ioc_ids before the CF push. If the push exhausts push_max_retries (_push_to_cf returns False), in-memory state says the IOC is committed but CF was never written. The divergence persists until the IOC reconnects. On retry exhaustion for a connected transaction, evict the IOC from all in-memory tracking structures. The next commit from that IOC is treated as an initial upload and re-registers all channels in CF.
46712c1 to
f6a00db
Compare
|
That is interesting. What happened with these issues when you disabled cleanOnStop and cleanOnStart? |
| while True: | ||
| channels = client.find_active_for_recceiver(cf_config.recceiver_id) | ||
| if not channels: | ||
| break | ||
| log.info( | ||
| "Found %d active channels for recceiver_id=%r%s", | ||
| len(channels), | ||
| cf_config.recceiver_id, | ||
| " (dry-run)" if dry_run else "", | ||
| ) | ||
| if not dry_run: | ||
| client.update_property( | ||
| CFProperty(CFPropertyName.PV_STATUS.value, cf_config.username, PVStatus.INACTIVE.value), | ||
| [ch.name for ch in channels], | ||
| ) | ||
| total += len(channels) | ||
| if dry_run: | ||
| break |
There was a problem hiding this comment.
The dry run option + loop here seems very weird. I guess the point is that more channels could come active? Or is this a pagination thing?
| if client is None: | ||
| from channelfinder import ChannelFinderClient | ||
|
|
||
| client = PyCFClientAdapter( | ||
| ChannelFinderClient( | ||
| BaseURL=cf_config.base_url, | ||
| username=cf_config.cf_username, | ||
| password=cf_config.cf_password, | ||
| verify_ssl=cf_config.verify_ssl, | ||
| ), | ||
| size_limit=int(cf_config.cf_query_limit), | ||
| ) |
There was a problem hiding this comment.
Do we really need to have a None argument to test this? I get that you are allowing the tests to pass a mocked object, but can't you just mock the creation of the object in the test without passing it as an argument?
| if last_ioc_info is None: | ||
| log.warning( | ||
| "Last IOC %s for channel %s not in local state; orphaning channel", | ||
| last_ioc_id, | ||
| cf_channel.name, | ||
| ) | ||
| self._orphan_channel(cf_channel, ioc_info, channels, record_info_by_name) | ||
| return |
There was a problem hiding this comment.
I wonder if this can be moved up to the calling function. This is a bit of a complicated jump-around of code, and since this block is a quick-exit that essentially jumps to the branch not chosen from the calling function, it seems like we could clean this up and make it a bit clearer.



This MR fixes a bunch of issues seen in production at ESS when deploying recCeiver 1.9.6 (and CF 5.1.0) with cleanOnStart active and freshly wiped DB.
known_iocsca 100 during incident; after manual restart climbed to expected level then crashedFrom prometheus exporter:
Also in logs:
Note that this MR adds a recceiver-clean utility. This is because we at ESS have decided to not use cleanOn* anymore - this is not, and never was, recCeiver's scope. It was a band-aid which we do not want to attempt fixing any further. We will instead use the utility as needed, and try to integrate better mechanisms in a future CF version or any potential CF replacement.