Skip to content

Release branch v2.0.0#266

Open
cuonglm wants to merge 141 commits into
mainfrom
release-branch-v2.0.0
Open

Release branch v2.0.0#266
cuonglm wants to merge 141 commits into
mainfrom
release-branch-v2.0.0

Conversation

@cuonglm

@cuonglm cuonglm commented Oct 9, 2025

Copy link
Copy Markdown
Collaborator

Major Release

This release contains new features, improvements and bug fixes.

Added

  • Rule Matching Engine: Implemented new modular rule matching engine infrastructure with support for configurable rule evaluation order (infrastructure ready, not yet exposed to users)
  • Modular CLI Architecture: Split monolithic CLI command structure into focused, maintainable command files
  • Context Support: Added context.Context support throughout configuration methods for better cancellation and timeout handling

Improvements

  • Logging System: Migrated from zerolog to uber zap logging package with improved performance, better structured logging, and enhanced extensibility
  • CLI Architecture: Refactored monolithic commands.go (1,397 lines) into 13 focused command files, improving maintainability and testability
  • DNS Proxy: Major refactoring of DNS proxy implementation with better code organization, improved error handling, and enhanced separation of concerns
  • Client Information System: Enhanced client discovery and information management with improved DHCP, mDNS, ARP, NDP, and hosts file parsing
  • Service Management: Improved service lifecycle management, enhanced reload functionality, and better status reporting across all platforms
  • Network/OS Abstractions: Improved network and OS abstraction layers with reduced code duplication and more consistent behavior across platforms
  • Configuration System: Added context support and improved logging throughout configuration initialization and bootstrap operations
  • Code Organization: Removed ~3,000 lines of deprecated router-specific code, focusing development on core DNS proxy functionality

Fixes

  • Improved error handling and recovery mechanisms throughout the codebase
  • Enhanced service state management and lifecycle handling

Breaking Changes

⚠️ Server and Router-Specific Integrations Removed: All server platforms and router-specific integration code has been removed, including support for:

  • DD-WRT, dnsmasq, EdgeOS, Firewalla, AsusWRT-Merlin, Netgear/Orbi/Voxel, OpenWRT, Synology, Tomato, and Ubiquiti UniFi OS routers
  • Windows Server

If you were using ctrld with any of these router platforms, you will need to use alternative deployment methods. See the migration guide for details.

Note: All other functionality remains backward compatible. Existing configuration files and CLI commands continue to work without changes.

@cuonglm cuonglm force-pushed the release-branch-v2.0.0 branch 2 times, most recently from f38c9ae to 90eddb8 Compare October 9, 2025 13:51
@cuonglm cuonglm force-pushed the release-branch-v2.0.0 branch 2 times, most recently from 1c74fc4 to f9d0263 Compare November 12, 2025 08:21
@cuonglm cuonglm changed the title [WIP] Release branch v2.0.0 Release branch v2.0.0 Nov 12, 2025
@cuonglm cuonglm force-pushed the release-branch-v2.0.0 branch 4 times, most recently from 0d4b697 to d0e66b8 Compare December 17, 2025 08:28
@cuonglm cuonglm force-pushed the release-branch-v2.0.0 branch 4 times, most recently from de415df to 1fbbb14 Compare March 10, 2026 10:42
cuonglm added 17 commits April 30, 2026 19:19
This commit reverts changes from v1.4.5 to v1.4.7, to prepare for v2.0.0
branch codes.

Changes includes in these releases have been included in v2.0.0 branch
already.

Details:

Revert "feat: add --rfc1918 flag for explicit LAN client support"

This reverts commit 0e3f764.

Revert "Upgrade quic-go to v0.54.0"

This reverts commit e52402e.

Revert "docs: add known issues documentation for Darwin 15.5 upgrade issue"

This reverts commit 2133f31.

Revert "start mobile library with provision id and custom hostname."

This reverts commit a198a5c.

Revert "Add OPNsense new lease file"

This reverts commit 7af29cf.

Revert ".github/workflows: bump go version to 1.24.x"

This reverts commit ce1a165.

Revert "fix: ensure upstream health checks can handle large DNS responses"

This reverts commit fd48e6d.

Revert "refactor(prog): move network monitoring outside listener loop"

This reverts commit d71d134.

Revert "fix: correct Windows API constants to fix domain join detection"

This reverts commit 21855df.

Revert "refactor: move network monitoring to separate goroutine"

This reverts commit 66e2d3a.

Revert "refactor: extract empty string filtering to reusable function"

This reverts commit 36a7423.

Revert "cmd/cli: ignore empty positional argument for start command"

This reverts commit e616091.

Revert "Avoiding Windows runners file locking issue"

This reverts commit 0948161.

Revert "refactor: split selfUpgradeCheck into version check and upgrade execution"

This reverts commit ce29b5d.

Revert "internal/router: support Ubios 4.3+"

This reverts commit de24fa2.

Revert "internal/router: support Merlin Guest Network Pro VLAN"

This reverts commit 6663925.
So setting up logging for ctrld binary and ctrld packages could be done
more easily, decouple the required setup for interactive vs daemon
running.

This is the first step toward replacing rs/zerolog libary with a
different logging library.
By adding a logger field to "prog" struct, and use this field inside its
method instead of always accessing global mainLog variable. This at
least ensure more consistent usage of the logger during ctrld prog
runtime, and also help refactoring the code more easily in the future
(like replacing the logger library).
Make nameserver resolution functions more consistent and accessible:
- Rename currentNameserversFromResolvconf to CurrentNameserversFromResolvconf
- Move function to public API for better reusability
- Update all internal references to use the new public API
- Add comprehensive godoc comments for nameserver functions
- Improve code organization by centralizing DNS resolution logic

This change makes the nameserver resolution functionality more maintainable
and easier to use across different parts of the codebase.
- Add timeouts and proper cleanup in Test_osResolver_Singleflight:
  * Implement context timeout
  * Add proper PacketConn cleanup
  * Fix race conditions in error handling
  * Improve atomic value reporting

- Enhance Test_osResolver_HotCache:
  * Add proper timeout context
  * Implement more reliable cache verification
  * Fix potential resource leaks
  * Add deterministic polling intervals

- Add thread safety to Test_Edns0_CacheReply:
  * Implement proper timeout context
  * Add proper resource cleanup
  * Fix concurrent operations handling

The changes improve overall test suite reliability by addressing resource
management, timeout handling, and thread safety concerns across multiple DNS
resolver test cases.
Move client information related functions from client_info_*.go to desktop_*.go files
to better organize platform-specific code and separate desktop functionality from
shared code.

No functional changes.
Improve documentation for Test_prog_parseResolvConfNameservers to clarify that
the old implementation was removed as part of code deduplication effort. The code
for handling resolv.conf was unified into the resolvconffile package to provide
a consistent interface across the codebase.

This change provides better context for future developers about why the
refactoring was done and what benefits it brings.
Add context parameter to validInterfacesMap for better error handling and
logging. Move Windows-specific network adapter validation logic to the
ctrld package. Key changes include:

- Add context parameter to validInterfacesMap across all platforms
- Move Windows validInterfaces to ctrld.ValidInterfaces
- Improve error handling for virtual interface detection on Linux
- Update all callers to pass appropriate context

This change improves error reporting and makes the interface validation
code more maintainable across different platforms.
Move getDNS type definition from dns.go to os_linux.go where it is used.
Remove the now-empty dns.go file. This change improves code organization
by keeping platform-specific types with their implementations.
Break down the large DNS handling function into smaller, focused functions
with clear responsibilities:

- Extract handleDNSQuery from serveDNS handler function
- Create dedicated startListeners function for listener management
- Add standardQueryRequest struct to encapsulate query parameters
- Split special domain handling into separate function
- Add descriptive comments for each new function
- Improve variable names for better clarity (e.g., startTime vs t)

This refactoring improves code maintainability and readability without
changing the core DNS proxy functionality.
By looking for any additional dnsmasq configuration files under
/tmp/etc, and handling them like default one.
This change improves compatibility with newer UniFi OS versions while
maintaining backward compatibility with UniFi OS 4.2 and earlier.
The refactoring also reduces code duplication and improves maintainability
by centralizing dnsmasq configuration path logic.
Codescribe and others added 9 commits April 30, 2026 19:19
upstreamConfigFor() used strings.Contains(":") to decide whether to
append ":53", which always evaluates true for IPv6 addresses. This left
bare addresses like "2a0d:6fc0:9b0:3600::1" without brackets or port,
causing net.Dial to reject with "too many colons in address".

Use net.JoinHostPort() which handles IPv6 bracketing automatically,
producing "[2a0d:6fc0:9b0:3600::1]:53".
- Update comment in ensurePFAnchorReference: pfctl -sn returns
  rdr-anchor only (nat-anchor not used by ctrld)
- Update nat-anchor table entry in pf-dns-intercept.md
- Add pf nuances 10-16 from investigation: cross-AF redirect,
  block return, sendmsg EINVAL, nat-on-lo0, raw sockets, DIOCNATLOOK,
  and the pragmatic IPv6 block solution
When port 53 is taken (e.g. by mDNSResponder), ctrld failed with
'could not find available listen ip and port' instead of falling back
to port 5354. Root cause: tryUpdateListenerConfig() checked the
dnsIntercept bool, which is derived in prog.run() AFTER listener
config is resolved.

Fix: check interceptMode string directly (CLI flag + config fallback)
in a new tryUpdateListenerConfigIntercept() that tries 127.0.0.1:53
then 127.0.0.1:5354.

Also updates buildPFAnchorRules() to use the actual listener IP/port
from config instead of hardcoded 127.0.0.1:53, so pf rules redirect
to wherever ctrld is actually listening.
Pass a quic.Config with KeepAlivePeriod (15s) to DoQ dial calls instead
of nil, so pooled connections send periodic QUIC PINGs to stay alive and
detect dead paths proactively.

Also add IdleTimeoutError to the DoQ retry conditions alongside io.EOF,
so stale pooled connections trigger a transparent retry instead of
propagating as a query failure.
Replace conn.OpenStream (non-blocking) with conn.OpenStreamSync so that
the resolver waits for the server's MAX_STREAMS credit replenishment frame
instead of immediately failing when the stream limit is temporarily
exhausted. Also retry on StreamLimitReachedError as defense-in-depth for
servers that are slow or fail to send MAX_STREAMS updates.
SetSelfIP unconditionally accessed t.dhcp, but t.dhcp is only
initialized when DHCP discovery is enabled. A network change event
can fire SetSelfIP regardless of the discovery configuration,
causing a nil pointer dereference.

Guard the t.dhcp access with a nil check so the self IP is still
updated on the Table even when DHCP discovery is disabled.
README.md: fix Go version requirement (1.23 -> 1.24), update OS
support architectures (add arm64/mipsle/mips64 for Linux, arm64 for
Windows/FreeBSD, remove windows/arm), fix broken PowerShell install
path, demote H1 section headings to H2.
When multiple network changes fire in quick succession (e.g., VPN
disconnect + interface swap), the second handleRecovery() call cancels
the first but inherits stale DoH transports, causing DNS blackouts
of up to 30 seconds.

Three changes to reduce worst-case recovery from ~30s to <3s:

1. ForceReBootstrap() on recovery entry — closes dead connections and
   creates fresh transports synchronously before probing, replacing the
   lazy ReBootstrap() flag that left stale connections for probes to hit.

2. Debounce handleRecovery() for network changes (500ms window) — only
   the recovery flow is debounced; all other state updates (IP, pf
   anchor, VPN DNS, tunnel checks) still run immediately on every event.
   This eliminates the cancel-and-restart race without missing state.

3. Combined effect: ForceReBootstrap closes old in-flight connections
   (closeTransports) and builds new ones (SetupTransport) atomically,
   so recovery probes never inherit dead connections from a prior
   recovery attempt.
Add file-backed persistence to the internal logWriter so runtime logs
survive service restarts. When internal logging is enabled (CD mode,
no explicit log_path), writes are teed to both the existing in-memory
ring buffer and a rotated file on disk (ctrld.log in the home directory).

File rotation: 5MB max with 1 backup (ctrld.log.1), so max ~10MB on disk.
Log view/send now reads from the persisted files (including backup) to
provide complete history across restarts. Live tail continues to use
the in-memory subscriber mechanism unchanged.

Activation: same conditions as existing internal logging — CD mode only,
no log_path configured. No new config options or dependencies.
@cuonglm cuonglm force-pushed the release-branch-v2.0.0 branch from 3e53fd4 to 4753507 Compare April 30, 2026 12:21
When third-party VPN software (e.g., OpenVPN) installs WFP block filters via
block-outside-dns, all DNS traffic to non-tunnel interfaces is blocked —
including DNS to 127.0.0.1 (ctrld's NRPT target). This breaks DNS mode
interception because the NRPT catch-all rule routes queries to loopback,
but WFP blocks the connection before it reaches ctrld's listener.

Fix: after exhausting all NRPT recovery attempts, activate a minimal WFP
session with "hard permit" filters (FWPM_FILTER_FLAG_CLEAR_ACTION_RIGHT)
for DNS to localhost in a max-priority sublayer (weight 0xFFFF). This
overrides the VPN's block for loopback DNS only, while preserving the
VPN's DNS leak protection for all other (non-loopback) DNS traffic.

The loopback protect is:
- Only activated when NRPT probes fail (not preemptively)
- Harmless when no conflicting WFP blocks exist (permit-only, no blocks)
- Persistent until ctrld shutdown (survives VPN reconnect cycles)
- Cleaned up by the existing cleanupWFPFilters path on shutdown
@cuonglm cuonglm force-pushed the release-branch-v2.0.0 branch from 4753507 to 81aa6b2 Compare April 30, 2026 12:30
Codescribe and others added 18 commits May 7, 2026 19:37
When WFP loopback protect is active, the upstream.os healthcheck will
always fail because an external WFP block filter is interfering with
plain DNS. This demotes those expected failures to debug level and
returns errOsHealthcheckSuppressed so the recovery loop treats them
as non-fatal, eliminating the log spam described in #526.
Go's default is already TLS 1.2+ (since Go 1.18), but making this
explicit satisfies RFC 7858/9250 recommendations and makes the security
intent clear for auditors.
Current code writes to a predictable path, which on systems without
`fs.protected_symlinks` (e.g. embedded routers) could allow a local
attacker with API compromise to perform symlink attacks.
Currently there is no limit on PIN attempts, allowing unlimited
brute force if an attacker gains socket access. While the socket is
root-only by default, rate limiting is cheap defense-in-depth.
DoQ responses are length-prefixed per RFC 9250. The resolver previously
assumed the stream always contained at least two bytes and unpacked from
buf[2:], which could panic on truncated or malicious replies.

Validate the prefix against the bytes read, return a clear error, and
retire the connection from the pool on framing failure. Unpack only the
slice declared by the prefix so a short read cannot be misinterpreted as
a full message.

Add regression coverage with a small test server that returns malformed
raw payloads (empty, one byte, prefix-only, prefix larger than payload).
DoQ pools now keep a single quic.Transport and UDP socket for all dials,
so parallel dial and reconnect churn no longer allocate a new socket per
attempt or leak the winner's UDP conn when the caller owns the packet
conn.

quicParallelDialer accepts an optional transport: when set, dials use
Transport.DialEarly on that socket; when nil, behavior matches the old
per-dial ListenUDP path (losers close their sockets).

Per RFC 9250 §4.2, close the query stream's send side before reading the
response so strict upstreams see STREAM FIN before answering.

CloseIdleConnections closes the shared transport and underlying UDP
conn so checked-out connections and the OS socket are torn down.

Add a FIN-strict test server, coverage for bootstrap vs parallel-dial
paths, and a Linux-only FD churn regression test.
Stick to go1.25 for now, since using go1.26 causing a runtime panic when
building arm platforms.
For GO-2026-5026 security fix.
The test:windows CI job intermittently failed to clean up .testbin with
"Access to the path '...cmd_cli.test.exe' is denied". This was previously
attributed to Windows Defender scanning the large unsigned test binaries,
and mitigated with Defender exclusions and cleanup retries. That was
treating a symptom.

Root cause: performUpgrade() self-upgrades by running
exec.Command(os.Executable(), "upgrade", "prod", "-vv") as a detached,
windowless child. In the real ctrld binary this re-execs ctrld and is
correct. Under `go test`, os.Executable() is the test binary itself, and
`go test` stops flag parsing at the first positional arg ("upgrade") and
ignores the rest -- so the child silently re-runs the entire test suite.
That child hits the upgrade tests again and spawns more detached children,
recursively: a fork bomb of hidden processes that pins the runner's
CPU/memory and keeps the test binary's image file locked. Windows refuses
to delete the image of a running process, hence the "Access is denied"
during after_script. Whether any children are still alive when cleanup
runs is a timing race, which is why the failure was flaky.

Two tests reached this path: Test_performUpgrade (directly) and
Test_selfUpgradeCheck (via selfUpgradeCheck -> performUpgrade on the
"upgrade allowed" case).

Fix:
- prog.go: extract the command construction into a package-level
  newUpgradeCmd var. Production behavior is unchanged.
- main_test.go: stub newUpgradeCmd once in TestMain so the whole test
  binary self-execs with `-test.run=^$` (matches no tests, exits
  immediately) instead of re-running the suite. This covers every test
  that reaches performUpgrade, present and future, while still exercising
  the cmd.Start() success path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants