Skip to content

feat(quota): per-account Apps Script quota tracking with hard-stop and UI display#1396

Open
CaptainMirage wants to merge 8 commits into
therealaleph:mainfrom
CaptainMirage:feat/quota-tracking
Open

feat(quota): per-account Apps Script quota tracking with hard-stop and UI display#1396
CaptainMirage wants to merge 8 commits into
therealaleph:mainfrom
CaptainMirage:feat/quota-tracking

Conversation

@CaptainMirage
Copy link
Copy Markdown
Contributor

@CaptainMirage CaptainMirage commented May 25, 2026

Summary

Adds a full per-account quota tracking system to protect the proxy against
Apps Script daily quota exhaustion. When an account is running low or fully
exhausted, it is blocked before the next request — not after — so the proxy
never silently fails mid-session. When all accounts are exhausted, a global
hard stop fires and returns 503 to the client immediately rather than
burning remaining quota on requests that will fail anyway.

What changed

src/quota_tracker.rs — new module

New module. Implements AccountBucket (per-account rolling 24h window state),
QuotaState (serializable snapshot of all buckets + persistent relay counter),
and QuotaTracker (the live runtime handle shared across tasks).

Key behaviour:

  • Rolling 24h windows — each account tracks usage in a rolling window
    anchored to the first call of the day, not calendar midnight. Reset time
    is per-account, not global.
  • Safety buffer pre-check — before each relay call the tracker checks
    remaining < safety_buffer. If true, the account is hard-stopped and
    removed from the dispatch rotation. This means an account goes dark while
    it still has headroom, rather than failing at zero.
  • Startup safety checkcheck_all_safety_buffers() runs at load time
    so near-limit accounts from the previous session are blocked before the
    first request of the new session.
  • Persistent relay countertotal_relay_calls is written to
    quota_state.json and restored on restart. Resets at UTC midnight via
    a stored day number. Drives the "fetches today" UI counter.
  • Startup summary linestartup_summary() builds a human-readable
    line that logs quota state right after the Listening lines on startup.
  • impl Drop — saves state to disk on process exit so no in-memory
    data is lost on clean shutdown.

src/config.rs

Added two fields with sane defaults — no TOML changes are required,
the proxy works out of the box:

Field Default Meaning
quota_daily_limit 20000 Apps Script quota per account per day
quota_safety_buffer 500 Hard-stop N calls before the limit

These can be overridden in config.toml if needed, but the defaults match
the standard Apps Script quota ceiling with a reasonable safety margin.
Documentation will be updated in a separate docs revision pass.

src/domain_fronter.rs

  • record_relay() called at the top of relay(), before any early
    return. Every proxied request is counted regardless of path (exit node,
    Apps Script, or hard stop).
  • Global hard stop checked before the exit node path — exit node cannot
    bypass a fully exhausted quota state.
  • Exit node byte trackingbytes_relayed now accumulates
    body.len() + response.len() on exit node success, not just Apps Script
    responses. This makes the "data transferred" estimate accurate across
    both paths.
  • relay_calls and relay_failures counters remain on DomainFronter
    and are used for per-session stats (not persisted).

src/proxy_server.rs

  • Startup summary logged once after the Listening lines, showing quota
    state for all configured accounts.
  • Stats task reduced from 60s → 15s interval, fires immediately on
    start. Calls roll_expired_windows() so 24h windows reset even when
    the proxy is idle (no traffic required to trigger a reset).
  • 1-second save task — separate tokio task that flushes dirty quota
    state every second. Decoupled from the stats log so disk writes are not
    held hostage to the 15s interval.
  • Exhaustion detail logging — on the stats cycle after a global hard
    stop, logs each exhausted account's masked ID and remaining count. Uses
    a was_hard_stopped flag so this fires once on transition, not every
    15s.

src/bin/ui.rs

New Usage Today grid in the UI sidebar:

Row Left Right
1 fetches today X / Y (Z%) · resets in Xh Ym
2 relay calls N (M failed) · cache —
3 PT day YYYY-MM-DD · accounts N/N active
4 (conditional) data transferred X MB / Y GB est. (shown after 5+ calls)
  • "Fetches today" is driven by the persisted total_relay_calls, not the
    in-memory relay_calls counter, so it survives restarts.
  • "Resets in" shows the rolling 24h countdown to the earliest account reset.
  • "Data transferred" only renders after 5+ calls to avoid a meaningless
    estimate on a cold start. Clamped average per-request size of 50 KB–500 KB.
  • A red QUOTA HARD STOP banner appears above the grid when all accounts
    are exhausted.

Compatibility note with PR #1346

PR #1346 (large download resilience, stream timeout decoupling, compact log
timestamps) and this PR both touch src/bin/ui.rs, src/config.rs,
src/domain_fronter.rs, and src/proxy_server.rs — but in completely
separate areas of each file. #1346 adds a stream timeout config field and
log timestamp formatting; this PR adds quota fields and a new UI section.
There is no logical overlap. Whichever merges second will have a trivial
conflict that resolves cleanly by keeping both diffs.


Test plan

  • Start proxy with no quota_state.json — verify file is created within
    1 second of the first relay call
  • Stop and restart — verify total_relay_calls is restored from disk and
    "fetches today" in the UI matches the pre-restart value
  • Set quota_safety_buffer high enough to trigger a blocked account at
    startup — verify that account is excluded before the first request
  • Exhaust all accounts (or set quota_daily_limit = 1) — verify global
    hard stop fires, subsequent requests return 503, and the red banner appears
    in the UI
  • Let the proxy idle past a 24h window — verify roll_expired_windows
    resets the account without needing any traffic
  • Confirm UI grid renders: correct counts, resets-in countdown, data
    estimate only visible after 5+ calls

P.S. - i had around 20 commits that i squashed down to 7 thats why the commits are all in the same exact time lol

@github-actions github-actions Bot added the type: feature feat: PR — auto-applied by release-drafter label May 25, 2026
@CaptainMirage
Copy link
Copy Markdown
Contributor Author

Related work -- PR #1388

After opening this PR I noticed #1388 ("feat(relay): prioritize mux dispatch
and expose script health") independently implements a local rolling 24h call
ledger per Apps Script deployment inside domain_fronter.rs, along with a
Script health panel in the UI showing masked deployment IDs, usage, saturation
status, and failure quarantine.

There is conceptual overlap worth flagging so you can decide how these two interact:

This PR (feat/quota-tracking) #1388
Scope Per-account (Google account) Per-deployment (script URL)
Persistence Written to quota_state.json, survives restarts In-memory, resets on restart
Hard stop Global hard stop blocks all traffic when exhausted Steering: prefers non-saturated, falls through
Failure handling Hard stop + was_hard_stopped flag, 503 to client 429/403 -> 24h quarantine; 5xx -> cooldown
UI Usage Today grid (fetches, resets, relay calls, etc) Script health panel (per-deployment stats)

These are not mutually exclusive -- tracking quota per-account (this PR) and
per-deployment (#1388) are complementary. But the deployment selection logic
in #1388 and the global hard stop in this PR would need to be aware of each
other if both land. Flagging it early so you can coordinate aleph, i don't think there is a single other maintainer is there?

@CaptainMirage
Copy link
Copy Markdown
Contributor Author

also one thing i noticed, when you merge PRs you tend to squash all the commits into a single one, and the message isn't always super detailed either. it would be really helpful to keep the individual commits from each PR so contributors like me can see exactly what changed at each step without having to open the PR itself, and rolling back something specific becomes a lot more precise rather than having to revert an entire PR at once. just something to think about, many thanks!

Copy link
Copy Markdown
Owner

@therealaleph therealaleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the very detailed PR and the notes about #1388. I tested the branch locally with cargo test --all-targets --features ui, and the suite is green, but I cannot merge this one yet because I found one quota-state correctness bug.

QuotaTracker::is_globally_hard_stopped() computes the secondary aggregate check from st.buckets.values(), while total_cap is based only on the currently configured script_ids. Since load() preserves buckets for script IDs that were removed from config, a user who rotates away old exhausted IDs can still carry those stale requests_used / quota_error_count values in quota_state.json. That can trip the global hard stop even when the currently configured IDs are fresh.

Please either filter the aggregate sums to self.script_ids or prune removed buckets on load, and add a regression test for this shape: one stale exhausted persisted bucket that is no longer configured, plus one fresh configured ID, should not globally hard-stop.

Small follow-up while you are in there: the hard-stop response currently returns 502, while the PR description says 503. 503 Service Unavailable fits this case better.

On commit preservation: for user-facing release history I usually squash, but for a feature-sized PR like this I can keep a more detailed merge commit/message once the blocker is fixed.


Answered via LLM, Supervised @therealaleph

@CaptainMirage
Copy link
Copy Markdown
Contributor Author

i see, ill fix them right away

and on the commit preservation, the JSON to TOML change wouldve been a nice thing to keep since it changed around many docs and added alot of stuff, i did make sure everything is automatically translated but i mostly mean for others in case they are doing something specific that i happened to touch, now sure blame exists but i mean in a more of a version rewind situation than a see who changed what and when

…Ds, fix 503 response

is_globally_hard_stopped() was summing quota_error_count and requests_used
over all persisted buckets, including stale ones from script IDs removed from
config. A user rotating away exhausted IDs would still carry their usage in
quota_state.json, causing the aggregate check to falsely trip a global hard
stop against fresh accounts.

Fixed by filtering both sums to self.script_ids only. The all_stopped primary
check was already correct (iterates script_ids, not bucket values).

Also corrects the hard-stop HTTP response from 502 Bad Gateway to 503
Service Unavailable, which is the accurate status for a deliberately refused
request due to resource exhaustion.

Regression test: one stale exhausted persisted bucket not in the current
config plus one fresh configured bucket must not trigger a global hard stop.
@CaptainMirage
Copy link
Copy Markdown
Contributor Author

i wanted to keep those for a later update im working on but it makes more sense to just remove it here so it doesnt confuse others and add it straight up with the other stuff

Copy link
Copy Markdown
Owner

@therealaleph therealaleph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this fixes the blocker I found.

I rechecked head f05601619ac550484f70300801fb2b45077b29bf: is_globally_hard_stopped() now filters the aggregate quota/error sums to the currently configured script_ids, the regression test covers the stale-removed-bucket case, and the hard-stop HTTP response is now 503.

Local verification:

cargo test --all-targets --features ui
249 passed

Approved. I am not merging it this minute only because the v1.9.35 release workflow is currently running from the previous merge; I do not want to stack another main change while that release is mid-flight.


Answered via LLM, Supervised @therealaleph

@CaptainMirage
Copy link
Copy Markdown
Contributor Author

alrighty then, in the meantime, could we discuss a way to communicate better? could we have the discussions tab open on this repo so people stop using the issues tab for help? and also what are your thoughts on a discord channel? and maybe a contributor chat too so we can communicate a little ezier so we dont do extra work if someone is already doing something and stuff like that

@maybeknott
Copy link
Copy Markdown

Related work -- PR #1388

After opening this PR I noticed #1388 ("feat(relay): prioritize mux dispatch and expose script health") independently implements a local rolling 24h call ledger per Apps Script deployment inside domain_fronter.rs, along with a Script health panel in the UI showing masked deployment IDs, usage, saturation status, and failure quarantine.

There is conceptual overlap worth flagging so you can decide how these two interact:

This PR (feat/quota-tracking) #1388
Scope Per-account (Google account) Per-deployment (script URL)
Persistence Written to quota_state.json, survives restarts In-memory, resets on restart
Hard stop Global hard stop blocks all traffic when exhausted Steering: prefers non-saturated, falls through
Failure handling Hard stop + was_hard_stopped flag, 503 to client 429/403 -> 24h quarantine; 5xx -> cooldown
UI Usage Today grid (fetches, resets, relay calls, etc) Script health panel (per-deployment stats)
These are not mutually exclusive -- tracking quota per-account (this PR) and per-deployment (#1388) are complementary. But the deployment selection logic in #1388 and the global hard stop in this PR would need to be aware of each other if both land. Flagging it early so you can coordinate aleph, i don't think there is a single other maintainer is there?

Hey Mirage, thanks for flagging this clearly.

If you agree I would like to keep your quota tracker as the canon. Persistent per-account quota state, safety-buffer hard stops, startup restoration, and the 503 global stop is a proper bundle and cleanly implemented, and QuotaTracker is the better home for that than the lightweight in-memory ledger I had.

My plan is to not push #1388 as-is. I’ll split it up and remove the overlapping quota pieces:

  • keep the TunnelMux interactive-priority work as a separate small PR, since that is transport scheduling and does not depend on quota tracking;
  • drop/supersede the local rolling 24h quota ledger from feat(relay): prioritize mux dispatch and expose script health #1388, since your quota_state.json tracker covers the durable quota/account model better;
  • rework the failure-classification part, if still useful, so quota-like failures feed into QuotaTracker instead of maintaining a second quota path;
  • keep transient deployment/route failures separate from quota hard-stops, so a bad route can cool down without marking the account exhausted;
  • rename any future UI concept from “quota/script health” toward “deployment route health” if it only shows transient network route state, cooldowns, timeout strikes, and last failure class.

So the intended layering would be:

  • Quota/account truth: your QuotaTracker
  • Dispatch hard stop: your global hard-stop / per-account hard-stop checks
  • Transient deployment health: a small route-health layer only for non-quota network failures, if needed later
  • Mux scheduling: separate TunnelMux priority PR, independent from quota

That avoids two ledgers trying to answer the same question.

If you agree, I can build a small follow-up directly on top of your feat/quota-tracking branch instead of waiting for it to land. I would keep it narrow and additive: probably extra quota-vs-transient failure classification coverage, a reusable helper cleanup if needed, or a tiny integration point that makes later route-health work consume QuotaTracker cleanly. Then I will separately commit TunnelMux priority, clearer quota-vs-transient failure boundaries, and possibly a later deployment route-health view that only describes network health rather than quota capacity.

So I think the clean path is: your PR owns quota truth, I split my PR into smaller non-overlapping parts, and any quota-adjacent follow-up is either built on your branch with your okay or stacked after your PR lands.

@CaptainMirage
Copy link
Copy Markdown
Contributor Author

CaptainMirage commented May 26, 2026

It would make more sense if you wait for my PR to be merged, then from your own fork make a branch off main/upstream for your TunnelMux PR. Keeps both our histories clean, and I'd rather it that way - no stacking.

so please don't branch off my PR or build on top of it without my go-ahead. Wait for it to merge, then work from main.

@CaptainMirage
Copy link
Copy Markdown
Contributor Author

alrighty then, in the meantime, could we discuss a way to communicate better? could we have the discussions tab open on this repo so people stop using the issues tab for help? and also what are your thoughts on a discord channel? and maybe a contributor chat too so we can communicate a little ezier so we dont do extra work if someone is already doing something and stuff like that

also aleph i would love an answer for this, and if you are able to it would be nice to merge this commit till the end of the day, im working on some optimizations on the relay logic and i need this to be merged before i change a few things, the merge diff would be a mess if i edit the files this PR has edited, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type: feature feat: PR — auto-applied by release-drafter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants