Skip to content

feat(discord): internal user token extraction and per-channel incremental export#283

Open
leostar0412 wants to merge 5 commits into
cppalliance:developfrom
leostar0412:feat/discord-token-extraction
Open

feat(discord): internal user token extraction and per-channel incremental export#283
leostar0412 wants to merge 5 commits into
cppalliance:developfrom
leostar0412:feat/discord-token-extraction

Conversation

@leostar0412

@leostar0412 leostar0412 commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a compliance-gated workflow to extract and persist Discord user tokens from a dedicated Chrome profile (workspace JSON + extract_discord_tokens), with Docker/noVNC login targets and Makefile helpers mirroring the Slack session flow. run_discord_activity_tracker can load tokens from that JSON at runtime and re-extract on DiscordChatExporter auth failures when ALLOW_INTERNAL_DISCORD_TOKENS is enabled.
Also improves scheduled export reliability:

  • Per-channel incremental lower bounds — each channel resumes from the UTC day start of its own latest stored message (not the guild-wide max), so quiet channels are not skipped.
  • Per-channel per-UTC-day exports — DiscordChatExporter runs once per channel per calendar day in the resolved window.
  • Raw archive merge — daily JSON under raw/discord_activity_tracker/<server_id>/<channel_id>/YYYY-MM-DD.json merges by message id.
    Docs, .env.example, SECURITY.md, and Docker/Makefile ops are updated for the new session profile and token paths.

Apps touched

  • discord_activity_tracker
  • config (settings)
  • docker-compose, Makefile, scripts (ops)
  • docs (Docker, Workspace, discord_chat_exporter, service_api)

Test plan

  • python -m pytest discord_activity_tracker/tests/ -v
  • uv run pyright (if typed code changed)
  • lint-imports (if imports or cross-app coupling changed)
  • App command smoke-tested (if collector/command changed):
# Token extraction (stop discord-chromium first to avoid LevelDB lock)
make discord-tokens-refresh
# or: make extract-discord-tokens
python manage.py extract_discord_tokens
# Incremental export (per-channel bounds when --since omitted)
python manage.py run_discord_activity_tracker --dry-run
python manage.py run_discord_activity_tracker

Docs / coupling

  • cross-app-dependencies.md updated (if FKs or cross-app imports changed)
  • python scripts/generate_service_docs.py run (if services.py or core/protocols.py changed)
  • App README or docs/ updated (if behavior or ops changed)

Closes #278

Summary by CodeRabbit

  • New Features

    • Per-channel, per-UTC-day Discord exports with incremental sync; exporter merges same-day messages into per-day archives.
    • Browser-driven Discord session tooling with commands to extract/refresh user session tokens and automatic re-extract-on-auth-failure.
  • Configuration

    • New env flags and workspace paths to enable internal token extraction and point to a Chrome profile; Docker/compose and Make targets updated to support the workflow.
  • Documentation

    • Expanded docs covering Discord session setup, token extraction, and workspace layout.
  • Tests

    • Added extensive tests for token extraction, exporter windows, archiving, CLI and command behaviors.

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 55285c4d-3db9-4f3f-af94-17799f6d736c

📥 Commits

Reviewing files that changed from the base of the PR and between 5904a6e and 113c7d2.

📒 Files selected for processing (3)
  • discord_activity_tracker/README.md
  • discord_activity_tracker/management/commands/run_discord_activity_tracker.py
  • discord_activity_tracker/tests/test_run_command_coverage.py
✅ Files skipped from review due to trivial changes (1)
  • discord_activity_tracker/README.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • discord_activity_tracker/tests/test_run_command_coverage.py
  • discord_activity_tracker/management/commands/run_discord_activity_tracker.py

📝 Walkthrough

Walkthrough

Adds workspace-configured internal Discord user-token extraction (Chrome profile LevelDB), token JSON persistence and re-extract flow, refactors exporter to per-channel-per-day outputs, merges exporter JSON into per-day archives by message id, updates orchestration, Docker/Make tooling, scripts, docs, and comprehensive tests.

Changes

Discord Internal Token Extraction and Per-Channel Daily Export

Layer / File(s) Summary
Configuration and workspace paths
.env.example, config/settings.py, discord_activity_tracker/workspace.py, SECURITY.md
Adds ALLOW_INTERNAL_DISCORD_TOKENS, DISCORD_INTERNAL_TOKENS_JSON, DISCORD_CHROME_PROFILE_PATH and workspace path helpers; documents internal token file location in SECURITY.md.
Docker & Make session plumbing
docker-compose.yml, Makefile, scripts/wait_discord_chrome_profile.sh
Adds discord-chromium service, DISCORD_CHROME_PROFILE_PATH env propagation, Make targets for discord session/token extraction, and a script to wait for Chrome profile readiness.
Chrome LevelDB token extraction utilities
discord_activity_tracker/utils/discord_tokens.py, tests
Implements profile validation, LevelDB read/retry, token parsing, Discord API probes (/users/@me), exporter auth-error detection, and extract_discord_token_auto(). Tests cover parsing, probing, LevelDB and path resolution.
Internal token JSON store & loader
discord_activity_tracker/utils/discord_internal_tokens_store.py, tests
Atomic JSON save/load, permission best-effort, gating by ALLOW_INTERNAL_DISCORD_TOKENS, probe + re-extract fallback, and get_or_load_discord_user_token() entrypoint with tests for missing/stale/invalid flows.
Extract tokens management command
discord_activity_tracker/management/commands/extract_discord_tokens.py, tests
Django command to extract an internal token from Chrome profile and write workspace JSON; validates settings, profile path, and surfaces clear errors.
Per-channel export window helpers
discord_activity_tracker/sync/exporter_window.py, tests
UTC day normalization, per-channel latest-message lookup, incremental lower-bound derivation, and iter_channel_export_days to enumerate per-channel per-day windows.
ChannelDayExport contract & chat exporter refactor
discord_activity_tracker/sync/chat_exporter.py, tests
Adds ChannelDayExport dataclass; exports per channel per UTC day, handles empty exports, supports per_channel_incremental mode, and retries auth-failures after token re-extraction.
Raw archive merging by day & dedupe
discord_activity_tracker/sync/raw_archive.py, tests
Merges exporter JSON into per-channel YYYY-MM-DD.json archives, filters by UTC day, deduplicates/updates messages by id, refreshes metadata, and writes atomically.
Main run command & task changes
discord_activity_tracker/management/commands/run_discord_activity_tracker.py, tests
Loads token via get_or_load_discord_user_token(), _resolve_exporter_date_bounds returns per_channel_incremental, task_discord_sync accepts and forwards flag, consumes ChannelDayExport list, and merges staging exports into per-day archives.
Tests, docs, and minor scripts
many tests/*, docs/*, scripts/clean-macos.sh
Adds and updates unit tests for tokens, store, exporter window, raw archive, command integration, updates docs (Docker, Workspace, operations), and tweaks macOS cleanup script.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • jonathanMLDev
  • snowfox1003
  • wpak-ai

Poem

🐰 In LevelDB burrowed deep I found a key,

I nudged it gently, set the token free.
Per-channel days, each message kept in line,
Merged and deduped — a tidy archive fine.
Hop, hop, the rabbit logs, the sync ran true.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/service_api/discord_activity_tracker.md (1)

85-97: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Document the per-channel incremental lower bound.

This still reads like a guild-wide “latest DB message” resume path, but run_discord_activity_tracker now resumes each channel from its own latest stored message (or today UTC when empty). The current wording can send operators back to the quiet-channel skip behavior this PR is fixing.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/service_api/discord_activity_tracker.md` around lines 85 - 97, Update
the docs text around run_discord_activity_tracker/DiscordChatExporter to clarify
that the resume lower bound is per-channel, not guild-wide: replace the phrase
"the lower bound is the latest stored message time for this guild (and channel
allowlist)" with wording that each channel resumes from its own latest stored
message time (and if a channel has no stored rows, only today UTC for that
channel is exported), and also ensure the note about merging raw archives
(`YYYY-MM-DD.json`) and the behavior when both --since and --until are set still
applies per-channel; reference run_discord_activity_tracker and
DiscordChatExporter in the doc so readers know where the behavior is
implemented.
🧹 Nitpick comments (1)
discord_activity_tracker/sync/raw_archive.py (1)

92-98: ⚡ Quick win

Fix docstring indentation.

The closing line of the docstring (line 98) is indented with 2 spaces instead of 4, inconsistent with Python docstring conventions.

📝 Proposed fix
     Returns the number of messages written to the merged file.
-  """
+    """
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@discord_activity_tracker/sync/raw_archive.py` around lines 92 - 98, The
docstring block in discord_activity_tracker/sync/raw_archive.py that documents
merging exporter JSON has its closing triple-quote indented by 2 spaces; adjust
the closing triple-quote to use 4-space indentation so it lines up with the
other lines of the docstring (making the entire docstring consistently indented
and PEP-style), i.e., move the closing """ to the same indentation level as the
opening docstring in that function/module.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@discord_activity_tracker/sync/chat_exporter.py`:
- Around line 546-553: The current logic sets explicit_after = after_date if not
per_channel_incremental else None which drops a caller-supplied --since when
per_channel_incremental is True; change this so explicit_after always carries
the provided after_date (i.e., do not null it based on per_channel_incremental)
and pass that explicit_after through to resolve_channel_export_after(guild_id,
ch_id, explicit_after=explicit_after) so per-channel incremental uses each
channel's checkpoint only when no explicit --since was supplied. Ensure
resolve_channel_export_after still prefers explicit_after when present.

In `@discord_activity_tracker/sync/exporter_window.py`:
- Around line 104-108: Normalize naive `before` to UTC the same way `after` and
`now` are handled: when computing `upper` in exporter_window.py, treat a naive
`before` as UTC instead of local time by checking `before.tzinfo` and using
`before.replace(tzinfo=timezone.utc)` for naive datetimes, otherwise call
`before.astimezone(timezone.utc)`; keep the existing fallback to `now`. This
ensures `upper`, `before`, `after`, and `now` are consistently UTC-aware.

---

Outside diff comments:
In `@docs/service_api/discord_activity_tracker.md`:
- Around line 85-97: Update the docs text around
run_discord_activity_tracker/DiscordChatExporter to clarify that the resume
lower bound is per-channel, not guild-wide: replace the phrase "the lower bound
is the latest stored message time for this guild (and channel allowlist)" with
wording that each channel resumes from its own latest stored message time (and
if a channel has no stored rows, only today UTC for that channel is exported),
and also ensure the note about merging raw archives (`YYYY-MM-DD.json`) and the
behavior when both --since and --until are set still applies per-channel;
reference run_discord_activity_tracker and DiscordChatExporter in the doc so
readers know where the behavior is implemented.

---

Nitpick comments:
In `@discord_activity_tracker/sync/raw_archive.py`:
- Around line 92-98: The docstring block in
discord_activity_tracker/sync/raw_archive.py that documents merging exporter
JSON has its closing triple-quote indented by 2 spaces; adjust the closing
triple-quote to use 4-space indentation so it lines up with the other lines of
the docstring (making the entire docstring consistently indented and PEP-style),
i.e., move the closing """ to the same indentation level as the opening
docstring in that function/module.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3abf2f92-aaff-4575-b53a-9d81ca93cdf9

📥 Commits

Reviewing files that changed from the base of the PR and between 2347ffd and 94e895a.

📒 Files selected for processing (32)
  • .env.example
  • Makefile
  • SECURITY.md
  • config/settings.py
  • discord_activity_tracker/README.md
  • discord_activity_tracker/management/commands/extract_discord_tokens.py
  • discord_activity_tracker/management/commands/run_discord_activity_tracker.py
  • discord_activity_tracker/sync/chat_exporter.py
  • discord_activity_tracker/sync/exporter_window.py
  • discord_activity_tracker/sync/raw_archive.py
  • discord_activity_tracker/tests/test_chat_exporter_branch_coverage.py
  • discord_activity_tracker/tests/test_discord_internal_tokens_store.py
  • discord_activity_tracker/tests/test_discord_tokens.py
  • discord_activity_tracker/tests/test_exporter_window.py
  • discord_activity_tracker/tests/test_extract_discord_tokens_command.py
  • discord_activity_tracker/tests/test_raw_archive.py
  • discord_activity_tracker/tests/test_run_command_coverage.py
  • discord_activity_tracker/tests/test_run_discord_activity_tracker_command.py
  • discord_activity_tracker/tests/test_sync_chat_exporter.py
  • discord_activity_tracker/tests/test_task_discord_sync_coverage.py
  • discord_activity_tracker/tests/test_workspace.py
  • discord_activity_tracker/utils/__init__.py
  • discord_activity_tracker/utils/discord_internal_tokens_store.py
  • discord_activity_tracker/utils/discord_tokens.py
  • discord_activity_tracker/workspace.py
  • docker-compose.yml
  • docs/Docker.md
  • docs/Workspace.md
  • docs/operations/discord_chat_exporter.md
  • docs/service_api/discord_activity_tracker.md
  • scripts/clean-macos.sh
  • scripts/wait_discord_chrome_profile.sh

Comment thread discord_activity_tracker/sync/chat_exporter.py Outdated
Comment thread discord_activity_tracker/sync/exporter_window.py Outdated
Comment thread discord_activity_tracker/management/commands/run_discord_activity_tracker.py Outdated
jonathanMLDev
jonathanMLDev previously approved these changes Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Discord message collection: end-to-end collector bootstrap

2 participants