Skip to content

Embedder#1

Open
ArnabChatterjee20k wants to merge 17 commits into
masterfrom
embedder
Open

Embedder#1
ArnabChatterjee20k wants to merge 17 commits into
masterfrom
embedder

Conversation

@ArnabChatterjee20k
Copy link
Copy Markdown
Member

What does this PR do?

(Provide a description of what this PR does.)

Test Plan

(Write your test plan here. If you changed any code, please provide us with clear instructions on how you verified your changes work.)

Related PRs and Issues

(If this PR is related to any other PR or resolves any issue or related to any issue link all related PR and issues here.)

Have you read the Contributing Guidelines on issues?

(Write your answer here.)

- Introduced `EmbeddingConfig` struct for managing embedding model configurations.
- Implemented `EmbeddingClient` for handling model initialization and embedding generation.
- Added environment variable support for model selection, cache directory, and pool size.
- Created a main function to demonstrate model loading and embedding generation with sample texts.
- Integrated `fastembed` library for embedding functionalities.
@ArnabChatterjee20k
Copy link
Copy Markdown
Member Author

@copilot create a pr fixing the dockerfile

Copy link
Copy Markdown

Copilot AI commented May 22, 2026

@ArnabChatterjee20k I've opened a new pull request, #2, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI and others added 6 commits May 22, 2026 12:33
Agent-Logs-Url: https://github.com/appwrite/embedding/sessions/4f5decfe-9143-42bd-b0e6-811ed2616fcb

Co-authored-by: ArnabChatterjee20k <83803257+ArnabChatterjee20k@users.noreply.github.com>
Agent-Logs-Url: https://github.com/appwrite/embedding/sessions/4f5decfe-9143-42bd-b0e6-811ed2616fcb

Co-authored-by: ArnabChatterjee20k <83803257+ArnabChatterjee20k@users.noreply.github.com>
Agent-Logs-Url: https://github.com/appwrite/embedding/sessions/5bc8a2a9-7dfa-400d-9834-850b8f96eb8d

Co-authored-by: ArnabChatterjee20k <83803257+ArnabChatterjee20k@users.noreply.github.com>
Set default runtime model cache path and fix Docker CI compatibility
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 26, 2026

Greptile Summary

This PR introduces a new Rust-based embedding microservice that exposes a single POST /embed HTTP endpoint backed by a pool of ONNX model instances (via fastembed). It includes dynamic pool-size calculation based on measured runtime memory footprint, a round-robin dispatch strategy, a Docker image with a non-root user, and a GitHub Actions CI pipeline.

  • Core logic (src/embedding.rs): Loads one or more TextEmbedding instances into a Mutex-guarded pool, runs a warmup inference to capture the real ONNX arena size, then caps the pool to what available RAM can support.
  • HTTP layer (src/main.rs): Minimal Axum server with input validation and structured JSON error responses; no health endpoint or graceful shutdown hook.
  • CI (.github/workflows/ci.yml): Runs fmt, clippy, unit tests, and a Docker build — but the push trigger names main while the repository default branch is master.

Confidence Score: 3/5

The service needs a few fixes before merging: CI will not validate master-branch merges, and .env is tracked by git without a gitignore guard.

The CI branch mismatch means merges to the primary branch will never trigger the test or docker-build jobs, leaving master unvalidated by automation. Committing .env without a gitignore entry creates an ongoing risk of accidental secret exposure. These two issues together with performance concerns in the pool dispatch and sub-batch sizing path indicate the service needs fixes before it is ready to merge.

.github/workflows/ci.yml (branch mismatch), .env (missing gitignore entry), src/embedding.rs (pool dispatch and per-request sysinfo call)

Security Review

  • .env is committed to the repository and absent from .gitignore. Current values are non-sensitive defaults, but the pattern means any secret added in the future will land in git history.

Important Files Changed

Filename Overview
.github/workflows/ci.yml CI workflow with fmt, clippy, test, and docker build jobs. Push trigger targets 'main' but the repo default branch is 'master', so CI never runs on direct pushes to the primary branch.
.env Default environment config committed to git with no .gitignore entry. Current values are benign but establishes a risky pattern.
src/embedding.rs Core embedding logic with pool initialization and async batch dispatch. Round-robin selection ignores slot availability and sub_batch re-reads system memory on every request.
src/main.rs Minimal Axum server with basic input validation. Missing graceful shutdown signal handling.
src/lib.rs Thin re-export module; no issues.
tests/embed_e2e.rs E2e tests correctly marked #[ignore] to avoid downloading ONNX models in CI.
Dockerfile Multi-stage build with a non-root user and dedicated model cache directory.
Cargo.toml Standard Rust manifest with pinned dependency versions; no issues.

Reviews (1): Last reviewed commit: "Merge remote-tracking branch 'origin/emb..." | Re-trigger Greptile

Comment thread .github/workflows/ci.yml
Comment on lines +3 to +6
on:
push:
branches: [main]
pull_request:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 CI push trigger targets main, but the repository's default branch (and the base for this PR) is master. As-is, the test and docker jobs will never fire on a direct push to master, so merges to the primary branch go unvalidated by CI.

Suggested change
on:
push:
branches: [main]
pull_request:
on:
push:
branches: [master]
pull_request:

Comment thread .env
Comment on lines +1 to +3
EMBEDDING_DEFAULT_MODEL=nomic-embed-text
EMBEDDING_CACHE_DIR=./models
EMBEDDING_POOL_SIZE=2
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 security .env committed without a .gitignore entry

The .gitignore only ignores /target and .fastembed_cache/; .env is not listed. The current values are benign defaults, but committing .env to version control means any future secret added here (e.g. a Hugging Face token, a remote model endpoint key) will end up in git history. The conventional approach is to commit a .env.example with placeholder values and add .env to .gitignore.

Comment thread src/embedding.rs
Comment on lines +284 to +288
/// Round-robin acquire a model instance from the pool.
fn acquire(&self) -> Arc<Mutex<TextEmbedding>> {
let idx = next_index(&self.next, self.pool.len());
self.pool[idx].clone()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Round-robin acquires without availability awareness

acquire() uses a simple incrementing modulo to pick a pool slot, but does not check whether that slot is currently locked. Under concurrent requests with more chunks than pool instances, multiple chunks can queue behind the same Mutex while another slot sits idle. A semaphore-based or try_lock-with-fallback approach would use all available instances evenly.

Comment thread src/embedding.rs
Comment on lines +297 to +315
fn compute_sub_batch(dimension: usize, gpu: bool) -> usize {
let mut sys = sysinfo::System::new();
sys.refresh_memory();
let available_mb = sys.available_memory() / (1024 * 1024);

if available_mb == 0 {
return 32;
}

// Per-text memory estimate for ONNX inference. Attention matrices
// dominate: heads × seq² × 4 bytes. For 768-dim BERT-like models
// (12 heads) processing ~1000-2000 token code chunks, attention alone
// is 50-200 MB per text. The estimate below is conservative so the
// sub-batch stays small enough to prevent arena over-allocation.
let mb_per_text: u64 = if dimension >= 768 { 100 } else { 40 };
let budget_mb = available_mb / 2;
let max_batch = if gpu { 256 } else { 16 };
(budget_mb / mb_per_text).clamp(4, max_batch) as usize
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 sysinfo system call on every embed request

compute_sub_batch allocates a new sysinfo::System and calls refresh_memory() on every invocation of embed() when sub_batch_override is 0. Under moderate concurrency this becomes a repeated syscall per request. Computing it once at EmbeddingClient construction and storing it as a field would avoid the overhead.

Comment thread src/main.rs
Comment on lines +64 to +85
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
tracing_subscriber::fmt()
.with_env_filter(
tracing_subscriber::EnvFilter::try_from_default_env()
.unwrap_or_else(|_| tracing_subscriber::EnvFilter::new("info")),
)
.init();

let config = EmbeddingConfig::from_env();
let client = Arc::new(EmbeddingClient::new(config)?);
let state = AppState { client };

let app = Router::new().route("/embed", post(embed)).with_state(state);

let addr = std::env::var("BIND_ADDR").unwrap_or_else(|_| "0.0.0.0:3000".to_string());
let listener = tokio::net::TcpListener::bind(&addr).await?;
tracing::info!("listening on {}", addr);
axum::serve(listener, app).await?;

Ok(())
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No graceful shutdown handling

axum::serve is awaited directly with no signal hook. When the container receives SIGTERM, the process is killed immediately and any in-flight embedding requests are dropped. Axum supports a .with_graceful_shutdown(signal) overload that drains connections before exit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants