Embedder#1
Conversation
- Introduced `EmbeddingConfig` struct for managing embedding model configurations. - Implemented `EmbeddingClient` for handling model initialization and embedding generation. - Added environment variable support for model selection, cache directory, and pool size. - Created a main function to demonstrate model loading and embedding generation with sample texts. - Integrated `fastembed` library for embedding functionalities.
… and add integration tests
…rkflow, and README documentation
|
@copilot create a pr fixing the dockerfile |
|
@ArnabChatterjee20k I've opened a new pull request, #2, to work on those changes. Once the pull request is ready, I'll request review from you. |
Agent-Logs-Url: https://github.com/appwrite/embedding/sessions/4f5decfe-9143-42bd-b0e6-811ed2616fcb Co-authored-by: ArnabChatterjee20k <83803257+ArnabChatterjee20k@users.noreply.github.com>
Agent-Logs-Url: https://github.com/appwrite/embedding/sessions/4f5decfe-9143-42bd-b0e6-811ed2616fcb Co-authored-by: ArnabChatterjee20k <83803257+ArnabChatterjee20k@users.noreply.github.com>
Agent-Logs-Url: https://github.com/appwrite/embedding/sessions/5bc8a2a9-7dfa-400d-9834-850b8f96eb8d Co-authored-by: ArnabChatterjee20k <83803257+ArnabChatterjee20k@users.noreply.github.com>
Set default runtime model cache path and fix Docker CI compatibility
Greptile SummaryThis PR introduces a new Rust-based embedding microservice that exposes a single
Confidence Score: 3/5The service needs a few fixes before merging: CI will not validate master-branch merges, and .env is tracked by git without a gitignore guard. The CI branch mismatch means merges to the primary branch will never trigger the test or docker-build jobs, leaving master unvalidated by automation. Committing .env without a gitignore entry creates an ongoing risk of accidental secret exposure. These two issues together with performance concerns in the pool dispatch and sub-batch sizing path indicate the service needs fixes before it is ready to merge. .github/workflows/ci.yml (branch mismatch), .env (missing gitignore entry), src/embedding.rs (pool dispatch and per-request sysinfo call)
|
| Filename | Overview |
|---|---|
| .github/workflows/ci.yml | CI workflow with fmt, clippy, test, and docker build jobs. Push trigger targets 'main' but the repo default branch is 'master', so CI never runs on direct pushes to the primary branch. |
| .env | Default environment config committed to git with no .gitignore entry. Current values are benign but establishes a risky pattern. |
| src/embedding.rs | Core embedding logic with pool initialization and async batch dispatch. Round-robin selection ignores slot availability and sub_batch re-reads system memory on every request. |
| src/main.rs | Minimal Axum server with basic input validation. Missing graceful shutdown signal handling. |
| src/lib.rs | Thin re-export module; no issues. |
| tests/embed_e2e.rs | E2e tests correctly marked #[ignore] to avoid downloading ONNX models in CI. |
| Dockerfile | Multi-stage build with a non-root user and dedicated model cache directory. |
| Cargo.toml | Standard Rust manifest with pinned dependency versions; no issues. |
Reviews (1): Last reviewed commit: "Merge remote-tracking branch 'origin/emb..." | Re-trigger Greptile
| on: | ||
| push: | ||
| branches: [main] | ||
| pull_request: |
There was a problem hiding this comment.
CI push trigger targets
main, but the repository's default branch (and the base for this PR) is master. As-is, the test and docker jobs will never fire on a direct push to master, so merges to the primary branch go unvalidated by CI.
| on: | |
| push: | |
| branches: [main] | |
| pull_request: | |
| on: | |
| push: | |
| branches: [master] | |
| pull_request: |
| EMBEDDING_DEFAULT_MODEL=nomic-embed-text | ||
| EMBEDDING_CACHE_DIR=./models | ||
| EMBEDDING_POOL_SIZE=2 |
There was a problem hiding this comment.
.env committed without a .gitignore entry
The .gitignore only ignores /target and .fastembed_cache/; .env is not listed. The current values are benign defaults, but committing .env to version control means any future secret added here (e.g. a Hugging Face token, a remote model endpoint key) will end up in git history. The conventional approach is to commit a .env.example with placeholder values and add .env to .gitignore.
| /// Round-robin acquire a model instance from the pool. | ||
| fn acquire(&self) -> Arc<Mutex<TextEmbedding>> { | ||
| let idx = next_index(&self.next, self.pool.len()); | ||
| self.pool[idx].clone() | ||
| } |
There was a problem hiding this comment.
Round-robin acquires without availability awareness
acquire() uses a simple incrementing modulo to pick a pool slot, but does not check whether that slot is currently locked. Under concurrent requests with more chunks than pool instances, multiple chunks can queue behind the same Mutex while another slot sits idle. A semaphore-based or try_lock-with-fallback approach would use all available instances evenly.
| fn compute_sub_batch(dimension: usize, gpu: bool) -> usize { | ||
| let mut sys = sysinfo::System::new(); | ||
| sys.refresh_memory(); | ||
| let available_mb = sys.available_memory() / (1024 * 1024); | ||
|
|
||
| if available_mb == 0 { | ||
| return 32; | ||
| } | ||
|
|
||
| // Per-text memory estimate for ONNX inference. Attention matrices | ||
| // dominate: heads × seq² × 4 bytes. For 768-dim BERT-like models | ||
| // (12 heads) processing ~1000-2000 token code chunks, attention alone | ||
| // is 50-200 MB per text. The estimate below is conservative so the | ||
| // sub-batch stays small enough to prevent arena over-allocation. | ||
| let mb_per_text: u64 = if dimension >= 768 { 100 } else { 40 }; | ||
| let budget_mb = available_mb / 2; | ||
| let max_batch = if gpu { 256 } else { 16 }; | ||
| (budget_mb / mb_per_text).clamp(4, max_batch) as usize | ||
| } |
There was a problem hiding this comment.
sysinfo system call on every embed request
compute_sub_batch allocates a new sysinfo::System and calls refresh_memory() on every invocation of embed() when sub_batch_override is 0. Under moderate concurrency this becomes a repeated syscall per request. Computing it once at EmbeddingClient construction and storing it as a field would avoid the overhead.
| #[tokio::main] | ||
| async fn main() -> Result<(), Box<dyn std::error::Error + Send + Sync>> { | ||
| tracing_subscriber::fmt() | ||
| .with_env_filter( | ||
| tracing_subscriber::EnvFilter::try_from_default_env() | ||
| .unwrap_or_else(|_| tracing_subscriber::EnvFilter::new("info")), | ||
| ) | ||
| .init(); | ||
|
|
||
| let config = EmbeddingConfig::from_env(); | ||
| let client = Arc::new(EmbeddingClient::new(config)?); | ||
| let state = AppState { client }; | ||
|
|
||
| let app = Router::new().route("/embed", post(embed)).with_state(state); | ||
|
|
||
| let addr = std::env::var("BIND_ADDR").unwrap_or_else(|_| "0.0.0.0:3000".to_string()); | ||
| let listener = tokio::net::TcpListener::bind(&addr).await?; | ||
| tracing::info!("listening on {}", addr); | ||
| axum::serve(listener, app).await?; | ||
|
|
||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
What does this PR do?
(Provide a description of what this PR does.)
Test Plan
(Write your test plan here. If you changed any code, please provide us with clear instructions on how you verified your changes work.)
Related PRs and Issues
(If this PR is related to any other PR or resolves any issue or related to any issue link all related PR and issues here.)
Have you read the Contributing Guidelines on issues?
(Write your answer here.)