Skip to content

Use ResourceId to refer to Resources#4176

Draft
timon-schelling wants to merge 5 commits into
masterfrom
resource-id-migration
Draft

Use ResourceId to refer to Resources#4176
timon-schelling wants to merge 5 commits into
masterfrom
resource-id-migration

Conversation

@timon-schelling
Copy link
Copy Markdown
Member

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a migration from referencing resources by hash to referencing them by a unique ID (ResourceId), backed by a new ResourceRegistry to manage resource sources and garbage collection. Key feedback focuses on making the JSON migration parser more robust against pretty-printed or multi-field objects, removing a strict check that could cause deserialization failures on documents with missing embedded resources, and optimizing the resource garbage collection pipeline by passing HashSet references directly to avoid redundant conversions and allocations.

Comment on lines +1041 to +1074
fn collect_resources_referenced_by_hash(s: &str) -> Vec<(ResourceHash, Range<usize>)> {
let mut out = Vec::new();
let mut offset = 0;

while let Some(i) = s[offset..].find("\"Resource\":") {
let abs = offset + i + 11; // pos after `"Resource":`
offset = abs;

if let Some(j) = s[offset..].find('}') {
let chunk_start = offset;
let chunk = &s[chunk_start..chunk_start + j];

let quotes: Vec<_> = chunk.match_indices('"').collect();
if quotes.len() == 2 {
let q0 = chunk_start + quotes[0].0;
let q1 = chunk_start + quotes[1].0;
let hash = &s[q0 + 1..q1];

if hash.len() == 64
&& hash.bytes().all(|b| matches!(b, b'0'..=b'9' | b'a'..=b'f'))
&& let Ok(hash) = hash.parse::<ResourceHash>()
{
out.push((hash, q0..q1 + 1));
}
}

offset = chunk_start + j + 1;
} else {
break;
}
}

out
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of collect_resources_referenced_by_hash relies on finding the next } character and then checking if there are exactly two quotes in that chunk. This is fragile and will fail to detect resource hashes in several common scenarios:

  1. If the JSON is pretty-printed, the } might be on a different line or much further down, and the chunk will contain other fields/quotes.
  2. If there are other fields in the same JSON object after the "Resource" field, the chunk will contain those fields and their quotes, making quotes.len() == 2 false.

We can make this parsing much more robust and simple by directly finding the first and second quotes after "Resource": and extracting the string between them.

	fn collect_resources_referenced_by_hash(s: &str) -> Vec<(ResourceHash, Range<usize>)> {
		let mut out = Vec::new();
		let mut offset = 0;

		while let Some(i) = s[offset..].find("\"Resource\":") {
			let abs = offset + i + 11; // pos after `"Resource":`
			offset = abs;

			if let Some(q0_rel) = s[offset..].find('"') {
				let q0 = offset + q0_rel;
				if let Some(q1_rel) = s[q0 + 1..].find('"') {
					let q1 = q0 + 1 + q1_rel;
					let hash_str = &s[q0 + 1..q1];

					if hash_str.len() == 64
						&& hash_str.bytes().all(|b| matches!(b, b'0'..=b'9' | b'a'..=b'f'))
					{
						if let Ok(hash) = hash_str.parse::<ResourceHash>() {
							out.push((hash, q0..q1 + 1));
						}
					}
					offset = q1 + 1;
				} else {
					break;
				}
			} else {
				break;
			}
		}

		out
	}

Comment on lines +1078 to +1081
// Require each referenced hash to also appear as an embedded resource map key (at least 2 occurrences).
if resources_by_hash.is_empty() || !resources_by_hash.iter().all(|(hash, _)| document_serialized_content.matches(&hash.to_string()).count() >= 2) {
return (document_serialized_content, HashMap::new());
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The .all(...) check requires every referenced hash to appear at least twice in the document (once in the node input and once in the embedded resources). If a single resource reference is dangling or missing from the embedded list, this check will cause the entire migration to be skipped.

Since TaggedValue::Resource now expects a ResourceId (bare integer) instead of a ResourceHash (string), skipping the migration means the document will fail to deserialize entirely and cannot be opened. To prevent data loss and allow the document to open successfully, we should remove this strict check and migrate whatever resources are found.

	if resources_by_hash.is_empty() {
		return (document_serialized_content, HashMap::new());
	}

Comment on lines +3493 to 3501
pub fn used_resources(&self, include_history: bool) -> Box<[ResourceId]> {
let mut resources = HashSet::new();
self.network_interface.collect_used_resources(&mut resources);
if include_history {
self.document_undo_history.iter().for_each(|interface| interface.collect_used_resources(&mut resources));
self.document_redo_history.iter().for_each(|interface| interface.collect_used_resources(&mut resources));
}
resources
resources.into_iter().collect::<Vec<_>>().into_boxed_slice()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

used_resources collects the used resource IDs into a HashSet, but then converts it to a Vec and then a Box<[ResourceId]>. When this is passed to garbage_collect, the first thing garbage_collect does is collect it back into a HashSet.

This results in redundant allocations and conversions (HashSet -> Vec -> Box -> HashSet). Returning HashSet<ResourceId> directly from used_resources and updating garbage_collect to accept &HashSet<ResourceId> avoids these redundant operations.

Suggested change
pub fn used_resources(&self, include_history: bool) -> Box<[ResourceId]> {
let mut resources = HashSet::new();
self.network_interface.collect_used_resources(&mut resources);
if include_history {
self.document_undo_history.iter().for_each(|interface| interface.collect_used_resources(&mut resources));
self.document_redo_history.iter().for_each(|interface| interface.collect_used_resources(&mut resources));
}
resources
resources.into_iter().collect::<Vec<_>>().into_boxed_slice()
}
pub fn used_resources(&self, include_history: bool) -> HashSet<ResourceId> {
let mut resources = HashSet::new();
self.network_interface.collect_used_resources(&mut resources);
if include_history {
self.document_undo_history.iter().for_each(|interface| interface.collect_used_resources(&mut resources));
self.document_redo_history.iter().for_each(|interface| interface.collect_used_resources(&mut resources));
}
resources
}

Comment on lines +931 to +941
let used_resources = self.used_resources(false);

let embedded_resource_hashes = Vec::from_iter(used_resources.iter().filter_map(|id| match self.resources.registry.info(id) {
Some(ResourceInfo { hash: Some(hash), sources, .. }) if sources.contains(&DataSource::Embedded) => Some(hash),
_ => None,
}))
.into_boxed_slice();

let mut document = self.clone();

document.resources.registry.garbage_collect(used_resources.as_ref());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If used_resources is updated to return HashSet<ResourceId>, we should update this call to pass &used_resources instead of used_resources.as_ref() to match the updated garbage_collect signature.

Suggested change
let used_resources = self.used_resources(false);
let embedded_resource_hashes = Vec::from_iter(used_resources.iter().filter_map(|id| match self.resources.registry.info(id) {
Some(ResourceInfo { hash: Some(hash), sources, .. }) if sources.contains(&DataSource::Embedded) => Some(hash),
_ => None,
}))
.into_boxed_slice();
let mut document = self.clone();
document.resources.registry.garbage_collect(used_resources.as_ref());
let used_resources = self.used_resources(false);
let embedded_resource_hashes = Vec::from_iter(used_resources.iter().filter_map(|id| match self.resources.registry.info(id) {
Some(ResourceInfo { hash: Some(hash), sources, .. }) if sources.contains(&DataSource::Embedded) => Some(hash),
_ => None,
}))
.into_boxed_slice();
let mut document = self.clone();
document.resources.registry.garbage_collect(&used_resources);

Comment on lines +274 to +278
pub fn garbage_collect(&mut self, used: &[ResourceId]) {
let used_set: std::collections::HashSet<ResourceId> = used.iter().copied().collect();
self.sources.retain(|id, _| used_set.contains(id));
self.hashes.retain(|id, _| used_set.contains(id));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accepting &HashSet<ResourceId> directly in garbage_collect avoids the need to re-collect a slice into a HashSet on every call, eliminating redundant allocations and hashing operations.

	pub fn garbage_collect(&mut self, used: &std::collections::HashSet<ResourceId>) {
		self.sources.retain(|id, _| used.contains(id));
		self.hashes.retain(|id, _| used.contains(id));
	}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant