Use ResourceId to refer to Resources#4176
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a migration from referencing resources by hash to referencing them by a unique ID (ResourceId), backed by a new ResourceRegistry to manage resource sources and garbage collection. Key feedback focuses on making the JSON migration parser more robust against pretty-printed or multi-field objects, removing a strict check that could cause deserialization failures on documents with missing embedded resources, and optimizing the resource garbage collection pipeline by passing HashSet references directly to avoid redundant conversions and allocations.
| fn collect_resources_referenced_by_hash(s: &str) -> Vec<(ResourceHash, Range<usize>)> { | ||
| let mut out = Vec::new(); | ||
| let mut offset = 0; | ||
|
|
||
| while let Some(i) = s[offset..].find("\"Resource\":") { | ||
| let abs = offset + i + 11; // pos after `"Resource":` | ||
| offset = abs; | ||
|
|
||
| if let Some(j) = s[offset..].find('}') { | ||
| let chunk_start = offset; | ||
| let chunk = &s[chunk_start..chunk_start + j]; | ||
|
|
||
| let quotes: Vec<_> = chunk.match_indices('"').collect(); | ||
| if quotes.len() == 2 { | ||
| let q0 = chunk_start + quotes[0].0; | ||
| let q1 = chunk_start + quotes[1].0; | ||
| let hash = &s[q0 + 1..q1]; | ||
|
|
||
| if hash.len() == 64 | ||
| && hash.bytes().all(|b| matches!(b, b'0'..=b'9' | b'a'..=b'f')) | ||
| && let Ok(hash) = hash.parse::<ResourceHash>() | ||
| { | ||
| out.push((hash, q0..q1 + 1)); | ||
| } | ||
| } | ||
|
|
||
| offset = chunk_start + j + 1; | ||
| } else { | ||
| break; | ||
| } | ||
| } | ||
|
|
||
| out | ||
| } |
There was a problem hiding this comment.
The current implementation of collect_resources_referenced_by_hash relies on finding the next } character and then checking if there are exactly two quotes in that chunk. This is fragile and will fail to detect resource hashes in several common scenarios:
- If the JSON is pretty-printed, the
}might be on a different line or much further down, and the chunk will contain other fields/quotes. - If there are other fields in the same JSON object after the
"Resource"field, the chunk will contain those fields and their quotes, makingquotes.len() == 2false.
We can make this parsing much more robust and simple by directly finding the first and second quotes after "Resource": and extracting the string between them.
fn collect_resources_referenced_by_hash(s: &str) -> Vec<(ResourceHash, Range<usize>)> {
let mut out = Vec::new();
let mut offset = 0;
while let Some(i) = s[offset..].find("\"Resource\":") {
let abs = offset + i + 11; // pos after `"Resource":`
offset = abs;
if let Some(q0_rel) = s[offset..].find('"') {
let q0 = offset + q0_rel;
if let Some(q1_rel) = s[q0 + 1..].find('"') {
let q1 = q0 + 1 + q1_rel;
let hash_str = &s[q0 + 1..q1];
if hash_str.len() == 64
&& hash_str.bytes().all(|b| matches!(b, b'0'..=b'9' | b'a'..=b'f'))
{
if let Ok(hash) = hash_str.parse::<ResourceHash>() {
out.push((hash, q0..q1 + 1));
}
}
offset = q1 + 1;
} else {
break;
}
} else {
break;
}
}
out
}| // Require each referenced hash to also appear as an embedded resource map key (at least 2 occurrences). | ||
| if resources_by_hash.is_empty() || !resources_by_hash.iter().all(|(hash, _)| document_serialized_content.matches(&hash.to_string()).count() >= 2) { | ||
| return (document_serialized_content, HashMap::new()); | ||
| } |
There was a problem hiding this comment.
The .all(...) check requires every referenced hash to appear at least twice in the document (once in the node input and once in the embedded resources). If a single resource reference is dangling or missing from the embedded list, this check will cause the entire migration to be skipped.
Since TaggedValue::Resource now expects a ResourceId (bare integer) instead of a ResourceHash (string), skipping the migration means the document will fail to deserialize entirely and cannot be opened. To prevent data loss and allow the document to open successfully, we should remove this strict check and migrate whatever resources are found.
if resources_by_hash.is_empty() {
return (document_serialized_content, HashMap::new());
}| pub fn used_resources(&self, include_history: bool) -> Box<[ResourceId]> { | ||
| let mut resources = HashSet::new(); | ||
| self.network_interface.collect_used_resources(&mut resources); | ||
| if include_history { | ||
| self.document_undo_history.iter().for_each(|interface| interface.collect_used_resources(&mut resources)); | ||
| self.document_redo_history.iter().for_each(|interface| interface.collect_used_resources(&mut resources)); | ||
| } | ||
| resources | ||
| resources.into_iter().collect::<Vec<_>>().into_boxed_slice() | ||
| } |
There was a problem hiding this comment.
used_resources collects the used resource IDs into a HashSet, but then converts it to a Vec and then a Box<[ResourceId]>. When this is passed to garbage_collect, the first thing garbage_collect does is collect it back into a HashSet.
This results in redundant allocations and conversions (HashSet -> Vec -> Box -> HashSet). Returning HashSet<ResourceId> directly from used_resources and updating garbage_collect to accept &HashSet<ResourceId> avoids these redundant operations.
| pub fn used_resources(&self, include_history: bool) -> Box<[ResourceId]> { | |
| let mut resources = HashSet::new(); | |
| self.network_interface.collect_used_resources(&mut resources); | |
| if include_history { | |
| self.document_undo_history.iter().for_each(|interface| interface.collect_used_resources(&mut resources)); | |
| self.document_redo_history.iter().for_each(|interface| interface.collect_used_resources(&mut resources)); | |
| } | |
| resources | |
| resources.into_iter().collect::<Vec<_>>().into_boxed_slice() | |
| } | |
| pub fn used_resources(&self, include_history: bool) -> HashSet<ResourceId> { | |
| let mut resources = HashSet::new(); | |
| self.network_interface.collect_used_resources(&mut resources); | |
| if include_history { | |
| self.document_undo_history.iter().for_each(|interface| interface.collect_used_resources(&mut resources)); | |
| self.document_redo_history.iter().for_each(|interface| interface.collect_used_resources(&mut resources)); | |
| } | |
| resources | |
| } |
| let used_resources = self.used_resources(false); | ||
|
|
||
| let embedded_resource_hashes = Vec::from_iter(used_resources.iter().filter_map(|id| match self.resources.registry.info(id) { | ||
| Some(ResourceInfo { hash: Some(hash), sources, .. }) if sources.contains(&DataSource::Embedded) => Some(hash), | ||
| _ => None, | ||
| })) | ||
| .into_boxed_slice(); | ||
|
|
||
| let mut document = self.clone(); | ||
|
|
||
| document.resources.registry.garbage_collect(used_resources.as_ref()); |
There was a problem hiding this comment.
If used_resources is updated to return HashSet<ResourceId>, we should update this call to pass &used_resources instead of used_resources.as_ref() to match the updated garbage_collect signature.
| let used_resources = self.used_resources(false); | |
| let embedded_resource_hashes = Vec::from_iter(used_resources.iter().filter_map(|id| match self.resources.registry.info(id) { | |
| Some(ResourceInfo { hash: Some(hash), sources, .. }) if sources.contains(&DataSource::Embedded) => Some(hash), | |
| _ => None, | |
| })) | |
| .into_boxed_slice(); | |
| let mut document = self.clone(); | |
| document.resources.registry.garbage_collect(used_resources.as_ref()); | |
| let used_resources = self.used_resources(false); | |
| let embedded_resource_hashes = Vec::from_iter(used_resources.iter().filter_map(|id| match self.resources.registry.info(id) { | |
| Some(ResourceInfo { hash: Some(hash), sources, .. }) if sources.contains(&DataSource::Embedded) => Some(hash), | |
| _ => None, | |
| })) | |
| .into_boxed_slice(); | |
| let mut document = self.clone(); | |
| document.resources.registry.garbage_collect(&used_resources); |
| pub fn garbage_collect(&mut self, used: &[ResourceId]) { | ||
| let used_set: std::collections::HashSet<ResourceId> = used.iter().copied().collect(); | ||
| self.sources.retain(|id, _| used_set.contains(id)); | ||
| self.hashes.retain(|id, _| used_set.contains(id)); | ||
| } |
There was a problem hiding this comment.
Accepting &HashSet<ResourceId> directly in garbage_collect avoids the need to re-collect a slice into a HashSet on every call, eliminating redundant allocations and hashing operations.
pub fn garbage_collect(&mut self, used: &std::collections::HashSet<ResourceId>) {
self.sources.retain(|id, _| used.contains(id));
self.hashes.retain(|id, _| used.contains(id));
}
No description provided.