epub: free the parsed document in its owning load functions, not in set_xml_root_node#721
Open
maxheise wants to merge 1 commit into
Open
Conversation
…et_xml_root_node
backend/epub/epub-document.c parses container.xml, the OPF package
document, and the per-page XHTML into the file-scope global xmldocument,
and releases it with the helper xml_free_doc(), which frees the document
and sets the global back to NULL. open_xml_document() assigns
xmldocument = xmlParseFile(...) without freeing any previous value, and a
document load opens these documents in series, so whatever a parse leaves
behind has to be released before the next open_xml_document() overwrites
the only pointer. xmldocument is not freed at document finalize, so a
parse tree left unreleased leaks; a later open_xml_document() then
overwrites the only pointer to it. The values these functions return (a
content URI, the content list, a toc file name, a stylesheet href) are
independent allocations: g_filename_to_uri results, g_malloc'd list
nodes, and xmlGetProp copies that do not point into the document tree, so
each owner can free the document before returning its result.
The ownership rule this follows: the function that opens a document owns
its lifetime and frees it on every exit; set_xml_root_node() only
validates the root element and must not free.
set_xml_root_node() previously broke that rule. On the no-root-element
branch it called a bare xmlFreeDoc(xmldocument), freeing the global as a
side effect of a function named for validation. That free is removed
here; set_xml_root_node() now frees nothing; it only sets the global
xmlroot and reports whether the root matches. Freeing inside it would not
even be safe: two callers pass a non-NULL root name and ignore the return
value (setup_document_index() with "ncx", epub_document_get_info() with
"package"), then call xml_get_pointer_to_node(), which dereferences
xmlroot->name. Freeing the tree inside the helper on a root-name mismatch
would leave those callers dereferencing freed memory. Leaving the release
to the owners also keeps the failure mode benign: a caller that omits a
free leaks rather than uses freed memory.
With the helper no longer freeing, each owner frees the document on the
exits that previously leaked it. set_xml_root_node()'s callers split two
ways:
- get_uri_to_content() and setup_document_content_list() check the
return value and bail; they now free before their bail-out returns
("container file is corrupt"; "content file is invalid"). Their other
early error returns also free ("epub file is invalid or corrupt" with
no rootfile node, "epub file is corrupt, no container" with no
full-path attribute, "could not retrieve container file" when
g_filename_to_uri fails; "epub file has no spine", "epub file has no
manifest", and the errorflag return for a missing or broken manifest
item).
- setup_document_index() ("ncx") and epub_document_get_info()
("package") ignore the return value, but each already calls
xml_free_doc() at its natural end and has no early return after
opening the document, so it keeps a valid tree through its lookups
and frees it at the end. They need no change.
The remaining owners with an early return that bypassed their
end-of-function free also free now:
- get_toc_file_name(): the early return taken when the spine has no
"toc" attribute (ncx == NULL). The in-code comment notes this is the
EPUB3 case with no NCX, where the nav file is used instead, so it is
reached for ordinary EPUB3 documents that have no NCX, not only for
malformed input (ncx == NULL can also arise from malformed package
data). It is called unconditionally from setup_document_index()
during the load.
- epub_document_get_alternate_stylesheet(): the success path, taken
when a <link class="night"> alternate stylesheet is present. The href
is read with xml_get_data_from_node(..., XML_ATTRIBUTE, ...), which
calls xmlGetProp() and returns a fresh copy, so the document is freed
first and that copy is returned afterwards.
xml_free_doc() sets xmldocument to NULL after freeing, so the added calls
are fine on every path; successful return values are unchanged (the same
strings and lists), and only the point at which the parsed tree is
released changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hello,
This moves ownership of the parsed XML document in the EPUB backend to
the functions that open it, and frees it on the load-path returns that
currently leave it allocated.
Background: backend/epub/epub-document.c parses container.xml, the OPF
package document and the per-page XHTML into a single file-scope global,
xmldocument, and releases it with the helper xml_free_doc(), which frees
the document and sets the global back to NULL. open_xml_document() just
assigns xmldocument = xmlParseFile(...) without freeing any previous
value, and a load opens these documents in series, so a tree left
unreleased is lost when the next open overwrites the pointer (it is not
freed at finalize either). The values these functions return (a content
URI, the content list, a toc file name, a stylesheet href) are
independent allocations, so each opener can free the document before
returning.
set_xml_root_node() used to free the global on its no-root-element branch
with a bare xmlFreeDoc(). I have removed that: it now only validates the
root element and frees nothing. Freeing inside it is not safe anyway,
because two callers pass a non-NULL root name and ignore the return value
(setup_document_index with "ncx", epub_document_get_info with "package")
and then call xml_get_pointer_to_node(), which dereferences the root
node; a free inside the helper on a root-name mismatch would leave them
using freed memory. Leaving the release to the openers also keeps the
failure mode benign: a caller that omits a free leaks rather than uses
freed memory.
With the helper no longer freeing, the openers free on the returns that
previously leaked:
set_xml_root_node failure), plus "invalid or corrupt" (no rootfile
node), "no container" (no full-path attribute) and "could not
retrieve container file" (g_filename_to_uri failed).
plus "no spine", "no manifest" and the errorflag return.
attribute. Per the in-code comment this is the EPUB3 case with no
NCX, where the nav file is used instead, so it is reached for ordinary
EPUB3 documents, not only malformed input.
night-mode is present. The href is an xmlGetProp
copy independent of the document, so the document is freed first and
the copy returned afterwards.
setup_document_index and epub_document_get_info already call
xml_free_doc() at their natural ends and have no early return after
opening the document, so they need no change.
xml_free_doc() sets the global to NULL after freeing, so the added calls
are fine on every path; successful return values are unchanged (the same
strings and lists), and only the point at which the parsed tree is
released changes.
Best regards,
Max