Skip to content

epub: free the parsed document in its owning load functions, not in set_xml_root_node#721

Open
maxheise wants to merge 1 commit into
linuxmint:masterfrom
maxheise:epub-free-parsed-document-load-paths
Open

epub: free the parsed document in its owning load functions, not in set_xml_root_node#721
maxheise wants to merge 1 commit into
linuxmint:masterfrom
maxheise:epub-free-parsed-document-load-paths

Conversation

@maxheise

Copy link
Copy Markdown

Hello,

This moves ownership of the parsed XML document in the EPUB backend to
the functions that open it, and frees it on the load-path returns that
currently leave it allocated.

Background: backend/epub/epub-document.c parses container.xml, the OPF
package document and the per-page XHTML into a single file-scope global,
xmldocument, and releases it with the helper xml_free_doc(), which frees
the document and sets the global back to NULL. open_xml_document() just
assigns xmldocument = xmlParseFile(...) without freeing any previous
value, and a load opens these documents in series, so a tree left
unreleased is lost when the next open overwrites the pointer (it is not
freed at finalize either). The values these functions return (a content
URI, the content list, a toc file name, a stylesheet href) are
independent allocations, so each opener can free the document before
returning.

set_xml_root_node() used to free the global on its no-root-element branch
with a bare xmlFreeDoc(). I have removed that: it now only validates the
root element and frees nothing. Freeing inside it is not safe anyway,
because two callers pass a non-NULL root name and ignore the return value
(setup_document_index with "ncx", epub_document_get_info with "package")
and then call xml_get_pointer_to_node(), which dereferences the root
node; a free inside the helper on a root-name mismatch would leave them
using freed memory. Leaving the release to the openers also keeps the
failure mode benign: a caller that omits a free leaks rather than uses
freed memory.

With the helper no longer freeing, the openers free on the returns that
previously leaked:

  • get_uri_to_content(): the "container file is corrupt" return (the
    set_xml_root_node failure), plus "invalid or corrupt" (no rootfile
    node), "no container" (no full-path attribute) and "could not
    retrieve container file" (g_filename_to_uri failed).
  • setup_document_content_list(): the "content file is invalid" return,
    plus "no spine", "no manifest" and the errorflag return.
  • get_toc_file_name(): the early return when the spine has no "toc"
    attribute. Per the in-code comment this is the EPUB3 case with no
    NCX, where the nav file is used instead, so it is reached for ordinary
    EPUB3 documents, not only malformed input.
  • epub_document_get_alternate_stylesheet(): the success path, when a
    night-mode is present. The href is an xmlGetProp
    copy independent of the document, so the document is freed first and
    the copy returned afterwards.

setup_document_index and epub_document_get_info already call
xml_free_doc() at their natural ends and have no early return after
opening the document, so they need no change.

xml_free_doc() sets the global to NULL after freeing, so the added calls
are fine on every path; successful return values are unchanged (the same
strings and lists), and only the point at which the parsed tree is
released changes.

Best regards,
Max

…et_xml_root_node

backend/epub/epub-document.c parses container.xml, the OPF package
document, and the per-page XHTML into the file-scope global xmldocument,
and releases it with the helper xml_free_doc(), which frees the document
and sets the global back to NULL. open_xml_document() assigns
xmldocument = xmlParseFile(...) without freeing any previous value, and a
document load opens these documents in series, so whatever a parse leaves
behind has to be released before the next open_xml_document() overwrites
the only pointer. xmldocument is not freed at document finalize, so a
parse tree left unreleased leaks; a later open_xml_document() then
overwrites the only pointer to it. The values these functions return (a
content URI, the content list, a toc file name, a stylesheet href) are
independent allocations: g_filename_to_uri results, g_malloc'd list
nodes, and xmlGetProp copies that do not point into the document tree, so
each owner can free the document before returning its result.

The ownership rule this follows: the function that opens a document owns
its lifetime and frees it on every exit; set_xml_root_node() only
validates the root element and must not free.

set_xml_root_node() previously broke that rule. On the no-root-element
branch it called a bare xmlFreeDoc(xmldocument), freeing the global as a
side effect of a function named for validation. That free is removed
here; set_xml_root_node() now frees nothing; it only sets the global
xmlroot and reports whether the root matches. Freeing inside it would not
even be safe: two callers pass a non-NULL root name and ignore the return
value (setup_document_index() with "ncx", epub_document_get_info() with
"package"), then call xml_get_pointer_to_node(), which dereferences
xmlroot->name. Freeing the tree inside the helper on a root-name mismatch
would leave those callers dereferencing freed memory. Leaving the release
to the owners also keeps the failure mode benign: a caller that omits a
free leaks rather than uses freed memory.

With the helper no longer freeing, each owner frees the document on the
exits that previously leaked it. set_xml_root_node()'s callers split two
ways:

  - get_uri_to_content() and setup_document_content_list() check the
    return value and bail; they now free before their bail-out returns
    ("container file is corrupt"; "content file is invalid"). Their other
    early error returns also free ("epub file is invalid or corrupt" with
    no rootfile node, "epub file is corrupt, no container" with no
    full-path attribute, "could not retrieve container file" when
    g_filename_to_uri fails; "epub file has no spine", "epub file has no
    manifest", and the errorflag return for a missing or broken manifest
    item).
  - setup_document_index() ("ncx") and epub_document_get_info()
    ("package") ignore the return value, but each already calls
    xml_free_doc() at its natural end and has no early return after
    opening the document, so it keeps a valid tree through its lookups
    and frees it at the end. They need no change.

The remaining owners with an early return that bypassed their
end-of-function free also free now:

  - get_toc_file_name(): the early return taken when the spine has no
    "toc" attribute (ncx == NULL). The in-code comment notes this is the
    EPUB3 case with no NCX, where the nav file is used instead, so it is
    reached for ordinary EPUB3 documents that have no NCX, not only for
    malformed input (ncx == NULL can also arise from malformed package
    data). It is called unconditionally from setup_document_index()
    during the load.
  - epub_document_get_alternate_stylesheet(): the success path, taken
    when a <link class="night"> alternate stylesheet is present. The href
    is read with xml_get_data_from_node(..., XML_ATTRIBUTE, ...), which
    calls xmlGetProp() and returns a fresh copy, so the document is freed
    first and that copy is returned afterwards.

xml_free_doc() sets xmldocument to NULL after freeing, so the added calls
are fine on every path; successful return values are unchanged (the same
strings and lists), and only the point at which the parsed tree is
released changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant