Skip to content

feat: support script-driven generators#1218

Open
gennaroprota wants to merge 10 commits into
cppalliance:developfrom
gennaroprota:feat/script_driven_generators
Open

feat: support script-driven generators#1218
gennaroprota wants to merge 10 commits into
cppalliance:developfrom
gennaroprota:feat/script_driven_generators

Conversation

@gennaroprota

@gennaroprota gennaroprota commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

This adds the second flavor of addon-defined output generator, complementing the data-driven Handlebars generators from #1197: a generator backed entirely by a user Lua or JavaScript script, with no C++ and no compilation. The script owns the whole emit loop through a generate(corpus, output) entry point, so it can produce output shapes the per-page generators cannot, such as a single artifact aggregated across every symbol (a search index, for example).

The design is recorded in the #1216 design comment.

Changes

  • Source: A new src/lib/Gen/script/ module: ScriptGenerator (a Generator whose build() owns the emit loop and invokes the script's generate(corpus, output)), OutputSink (the file-writing API exposed to the script as output.write, resolved under the output directory and forbidden from escaping it), and the Lua and JavaScript runners. The manifest parser and the addon-directory walk are factored into a shared src/lib/Gen/GeneratorManifest.{hpp,cpp}, so the data-driven and script-driven discovery passes read the same mrdocs-generator.yml; src/lib/Gen/hbs/DataDrivenGenerators.{cpp,hpp} now consume it and skip directories that declare a script: entry. script::discoverScriptGenerators is wired into src/tool/GenerateAction.cpp and the test runner alongside the data-driven pass. The generator option's description in src/lib/ConfigOptions.json now mentions the script-driven flavor. A latent Lua-bridge bug is fixed in src/lib/Support/Lua.cpp: domValue_push now marshals Undefined (to nil) and SafeString (to a string) rather than aborting, so reading an absent field (for example the global namespace's empty name) no longer crashes a Lua script.
  • Tests: A unit suite drives both runners against a synthetic corpus, exercises the output writer's path safety (writes under the root; rejects absolute and escaping paths), and checks that discovery installs a ScriptGenerator. A regression test reads a symbol with no name field, covering the Undefined-to-nil marshalling that a real corpus needs.
  • Documentation: A new "Script-driven generators" section in docs/modules/ROOT/pages/generators.adoc, plus the regenerated docs/mrdocs.schema.json.
  • Breaking changes: None. The script-driven flavor is additive; data-driven discovery and output are unchanged.

Testing

  • src/test/lib/Gen/script/ is the unit-level coverage: discovery, the output writer's path-escaping guards, and both the Lua and JavaScript runners producing the expected aggregated file.
  • A script-driven generator's output is an arbitrary tree, which the per-file golden harness (one expected file per source) cannot check; the mechanism is therefore covered by the unit suite rather than by goldens.
  • The Undefined-to-nil regression test guards the Lua-bridge fix, which also benefits the feat: support Lua and JavaScript extensions #1196 corpus-transform scripts that read symbol fields.
  • No CI workflow changes are needed.

Documentation

  • A new "Script-driven generators" section in docs/modules/ROOT/pages/generators.adoc covers the generate(corpus, output) entry point, the output.write API, the script: discovery marker, and a search-index example.
  • The generator option's description in src/lib/ConfigOptions.json now mentions script-driven generators.

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

✨ Highlights

  • 🧪 Existing golden tests changed (behavior likely shifted)

🧾 Changes by Scope

Scope Lines Δ% Lines Δ Lines + Lines - Files Δ Files + Files ~ Files ↔ Files -
🛠️ Source 57% 1606 1278 328 24 6 18 - -
🥇 Golden Tests 20% 564 439 125 32 12 20 - -
🧪 Unit Tests 17% 480 463 17 2 1 1 - -
📄 Docs 3% 85 66 19 5 - 5 - -
📚 Examples 2% 57 57 - 4 4 - - -
🏗️ Build <1% 13 13 - 1 - 1 - -
Total 100% 2805 2316 489 68 23 45 - -

Legend: Files + (added), Files ~ (modified), Files ↔ (renamed), Files - (removed)

🔝 Top Files

  • src/test/lib/Gen/script/ScriptGenerator.cpp (Unit Tests): 446 lines Δ (+446 / -0)
  • src/lib/Gen/GeneratorManifest.cpp (Source): 210 lines Δ (+210 / -0)
  • src/lib/Gen/hbs/DataDrivenGenerators.cpp (Source): 203 lines Δ (+31 / -172)

Generated by 🚫 dangerJS against 5b3e0e1

@codecov

codecov Bot commented Jun 3, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.16%. Comparing base (b28ba4b) to head (2f00b2c).
⚠️ Report is 5 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1218      +/-   ##
===========================================
+ Coverage    81.97%   83.16%   +1.18%     
===========================================
  Files           34       35       +1     
  Lines         3179     3658     +479     
  Branches       743      843     +100     
===========================================
+ Hits          2606     3042     +436     
- Misses         392      409      +17     
- Partials       181      207      +26     
Flag Coverage Δ
bootstrap 83.16% <ø> (+1.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@gennaroprota gennaroprota linked an issue Jun 3, 2026 that may be closed by this pull request
@cppalliance-bot

cppalliance-bot commented Jun 3, 2026

Copy link
Copy Markdown

An automated preview of the documentation is available at https://1218.mrdocs.prtest2.cppalliance.org/index.html

If more commits are pushed to the pull request, the docs will rebuild at the same URL.

2026-06-12 16:30:20 UTC

@alandefreitas

Copy link
Copy Markdown
Collaborator

Nice! Let me split this review into two points:

On the contribution

  • In the documentation example, in what directory would output.write write?
  • How can the user see the configuration values and so on?
  • How can the extension receive its own parameters?
  • The extensions model we had envisioned (and discussed a lot in the last PR) is that a single extensions would contain all kinds of functionalities. The entry point is described as generate (in the same file, right?) but I don't see the word "generate" anywhere in the example script of the documentation. Why is that? Is it just my lack of skills with lua? The transform_corpus example does say function transform_corpus.
  • As a side note, could you please put this new documentation content on a new page? I'll reorganize it later in nav.adoc. I'm working on documentation right now, and this will have its own page.

On the use of AI

I think this is a little out of hand, to be honest. 😅 And it's creating a bottleneck rather than being useful at this point.

The design, and how each of its open questions was resolved, is recorded in the #1216 design comment

I don't find this very useful at all. Let me try to explain what I find useful first:

The way to make contributions more useful and less extractive is by having a human in the loop, adding human entropy, and being able to quickly answer questions about the human decisions. The https://llvm.org/docs/AIToolPolicy.html explains this well. Even the PR summary becomes less and less useful if there's no human in the loop, because it's harder and harder to read without intentionality. Only humans know what decisions were made by humans and why. Letting AI do it because AI doesn't know what was on the human's mind. AI doesn't know what you see as important or not. And the reviewer won't know it if AI generates the report. So that's the only part of the description that is actually useful to the contribution. Everything else could be generated by the reviewer just asking Claude to do it the same way. The reviewer was getting a better reply because Claude would make the answer according to the preferences of the reviewer.

I think the word loop here is also very important. When asked something like "why did you do that", humans operate on improving the design of things that are highest ROI. Or, at the very least, humans would answer with the things that have the highest ROI first. AI doesn't know what has the highest ROI to humans, so it tends to create very long answers that are completely inappropriate for the audience, doesn't use the internal vocabulary, includes metaphors, doesn't order things by internal importance, or any of that. AI answers are low ROI: return is low because they don't answer what's important first and how to phrase it to the audience, because they can't, and investment is high because you have to read this very long text that gives you no information.

And, of course, there's value in the human taking accountability over the code that was generated. Because without that, the "contribution" becomes about throwing this work on the reviewer. The human should take accountability for the entire code being reviewed and be able to answer questions with a human response immediately and with confidence. It's very hard to pin down a single metric for that because it's very easy to perceive it when it's not there. If that's not the case, and the answers also come from AI, then there's no contribution here because, again, the reviewer could just ask AI the question directly and avoid the friction.

I know we're excited about AI because it generates a lot of code for us, but trying to bypass the human contribution and answering things with AI only creates more bottlenecks. We have 8 open PRs right now because code generation is faster, but CI and reviews are the bottleneck. We can't skip review; we can't merge code that's not good in this project, since it's not a new project. Even if we trust AI, we still have the specification to check. So only humans in the loop can break this bottleneck by helping prioritize with human entropy. Otherwise, we'll just have more and more blocked PRs, and IA will not help us move things forward.

Of course, AI can still help us with text-related tasks. As I type this, Grammarly is making grammar suggestions. But this is the gist of what I'm saying is human. I often also ask Claude to reorganize text into sections differently, and so on. But the content and its ROI are still accountable, reviewed, and aligned with the vocabulary we use internally.

For instance, AI fluff paragraphs like:

D3 - One context, one call. Build the script context once, evaluate the script once, and call generate once. This matches corpus transforms (a single whole-corpus call), not helpers (many per-node calls), so the existing single-shot extension machinery applies almost unchanged.

just wastes me half a minute to read without understanding. Then, it takes much more time because I have to reverse engineer these distant AI catch phrases ("One context, one call"), metaphors (AI never writes things descriptively, which is very inappropriate in technical contexts), expressions that are not part of the project vocabulary ("single-shot extension"), and these unnecessarily long and complex sentence hierarchies. The whole document would take 15 minutes to read, and it's mostly fluff before getting into the PR. And content-wise, I'm sure no human would say this is high ROI content in the context of this PR. And the way AI works (predicting the next word retroactively) makes it the opposite of a design document because it's retroactively justifying the decisions. And even if the text were good, it doesn't reflect our internal vocabulary at all, which is really weird. And even if all of that were false, the lack of human entropy (to communicate what's on the human's mind) would still mean the reviewer could get a better version of it by asking Claude locally.

And the document is about 50x longer. Meaning it's all this pain times 50. We know that, in practice, most people read the first 10 words, notice it's AI, and ignore it completely. We need a human in the loop somewhere.

@gennaroprota

gennaroprota commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

On the contribution

* In the documentation example, in what directory would output.write write?

In the configured output directory. The same the built-in generators use. I added this to the docs.

* How can the user see the configuration values and so on?

Hmm, they couldn't. Implemented now.

* How can the extension receive its own parameters?

Likewise. generate now has config and params as trailing parameters.

* The extensions model we had envisioned (and discussed a lot in the last PR) is that a single extensions would contain all kinds of functionalities.

Hmm... maybe I misunderstood you, but I think we are waiting for that (see issue #1210).

The entry point is described as generate (in the same file, right?) but I don't see the word "generate" anywhere in the example script of the documentation. Why is that? Is it just my lack of skills with lua? The transform_corpus example does say function transform_corpus.

Yeah. It's a doc inconsistency, not your Lua :-). The example uses Lua's return-an-anonymous-function idiom (return function(corpus, output)...), which our Lua runner happens to accept, so the name never appears. It should be rewritten using function generate(corpus, output). And, yes, the function should be in the file named by script: in mrdocs-generator.yml.

Given the above, I wonder: should we keep accepting the returned-function form? Or should we require generate (like we do for JavaScript)?

* As a side note, could you please put this new documentation content on a new page? I'll reorganize it later in nav.adoc. I'm working on documentation right now, and this will have its own page.

Yes, done.

On the use of AI

Yeah. The PR summaries you generated with AI weren't particularly good either, so I've rewritten them by hand.

@gennaroprota gennaroprota force-pushed the feat/script_driven_generators branch 2 times, most recently from 658ce97 to 8d2066a Compare June 5, 2026 14:11
@alandefreitas

Copy link
Copy Markdown
Collaborator

Hi!

This needs rebase (more than others) because af3fc83 is going to have an impact on this. The new nav.adoc will also have an impact on docs.

Or should we require generate (like we do for JavaScript)?

I believe we should require it because one requirement of extensions is they need to be able to provide more than one extension.

We don't have to fix that issue with the "layers of extensions" models in this PR, but it's quite urgent after this because we need to be able to represent this so we can provide meaningful examples for the user that show how to put things together.

@gennaroprota gennaroprota force-pushed the feat/script_driven_generators branch 3 times, most recently from 77d7165 to e51cc45 Compare June 9, 2026 13:40
@alandefreitas

Copy link
Copy Markdown
Collaborator

In terms of design. I don’t understand the model we’re at. I think we were in level 1 going to level 2 (I mean, in the way you defined the levels in your issue) but now it seems like we went to some level 0.5. It seems like the other extension system comes from addons/extensions, and these ones come from addons/generator. But that can’t be different. Wasn’t the original idea that a extension could contain a transform_corpus and generate function? They can’t come from different places even if they weren’t related because the naming wouldn’t be appropriate.

It also seems like one of them requires a manifest and the other doesn’t but that doesn’t seem to make much sense. I think you tried to implement this system at level 2 and left the other system at level 1 if I understand correctly.

Also, it seems like everything I mentioned the user might need in an extension is being appended as an extra parameter but I’m not sure that’s the best design either.

As for the docs, it’s be nice to have the same model as the page for corpus extensions: scripts in both languages in tabs, sections explaining each thing the user can do, and examples interleaved with the concept being explained rather than singled out in a section.

`domValue_push` handled `Null`, `Boolean`, `Integer`, `String`, `Array`,
and `Object`, and aborted via `MRDOCS_UNREACHABLE` for any other kind.
Reading a field whose value is `Undefined` or `SafeString` therefore
crashed a Lua script.

This maps `Undefined` to `nil`, as `Null` already is, and a `SafeString`
the way a `String` is, which matches what happens in JavaScript.
@gennaroprota gennaroprota force-pushed the feat/script_driven_generators branch 2 times, most recently from c4d9eca to c73aa93 Compare June 11, 2026 16:48
This adds a generator flavor backed by a user script. A directory under
<addon>/generator/<name>/ whose mrdocs-generator.yml names a script
installs a generator that hands the whole emit to a Lua or JavaScript
`generate(corpus, output)` function: the script walks the corpus and
writes files through the output object, so it can produce output shapes
a per-page generator cannot, such as a single artifact aggregated across
every symbol.

The manifest parser moves into a shared `GeneratorManifest`, so the
data-driven and script-driven discovery passes read the same file. A
manifest that names a script is skipped by the data-driven pass and
installed by the script pass.

The output object exposes a single write method, resolved under the
output directory and forbidden from escaping it. Both languages receive
it as the second argument to generate; on the Lua side it is also bound
as a global and passed from there, because the Lua bridge cannot carry a
callable as a plain value.
This covers discovery (a script manifest installs a `ScriptGenerator`),
the output writer (writes under the root, rejects absolute and escaping
paths), and both runners against a synthetic corpus, asserting the file
they emit. A regression test reads a symbol with no name field,
exercising the `Undefined`-to-`nil` marshaling.
This adds a self-contained search-index generator that the docs page
includes and CI runs.

The extensions/script-driven-generators.adoc example section now
includes the manifest and the generate.lua from this fixture, so the
documented example is exactly the one the test runs.
This replaces the reserved-name `transform_corpus(corpus)` entry point
with an explicit `register_transform(fn)` call. A script may register
any number of transforms; each runs once, in registration order, against
a navigable DOM view of the corpus it can read and mutate in place.

The function is captured as a `dom::Function` on both languages. Lua
anchors it in `LUA_REGISTRYINDEX` via the new `lua::makeCallable`, not a
storage global, so no new ownerless global state is introduced. A script
that registers nothing warns and is otherwise a no-op, so an empty
script is tolerated.
`js::Context::~Context` tore the JerryScript interpreter down on the
first `Context` destruction, breaking the reference cycle between the
`Impl` and the native holders that each keep a `shared_ptr` to it. A
`dom::Function` obtained from a JavaScript value holds only that
`shared_ptr`, so once the originating `Context` was gone the function
referred to a freed interpreter and calling it crashed.

So, track the references that must keep the interpreter alive, every
live `js::Context` plus every JavaScript function value still held as a
`dom::Function`, and tear it down only when that count reaches zero. A
function value now self-owns its interpreter the way a Lua callable
does, so it can be invoked after the `Context` that produced it is gone.

Single-context use is unchanged: the count drops to zero on that
`Context`'s destruction and cleanup runs as before. Handlebars helpers
written in JavaScript are unaffected; they hold the value directly
rather than through the function-value conversion.
An extension script now defines an output generator with
`register_generator(id, fn)`, alongside any `register_transform` it
declares, rather than a generator directory shipping a
mrdocs-generator.yml that names a script. Both hooks receive one `ctx`
object: `ctx.corpus` and `ctx.config` for a transform, and additionally
`ctx.output` for a generator. This replaces the positional
`generate(corpus, output, config, params)` entry point and the separate
manifest-script discovery pass.

A registered generator is a `dom::Function` the corpus owns, because the
generator registry is a process-global that is never cleared. The build
populates the corpus with the registered generators while extensions
run; `GenerateAction` then resolves the requested generator from the
corpus before falling back to the registry. A single language-agnostic
runner invokes the function, so the two per-language generator runners
collapse into one path.

The manifest now carries only the data-driven generator fields, its
escape rules and the parent it extends; the `script` and `params` keys
are gone. The search-index example moves under addons/extensions and
declares its generator with `register_generator`.
The corpus-extensions page and the script-driven-generators page
documented two halves of one feature: scripts under addons/extensions
that declare transforms and generators through the `register_*` hooks.
This folds the generator material into the extensions page, which now
covers both hooks and the shared `ctx` object, and removes the separate
page.
@gennaroprota gennaroprota force-pushed the feat/script_driven_generators branch from 2f00b2c to 5b3e0e1 Compare June 12, 2026 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Generator extensions: user-defined output generators

3 participants