Skip to content

importer: scrape blog-post bodies from laddr's HTML (JSON endpoint doesn't surface Body) #106

@themightychris

Description

@themightychris

What's wrong

The cutover-blog importer landed in #84 / PR #101 and successfully imports 138 blog posts (titles, authors, summaries, publish dates), but bodies come through empty.

Root cause: laddr's `/blog?format=json` endpoint doesn't include a `Body` field at all — not in the list response, not on `?include=Body`, not on single-post fetches via `/blog/?format=json`. The body only exists in the rendered HTML at `https://codeforphilly.org/blog/` inside `<div class="article-body">`.

Verified 2026-05-30 against the live laddr site:

```
$ curl -s 'https://codeforphilly.org/blog?format=json&limit=1' | jq '.data[0] | keys'
[
"AuthorID", "Class", "ContextClass", "ContextID", "Created",
"CreatorID", "Handle", "ID", "Modified", "ModifierID",
"Published", "RevisionID", "Status", "Summary", "Title", "Visibility"
]
```

No `Body`. Same for `?include=Body` and single-post fetches.

Fix shape

`translateBlogPost` in `apps/api/scripts/import-laddr/translators.ts` needs a second pass after the JSON fetch:

  1. For each imported post, GET `https:///blog/` (the numeric-id URL, which 200s; the slug URL 404s because laddr's slug format uses underscores, mismatched with our kebab-case importer output)
  2. Parse the HTML, extract the `<div class="article-body">` content
  3. Run that HTML through a markdown-converter (turndown or similar) so the content-typed sheet stays in its declared markdown format
  4. Assign to `record.body`

Alternative: import bodies as raw HTML rather than markdown — either store as a plain TOML string (drop the markdown content-type) or embed HTML blocks within markdown (legal in CommonMark).

Impact today

The deployed sandbox shows the blog index + detail screens with titles, authors, summaries, and post dates — the structure is correct and visitors can see what posts exist. The body region renders the summary then an empty area. Acceptable as a v1 deploy; not acceptable as the cutover state for production.

Out of scope here

The slug mismatch between the importer's kebab-case slugs and laddr's underscore slugs is a separate redirect concern — currently the legacy URL redirects don't cover `/blog/`. Worth tracking but doesn't block this body-fetch work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions