Skip to content

Replace byte-chunking with section-based Algolia indexing#10408

Open
rosieyohannan wants to merge 3 commits into
mainfrom
DOC-176-search-section-indexing
Open

Replace byte-chunking with section-based Algolia indexing#10408
rosieyohannan wants to merge 3 commits into
mainfrom
DOC-176-search-section-indexing

Conversation

@rosieyohannan
Copy link
Copy Markdown
Contributor

Summary

  • Root cause of poor search: The previous indexing split pages at arbitrary byte boundaries (~8,500 bytes), producing ~520 large records with no awareness of document structure. Chunk boundaries fell mid-phrase (e.g. splitting "Pipeline states" across two records), and huge records diluted Algolia relevance scores so the right section ranked below unrelated pages.
  • New approach: One Algolia record per h2/h3/h4 section. The section heading is stored in a dedicated heading field with higher search priority than body content. The section anchor (#pipeline-states) is included in the record URL so clicking a result links directly to the right section on the page.
  • Expected record count: ~5,000–15,000 (from ~520), average size ~1KB (from ~5.5KB) — matching what the old Algolia crawler produced.
  • Search results display: Shows Page title / Section heading so users can see both which page and which section matched.

Changes

extensions/export-content-extension.js

  • Replace collectPages + chunkText with section-based extraction using extractSections
  • extractSections splits page HTML at h2/h3/h4 boundaries, extracts heading text and id attribute (anchor) per section
  • indexToAlgolia simplified: one record per section, no chunking loop
  • searchableAttributes updated to [title, heading, content, path] — section heading matches rank above body text

ui/src/js/07-search.js

  • Add heading to attributesToRetrieve
  • Result title displays as Page title / Section heading when a section heading exists

Test plan

  • Run a local build with SKIP_INDEX_SEARCH=true and inspect .temp/site-content.json — records should have heading field and relUrl values with # anchors
  • After merge and CI index rebuild: search "pipeline states" — should return the Pipeline states section of the pipelines overview page as a top result
  • Verify search result displays as Pipelines overview and setup / Pipeline states
  • Verify clicking the result lands on the #pipeline-states anchor
  • Check Algolia dashboard record count is substantially higher than 520

🤖 Generated with Claude Code

The previous approach flattened each page to text and split at arbitrary
byte boundaries, producing ~520 large records with no structural awareness.
Chunk boundaries fell mid-phrase (e.g. splitting "Pipeline states" across
records), and large sections diluted relevance scores so the right page
ranked below unrelated results.

New approach mirrors what the old Algolia crawler produced: one record per
h2/h3/h4 section, with the section heading in a dedicated `heading` field
and the section anchor included in the record URL. This gives Algolia the
structural context it needs to rank section heading matches above body text
matches, and means clicking a result links directly to the right section.

- Replace collectPages/chunkText with extractSections (splits at heading
  boundaries, extracts heading text and anchor ID per section)
- Remove chunkText entirely
- Update searchableAttributes to [title, heading, content, path]
- Update objectID to be derived from section URL including anchor
- Update search UI to retrieve heading field and display as
  "Page title / Section heading" in results

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@rosieyohannan rosieyohannan requested review from a team as code owners May 27, 2026 15:24
@linear-code
Copy link
Copy Markdown

linear-code Bot commented May 27, 2026

DOC-176

Comment thread extensions/export-content-extension.js Outdated
Comment thread extensions/export-content-extension.js
rosieyohannan and others added 2 commits June 1, 2026 23:07
Co-authored-by: mitchell amihod <4623+meeech@users.noreply.github.com>
Co-authored-by: mitchell amihod <4623+meeech@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants