Replace byte-chunking with section-based Algolia indexing#10408
Open
rosieyohannan wants to merge 3 commits into
Open
Replace byte-chunking with section-based Algolia indexing#10408rosieyohannan wants to merge 3 commits into
rosieyohannan wants to merge 3 commits into
Conversation
The previous approach flattened each page to text and split at arbitrary byte boundaries, producing ~520 large records with no structural awareness. Chunk boundaries fell mid-phrase (e.g. splitting "Pipeline states" across records), and large sections diluted relevance scores so the right page ranked below unrelated results. New approach mirrors what the old Algolia crawler produced: one record per h2/h3/h4 section, with the section heading in a dedicated `heading` field and the section anchor included in the record URL. This gives Algolia the structural context it needs to rank section heading matches above body text matches, and means clicking a result links directly to the right section. - Replace collectPages/chunkText with extractSections (splits at heading boundaries, extracts heading text and anchor ID per section) - Remove chunkText entirely - Update searchableAttributes to [title, heading, content, path] - Update objectID to be derived from section URL including anchor - Update search UI to retrieve heading field and display as "Page title / Section heading" in results Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
meeech
reviewed
Jun 1, 2026
meeech
reviewed
Jun 1, 2026
meeech
requested changes
Jun 1, 2026
Co-authored-by: mitchell amihod <4623+meeech@users.noreply.github.com>
Co-authored-by: mitchell amihod <4623+meeech@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
h2/h3/h4section. The section heading is stored in a dedicatedheadingfield with higher search priority than body content. The section anchor (#pipeline-states) is included in the record URL so clicking a result links directly to the right section on the page.Page title / Section headingso users can see both which page and which section matched.Changes
extensions/export-content-extension.jscollectPages+chunkTextwith section-based extraction usingextractSectionsextractSectionssplits page HTML ath2/h3/h4boundaries, extracts heading text andidattribute (anchor) per sectionindexToAlgoliasimplified: one record per section, no chunking loopsearchableAttributesupdated to[title, heading, content, path]— section heading matches rank above body textui/src/js/07-search.jsheadingtoattributesToRetrievePage title / Section headingwhen a section heading existsTest plan
SKIP_INDEX_SEARCH=trueand inspect.temp/site-content.json— records should haveheadingfield andrelUrlvalues with#anchorsPipelines overview and setup / Pipeline states#pipeline-statesanchor🤖 Generated with Claude Code