Replace byte-chunking with section-based Algolia indexing by rosieyohannan · Pull Request #10408 · circleci/circleci-docs

rosieyohannan · 2026-05-27T15:24:00Z

Summary

Root cause of poor search: The previous indexing split pages at arbitrary byte boundaries (~8,500 bytes), producing ~520 large records with no awareness of document structure. Chunk boundaries fell mid-phrase (e.g. splitting "Pipeline states" across two records), and huge records diluted Algolia relevance scores so the right section ranked below unrelated pages.
New approach: One Algolia record per h2/h3/h4 section. The section heading is stored in a dedicated heading field with higher search priority than body content. The section anchor (#pipeline-states) is included in the record URL so clicking a result links directly to the right section on the page.
Expected record count: ~5,000–15,000 (from ~520), average size ~1KB (from ~5.5KB) — matching what the old Algolia crawler produced.
Search results display: Shows Page title / Section heading so users can see both which page and which section matched.

Changes

extensions/export-content-extension.js

Replace collectPages + chunkText with section-based extraction using extractSections
extractSections splits page HTML at h2/h3/h4 boundaries, extracts heading text and id attribute (anchor) per section
indexToAlgolia simplified: one record per section, no chunking loop
searchableAttributes updated to [title, heading, content, path] — section heading matches rank above body text

ui/src/js/07-search.js

Add heading to attributesToRetrieve
Result title displays as Page title / Section heading when a section heading exists

Test plan

Run a local build with SKIP_INDEX_SEARCH=true and inspect .temp/site-content.json — records should have heading field and relUrl values with # anchors
After merge and CI index rebuild: search "pipeline states" — should return the Pipeline states section of the pipelines overview page as a top result
Verify search result displays as Pipelines overview and setup / Pipeline states
Verify clicking the result lands on the #pipeline-states anchor
Check Algolia dashboard record count is substantially higher than 520

🤖 Generated with Claude Code

The previous approach flattened each page to text and split at arbitrary byte boundaries, producing ~520 large records with no structural awareness. Chunk boundaries fell mid-phrase (e.g. splitting "Pipeline states" across records), and large sections diluted relevance scores so the right page ranked below unrelated results. New approach mirrors what the old Algolia crawler produced: one record per h2/h3/h4 section, with the section heading in a dedicated `heading` field and the section anchor included in the record URL. This gives Algolia the structural context it needs to rank section heading matches above body text matches, and means clicking a result links directly to the right section. - Replace collectPages/chunkText with extractSections (splits at heading boundaries, extracts heading text and anchor ID per section) - Remove chunkText entirely - Update searchableAttributes to [title, heading, content, path] - Update objectID to be derived from section URL including anchor - Update search UI to retrieve heading field and display as "Page title / Section heading" in results Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

linear-code · 2026-05-27T15:24:05Z

DOC-176

Co-authored-by: mitchell amihod <4623+meeech@users.noreply.github.com>

rosieyohannan requested review from a team as code owners May 27, 2026 15:24

meeech reviewed Jun 1, 2026

View reviewed changes

Comment thread extensions/export-content-extension.js Outdated

meeech reviewed Jun 1, 2026

View reviewed changes

Comment thread extensions/export-content-extension.js

meeech requested changes Jun 1, 2026

View reviewed changes

rosieyohannan and others added 2 commits June 1, 2026 23:07

Update extensions/export-content-extension.js

1beed06

Co-authored-by: mitchell amihod <4623+meeech@users.noreply.github.com>

Update extensions/export-content-extension.js

e4a154c

Co-authored-by: mitchell amihod <4623+meeech@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace byte-chunking with section-based Algolia indexing#10408

Replace byte-chunking with section-based Algolia indexing#10408
rosieyohannan wants to merge 3 commits into
mainfrom
DOC-176-search-section-indexing

rosieyohannan commented May 27, 2026

Uh oh!

linear-code Bot commented May 27, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rosieyohannan commented May 27, 2026

Summary

Changes

Test plan

Uh oh!

linear-code Bot commented May 27, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants