Ingestion

Ingestion is the process of pushing documentation into the knowledge base so it can be searched and retrieved. There are four methods: direct Markdown, URL fetch, OpenAPI spec, and Confluence bulk import. All methods are idempotent — re-ingesting unchanged content is a no-op.

How ingestion works

Every document goes through the same pipeline regardless of method:

Content hash check — if the document exists and the content hash matches, the job is skipped
Source upsert — the namespace record is created if it does not exist
Document upsert — title, slug, metadata, and content hash are stored
Chunk deletion — old chunks for this document are removed
Hierarchical splitting — Markdown is split at heading boundaries into parent chunks, then parent chunks into child chunks
Batch embedding — all child chunks are embedded via Gemini in one batch call
Chunk upsert — chunks and embeddings are written to PostgreSQL with pgvector

Methods

Markdown (direct)

The simplest method. POST raw Markdown with a title and namespace.

curl -X POST https://your-host/api/v1/ingest/md \
  -H "Authorization: Bearer cape_..." \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Uploading assets",
    "namespace": "user_docs",
    "slug": "uploading-assets",
    "content": "# Uploading assets\n\nTo upload..."
  }'

Use slug to give the document a stable identifier for deduplication. If omitted, the title is used as the slug.

URL fetch

Pass a URL — the system fetches the page, strips HTML if needed, and ingests the result as Markdown.

curl -X POST https://your-host/api/v1/ingest/url \
  -H "Authorization: Bearer cape_..." \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.cape.io/getting-started",
    "namespace": "user_docs"
  }'

Returns 422 if the URL cannot be fetched.

OpenAPI spec

POST an OpenAPI/Swagger spec as JSON or YAML. The system creates one document per API operation, keyed by operationId.

curl -X POST https://your-host/api/v1/ingest/openapi \
  -H "Authorization: Bearer cape_..." \
  -H "Content-Type: application/yaml" \
  --data-binary @openapi.yaml

Operations without an operationId are skipped. Re-posting the same spec only re-embeds changed operations.

Confluence space

Bulk-import an entire Confluence space. This runs asynchronously — the endpoint returns a job ID immediately.

curl -X POST https://your-host/api/v1/ingest/confluence \
  -H "Authorization: Bearer cape_..." \
  -H "Content-Type: application/json" \
  -d '{ "spaceKey": "ENG", "namespace": "confluence" }'

Required environment variables:

CONFLUENCE_BASE_URL=https://yourorg.atlassian.net
CONFLUENCE_EMAIL=ci@yourorg.com
CONFLUENCE_API_TOKEN=...

Monitor the job via GET /api/v1/ingest/jobs.

CI/CD integration

The typical pattern is to trigger ingestion from a deployment pipeline after a docs build:

# GitHub Actions example
- name: Ingest documentation
  run: |
    for file in docs/**/*.md; do
      curl -s -X POST $CAPE_DOCS_URL/api/v1/ingest/md \
        -H "Authorization: Bearer $CAPE_API_KEY" \
        -H "Content-Type: application/json" \
        -d "$(jq -n \
          --arg title "$(head -1 $file | sed 's/# //')" \
          --arg content "$(cat $file)" \
          --arg namespace "user_docs" \
          --arg slug "$(basename $file .md)" \
          '{title:$title,content:$content,namespace:$namespace,slug:$slug}')"
    done

Or use the bundled script for local ingestion:

npx tsx scripts/ingest-folder.ts ./docs --limit=50

The script auto-detects namespace: files named technical.md go to tech_docs; everything else goes to user_docs.

Bulk folder script

npx tsx scripts/ingest-folder.ts <directory> [options]

Option	Description
`--dry-run`	Parse and log without writing to the database
`--limit=N`	Stop after N files

Progress output: + means ingested, . means skipped (unchanged).

Monitoring ingestion jobs

Long-running jobs (Confluence imports) are tracked in the database. View them in the Ingestion Jobs section of the admin panel or via the API:

curl https://your-host/api/v1/ingest/jobs \
  -H "Authorization: Bearer cape_..."

Job statuses:

Status	Meaning
`pending`	Queued, not yet started
`running`	Actively fetching and embedding
`done`	Completed successfully
`failed`	Error occurred — check the `error` field