Confluence Connector - Flows

10 min read

Content Sync Flow

The content sync flow runs periodically (default: every 15 minutes) to synchronize labeled Confluence pages and their attachments to Unique.

Overview

Sequence Diagram

The connector is stateless -- it does not maintain local state between sync cycles. Change detection is performed by the Unique platform's file diff API, called once per Confluence space.

Scope Hierarchy

The connector manages a two-level scope hierarchy:

Root scope -- Must pre-exist in the Unique platform. Configured via ingestion.scopeId. The connector grants itself access at initialization and marks the scope as owned by this Confluence instance on the first sync cycle. Subsequent sync cycles verify that the ownership mark still matches the configured Confluence instance.
Space scopes -- Created automatically as children of the root scope, one per Confluence space key. Access is inherited from the root scope.

Root Scope Ownership Validation

To prevent misconfiguration where the same Confluence instance is connected to multiple Unique organizations, or the same root scope is pointed at a different Confluence instance, the connector tags the root scope with an identifier derived from the Confluence instance on the first sync cycle:

Cloud: the identifier is derived from the cloudId configured in the tenant YAML.
Data Center: the identifier is fetched from the Data Center's application manifest endpoint (/rest/applinks/1.0/manifest). No authentication is required.

On every subsequent sync cycle, the connector reads the ownership mark from the root scope and compares it against the current Confluence instance. A mismatch aborts the tenant sync cycle with a fatal error.

Discovery Phase

Pages are discovered through a CQL-based label search, then attachments are optionally extracted from the already-fetched page objects.

Important: CQL search results are filtered by the authenticated user's permissions. Pages in spaces the service account cannot access are silently excluded from results. On Data Center, the service account's space access can optionally be restricted to specific spaces. When no spaces are specified, the service account has read access to all spaces. If space restrictions are configured, pages in excluded spaces never appear in discovery results and no error is produced.

CQL Queries

The connector uses Confluence Query Language (CQL) to discover pages. The exact CQL differs by instance type:

Instance Type	Space Type Filter	CQL Template
Cloud	`space.type=global OR space.type=collaboration`	`((label="{ingestSingleLabel}") OR (label="{ingestAllLabel}")) AND ({spaceTypeFilter}) AND type != attachment`
Data Center	`space.type=global`	Same template, different space type filter

The space type filter varies by instance type. See the Configuration Guide for details on which space types are scanned per platform.

The type != attachment clause excludes attachments from top-level CQL results since they are fetched via the expand=children.attachment parameter on the page objects themselves.

Descendant Discovery

Pages labeled with ingestAllLabel trigger a descendant search:

Collect all page IDs that carry the ingestAllLabel
Batch IDs into groups of 100
For each batch, execute CQL: ancestor IN ({batch}) AND type != attachment
Deduplicate results with labeled pages by page ID

Attachment Extraction

When attachments.mode is enabled (default), attachments are extracted from the already-fetched page objects. Confluence inlines up to 25 attachments per page via the expand=children.attachment parameter during the CQL search. If a page has more than 25 attachments, additional attachments are fetched via pagination (Cloud uses the v2 attachments endpoint; Data Center uses v1 pagination via _links.next).

An attachment is accepted if: - Its mediaType (reported by the Confluence API) is in the allowedMimeTypes list (defaults cover PDF, the major Office formats, plain text, CSV, HTML, PNG, and JPEG) - Its file size does not exceed maxFileSizeMb (default: 200 MB) - The maxItemsToScan capacity has not been exhausted (pages count first, attachments use remaining capacity)

Images embedded in a page body (via the editor's drag/drop, paste, or "Insert image" actions) are stored by Confluence as regular page attachments and surface in the same expand=children.attachment results. When inlining is enabled, these images are inlined directly into the page HTML (see the Page Image Inlining section below) rather than ingested as separate artifacts, and image attachments are excluded from the standalone attachment path entirely. Orphan images (attached but not referenced by an in-body macro) are appended to the end of the page body. A macro-referenced image that cannot be inlined keeps its original macro and is not ingested elsewhere. Images inserted as external URLs (<ri:url>) are not attachments and are not ingested.

When inlining is off, image attachments go through the standalone attachment path. With attachments.imageOcr enabled (default), each such image content registration includes ingestionConfig.jpgReadMode = DOC_INTELLIGENCE_DEFAULT, which the Unique ingestion service merges over its environment defaults and the destination scope's own config (request body has highest precedence). This forces OCR-based processing on the worker so chunks are produced; without it, the worker default (NO_INGESTION) returns zero chunks and the worker raises FAILED_IMAGE. Images inlined into a page are processed via the page artifact and do not go through this OCR path.

Page Image Inlining

During page ingestion, the connector parses each page's Confluence storage XML and replaces every <ac:image> macro that points to a Confluence attachment with an <img src="data:image/...;base64,..."> element before uploading the page. Two reference shapes are resolved:

Current-page attachment:<ac:image><ri:attachment ri:filename="..."/></ac:image> is matched against the page's discovered image attachments by filename.
Other-page attachment:<ac:image><ri:attachment ri:filename="..."><ri:page ri:space-key="..." ri:content-title="..."/></ri:attachment></ac:image> triggers an on-demand lookup of the referenced page's attachments.

A macro is left untouched in the page body (the image is not ingested elsewhere) when any of these conditions hold:

The reference is an external URL (<ri:url>). Never fetched.
The filename does not match any attachment on the resolved page.
The matched attachment is not an image/* MIME type.
The image size exceeds attachments.maxFileSizeMb.
The image download stream errors.
The referenced other-page lookup returns no page.

Per-page memory is bounded by the inlined page size, the per-image attachments.maxFileSizeMb cap, and processing.concurrency. Image downloads within a single page are also concurrency-capped to keep memory usage predictable on image-heavy pages.

Content Type Ingestion Map

The connector uses label-based discovery via CQL. After fetching, it skips three content types: database, whiteboard, and embed.

Content that passes the filter has its body.storage HTML extracted and ingested. Items with empty bodies are skipped. Descendants of skipped content types (such as sub-pages under a database) are still discovered and ingested.

Confluence Cloud

Content Type	Ingested?	Body Available via API?	Notes
Page	Yes	Yes (`body.storage` / ADF)	Primary content type. Full body ingestion.
Blog Post	Yes	Yes (`body.storage` / ADF)	Treated identically to pages by the connector.
Attachment	Yes (conditional)	No (binary)	Only when `attachments.mode=enabled`. Filtered by MIME type and size. Embedded images (PNG, JPEG) are inlined into the page artifact (see Page Image Inlining) rather than ingested as separate attachments.
Whiteboard	No	No (no body via API)	Explicitly skipped. API returns no body content. Descendants are still discovered.
Database	No	No (structured data, not exposed)	Explicitly skipped. No body via API. Descendants (sub-pages) are still discovered and ingested.
Embed / Smart Link	No	No (only has `embedUrl`)	Explicitly skipped. Only contains a URL reference, no renderable body.
Folder	No (effectively)	No (organizational container)	Not explicitly skipped, but has no body -- skipped by the empty-body filter. Descendants are still discovered.
Comment (inline/footer)	No	Yes (`body.storage` / ADF)	Not discovered -- comments do not appear in label/ancestor CQL results.
Live Doc (page subtype)	Yes (as page)	Yes (`body.storage` / ADF)	Subtype of page. Passes through as a regular page.
Custom Content (app-defined)	No	Yes (`body.storage`)	Not discovered -- uses `ac🔑type` format, not matched by standard CQL.
Task (standalone)	No	Yes (`body.storage` / ADF)	Not a CQL-searchable content type. Only accessible via `/tasks` v2 endpoint.

Confluence Data Center

Content Type	Ingested?	Notes
Page	Yes	Primary content type. Full body ingestion (storage format / XHTML).
Blog Post	Yes	Treated identically to pages by the connector.
Attachment	Yes (conditional)	Only when `attachments.mode=enabled`. Uses v1 pagination (`_links.next`).
Comment (inline/footer)	No	Not discovered -- comments do not appear in label/ancestor CQL results.
Custom Content (plugin-defined)	No	Accessed via plugin-specific REST APIs, not standard `/rest/api/content`.

Whiteboard, Database, Embed, Folder, and Live Doc are Cloud-only features and do not exist in Data Center.

File Diff Mechanism

The connector uses the Unique platform's server-side file diff API (/v2/content/file-diff) to detect changes between sync cycles. The connector does not compute local content hashes -- instead, it sends each item's key, url, and updatedAt timestamp to the diff endpoint, which returns categorized results. The diff is called once per Confluence space.

Change Detection Logic

File Diff Item Attributes

Each item sent to the diff API contains:

Attribute	Description	Used For
`key`	Unique key identifying the item (derived from page and attachment IDs)	Identity and change tracking
`url`	Confluence URL of the page or attachment	Location tracking
`updatedAt`	Last modification timestamp from Confluence	Change detection

Safety Checks

After each space's file diff is computed, two guards run to prevent accidental full deletion of content. Both abort the current tenant sync cycle when triggered.

Guard	Trigger Condition	Rationale
Zero-submission guard	Discovery submitted 0 items but the diff contains deletions	Likely a bug in page fetching or an authentication failure that returned no results. Deleting all stored content would be destructive.
Full-deletion guard	The diff would delete every file stored in Unique for the space and the connector determines the deletion is not a legitimate full content replacement	Likely a key format change (e.g., toggling `useV1KeyFormat`) that causes all existing content to appear unrecognized. If the connector determines the content was genuinely replaced rather than misidentified, the sync proceeds with a warning instead of aborting.

Partial deletions (removing some but not all files) are always allowed. To intentionally remove all content for a tenant, set the tenant status to deleted (see Configuration -- Tenant Status).

Ingestion Pipeline

The ingestion pipeline follows a 3-step process for each item: register, upload, and finalize.

Page Ingestion

For each new or updated page:

Fetch content -- The connector retrieves the full page content (HTML storage representation). The page is skipped if it is not found, if fetching fails, or if the page body is empty.
Extract labels -- Confluence labels are extracted from the page, excluding the configured sync labels. Labels are sorted alphabetically for deterministic ordering.
Register -- Sends metadata to the Unique ingestion API.
Upload -- The page body (Confluence storage format HTML) is uploaded to Unique.
Finalize -- Triggers downstream processing in Unique.

If any step fails for a page, the error is logged and that page is skipped. Other pages continue processing.

Attachment Ingestion

For each new or updated attachment:

Skip zero-byte -- Attachments with zero file size are skipped.
Register -- Sends metadata to the Unique ingestion API.
Download -- Streams the attachment from Confluence.
Upload -- The stream is uploaded to Unique with the original content type and content length.
Finalize -- Triggers downstream processing in Unique.

If any step fails, the stream is destroyed and the error is logged. Other attachments continue processing.

Deletion

Deleted items identified by the file diff are processed after ingestion:

Build content keys for each deleted item
Resolve keys to content IDs
Delete content by IDs

If no content is found for the given keys, a warning is logged and no deletion occurs.

Removed Space Cleanup

The file diff runs once per Confluence space that appears in the current discovery results. Spaces that have disappeared from discovery entirely (because the space was deleted in Confluence, or all ingest labels were removed from its pages) are not diffed and would otherwise leave orphaned content and scopes in Unique.

After the per-item deletion step, the connector detects orphaned space scopes by comparing the root scope's children against the set of Confluence space keys discovered during the current sync. For each orphaned space, the connector deletes both the space's files and the space scope itself. A failure for one space does not prevent the cleanup of other orphaned spaces.

Concurrency and Progress

Ingestion concurrency is controlled by processing.concurrency (default: 1). Pages are ingested first, then attachments -- not in parallel. All items in a batch are attempted even if some fail.

Progress is logged periodically. After all items complete, a summary is logged with the total, succeeded, and failed counts.

Error Handling

Error Handling Strategy

The connector applies scenario-specific behavior to keep sync cycles stable:

Scenario	Typical Cause	Connector Behavior
Sync already in progress	Overlapping cron triggers or long-running sync	Skip the cycle entirely (re-entrancy guard)
Root scope not found	Misconfigured `scopeId` or scope deleted	Abort the entire sync cycle
Root scope belongs to a different Confluence instance	`scopeId` reused across tenants or the tenant's Confluence instance was reassigned	Abort the entire tenant sync cycle
Discovery failure	Error during page or attachment fetching from Confluence	Abort the sync cycle to avoid submitting partial results to file-diff, which could cause unnecessary deletions and re-ingestion costs
Accidental full deletion detected	Bug in page fetching or key format change	Abort the entire tenant sync cycle
Page unavailable	Page deleted between discovery and content fetch, or transient API error	Log, skip the page, continue other pages
Page with empty body	Page has no content (e.g., newly created)	Log, skip the page
Attachment with zero bytes	Empty attachment	Log, skip the attachment
Page ingestion failure	Upload error, registration error	Clean up partial content from KB, log error, skip the page
Attachment ingestion failure	Download error, upload error	Destroy stream, clean up partial content from KB, log error, skip the attachment
Content deletion failure	Unique API error	Log error, continue
Unhandled sync error	Unexpected exception	Caught at top level, logged, sync state reset

Retry Logic

Transient HTTP errors are retried automatically with backoff. Requests follow redirects (up to 10 redirections). Non-2xx responses that exhaust retries are thrown as errors with the status code and response content.

Technical Reference - Architecture overview, key concepts, ingestion pipeline
Architecture - System components and infrastructure
Configuration - Scheduler and processing settings

Standard References

Confluence Cloud REST API - Atlassian Confluence Cloud API documentation
Confluence Data Center REST API - Atlassian Confluence Data Center API documentation
Confluence Query Language (CQL) - CQL reference for content search queries