Outlook Semantic MCP - Disaster Recovery

11 min read

Outlook Semantic MCP - Disaster Recovery

This runbook covers recovery procedures for the three stateful components the Outlook Semantic MCP Server depends on: the local PostgreSQL database, RabbitMQ, and the Unique Knowledge Base. Each component has a distinct failure mode and recovery path.

Automatic recovery schedulers (a 2-minute full-sync retry, a live catch-up recovery scheduler that retriggers within 5 minutes on failure or after 30 minutes of inactivity, and an inbox-deletion recovery scheduler) handle transient failures. The scenarios below require explicit operator action because the automatic schedulers are insufficient for total data loss.

Out of scope: partial database corruption, Microsoft Graph API outages, and automated recovery scripts.

Deployment Mode Scope

The scenarios in this runbook differ by deployment mode:

Scenario	`microsoft_graph_and_unique_api`	`microsoft_graph`
Scenario 1: PostgreSQL Loss	Applies	Applies (simplified — see below)
Scenario 2: RabbitMQ Loss	Applies	Applies (simplified — see below)
Scenario 3: Unique Knowledge Base Loss	Applies	Infrastructure must be restored; no email data to re-ingest

Recovery Considerations

Idempotent re-ingestion

This sub-section applies to Mode A (microsoft_graph_and_unique_api) deployments only. In microsoft_graph mode there is no ingestion pipeline.

For all recovery scenarios except Knowledge Base loss (Scenario 3), the system checks whether each email already exists in the Knowledge Base before re-ingesting it. If the file key matches and the metadata is unchanged, the email is skipped entirely. The only cost of a full re-sync in these cases is:

Microsoft Graph API calls to re-fetch email metadata (pages of 100 messages).
Ingestion API lookups to check whether each file key already exists.

No duplicate content is written to the Knowledge Base. This means that even a full re-sync after database or RabbitMQ loss is a lightweight operation relative to the initial sync — the overhead is API calls, not data re-processing.

In Scenario 3 (Knowledge Base loss), the content must be fully re-ingested because the Knowledge Base no longer contains the files. This is the only scenario where re-sync carries the full cost of initial ingestion.

Recovery time factors

The documentation does not provide fixed RTO targets because recovery time varies significantly by deployment. The main factors are:

Number of connected users — each user's mailbox is re-synced independently. The service processes users concurrently but enforces a batch limit of 50 messages per cycle per user before yielding to others (hardcoded service limit).
Mailbox size — full sync fetches emails in pages of 100 from Microsoft Graph (service-configured page size), processing them sequentially. Large mailboxes (100,000+ emails) take proportionally longer.
Microsoft Graph API rate limits — Microsoft enforces a global limit of 130,000 requests per 10 seconds per app across all tenants; additional per-mailbox and per-service limits may apply (Microsoft limit, not configurable). Re-syncing many users simultaneously may approach these limits. There is no built-in staggering; operators should coordinate with users to stagger their restart_full_sync calls in batches to avoid throttling. See Microsoft Graph throttling.
Ingestion capacity — full sync uploads email batches to the Unique Knowledge Base. If the Unique Knowledge Base ingestion pipeline worker is under-provisioned or scaled down, messages queue up and full sync stalls in waiting-for-ingestion. Live catch-up ingests emails without monitoring the ingestion pipeline. Ensure the ingestion pipeline has sufficient replicas and resources during recovery.
Infrastructure provisioning — if PostgreSQL or RabbitMQ must be provisioned from scratch rather than restored from backup, lead time depends on the platform. Clients using managed database services rather than Kubernetes-native solutions (e.g. CNPG) should account for provider-specific provisioning and configuration time.

Backup recommendations

Component	Recommendation	Rationale	Mode
PostgreSQL	Regular backups strongly recommended. Use your platform's backup solution (managed service snapshots, `pg_dump`, or WAL archiving).	Contains OAuth tokens, webhook subscriptions, and all sync state. Without a backup, all users must re-authenticate and full sync restarts from scratch.	Both modes — strongly recommended
RabbitMQ	Backup not required.	Queues carry only transient sync trigger events. Live catch-up email ingestion happens inline (not via RabbitMQ), so RabbitMQ loss only affects trigger delivery, not per-email ingestion. The 2-minute full-sync recovery scheduler and the live catch-up recovery scheduler (5-minute retry on failure, 30-minute retry on inactivity; cron: `INGESTION_LIVE_CATCHUP_RECOVERY_CRON`, default `/5 * * *`) re-create any lost trigger events after reconnection.	Both modes — backup not required
Unique Knowledge Base	Managed by the Unique platform.	Backup and restore are the responsibility of the Unique platform operator.	Both modes — managed by Unique platform

Risk if no PostgreSQL backup exists: every user must re-authenticate via OAuth and a full re-sync runs for each user. Existing emails in the Knowledge Base are not lost (re-ingestion is idempotent — only API call overhead, no duplicate data), but recovery time scales linearly with user count and mailbox size. For large deployments this can be significant, compounded by the shared Microsoft Graph API rate limit.

Data loss window

This sub-section applies to Mode A (microsoft_graph_and_unique_api) deployments only. In microsoft_graph mode there is no ingestion pipeline.

Emails are sourced from Microsoft Graph, which retains the authoritative copy. In all three disaster scenarios, email content is not permanently lost — it can be re-fetched and re-ingested. The data loss window refers to the delay before the system catches up:

Webhook notifications lost during an outage are recovered by the live catch-up recovery scheduler — if no new activity occurs for 30 minutes it retriggers live catch-up, which polls Microsoft Graph for any emails modified since the last known watermark.
If a webhook subscription expires during an extended outage (subscriptions renew daily), users must call reconnect_inbox to re-create it. Emails received during the gap are picked up by the subsequent full re-sync.
Worst case: emails received between the last successful live catch-up and service restoration are delayed, not lost. Full re-sync recovers all emails from Microsoft Graph that match the operator-configured Mail Filters — emails outside the configured date window or matching exclusion rules are not synced.

Personnel

Role	When needed
Kubernetes operator	All scenarios — restarts pods, updates secrets, runs migrations, enables debug mode.
Database / platform administrator	Scenario 1 — restores or provisions PostgreSQL. Scenario 2 — restores or provisions RabbitMQ.
End users	Scenario 1 — must re-authenticate via the standard OAuth flow (reconnecting their MCP client). No tool call is required; OAuth completion automatically recreates the Graph webhook subscription and triggers a full sync (Mode A) or simply re-issues MCP tokens (Mode B). Scenario 2 — must call `reconnect_inbox` only if they are not receiving new emails after recovery (Mode A only). Scenario 3 — must call `restart_full_sync` (Mode A only — in Mode B, users re-authenticate via the standard OAuth flow without calling any tool). The operator cannot call tools on behalf of users.

No Microsoft tenant administrator action is required for recovery. Orphaned webhook subscriptions in Microsoft's systems expire automatically based on the expiration time set at creation (the service configures subscriptions to renew daily, so orphaned subscriptions typically expire within about 1 day; Microsoft allows up to 7 days for message subscriptions).

Scenario 1: Local PostgreSQL Database Loss

Symptoms

Service fails to start with database connection errors in the logs.
All MCP tools return errors or empty responses.
No users appear connected — verify_inbox_connection returns not_configured for all users if db was already reprovisioned or it just returns errors because it cannot connect to the database.

Impact

The local database stores OAuth tokens, Microsoft Graph webhook subscriptions, and all sync state. Total loss of the database means:

All users must re-authenticate via the OAuth flow.
All Graph webhook subscriptions are orphaned in Microsoft's systems (they expire naturally based on their expiration time — typically within about 1 day given the service's daily renewal cycle; Microsoft allows up to 7 days for message subscriptions).
All sync state (cursor positions, progress counters) is lost.
Emails already ingested into the Unique Knowledge Base are not affected — they remain searchable.

Recovery Steps

Mode A (`microsoft_graph_and_unique_api`) Recovery

Restore or provision a new PostgreSQL instance and update DATABASE_URL in the Kubernetes secret if the connection string changed:

bash

kubectl create secret generic outlook-semantic-mcp-secrets \
  --namespace outlook-semantic-mcp \
  --from-literal=DATABASE_URL="postgresql://user:password@host:5432/outlook_semantic_mcp" \
  --dry-run=client -o yaml | kubectl apply -f -

Trigger a deploy or restart the service pods. Database migrations run automatically via the Helm hook on deploy — there is no need to run them manually via kubectl exec (the pods may not be running if the DB was lost):

bash

kubectl rollout restart deploy/outlook-semantic-mcp -n outlook-semantic-mcp

Verify migrations ran successfully by checking the pod logs:

bash

kubectl logs deploy/outlook-semantic-mcp -n outlook-semantic-mcp | grep -i migration

Notify affected users that they must reconnect their inbox. Each user must reconnect their MCP client and complete the standard OAuth flow. The server publishes a user-authorized event on OAuth completion that automatically creates a new Graph webhook subscription and triggers a full sync — no tool call is needed (and reconnect_inbox cannot be called anyway, because the user has no MCP bearer token until OAuth completes).
After reconnection, a full sync starts automatically. Users can monitor progress with sync_progress.
Previously ingested emails remain in the Unique Knowledge Base and are unaffected. The post-recovery full sync checks each email against the Knowledge Base by file key and skips any that already exist — the only overhead is Microsoft Graph API calls and ingestion API lookups, not actual re-ingestion (see Idempotent re-ingestion).

Mode B (`microsoft_graph`) Recovery

The recovery procedure is simpler in Mode B because there are no webhook subscriptions, sync state, or Knowledge Base scopes to restore.

Impact: All users' OAuth tokens and MCP session tokens are lost. Users cannot call tools until they re-authenticate.

Recovery Steps:

Restore or provision a new PostgreSQL instance and update DATABASE_URL in the Kubernetes secret.
Restart the service pods. Database migrations run automatically.
Notify affected users that they must reconnect their MCP client via the standard OAuth flow. No tool call is needed — re-authentication issues new MCP tokens.

Scenario 2: RabbitMQ Loss

Symptoms

Service logs show AMQP connection errors or failed message publish attempts.
In-progress full syncs complete their current batch but no new batches are triggered — sync_progress shows fullSyncState: "running" but scheduledForIngestion stops incrementing after the current batch finishes.
Live catch-up stops processing new webhook notifications — recently received emails are not ingested.

Impact

RabbitMQ carries in-flight sync trigger events between the service and its internal workers. Total loss means:

Any full sync in progress at the time of failure is stalled. The sync state in the database is intact but the trigger event that drives the next batch is gone.
Live catch-up trigger events (incoming webhook notifications from Microsoft Graph) that were in-flight in RabbitMQ at the time of failure are lost. Any in-progress catch-up run at the time of failure continues and completes normally (it does not depend on RabbitMQ for per-email ingestion).
The local database and Unique Knowledge Base are not affected.
No re-authentication is required.

Recovery Steps

Mode A (`microsoft_graph_and_unique_api`) Recovery

Restore or provision a new RabbitMQ instance and update AMQP_URL in the Kubernetes secret if the connection string changed.
Restart the service pods to reconnect to RabbitMQ:

bash

kubectl rollout restart deploy/outlook-semantic-mcp -n outlook-semantic-mcp

Once the pods reconnect, the automatic schedulers handle recovery without user action: - The 2-minute full-sync retry scheduler re-triggers any stalled full syncs. - The live catch-up recovery scheduler retriggers any failed catch-ups within 5 minutes, and retriggers catch-ups that received no notifications within 30 minutes.
If a user reports not receiving new emails after the service has recovered, they can call reconnect_inbox to re-create the webhook subscription. A full sync starts automatically after reconnection.

See also:reconnect_inbox, sync_progress

Mode B (`microsoft_graph`) Recovery

In microsoft_graph mode, RabbitMQ is a required infrastructure component but is not used for email ingestion or sync. Service impact is limited to connectivity loss until RabbitMQ is restored.

Impact: The service cannot connect to RabbitMQ. No email data is lost — there is no ingestion pipeline.

Recovery Steps:

Restore or provision a new RabbitMQ instance and update AMQP_URL in the Kubernetes secret if the connection string changed.
Restart the service pods to reconnect.
No user action is required.

Scenario 3: Unique Knowledge Base Loss

Symptoms

search_emails returns no results or errors for all users.
sync_progress shows ingestionStats.failed increasing, or ingestionStats returns { state: "error" }.
Service logs show errors contacting the Unique ingestion or scope management services.

Impact

The Unique Knowledge Base stores the actual ingested email content used for semantic search. Total loss means:

All previously ingested emails are gone — search_emails returns no results.
The local database and its sync state are not affected.
Microsoft Graph webhook subscriptions are not affected — live notifications continue to arrive.
No re-authentication is required.

Recovery Steps

Mode A (`microsoft_graph_and_unique_api`) Recovery

Restore or verify the Unique Knowledge Base is operational and reachable from the service. Confirm UNIQUE_INGESTION_SERVICE_BASE_URL and UNIQUE_SCOPE_MANAGEMENT_SERVICE_BASE_URL are correct in the Helm values.
Enable debug mode on the deployment if it is not already enabled, by setting MCP_DEBUG_MODE=enabled in mcpConfig.app.mcpDebugMode and restarting pods. This exposes debug tools including restart_full_sync, run_full_sync, pause_full_sync, and resume_full_sync. Note: Debug mode exposes these tools to all connected MCP users, not just operators — restrict MCP client access during recovery. See Configuration.
Notify affected users that they must call restart_full_sync from their own MCP session. All MCP tools are scoped to the authenticated user — there is currently no admin API or batch endpoint to trigger this on behalf of users. This resets sync state in the local database and re-fetches all emails from Microsoft Graph, re-ingesting them into the restored Knowledge Base. Unlike Scenarios 1 and 2, this is the only recovery scenario where emails must be fully re-ingested — the Knowledge Base no longer contains the files, so the cost includes Microsoft Graph API calls, ingestion API calls, and the full content transfer. Subsequent runs are idempotent — file keys prevent duplicates.
Each user can monitor their own recovery progress with sync_progress. The sync is complete when fullSyncState transitions to "ready" and state is "finished".
Live catch-up resumes automatically once ingestion is healthy — emails received during the outage will be processed through the normal webhook pipeline without additional operator action.
Once all affected users have called restart_full_sync, disable debug mode by removing or unsetting MCP_DEBUG_MODE in the Helm values and redeploying (requires a pod restart). Debug mode should not remain enabled in production.

Mode B (`microsoft_graph`) Impact

In microsoft_graph mode, the Unique Knowledge Base is required infrastructure but no email content is stored there — search always queries Microsoft Graph directly. A Knowledge Base outage does not affect search results or any user-facing tool. No user action is required. Restore the Knowledge Base to restore the service's ability to connect to it.

Outlook Semantic MCP - Disaster Recovery

Outlook Semantic MCP - Disaster Recovery

Recovery Considerations

Idempotent re-ingestion

Recovery time factors

Backup recommendations

Data loss window

Personnel

Scenario 1: Local PostgreSQL Database Loss

Symptoms

Impact

Recovery Steps

Mode A (microsoft_graph_and_unique_api) Recovery

Mode B (microsoft_graph) Recovery

Scenario 2: RabbitMQ Loss

Symptoms

Impact

Recovery Steps

Mode A (microsoft_graph_and_unique_api) Recovery

Mode B (microsoft_graph) Recovery

Scenario 3: Unique Knowledge Base Loss

Symptoms

Impact

Recovery Steps

Mode A (microsoft_graph_and_unique_api) Recovery

Mode B (microsoft_graph) Impact

Mode A (`microsoft_graph_and_unique_api`) Recovery

Mode B (`microsoft_graph`) Recovery

Mode A (`microsoft_graph_and_unique_api`) Recovery

Mode B (`microsoft_graph`) Recovery

Mode A (`microsoft_graph_and_unique_api`) Recovery

Mode B (`microsoft_graph`) Impact