Outlook Semantic MCP - Disaster Recovery
11 min read
Outlook Semantic MCP - Disaster Recovery
This runbook covers recovery procedures for the three stateful components the Outlook Semantic MCP Server depends on: the local PostgreSQL database, RabbitMQ, and the Unique Knowledge Base. Each component has a distinct failure mode and recovery path.
Automatic recovery schedulers (a 2-minute full-sync retry, a live catch-up recovery scheduler that retriggers within 5 minutes on failure or after 30 minutes of inactivity, and an inbox-deletion recovery scheduler) handle transient failures. The scenarios below require explicit operator action because the automatic schedulers are insufficient for total data loss.
Out of scope: partial database corruption, Microsoft Graph API outages, and automated recovery scripts.
Deployment Mode Scope
The scenarios in this runbook differ by deployment mode:
| Scenario | MicrosoftGraphAndUniqueApi | MicrosoftGraph |
|---|---|---|
| Scenario 1: PostgreSQL Loss | Applies | Applies (simplified — see below) |
| Scenario 2: RabbitMQ Loss | Applies | Applies (simplified — see below) |
| Scenario 3: Unique Knowledge Base Loss | Applies | Infrastructure must be restored; no email data to re-ingest |
Recovery Considerations
Idempotent re-ingestion
This sub-section applies to Mode A (
MicrosoftGraphAndUniqueApi) deployments only. InMicrosoftGraphmode there is no ingestion pipeline.
For all recovery scenarios except Knowledge Base loss (Scenario 3), the system checks whether each email already exists in the Knowledge Base before re-ingesting it. If the file key matches and the metadata is unchanged, the email is skipped entirely. The only cost of a full re-sync in these cases is:
Microsoft Graph API calls to re-fetch email metadata (pages of 100 messages).
Ingestion API lookups to check whether each file key already exists.
No duplicate content is written to the Knowledge Base. This means that even a full re-sync after database or RabbitMQ loss is a lightweight operation relative to the initial sync — the overhead is API calls, not data re-processing.
In Scenario 3 (Knowledge Base loss), the content must be fully re-ingested because the Knowledge Base no longer contains the files. This is the only scenario where re-sync carries the full cost of initial ingestion.
Recovery time factors
The documentation does not provide fixed RTO targets because recovery time varies significantly by deployment. The main factors are:
Number of connected users — each user's mailbox is re-synced independently. The service processes users concurrently but enforces a batch limit of 50 messages per cycle per user before yielding to others (hardcoded service limit).
Mailbox size — full sync fetches emails in pages of 100 from Microsoft Graph (service-configured page size), processing them sequentially. Large mailboxes (100,000+ emails) take proportionally longer.
Microsoft Graph API rate limits — Microsoft enforces a global limit of 130,000 requests per 10 seconds per app across all tenants; additional per-mailbox and per-service limits may apply (Microsoft limit, not configurable). Re-syncing many users simultaneously may approach these limits. There is no built-in staggering; operators should coordinate with users to stagger their
restart_full_synccalls in batches to avoid throttling. See Microsoft Graph throttling.Ingestion capacity — full sync uploads email batches to the Unique Knowledge Base. If the Unique Knowledge Base ingestion pipeline worker is under-provisioned or scaled down, messages queue up and full sync stalls in
waiting-for-ingestion. Live catch-up ingests emails without monitoring the ingestion pipeline. Ensure the ingestion pipeline has sufficient replicas and resources during recovery.Infrastructure provisioning — if PostgreSQL or RabbitMQ must be provisioned from scratch rather than restored from backup, lead time depends on the platform. Clients using managed database services rather than Kubernetes-native solutions (e.g. CNPG) should account for provider-specific provisioning and configuration time.
Backup recommendations
Component | Recommendation | Rationale | Mode |
|---|---|---|---|
PostgreSQL | Regular backups strongly recommended. Use your platform's backup solution (managed service snapshots, | Contains OAuth tokens, webhook subscriptions, and all sync state. Without a backup, all users must re-authenticate and full sync restarts from scratch. | Both modes — strongly recommended |
RabbitMQ | Backup not required. | Queues carry only transient sync trigger events. Live catch-up email ingestion happens inline (not via RabbitMQ), so RabbitMQ loss only affects trigger delivery, not per-email ingestion. The 2-minute full-sync recovery scheduler and the live catch-up recovery scheduler (5-minute retry on failure, 30-minute retry on inactivity; cron: | Both modes — backup not required |
Unique Knowledge Base | Managed by the Unique platform. | Backup and restore are the responsibility of the Unique platform operator. | Both modes — managed by Unique platform |
Risk if no PostgreSQL backup exists: every user must re-authenticate via OAuth and a full re-sync runs for each user. Existing emails in the Knowledge Base are not lost (re-ingestion is idempotent — only API call overhead, no duplicate data), but recovery time scales linearly with user count and mailbox size. For large deployments this can be significant, compounded by the shared Microsoft Graph API rate limit.
Data loss window
This sub-section applies to Mode A (
MicrosoftGraphAndUniqueApi) deployments only. InMicrosoftGraphmode there is no ingestion pipeline.
Emails are sourced from Microsoft Graph, which retains the authoritative copy. In all three disaster scenarios, email content is not permanently lost — it can be re-fetched and re-ingested. The data loss window refers to the delay before the system catches up:
Webhook notifications lost during an outage are recovered by the live catch-up recovery scheduler — if no new activity occurs for 30 minutes it retriggers live catch-up, which polls Microsoft Graph for any emails modified since the last known watermark.
If a webhook subscription expires during an extended outage (subscriptions renew daily), users must call
reconnect_inboxto re-create it. Emails received during the gap are picked up by the subsequent full re-sync.Worst case: emails received between the last successful live catch-up and service restoration are delayed, not lost. Full re-sync recovers all emails from Microsoft Graph that match the operator-configured Mail Filters — emails outside the configured date window or matching exclusion rules are not synced.
Personnel
Role | When needed |
|---|---|
Kubernetes operator | All scenarios — restarts pods, updates secrets, runs migrations, enables debug mode. |
Database / platform administrator | Scenario 1 — restores or provisions PostgreSQL. Scenario 2 — restores or provisions RabbitMQ. |
End users | Scenario 1 — must re-authenticate via the standard OAuth flow (reconnecting their MCP client). No tool call is required; OAuth completion automatically recreates the Graph webhook subscription and triggers a full sync (Mode A) or simply re-issues MCP tokens (Mode B). Scenario 2 — must call |
No Microsoft tenant administrator action is required for recovery. Orphaned webhook subscriptions in Microsoft's systems expire automatically based on the expiration time set at creation (the service configures subscriptions to renew daily, so orphaned subscriptions typically expire within about 1 day; Microsoft allows up to 7 days for message subscriptions).
Scenario 1: Local PostgreSQL Database Loss
Symptoms
Service fails to start with database connection errors in the logs.
All MCP tools return errors or empty responses.
No users appear connected —
verify_inbox_connectionreturnsnot_configuredfor all users if db was already reprovisioned or it just returns errors because it cannot connect to the database.
Impact
The local database stores OAuth tokens, Microsoft Graph webhook subscriptions, and all sync state. Total loss of the database means:
All users must re-authenticate via the OAuth flow.
All Graph webhook subscriptions are orphaned in Microsoft's systems (they expire naturally based on their expiration time — typically within about 1 day given the service's daily renewal cycle; Microsoft allows up to 7 days for message subscriptions).
All sync state (cursor positions, progress counters) is lost.
Emails already ingested into the Unique Knowledge Base are not affected — they remain searchable.
Recovery Steps
Mode A (MicrosoftGraphAndUniqueApi) Recovery
Restore or provision a new PostgreSQL instance and update
DATABASE_URLin the Kubernetes secret if the connection string changed:
kubectl create secret generic outlook-semantic-mcp-secrets \
--namespace outlook-semantic-mcp \
--from-literal=DATABASE_URL="postgresql://user:password@host:5432/outlook_semantic_mcp" \
--dry-run=client -o yaml | kubectl apply -f -Trigger a deploy or restart the service pods. Database migrations run automatically via the Helm hook on deploy — there is no need to run them manually via
kubectl exec(the pods may not be running if the DB was lost):
kubectl rollout restart deploy/outlook-semantic-mcp -n outlook-semantic-mcpVerify migrations ran successfully by checking the pod logs:
kubectl logs deploy/outlook-semantic-mcp -n outlook-semantic-mcp | grep -i migrationNotify affected users that they must reconnect their inbox. Each user must reconnect their MCP client and complete the standard OAuth flow. The server publishes a
user-authorizedevent on OAuth completion that automatically creates a new Graph webhook subscription and triggers a full sync — no tool call is needed (andreconnect_inboxcannot be called anyway, because the user has no MCP bearer token until OAuth completes).After reconnection, a full sync starts automatically. Users can monitor progress with
sync_progress.Previously ingested emails remain in the Unique Knowledge Base and are unaffected. The post-recovery full sync checks each email against the Knowledge Base by file key and skips any that already exist — the only overhead is Microsoft Graph API calls and ingestion API lookups, not actual re-ingestion (see Idempotent re-ingestion).
See also:Authentication, Deployment, Security — Encryption
Mode B (MicrosoftGraph) Recovery
The recovery procedure is simpler in Mode B because there are no webhook subscriptions, sync state, or Knowledge Base scopes to restore.
Impact: All users' OAuth tokens and MCP session tokens are lost. Users cannot call tools until they re-authenticate.
Recovery Steps:
Restore or provision a new PostgreSQL instance and update
DATABASE_URLin the Kubernetes secret.Restart the service pods. Database migrations run automatically.
Notify affected users that they must reconnect their MCP client via the standard OAuth flow. No tool call is needed — re-authentication issues new MCP tokens.
Scenario 2: RabbitMQ Loss
Symptoms
Service logs show AMQP connection errors or failed message publish attempts.
In-progress full syncs complete their current batch but no new batches are triggered —
sync_progressshowsfullSyncState: "running"butscheduledForIngestionstops incrementing after the current batch finishes.Live catch-up stops processing new webhook notifications — recently received emails are not ingested.
Impact
RabbitMQ carries in-flight sync trigger events between the service and its internal workers. Total loss means:
Any full sync in progress at the time of failure is stalled. The sync state in the database is intact but the trigger event that drives the next batch is gone.
Live catch-up trigger events (incoming webhook notifications from Microsoft Graph) that were in-flight in RabbitMQ at the time of failure are lost. Any in-progress catch-up run at the time of failure continues and completes normally (it does not depend on RabbitMQ for per-email ingestion).
The local database and Unique Knowledge Base are not affected.
No re-authentication is required.
Recovery Steps
Mode A (MicrosoftGraphAndUniqueApi) Recovery
Restore or provision a new RabbitMQ instance and update
AMQP_URLin the Kubernetes secret if the connection string changed.Restart the service pods to reconnect to RabbitMQ:
kubectl rollout restart deploy/outlook-semantic-mcp -n outlook-semantic-mcpOnce the pods reconnect, the automatic schedulers handle recovery without user action: - The 2-minute full-sync retry scheduler re-triggers any stalled full syncs. - The live catch-up recovery scheduler retriggers any failed catch-ups within 5 minutes, and retriggers catch-ups that received no notifications within 30 minutes.
If a user reports not receiving new emails after the service has recovered, they can call
reconnect_inboxto re-create the webhook subscription. A full sync starts automatically after reconnection.
See also:reconnect_inbox, sync_progress
Mode B (MicrosoftGraph) Recovery
In MicrosoftGraph mode, RabbitMQ is a required infrastructure component but is not used for email ingestion or sync. Service impact is limited to connectivity loss until RabbitMQ is restored.
Impact: The service cannot connect to RabbitMQ. No email data is lost — there is no ingestion pipeline.
Recovery Steps:
Restore or provision a new RabbitMQ instance and update
AMQP_URLin the Kubernetes secret if the connection string changed.Restart the service pods to reconnect.
No user action is required.
Scenario 3: Unique Knowledge Base Loss
Symptoms
search_emailsreturns no results or errors for all users.sync_progressshowsingestionStats.failedincreasing, oringestionStatsreturns{ state: "error" }.Service logs show errors contacting the Unique ingestion or scope management services.
Impact
The Unique Knowledge Base stores the actual ingested email content used for semantic search. Total loss means:
All previously ingested emails are gone —
search_emailsreturns no results.The local database and its sync state are not affected.
Microsoft Graph webhook subscriptions are not affected — live notifications continue to arrive.
No re-authentication is required.
Recovery Steps
Mode A (MicrosoftGraphAndUniqueApi) Recovery
Restore or verify the Unique Knowledge Base is operational and reachable from the service. Confirm
UNIQUE_INGESTION_SERVICE_BASE_URLandUNIQUE_SCOPE_MANAGEMENT_SERVICE_BASE_URLare correct in the Helm values.Enable debug mode on the deployment if it is not already enabled, by setting
MCP_DEBUG_MODE=enabledinmcpConfig.app.mcpDebugModeand restarting pods. This exposes debug tools includingrestart_full_sync,run_full_sync,pause_full_sync, andresume_full_sync. Note: Debug mode exposes these tools to all connected MCP users, not just operators — restrict MCP client access during recovery. See Configuration.Notify affected users that they must call
restart_full_syncfrom their own MCP session. All MCP tools are scoped to the authenticated user — there is currently no admin API or batch endpoint to trigger this on behalf of users. This resets sync state in the local database and re-fetches all emails from Microsoft Graph, re-ingesting them into the restored Knowledge Base. Unlike Scenarios 1 and 2, this is the only recovery scenario where emails must be fully re-ingested — the Knowledge Base no longer contains the files, so the cost includes Microsoft Graph API calls, ingestion API calls, and the full content transfer. Subsequent runs are idempotent — file keys prevent duplicates.Each user can monitor their own recovery progress with
sync_progress. The sync is complete whenfullSyncStatetransitions to"ready"andstateis"finished".Live catch-up resumes automatically once ingestion is healthy — emails received during the outage will be processed through the normal webhook pipeline without additional operator action.
Once all affected users have called
restart_full_sync, disable debug mode by removing or unsettingMCP_DEBUG_MODEin the Helm values and redeploying (requires a pod restart). Debug mode should not remain enabled in production.
See also:restart_full_sync, sync_progress, Configuration
Mode B (MicrosoftGraph) Impact
In MicrosoftGraph mode, the Unique Knowledge Base is required infrastructure but no email content is stored there — search always queries Microsoft Graph directly. A Knowledge Base outage does not affect search results or any user-facing tool. No user action is required. Restore the Knowledge Base to restore the service's ability to connect to it.