Error Handling & Recovery

Comprehensive guide to error handling, recovery mechanisms, and resilience strategies in the Zentalk protocol.

Overview

Zentalk implements a multi-layered error handling system designed to maintain communication continuity while preserving security guarantees. The system distinguishes between recoverable and unrecoverable errors, applying appropriate strategies for each category.

Design Principles

Principle	Description
Fail-secure	Errors default to secure behavior, never exposing sensitive data
Graceful degradation	Partial functionality maintained when possible
Automatic recovery	Self-healing without user intervention where safe
Transparency	Users informed of issues affecting their communication

Error Categories

Zentalk errors fall into four primary categories, each requiring distinct handling strategies.

Category Overview

Category	Examples	Severity	User Impact
Network	Timeout, DNS failure, connection reset	Low-High	Message delay
Cryptographic	Decryption failure, invalid signature	High	Message loss possible
Protocol	State mismatch, version incompatibility	Medium-High	Session reset possible
Storage	Quota exceeded, corruption	Medium-High	Data loss possible

Error Hierarchy


┌─────────────────────────────────────────────────────────────┐
│ ZentalkError (base)                                          │
├─────────────────────────────────────────────────────────────┤
│  ├── NetworkError                                            │
│  │    ├── TimeoutError                                       │
│  │    ├── ConnectionError                                    │
│  │    ├── DNSError                                           │
│  │    └── TLSError                                           │
│  │                                                           │
│  ├── CryptoError                                             │
│  │    ├── DecryptionError                                    │
│  │    ├── SignatureError                                     │
│  │    ├── MACError                                           │
│  │    └── KeyError                                           │
│  │                                                           │
│  ├── ProtocolError                                           │
│  │    ├── StateError                                         │
│  │    ├── VersionError                                       │
│  │    ├── SequenceError                                      │
│  │    └── HandshakeError                                     │
│  │                                                           │
│  └── StorageError                                            │
│       ├── QuotaError                                         │
│       ├── CorruptionError                                    │
│       ├── AccessError                                        │
│       └── TransactionError                                   │
└─────────────────────────────────────────────────────────────┘

Network Errors

Network errors are the most common error category and typically the most recoverable.

Error Types

Error	Code	Cause	Typical Duration
Connection Timeout	NET_001	Server unreachable, high latency	30s default
Connection Reset	NET_002	TCP RST received, server restart	Immediate
DNS Resolution Failed	NET_003	DNS server unreachable, invalid domain	5-30s
TLS Handshake Failed	NET_004	Certificate invalid, cipher mismatch	Immediate
Connection Refused	NET_005	Server not listening, firewall block	Immediate
Network Unreachable	NET_006	No internet connectivity	Variable
Host Unreachable	NET_007	Routing failure, host offline	Variable

Timeout Configuration

Different operations have different timeout requirements based on their expected duration and criticality.

Operation	Timeout	Rationale
TCP Connect	10 seconds	Fast networks should connect quickly
TLS Handshake	15 seconds	Includes certificate validation
Key Bundle Fetch	20 seconds	Server may need database lookup
Message Send	30 seconds	Includes relay path establishment
File Upload	60 seconds	Large payloads need more time
Circuit Creation	45 seconds	3-hop path establishment
DHT Lookup	30 seconds	Multiple node queries

Retry Strategy: Exponential Backoff

Zentalk uses exponential backoff with jitter to prevent thundering herd problems during service recovery.


Retry Algorithm:

1. Initialize:
   base_delay = 1 second
   max_delay = 30 seconds
   max_attempts = 10
   attempt = 0

2. On failure:
   if attempt >= max_attempts:
       return PERMANENT_FAILURE

   delay = min(base_delay * (2 ^ attempt), max_delay)
   jitter = random(0, delay * 0.1)
   actual_delay = delay + jitter

   wait(actual_delay)
   attempt = attempt + 1
   retry_operation()

3. On success:
   reset attempt counter
   resume normal operation

Retry Delay Table

Attempt	Base Delay	With Max Cap	Approximate Range (with jitter)
1	1s	1s	1.0 - 1.1s
2	2s	2s	2.0 - 2.2s
3	4s	4s	4.0 - 4.4s
4	8s	8s	8.0 - 8.8s
5	16s	16s	16.0 - 17.6s
6	32s	30s	30.0 - 33.0s
7-10	64s+	30s	30.0 - 33.0s

Offline Queue Management

When network connectivity is lost, messages are queued locally for later transmission.


Offline Queue Structure:

┌─────────────────────────────────────────────────────────────┐
│ Queue Entry                                                  │
├─────────────────────────────────────────────────────────────┤
│ message_id      │ UUID                                       │
│ recipient       │ Wallet address                             │
│ encrypted_data  │ Pre-encrypted message blob                 │
│ created_at      │ Unix timestamp (ms)                        │
│ attempts        │ Retry count                                │
│ last_attempt    │ Last retry timestamp                       │
│ priority        │ HIGH / NORMAL / LOW                        │
│ expires_at      │ TTL expiration timestamp                   │
└─────────────────────────────────────────────────────────────┘

Queue Parameter	Value	Description
Max queue size	1,000 messages	Prevents storage exhaustion
Max message age	7 days	Messages expire if not sent
Priority levels	3	HIGH (calls), NORMAL (chat), LOW (receipts)
Flush batch size	10 messages	Sent per reconnection cycle
Flush interval	5 seconds	Delay between batches

Reconnection Behavior


Reconnection State Machine:

┌───────────┐     network lost     ┌──────────────┐
│ CONNECTED │─────────────────────→│ DISCONNECTED │
└───────────┘                      └──────────────┘
      ↑                                   │
      │                                   │ start backoff
      │                                   ↓
      │ success                    ┌──────────────┐
      └────────────────────────────│ RECONNECTING │←─┐
                                   └──────────────┘  │
                                          │          │
                                          │ failure  │
                                          └──────────┘

State	Behavior	User Indication
CONNECTED	Normal operation	Green indicator
DISCONNECTED	Queue messages locally	Red indicator
RECONNECTING	Attempting connection	Yellow indicator

Network Error Recovery Actions

Error	Immediate Action	Retry Strategy	User Notification
Timeout	Retry with backoff	Up to 10 attempts	After 3 failures
Connection Reset	Immediate retry	Up to 5 attempts	After 2 failures
DNS Failure	Try alternate DNS	Up to 3 attempts	Immediate
TLS Failure	Check certificate	No retry (security)	Immediate
Connection Refused	Check server status	Up to 10 attempts	After 5 failures

Cryptographic Errors

Cryptographic errors indicate potential security issues and require careful handling to maintain protocol security.

Error Types

Error	Code	Cause	Security Implication
Decryption Failed	CRYPTO_001	Wrong key, corrupted ciphertext	Possible attack or desync
Invalid Signature	CRYPTO_002	Key mismatch, tampering	Possible MITM attack
MAC Verification Failed	CRYPTO_003	Message modified	Definite tampering
Key Derivation Failed	CRYPTO_004	Invalid input parameters	Implementation error
Invalid Public Key	CRYPTO_005	Malformed key, wrong curve	Protocol violation
Nonce Reuse Detected	CRYPTO_006	Same nonce used twice	Security critical
Key Expired	CRYPTO_007	SPK or session key too old	Rotation required

Decryption Failure Handling

When message decryption fails, the system must determine whether the failure is due to desynchronization or an attack.


Decryption Failure Decision Tree:

1. Decryption fails with current receiving chain key
   │
   ├─→ Check skipped message keys
   │   │
   │   ├─→ Found: Decrypt with skipped key, delete key
   │   │
   │   └─→ Not found: Continue to step 2
   │
2. Check if message contains new DH ratchet key
   │
   ├─→ Yes: Attempt DH ratchet step
   │   │
   │   ├─→ Success: Decrypt with new chain
   │   │
   │   └─→ Failure: Continue to step 3
   │
   └─→ No: Continue to step 3
   │
3. Increment failure counter for this session
   │
   ├─→ Counter < 3: Request message resend
   │
   ├─→ Counter >= 3: Trigger session reset
   │
   └─→ Counter >= 5: Flag as potential attack

When to Request Message Resend

Message resend is appropriate when decryption failure is likely due to network issues rather than cryptographic desynchronization.

Condition	Request Resend	Rationale
First decryption failure	Yes	Likely transient error
Message number gap detected	Yes	Messages may have been lost
Previous messages decrypted OK	Yes	Isolated failure
Multiple consecutive failures	No	Likely desync, need reset
MAC verification failed	No	Tampering detected
Same message fails twice	No	Not a transient error

When to Trigger Session Reset

Session reset destroys current cryptographic state and re-establishes the session via X3DH.

Trigger Condition	Action	Data Preserved
3+ consecutive decrypt failures	Automatic reset	Message history
Invalid ratchet state detected	Automatic reset	Message history
User requests reset	Manual reset	Message history
Peer sends reset request	Accept reset	Message history
Key compromise suspected	Force reset	Message history

MAC Verification Failure

MAC (Message Authentication Code) failures indicate definite message tampering and are handled with zero tolerance.


MAC Failure Response:

1. IMMEDIATELY discard message
2. Do NOT attempt alternative decryption
3. Do NOT advance ratchet state
4. Log security event:
   {
     type: "MAC_FAILURE",
     peer: <address>,
     timestamp: <now>,
     message_header: <non-sensitive portion>
   }
5. Increment MAC failure counter
6. If counter >= 2 in 1 hour:
   - Flag session as potentially compromised
   - Notify user with security warning
   - Recommend session reset

Signature Verification Failures

Failure Type	Possible Cause	Action
Identity key signature invalid	Key bundle tampered	Reject, warn user
SPK signature invalid	SPK corrupted or forged	Reject, fetch fresh bundle
Message signature invalid	Sender key changed	Verify key fingerprint
Timestamp signature invalid	Replay attack	Reject message

Protocol Errors

Protocol errors occur when communication violates expected state or format.

Error Types

Error	Code	Cause	Recovery Path
Invalid State Transition	PROTO_001	Message received in wrong state	Reset state machine
Version Mismatch	PROTO_002	Incompatible protocol versions	Negotiate or fail
Unknown Message Type	PROTO_003	Newer protocol, corrupted data	Ignore or request clarification
Sequence Number Invalid	PROTO_004	Replay or lost messages	Check skipped keys
Handshake Failed	PROTO_005	X3DH computation mismatch	Retry with fresh keys
Circuit ID Unknown	PROTO_006	Circuit destroyed or expired	Create new circuit
Stream ID Invalid	PROTO_007	Stream closed or never opened	Open new stream

State Machine Recovery

When the session state machine enters an invalid state, recovery depends on the current and expected states.


State Recovery Matrix:

Current          Expected         Recovery Action
─────────────────────────────────────────────────
NO_SESSION       ESTABLISHED      Initiate X3DH
PENDING          ESTABLISHED      Wait or retry X3DH
ESTABLISHED      NO_SESSION       Accept (peer reset)
CLOSED           Any active       Create new session
Invalid/Corrupt  Any              Force reset

Version Mismatch Handling

Client Version	Server Version	Behavior
v1.0	v1.0	Normal operation
v1.0	v2.0	Server downgrades if supported
v2.0	v1.0	Client downgrades if supported
v1.0	v3.0+	Connection refused, upgrade required


Version Negotiation:

1. Client sends supported_versions: [1.0, 1.1, 2.0]
2. Server selects highest common version
3. If no common version:
   Server responds:
   {
     error: "VERSION_MISMATCH",
     server_versions: [3.0, 3.1],
     client_versions: [1.0, 1.1, 2.0],
     upgrade_url: "https://zentalk.io/download"
   }
4. Client notifies user: "Update required"

Invalid Message Type Handling

Message Type	Known	Action
0x01 - 0x0F	Yes	Process normally
0x10 - 0xEF	Reserved	Log and ignore
0xF0 - 0xFE	Extension	Check extension support
0xFF	Error	Process error response

Double Ratchet Desynchronization

Desynchronization occurs when sender and receiver ratchet states diverge, preventing message decryption.

How Desync Happens

Cause	Frequency	Detection Difficulty
Network packet loss	Common	Easy
Device switch mid-conversation	Occasional	Medium
App crash during ratchet step	Rare	Medium
Storage corruption	Rare	Hard
Concurrent message sending	Occasional	Medium
Clock skew affecting ordering	Rare	Hard

Desync Scenarios


Scenario 1: Lost Message

Alice                               Bob
──────                              ───
Sends M1 (N=0) ─────────────────→  Receives M1
Sends M2 (N=1) ────────X           (lost in network)
Sends M3 (N=2) ─────────────────→  Receives M3

Bob expects N=1, receives N=2
Detection: Message number gap
Recovery: Bob stores skipped key for N=1


Scenario 2: Lost DH Ratchet

Alice                               Bob
──────                              ───
DH ratchet, sends M1 ──────X        (lost)
Sends M2 ─────────────────────────→ Receives M2

Bob has old DHr, cannot derive correct chain
Detection: Decryption fails with current state
Recovery: Attempt ratchet with received DH key


Scenario 3: State Corruption

Alice                               Bob
──────                              ───
Stores state to disk                Receives message
App crashes before flush            Decrypts successfully
App restarts with old state         Advances ratchet

Alice's state is behind Bob's state
Detection: Multiple decryption failures
Recovery: Session reset required

Detection Mechanisms

Mechanism	Detects	False Positive Rate
Message number gap	Lost messages	Very low
Consecutive decrypt failures	Chain desync	Low
DH key mismatch	Ratchet desync	Very low
Timestamp anomalies	Ordering issues	Medium
MAC failures	Corruption or attack	Very low

Desync Detection Algorithm


On message receipt:

1. Extract header: DH_pub, msg_num, prev_chain_len

2. Check for message number gap:
   if msg_num > Nr:
       gap_size = msg_num - Nr
       if gap_size > MAX_SKIP:
           return REJECT_POSSIBLE_DOS
       store_skipped_keys(Nr, msg_num - 1)

3. Check DH key:
   if DH_pub != current_DHr:
       if DH_pub in recent_DH_keys:
           // Out of order from previous chain
           use_previous_chain()
       else:
           // New DH ratchet
           perform_dh_ratchet(DH_pub)

4. Attempt decryption:
   if success:
       clear_failure_counter()
   else:
       increment_failure_counter()
       if failures >= DESYNC_THRESHOLD:
           initiate_recovery()

Recovery Protocol


Desync Recovery Flow:

┌─────────────────┐
│ Desync Detected │
└────────┬────────┘
         │
         ▼
┌─────────────────────────┐
│ failures < 3?           │───Yes───→ Request resend
└───────────┬─────────────┘
            │ No
            ▼
┌─────────────────────────┐
│ Try alternative chains  │
│ (skipped keys, prev DH) │
└───────────┬─────────────┘
            │
            ▼
┌─────────────────────────┐
│ Alternative succeeded?  │───Yes───→ Resume normal
└───────────┬─────────────┘
            │ No
            ▼
┌─────────────────────────┐
│ Send RESET_REQUEST      │
│ to peer                 │
└───────────┬─────────────┘
            │
            ▼
┌─────────────────────────┐
│ Await RESET_ACK         │
│ (timeout: 30 seconds)   │
└───────────┬─────────────┘
            │
            ▼
┌─────────────────────────┐
│ Re-establish session    │
│ via X3DH                │
└─────────────────────────┘

Message Loss During Recovery

During session recovery, some messages may be unrecoverable.

Message State	Recovery Outcome
Successfully decrypted	Preserved in history
Queued for send	Re-encrypted with new session
In-flight during reset	Lost (resend notification sent)
Received but undecrypted	Lost (request resend from peer)


Recovery Message Handling:

1. Before reset:
   - Flush all pending decrypted messages to storage
   - Mark queued outgoing messages as NEEDS_REENCRYPT
   - Record message IDs of failed decryptions

2. After new session established:
   - Re-encrypt and resend queued messages
   - Request resend of failed incoming messages:
     {
       type: "RESEND_REQUEST",
       message_ids: [<list of lost message IDs>],
       reason: "session_reset"
     }

3. Peer responds:
   - Re-encrypts requested messages with new session
   - Marks messages as RESENT to prevent duplicates

Message Delivery Failures

Message delivery follows a state machine tracking each message from creation to confirmed delivery.

Delivery States


Message Delivery State Machine:

┌──────────┐   encrypt    ┌─────────┐   send    ┌────────┐
│ CREATING │─────────────→│ PENDING │──────────→│  SENT  │
└──────────┘              └─────────┘           └────────┘
                               │                    │
                          timeout/error         delivered
                               │                    │
                               ▼                    ▼
                          ┌─────────┐         ┌───────────┐
                          │ FAILED  │         │ DELIVERED │
                          └─────────┘         └───────────┘
                               │                    │
                          retry succeeds        read
                               │                    │
                               ▼                    ▼
                          ┌────────┐           ┌────────┐
                          │  SENT  │           │  READ  │
                          └────────┘           └────────┘

State	Description	Typical Duration
CREATING	Message being composed/encrypted	Milliseconds
PENDING	Encrypted, awaiting network	Until connected
SENT	Transmitted to relay network	Until ACK received
DELIVERED	Confirmed received by recipient	Until read
READ	Read receipt received	Final state
FAILED	Delivery failed after retries	Until manual retry

Retry Behavior Per State

State	Automatic Retry	Max Attempts	Backoff Strategy
PENDING	Yes (when online)	Unlimited	Queue order
SENT (no ACK)	Yes	3	Exponential, max 30s
FAILED	No (user action)	User controlled	None
DELIVERED (no read)	No	N/A	N/A

Delivery Failure Causes

Cause	Detection Method	Retry Appropriate
Network timeout	No ACK within timeout	Yes
Recipient offline	Server queued response	Yes (server queues)
Recipient key changed	Key bundle mismatch	Yes (fetch new keys)
Recipient blocked sender	403 response	No
Message too large	413 response	No
Rate limited	429 response	Yes (after cooldown)
Server error	5xx response	Yes

User Notification Strategy

Condition	Notification Type	Timing
First retry	None	Silent
3rd retry	Subtle indicator	Delayed badge
Max retries exhausted	Alert	Immediate
Permanent failure	Dialog	Immediate
Rate limited	Toast	Immediate


Notification Decision:

1. Message enters FAILED state
2. Determine failure type:
   - Transient (network): "Message will retry when online"
   - Permanent (blocked): "Message could not be delivered"
   - Recoverable (key change): "Recipient's keys changed. Verify and resend?"

3. Display appropriate UI:
   - Transient: Yellow warning icon on message
   - Permanent: Red error icon, tap for details
   - Recoverable: Action prompt with verify option

Permanent Failure Handling

Failure Type	User Action	System Action
Recipient blocked	Inform user	Remove from contacts (optional)
Invalid recipient	Prompt correction	Discard message
Message expired	Inform user	Archive or delete
Key verification failed	Prompt verification	Hold message

Storage Errors

Storage errors affect local data persistence and can lead to data loss if not handled correctly.

Error Types

Error	Code	Cause	Severity
Quota Exceeded	STORE_001	IndexedDB limit reached	High
Database Corruption	STORE_002	Unexpected shutdown, disk error	Critical
Transaction Failed	STORE_003	Concurrent access conflict	Medium
Access Denied	STORE_004	Browser permissions revoked	High
Version Mismatch	STORE_005	Database schema outdated	Medium
Encryption Key Lost	STORE_006	Key derivation failed	Critical

IndexedDB Quota Management

Browser storage quotas vary by platform and available disk space.

Platform	Typical Quota	Zentalk Target Usage
Chrome Desktop	60% of disk	Max 500 MB
Firefox Desktop	50% of disk	Max 500 MB
Safari Desktop	1 GB	Max 500 MB
Mobile browsers	50-100 MB	Max 50 MB

Quota Exceeded Handling


Quota Exceeded Response:

1. Identify storage consumers:
   - Message history: typically largest
   - Media cache: can be cleared
   - Session state: critical, cannot reduce
   - Logs: can be truncated

2. Execute cleanup strategy:
   Priority 1: Clear media cache
   Priority 2: Compress old messages
   Priority 3: Archive old conversations
   Priority 4: Prompt user for action

3. Cleanup thresholds:
   - At 80% quota: Clear media cache
   - At 90% quota: Archive conversations > 6 months
   - At 95% quota: Notify user, suggest export
   - At 100%: Emergency mode, queue to memory only

Database Corruption Detection

Detection Method	Checks For	Frequency
Checksum validation	Data integrity	Every read
Schema verification	Structure validity	On app launch
Foreign key check	Referential integrity	On app launch
Index verification	Index corruption	Weekly
Transaction log	Incomplete writes	On app launch


Corruption Detection Algorithm:

1. On database open:
   - Verify schema version matches expected
   - Check critical tables exist
   - Validate index structures

2. On read operation:
   if stored_checksum != computed_checksum:
       mark_record_corrupted(record_id)
       attempt_recovery(record_id)

3. Periodic integrity check (weekly):
   for each table:
       for each record:
           validate_structure(record)
           validate_references(record)
           validate_checksum(record)
       report_corruption_rate()

Automatic Repair Strategies

Corruption Type	Repair Strategy	Success Rate
Single record	Restore from backup	High
Index corruption	Rebuild index	Very high
Table corruption	Restore table from backup	Medium
Schema corruption	Reset schema, migrate data	Medium
Full DB corruption	Restore from encrypted backup	Depends on backup


Repair Decision Tree:

┌─────────────────────────┐
│ Corruption Detected     │
└───────────┬─────────────┘
            │
            ▼
┌─────────────────────────┐
│ Scope Assessment        │
│ (single record/table/DB)│
└───────────┬─────────────┘
            │
   ┌────────┼────────┐
   │        │        │
   ▼        ▼        ▼
┌──────┐ ┌──────┐ ┌──────┐
│Record│ │Table │ │ Full │
└──┬───┘ └──┬───┘ └──┬───┘
   │        │        │
   ▼        ▼        ▼
Restore  Rebuild   Restore
from     from      from
cache    indices   backup

Manual Recovery Options

When automatic recovery fails, users have several manual options.

Option	Data Preserved	Complexity
Export and reimport	All exportable data	Medium
Restore from backup	Up to backup point	Low
Clear and resync	Contacts only	Low
Full reset	None (fresh start)	Very low


Manual Recovery UI Flow:

1. User accesses Settings → Data Recovery

2. Options presented:
   ┌─────────────────────────────────────────┐
   │ Recovery Options                         │
   ├─────────────────────────────────────────┤
   │ [Attempt Auto-Repair]                    │
   │   Try to fix corruption automatically    │
   │                                          │
   │ [Restore from Backup]                    │
   │   Last backup: 2 hours ago               │
   │                                          │
   │ [Export Available Data]                  │
   │   Save what can be recovered             │
   │                                          │
   │ [Clear All Data]                         │
   │   Start fresh (contacts re-sync)         │
   └─────────────────────────────────────────┘

3. After selection:
   - Confirm destructive actions
   - Show progress indicator
   - Report success/failure
   - Guide next steps

Error Codes Reference

Network Error Codes (NET_xxx)

Code	Name	Description	Resolution
NET_001	TIMEOUT	Operation exceeded time limit	Retry with backoff
NET_002	CONN_RESET	Connection forcibly closed	Reconnect
NET_003	DNS_FAILED	Domain name resolution failed	Check network, try alternate DNS
NET_004	TLS_FAILED	TLS handshake failed	Verify certificates, check time
NET_005	CONN_REFUSED	Server refused connection	Check server status
NET_006	NET_UNREACHABLE	No route to network	Check internet connection
NET_007	HOST_UNREACHABLE	Cannot reach specific host	Check host status
NET_008	CERT_EXPIRED	Server certificate expired	Contact server operator
NET_009	CERT_INVALID	Certificate validation failed	Security alert, do not proceed
NET_010	RATE_LIMITED	Too many requests	Wait and retry

Cryptographic Error Codes (CRYPTO_xxx)

Code	Name	Description	Resolution
CRYPTO_001	DECRYPT_FAILED	Decryption produced invalid data	Check keys, request resend
CRYPTO_002	SIG_INVALID	Signature verification failed	Verify sender identity
CRYPTO_003	MAC_FAILED	Authentication tag mismatch	Discard message, alert user
CRYPTO_004	KDF_FAILED	Key derivation error	Check inputs, retry
CRYPTO_005	KEY_INVALID	Public key validation failed	Request new key bundle
CRYPTO_006	NONCE_REUSE	Same nonce used twice	Critical security error
CRYPTO_007	KEY_EXPIRED	Key past validity period	Trigger key rotation
CRYPTO_008	ALGO_UNSUPPORTED	Unknown algorithm requested	Check protocol version
CRYPTO_009	RANDOM_FAILED	RNG failure	Critical system error
CRYPTO_010	KEY_MISMATCH	Public/private key mismatch	Regenerate keypair

Protocol Error Codes (PROTO_xxx)

Code	Name	Description	Resolution
PROTO_001	INVALID_STATE	Unexpected state transition	Reset state machine
PROTO_002	VERSION_MISMATCH	Incompatible protocol version	Upgrade client
PROTO_003	UNKNOWN_MSG_TYPE	Unrecognized message type	Ignore or upgrade
PROTO_004	SEQ_INVALID	Invalid sequence number	Check for replay
PROTO_005	HANDSHAKE_FAILED	X3DH computation failed	Retry with fresh keys
PROTO_006	CIRCUIT_UNKNOWN	Circuit ID not found	Create new circuit
PROTO_007	STREAM_INVALID	Stream ID not valid	Open new stream
PROTO_008	MSG_TOO_LARGE	Message exceeds size limit	Split or compress
PROTO_009	REPLAY_DETECTED	Duplicate message received	Discard message
PROTO_010	DESYNC_DETECTED	Ratchet desynchronization	Initiate recovery

Storage Error Codes (STORE_xxx)

Code	Name	Description	Resolution
STORE_001	QUOTA_EXCEEDED	Storage limit reached	Clear cache, archive old data
STORE_002	CORRUPTION	Data integrity check failed	Attempt repair or restore
STORE_003	TXN_FAILED	Database transaction failed	Retry operation
STORE_004	ACCESS_DENIED	Permission denied	Request permissions
STORE_005	SCHEMA_MISMATCH	Database schema outdated	Run migration
STORE_006	KEY_LOST	Encryption key unavailable	Restore from backup
STORE_007	NOT_FOUND	Requested record missing	Check ID, may be deleted
STORE_008	LOCKED	Database locked by another process	Wait and retry
STORE_009	FULL	Storage completely full	Emergency cleanup
STORE_010	INIT_FAILED	Database initialization failed	Clear and reinitialize

User-Facing vs Internal Errors

Error Type	User-Facing	Technical Details Shown
Network timeout	Yes	No (just “Connection problem”)
Decryption failed	Yes	No (just “Message unavailable”)
MAC failure	Yes	Partial (“Security warning”)
Storage quota	Yes	Yes (space remaining)
Protocol version	Yes	Yes (version numbers)
Internal errors	Yes	No (generic message)
Rate limiting	Yes	Yes (retry time)
Key expiration	No	N/A (auto-handled)

Error Message Templates

Code	User Message	Technical Log
NET_001	”Connection timed out. Retrying…"	"NET_001: Timeout after 30000ms to relay.zentalk.io:9001”
CRYPTO_003	”Message could not be verified. It may have been tampered with."	"CRYPTO_003: MAC verification failed for msg_id=abc123, session=def456”
STORE_001	”Storage full. Please free up space or archive old messages."	"STORE_001: QuotaExceededError, used=490MB, limit=500MB”
PROTO_002	”Please update Zentalk to continue."	"PROTO_002: Version mismatch, local=1.2.0, remote=2.0.0”

Recovery Best Practices

Error Recovery Priority

Priority	Error Category	Rationale
1 (Highest)	Security errors	Prevent data exposure
2	Cryptographic sync	Restore communication
3	Storage integrity	Prevent data loss
4	Network connectivity	Restore service
5 (Lowest)	UI/UX errors	User convenience

Logging and Diagnostics

Log Level	Error Types	Retention
ERROR	All failures	7 days
WARN	Retryable issues	3 days
INFO	Recovery actions	1 day
DEBUG	Detailed traces	Session only


Log Entry Structure:

{
  timestamp: "2024-01-15T10:30:00.000Z",
  level: "ERROR",
  code: "CRYPTO_001",
  message: "Decryption failed",
  context: {
    session_id: "abc123",
    peer: "0x1234...5678",
    message_num: 42,
    attempt: 2
  },
  stack: "<stack trace for DEBUG only>"
}

Circuit Breaker Pattern

For operations that repeatedly fail, Zentalk implements a circuit breaker to prevent resource exhaustion.

State	Behavior	Transition
CLOSED	Normal operation	Opens after 5 failures
OPEN	Fail fast, no attempts	Half-opens after 60s
HALF-OPEN	Allow single test request	Closes on success, opens on failure


Circuit Breaker Logic:

state = CLOSED
failure_count = 0
last_failure_time = null

on_operation():
    if state == OPEN:
        if now() - last_failure_time > 60 seconds:
            state = HALF_OPEN
        else:
            return FAIL_FAST

    result = attempt_operation()

    if result == SUCCESS:
        state = CLOSED
        failure_count = 0
    else:
        failure_count += 1
        last_failure_time = now()
        if failure_count >= 5:
            state = OPEN

Protocol Specification - Cryptographic protocol details
Wire Protocol - Network message formats
Multi-Device Support - Device synchronization
Threat Model - Security considerations

Error Handling & Recovery

Overview

Design Principles

Error Categories

Category Overview

Error Hierarchy

Network Errors

Error Types

Timeout Configuration

Retry Strategy: Exponential Backoff

Retry Delay Table

Offline Queue Management

Reconnection Behavior

Network Error Recovery Actions

Cryptographic Errors

Error Types

Decryption Failure Handling

When to Request Message Resend

When to Trigger Session Reset

MAC Verification Failure

Signature Verification Failures

Protocol Errors

Error Types

State Machine Recovery

Version Mismatch Handling

Invalid Message Type Handling

Double Ratchet Desynchronization

How Desync Happens

Desync Scenarios

Detection Mechanisms

Desync Detection Algorithm

Recovery Protocol

Message Loss During Recovery

Message Delivery Failures

Delivery States

Retry Behavior Per State

Delivery Failure Causes

User Notification Strategy

Permanent Failure Handling

Storage Errors

Error Types

IndexedDB Quota Management

Quota Exceeded Handling

Database Corruption Detection

Automatic Repair Strategies

Manual Recovery Options

Error Codes Reference

Network Error Codes (NET_xxx)

Cryptographic Error Codes (CRYPTO_xxx)

Protocol Error Codes (PROTO_xxx)

Storage Error Codes (STORE_xxx)

User-Facing vs Internal Errors

Error Message Templates

Recovery Best Practices

Error Recovery Priority

Logging and Diagnostics

Circuit Breaker Pattern

Related Documentation