Error Handling & Recovery
Comprehensive guide to error handling, recovery mechanisms, and resilience strategies in the Zentalk protocol.
Overview
Zentalk implements a multi-layered error handling system designed to maintain communication continuity while preserving security guarantees. The system distinguishes between recoverable and unrecoverable errors, applying appropriate strategies for each category.
Design Principles
| Principle | Description |
|---|---|
| Fail-secure | Errors default to secure behavior, never exposing sensitive data |
| Graceful degradation | Partial functionality maintained when possible |
| Automatic recovery | Self-healing without user intervention where safe |
| Transparency | Users informed of issues affecting their communication |
Error Categories
Zentalk errors fall into four primary categories, each requiring distinct handling strategies.
Category Overview
| Category | Examples | Severity | User Impact |
|---|---|---|---|
| Network | Timeout, DNS failure, connection reset | Low-High | Message delay |
| Cryptographic | Decryption failure, invalid signature | High | Message loss possible |
| Protocol | State mismatch, version incompatibility | Medium-High | Session reset possible |
| Storage | Quota exceeded, corruption | Medium-High | Data loss possible |
Error Hierarchy
┌─────────────────────────────────────────────────────────────┐
│ ZentalkError (base) │
├─────────────────────────────────────────────────────────────┤
│ ├── NetworkError │
│ │ ├── TimeoutError │
│ │ ├── ConnectionError │
│ │ ├── DNSError │
│ │ └── TLSError │
│ │ │
│ ├── CryptoError │
│ │ ├── DecryptionError │
│ │ ├── SignatureError │
│ │ ├── MACError │
│ │ └── KeyError │
│ │ │
│ ├── ProtocolError │
│ │ ├── StateError │
│ │ ├── VersionError │
│ │ ├── SequenceError │
│ │ └── HandshakeError │
│ │ │
│ └── StorageError │
│ ├── QuotaError │
│ ├── CorruptionError │
│ ├── AccessError │
│ └── TransactionError │
└─────────────────────────────────────────────────────────────┘Network Errors
Network errors are the most common error category and typically the most recoverable.
Error Types
| Error | Code | Cause | Typical Duration |
|---|---|---|---|
| Connection Timeout | NET_001 | Server unreachable, high latency | 30s default |
| Connection Reset | NET_002 | TCP RST received, server restart | Immediate |
| DNS Resolution Failed | NET_003 | DNS server unreachable, invalid domain | 5-30s |
| TLS Handshake Failed | NET_004 | Certificate invalid, cipher mismatch | Immediate |
| Connection Refused | NET_005 | Server not listening, firewall block | Immediate |
| Network Unreachable | NET_006 | No internet connectivity | Variable |
| Host Unreachable | NET_007 | Routing failure, host offline | Variable |
Timeout Configuration
Different operations have different timeout requirements based on their expected duration and criticality.
| Operation | Timeout | Rationale |
|---|---|---|
| TCP Connect | 10 seconds | Fast networks should connect quickly |
| TLS Handshake | 15 seconds | Includes certificate validation |
| Key Bundle Fetch | 20 seconds | Server may need database lookup |
| Message Send | 30 seconds | Includes relay path establishment |
| File Upload | 60 seconds | Large payloads need more time |
| Circuit Creation | 45 seconds | 3-hop path establishment |
| DHT Lookup | 30 seconds | Multiple node queries |
Retry Strategy: Exponential Backoff
Zentalk uses exponential backoff with jitter to prevent thundering herd problems during service recovery.
Retry Algorithm:
1. Initialize:
base_delay = 1 second
max_delay = 30 seconds
max_attempts = 10
attempt = 0
2. On failure:
if attempt >= max_attempts:
return PERMANENT_FAILURE
delay = min(base_delay * (2 ^ attempt), max_delay)
jitter = random(0, delay * 0.1)
actual_delay = delay + jitter
wait(actual_delay)
attempt = attempt + 1
retry_operation()
3. On success:
reset attempt counter
resume normal operationRetry Delay Table
| Attempt | Base Delay | With Max Cap | Approximate Range (with jitter) |
|---|---|---|---|
| 1 | 1s | 1s | 1.0 - 1.1s |
| 2 | 2s | 2s | 2.0 - 2.2s |
| 3 | 4s | 4s | 4.0 - 4.4s |
| 4 | 8s | 8s | 8.0 - 8.8s |
| 5 | 16s | 16s | 16.0 - 17.6s |
| 6 | 32s | 30s | 30.0 - 33.0s |
| 7-10 | 64s+ | 30s | 30.0 - 33.0s |
Offline Queue Management
When network connectivity is lost, messages are queued locally for later transmission.
Offline Queue Structure:
┌─────────────────────────────────────────────────────────────┐
│ Queue Entry │
├─────────────────────────────────────────────────────────────┤
│ message_id │ UUID │
│ recipient │ Wallet address │
│ encrypted_data │ Pre-encrypted message blob │
│ created_at │ Unix timestamp (ms) │
│ attempts │ Retry count │
│ last_attempt │ Last retry timestamp │
│ priority │ HIGH / NORMAL / LOW │
│ expires_at │ TTL expiration timestamp │
└─────────────────────────────────────────────────────────────┘| Queue Parameter | Value | Description |
|---|---|---|
| Max queue size | 1,000 messages | Prevents storage exhaustion |
| Max message age | 7 days | Messages expire if not sent |
| Priority levels | 3 | HIGH (calls), NORMAL (chat), LOW (receipts) |
| Flush batch size | 10 messages | Sent per reconnection cycle |
| Flush interval | 5 seconds | Delay between batches |
Reconnection Behavior
Reconnection State Machine:
┌───────────┐ network lost ┌──────────────┐
│ CONNECTED │─────────────────────→│ DISCONNECTED │
└───────────┘ └──────────────┘
↑ │
│ │ start backoff
│ ↓
│ success ┌──────────────┐
└────────────────────────────│ RECONNECTING │←─┐
└──────────────┘ │
│ │
│ failure │
└──────────┘| State | Behavior | User Indication |
|---|---|---|
| CONNECTED | Normal operation | Green indicator |
| DISCONNECTED | Queue messages locally | Red indicator |
| RECONNECTING | Attempting connection | Yellow indicator |
Network Error Recovery Actions
| Error | Immediate Action | Retry Strategy | User Notification |
|---|---|---|---|
| Timeout | Retry with backoff | Up to 10 attempts | After 3 failures |
| Connection Reset | Immediate retry | Up to 5 attempts | After 2 failures |
| DNS Failure | Try alternate DNS | Up to 3 attempts | Immediate |
| TLS Failure | Check certificate | No retry (security) | Immediate |
| Connection Refused | Check server status | Up to 10 attempts | After 5 failures |
Cryptographic Errors
Cryptographic errors indicate potential security issues and require careful handling to maintain protocol security.
Error Types
| Error | Code | Cause | Security Implication |
|---|---|---|---|
| Decryption Failed | CRYPTO_001 | Wrong key, corrupted ciphertext | Possible attack or desync |
| Invalid Signature | CRYPTO_002 | Key mismatch, tampering | Possible MITM attack |
| MAC Verification Failed | CRYPTO_003 | Message modified | Definite tampering |
| Key Derivation Failed | CRYPTO_004 | Invalid input parameters | Implementation error |
| Invalid Public Key | CRYPTO_005 | Malformed key, wrong curve | Protocol violation |
| Nonce Reuse Detected | CRYPTO_006 | Same nonce used twice | Security critical |
| Key Expired | CRYPTO_007 | SPK or session key too old | Rotation required |
Decryption Failure Handling
When message decryption fails, the system must determine whether the failure is due to desynchronization or an attack.
Decryption Failure Decision Tree:
1. Decryption fails with current receiving chain key
│
├─→ Check skipped message keys
│ │
│ ├─→ Found: Decrypt with skipped key, delete key
│ │
│ └─→ Not found: Continue to step 2
│
2. Check if message contains new DH ratchet key
│
├─→ Yes: Attempt DH ratchet step
│ │
│ ├─→ Success: Decrypt with new chain
│ │
│ └─→ Failure: Continue to step 3
│
└─→ No: Continue to step 3
│
3. Increment failure counter for this session
│
├─→ Counter < 3: Request message resend
│
├─→ Counter >= 3: Trigger session reset
│
└─→ Counter >= 5: Flag as potential attackWhen to Request Message Resend
Message resend is appropriate when decryption failure is likely due to network issues rather than cryptographic desynchronization.
| Condition | Request Resend | Rationale |
|---|---|---|
| First decryption failure | Yes | Likely transient error |
| Message number gap detected | Yes | Messages may have been lost |
| Previous messages decrypted OK | Yes | Isolated failure |
| Multiple consecutive failures | No | Likely desync, need reset |
| MAC verification failed | No | Tampering detected |
| Same message fails twice | No | Not a transient error |
When to Trigger Session Reset
Session reset destroys current cryptographic state and re-establishes the session via X3DH.
| Trigger Condition | Action | Data Preserved |
|---|---|---|
| 3+ consecutive decrypt failures | Automatic reset | Message history |
| Invalid ratchet state detected | Automatic reset | Message history |
| User requests reset | Manual reset | Message history |
| Peer sends reset request | Accept reset | Message history |
| Key compromise suspected | Force reset | Message history |
MAC Verification Failure
MAC (Message Authentication Code) failures indicate definite message tampering and are handled with zero tolerance.
MAC Failure Response:
1. IMMEDIATELY discard message
2. Do NOT attempt alternative decryption
3. Do NOT advance ratchet state
4. Log security event:
{
type: "MAC_FAILURE",
peer: <address>,
timestamp: <now>,
message_header: <non-sensitive portion>
}
5. Increment MAC failure counter
6. If counter >= 2 in 1 hour:
- Flag session as potentially compromised
- Notify user with security warning
- Recommend session resetSignature Verification Failures
| Failure Type | Possible Cause | Action |
|---|---|---|
| Identity key signature invalid | Key bundle tampered | Reject, warn user |
| SPK signature invalid | SPK corrupted or forged | Reject, fetch fresh bundle |
| Message signature invalid | Sender key changed | Verify key fingerprint |
| Timestamp signature invalid | Replay attack | Reject message |
Protocol Errors
Protocol errors occur when communication violates expected state or format.
Error Types
| Error | Code | Cause | Recovery Path |
|---|---|---|---|
| Invalid State Transition | PROTO_001 | Message received in wrong state | Reset state machine |
| Version Mismatch | PROTO_002 | Incompatible protocol versions | Negotiate or fail |
| Unknown Message Type | PROTO_003 | Newer protocol, corrupted data | Ignore or request clarification |
| Sequence Number Invalid | PROTO_004 | Replay or lost messages | Check skipped keys |
| Handshake Failed | PROTO_005 | X3DH computation mismatch | Retry with fresh keys |
| Circuit ID Unknown | PROTO_006 | Circuit destroyed or expired | Create new circuit |
| Stream ID Invalid | PROTO_007 | Stream closed or never opened | Open new stream |
State Machine Recovery
When the session state machine enters an invalid state, recovery depends on the current and expected states.
State Recovery Matrix:
Current Expected Recovery Action
─────────────────────────────────────────────────
NO_SESSION ESTABLISHED Initiate X3DH
PENDING ESTABLISHED Wait or retry X3DH
ESTABLISHED NO_SESSION Accept (peer reset)
CLOSED Any active Create new session
Invalid/Corrupt Any Force resetVersion Mismatch Handling
| Client Version | Server Version | Behavior |
|---|---|---|
| v1.0 | v1.0 | Normal operation |
| v1.0 | v2.0 | Server downgrades if supported |
| v2.0 | v1.0 | Client downgrades if supported |
| v1.0 | v3.0+ | Connection refused, upgrade required |
Version Negotiation:
1. Client sends supported_versions: [1.0, 1.1, 2.0]
2. Server selects highest common version
3. If no common version:
Server responds:
{
error: "VERSION_MISMATCH",
server_versions: [3.0, 3.1],
client_versions: [1.0, 1.1, 2.0],
upgrade_url: "https://zentalk.io/download"
}
4. Client notifies user: "Update required"Invalid Message Type Handling
| Message Type | Known | Action |
|---|---|---|
| 0x01 - 0x0F | Yes | Process normally |
| 0x10 - 0xEF | Reserved | Log and ignore |
| 0xF0 - 0xFE | Extension | Check extension support |
| 0xFF | Error | Process error response |
Double Ratchet Desynchronization
Desynchronization occurs when sender and receiver ratchet states diverge, preventing message decryption.
How Desync Happens
| Cause | Frequency | Detection Difficulty |
|---|---|---|
| Network packet loss | Common | Easy |
| Device switch mid-conversation | Occasional | Medium |
| App crash during ratchet step | Rare | Medium |
| Storage corruption | Rare | Hard |
| Concurrent message sending | Occasional | Medium |
| Clock skew affecting ordering | Rare | Hard |
Desync Scenarios
Scenario 1: Lost Message
Alice Bob
────── ───
Sends M1 (N=0) ─────────────────→ Receives M1
Sends M2 (N=1) ────────X (lost in network)
Sends M3 (N=2) ─────────────────→ Receives M3
Bob expects N=1, receives N=2
Detection: Message number gap
Recovery: Bob stores skipped key for N=1Scenario 2: Lost DH Ratchet
Alice Bob
────── ───
DH ratchet, sends M1 ──────X (lost)
Sends M2 ─────────────────────────→ Receives M2
Bob has old DHr, cannot derive correct chain
Detection: Decryption fails with current state
Recovery: Attempt ratchet with received DH keyScenario 3: State Corruption
Alice Bob
────── ───
Stores state to disk Receives message
App crashes before flush Decrypts successfully
App restarts with old state Advances ratchet
Alice's state is behind Bob's state
Detection: Multiple decryption failures
Recovery: Session reset requiredDetection Mechanisms
| Mechanism | Detects | False Positive Rate |
|---|---|---|
| Message number gap | Lost messages | Very low |
| Consecutive decrypt failures | Chain desync | Low |
| DH key mismatch | Ratchet desync | Very low |
| Timestamp anomalies | Ordering issues | Medium |
| MAC failures | Corruption or attack | Very low |
Desync Detection Algorithm
On message receipt:
1. Extract header: DH_pub, msg_num, prev_chain_len
2. Check for message number gap:
if msg_num > Nr:
gap_size = msg_num - Nr
if gap_size > MAX_SKIP:
return REJECT_POSSIBLE_DOS
store_skipped_keys(Nr, msg_num - 1)
3. Check DH key:
if DH_pub != current_DHr:
if DH_pub in recent_DH_keys:
// Out of order from previous chain
use_previous_chain()
else:
// New DH ratchet
perform_dh_ratchet(DH_pub)
4. Attempt decryption:
if success:
clear_failure_counter()
else:
increment_failure_counter()
if failures >= DESYNC_THRESHOLD:
initiate_recovery()Recovery Protocol
Desync Recovery Flow:
┌─────────────────┐
│ Desync Detected │
└────────┬────────┘
│
▼
┌─────────────────────────┐
│ failures < 3? │───Yes───→ Request resend
└───────────┬─────────────┘
│ No
▼
┌─────────────────────────┐
│ Try alternative chains │
│ (skipped keys, prev DH) │
└───────────┬─────────────┘
│
▼
┌─────────────────────────┐
│ Alternative succeeded? │───Yes───→ Resume normal
└───────────┬─────────────┘
│ No
▼
┌─────────────────────────┐
│ Send RESET_REQUEST │
│ to peer │
└───────────┬─────────────┘
│
▼
┌─────────────────────────┐
│ Await RESET_ACK │
│ (timeout: 30 seconds) │
└───────────┬─────────────┘
│
▼
┌─────────────────────────┐
│ Re-establish session │
│ via X3DH │
└─────────────────────────┘Message Loss During Recovery
During session recovery, some messages may be unrecoverable.
| Message State | Recovery Outcome |
|---|---|
| Successfully decrypted | Preserved in history |
| Queued for send | Re-encrypted with new session |
| In-flight during reset | Lost (resend notification sent) |
| Received but undecrypted | Lost (request resend from peer) |
Recovery Message Handling:
1. Before reset:
- Flush all pending decrypted messages to storage
- Mark queued outgoing messages as NEEDS_REENCRYPT
- Record message IDs of failed decryptions
2. After new session established:
- Re-encrypt and resend queued messages
- Request resend of failed incoming messages:
{
type: "RESEND_REQUEST",
message_ids: [<list of lost message IDs>],
reason: "session_reset"
}
3. Peer responds:
- Re-encrypts requested messages with new session
- Marks messages as RESENT to prevent duplicatesMessage Delivery Failures
Message delivery follows a state machine tracking each message from creation to confirmed delivery.
Delivery States
Message Delivery State Machine:
┌──────────┐ encrypt ┌─────────┐ send ┌────────┐
│ CREATING │─────────────→│ PENDING │──────────→│ SENT │
└──────────┘ └─────────┘ └────────┘
│ │
timeout/error delivered
│ │
▼ ▼
┌─────────┐ ┌───────────┐
│ FAILED │ │ DELIVERED │
└─────────┘ └───────────┘
│ │
retry succeeds read
│ │
▼ ▼
┌────────┐ ┌────────┐
│ SENT │ │ READ │
└────────┘ └────────┘| State | Description | Typical Duration |
|---|---|---|
| CREATING | Message being composed/encrypted | Milliseconds |
| PENDING | Encrypted, awaiting network | Until connected |
| SENT | Transmitted to relay network | Until ACK received |
| DELIVERED | Confirmed received by recipient | Until read |
| READ | Read receipt received | Final state |
| FAILED | Delivery failed after retries | Until manual retry |
Retry Behavior Per State
| State | Automatic Retry | Max Attempts | Backoff Strategy |
|---|---|---|---|
| PENDING | Yes (when online) | Unlimited | Queue order |
| SENT (no ACK) | Yes | 3 | Exponential, max 30s |
| FAILED | No (user action) | User controlled | None |
| DELIVERED (no read) | No | N/A | N/A |
Delivery Failure Causes
| Cause | Detection Method | Retry Appropriate |
|---|---|---|
| Network timeout | No ACK within timeout | Yes |
| Recipient offline | Server queued response | Yes (server queues) |
| Recipient key changed | Key bundle mismatch | Yes (fetch new keys) |
| Recipient blocked sender | 403 response | No |
| Message too large | 413 response | No |
| Rate limited | 429 response | Yes (after cooldown) |
| Server error | 5xx response | Yes |
User Notification Strategy
| Condition | Notification Type | Timing |
|---|---|---|
| First retry | None | Silent |
| 3rd retry | Subtle indicator | Delayed badge |
| Max retries exhausted | Alert | Immediate |
| Permanent failure | Dialog | Immediate |
| Rate limited | Toast | Immediate |
Notification Decision:
1. Message enters FAILED state
2. Determine failure type:
- Transient (network): "Message will retry when online"
- Permanent (blocked): "Message could not be delivered"
- Recoverable (key change): "Recipient's keys changed. Verify and resend?"
3. Display appropriate UI:
- Transient: Yellow warning icon on message
- Permanent: Red error icon, tap for details
- Recoverable: Action prompt with verify optionPermanent Failure Handling
| Failure Type | User Action | System Action |
|---|---|---|
| Recipient blocked | Inform user | Remove from contacts (optional) |
| Invalid recipient | Prompt correction | Discard message |
| Message expired | Inform user | Archive or delete |
| Key verification failed | Prompt verification | Hold message |
Storage Errors
Storage errors affect local data persistence and can lead to data loss if not handled correctly.
Error Types
| Error | Code | Cause | Severity |
|---|---|---|---|
| Quota Exceeded | STORE_001 | IndexedDB limit reached | High |
| Database Corruption | STORE_002 | Unexpected shutdown, disk error | Critical |
| Transaction Failed | STORE_003 | Concurrent access conflict | Medium |
| Access Denied | STORE_004 | Browser permissions revoked | High |
| Version Mismatch | STORE_005 | Database schema outdated | Medium |
| Encryption Key Lost | STORE_006 | Key derivation failed | Critical |
IndexedDB Quota Management
Browser storage quotas vary by platform and available disk space.
| Platform | Typical Quota | Zentalk Target Usage |
|---|---|---|
| Chrome Desktop | 60% of disk | Max 500 MB |
| Firefox Desktop | 50% of disk | Max 500 MB |
| Safari Desktop | 1 GB | Max 500 MB |
| Mobile browsers | 50-100 MB | Max 50 MB |
Quota Exceeded Handling
Quota Exceeded Response:
1. Identify storage consumers:
- Message history: typically largest
- Media cache: can be cleared
- Session state: critical, cannot reduce
- Logs: can be truncated
2. Execute cleanup strategy:
Priority 1: Clear media cache
Priority 2: Compress old messages
Priority 3: Archive old conversations
Priority 4: Prompt user for action
3. Cleanup thresholds:
- At 80% quota: Clear media cache
- At 90% quota: Archive conversations > 6 months
- At 95% quota: Notify user, suggest export
- At 100%: Emergency mode, queue to memory onlyDatabase Corruption Detection
| Detection Method | Checks For | Frequency |
|---|---|---|
| Checksum validation | Data integrity | Every read |
| Schema verification | Structure validity | On app launch |
| Foreign key check | Referential integrity | On app launch |
| Index verification | Index corruption | Weekly |
| Transaction log | Incomplete writes | On app launch |
Corruption Detection Algorithm:
1. On database open:
- Verify schema version matches expected
- Check critical tables exist
- Validate index structures
2. On read operation:
if stored_checksum != computed_checksum:
mark_record_corrupted(record_id)
attempt_recovery(record_id)
3. Periodic integrity check (weekly):
for each table:
for each record:
validate_structure(record)
validate_references(record)
validate_checksum(record)
report_corruption_rate()Automatic Repair Strategies
| Corruption Type | Repair Strategy | Success Rate |
|---|---|---|
| Single record | Restore from backup | High |
| Index corruption | Rebuild index | Very high |
| Table corruption | Restore table from backup | Medium |
| Schema corruption | Reset schema, migrate data | Medium |
| Full DB corruption | Restore from encrypted backup | Depends on backup |
Repair Decision Tree:
┌─────────────────────────┐
│ Corruption Detected │
└───────────┬─────────────┘
│
▼
┌─────────────────────────┐
│ Scope Assessment │
│ (single record/table/DB)│
└───────────┬─────────────┘
│
┌────────┼────────┐
│ │ │
▼ ▼ ▼
┌──────┐ ┌──────┐ ┌──────┐
│Record│ │Table │ │ Full │
└──┬───┘ └──┬───┘ └──┬───┘
│ │ │
▼ ▼ ▼
Restore Rebuild Restore
from from from
cache indices backupManual Recovery Options
When automatic recovery fails, users have several manual options.
| Option | Data Preserved | Complexity |
|---|---|---|
| Export and reimport | All exportable data | Medium |
| Restore from backup | Up to backup point | Low |
| Clear and resync | Contacts only | Low |
| Full reset | None (fresh start) | Very low |
Manual Recovery UI Flow:
1. User accesses Settings → Data Recovery
2. Options presented:
┌─────────────────────────────────────────┐
│ Recovery Options │
├─────────────────────────────────────────┤
│ [Attempt Auto-Repair] │
│ Try to fix corruption automatically │
│ │
│ [Restore from Backup] │
│ Last backup: 2 hours ago │
│ │
│ [Export Available Data] │
│ Save what can be recovered │
│ │
│ [Clear All Data] │
│ Start fresh (contacts re-sync) │
└─────────────────────────────────────────┘
3. After selection:
- Confirm destructive actions
- Show progress indicator
- Report success/failure
- Guide next stepsError Codes Reference
Network Error Codes (NET_xxx)
| Code | Name | Description | Resolution |
|---|---|---|---|
| NET_001 | TIMEOUT | Operation exceeded time limit | Retry with backoff |
| NET_002 | CONN_RESET | Connection forcibly closed | Reconnect |
| NET_003 | DNS_FAILED | Domain name resolution failed | Check network, try alternate DNS |
| NET_004 | TLS_FAILED | TLS handshake failed | Verify certificates, check time |
| NET_005 | CONN_REFUSED | Server refused connection | Check server status |
| NET_006 | NET_UNREACHABLE | No route to network | Check internet connection |
| NET_007 | HOST_UNREACHABLE | Cannot reach specific host | Check host status |
| NET_008 | CERT_EXPIRED | Server certificate expired | Contact server operator |
| NET_009 | CERT_INVALID | Certificate validation failed | Security alert, do not proceed |
| NET_010 | RATE_LIMITED | Too many requests | Wait and retry |
Cryptographic Error Codes (CRYPTO_xxx)
| Code | Name | Description | Resolution |
|---|---|---|---|
| CRYPTO_001 | DECRYPT_FAILED | Decryption produced invalid data | Check keys, request resend |
| CRYPTO_002 | SIG_INVALID | Signature verification failed | Verify sender identity |
| CRYPTO_003 | MAC_FAILED | Authentication tag mismatch | Discard message, alert user |
| CRYPTO_004 | KDF_FAILED | Key derivation error | Check inputs, retry |
| CRYPTO_005 | KEY_INVALID | Public key validation failed | Request new key bundle |
| CRYPTO_006 | NONCE_REUSE | Same nonce used twice | Critical security error |
| CRYPTO_007 | KEY_EXPIRED | Key past validity period | Trigger key rotation |
| CRYPTO_008 | ALGO_UNSUPPORTED | Unknown algorithm requested | Check protocol version |
| CRYPTO_009 | RANDOM_FAILED | RNG failure | Critical system error |
| CRYPTO_010 | KEY_MISMATCH | Public/private key mismatch | Regenerate keypair |
Protocol Error Codes (PROTO_xxx)
| Code | Name | Description | Resolution |
|---|---|---|---|
| PROTO_001 | INVALID_STATE | Unexpected state transition | Reset state machine |
| PROTO_002 | VERSION_MISMATCH | Incompatible protocol version | Upgrade client |
| PROTO_003 | UNKNOWN_MSG_TYPE | Unrecognized message type | Ignore or upgrade |
| PROTO_004 | SEQ_INVALID | Invalid sequence number | Check for replay |
| PROTO_005 | HANDSHAKE_FAILED | X3DH computation failed | Retry with fresh keys |
| PROTO_006 | CIRCUIT_UNKNOWN | Circuit ID not found | Create new circuit |
| PROTO_007 | STREAM_INVALID | Stream ID not valid | Open new stream |
| PROTO_008 | MSG_TOO_LARGE | Message exceeds size limit | Split or compress |
| PROTO_009 | REPLAY_DETECTED | Duplicate message received | Discard message |
| PROTO_010 | DESYNC_DETECTED | Ratchet desynchronization | Initiate recovery |
Storage Error Codes (STORE_xxx)
| Code | Name | Description | Resolution |
|---|---|---|---|
| STORE_001 | QUOTA_EXCEEDED | Storage limit reached | Clear cache, archive old data |
| STORE_002 | CORRUPTION | Data integrity check failed | Attempt repair or restore |
| STORE_003 | TXN_FAILED | Database transaction failed | Retry operation |
| STORE_004 | ACCESS_DENIED | Permission denied | Request permissions |
| STORE_005 | SCHEMA_MISMATCH | Database schema outdated | Run migration |
| STORE_006 | KEY_LOST | Encryption key unavailable | Restore from backup |
| STORE_007 | NOT_FOUND | Requested record missing | Check ID, may be deleted |
| STORE_008 | LOCKED | Database locked by another process | Wait and retry |
| STORE_009 | FULL | Storage completely full | Emergency cleanup |
| STORE_010 | INIT_FAILED | Database initialization failed | Clear and reinitialize |
User-Facing vs Internal Errors
| Error Type | User-Facing | Technical Details Shown |
|---|---|---|
| Network timeout | Yes | No (just “Connection problem”) |
| Decryption failed | Yes | No (just “Message unavailable”) |
| MAC failure | Yes | Partial (“Security warning”) |
| Storage quota | Yes | Yes (space remaining) |
| Protocol version | Yes | Yes (version numbers) |
| Internal errors | Yes | No (generic message) |
| Rate limiting | Yes | Yes (retry time) |
| Key expiration | No | N/A (auto-handled) |
Error Message Templates
| Code | User Message | Technical Log |
|---|---|---|
| NET_001 | ”Connection timed out. Retrying…" | "NET_001: Timeout after 30000ms to relay.zentalk.io:9001” |
| CRYPTO_003 | ”Message could not be verified. It may have been tampered with." | "CRYPTO_003: MAC verification failed for msg_id=abc123, session=def456” |
| STORE_001 | ”Storage full. Please free up space or archive old messages." | "STORE_001: QuotaExceededError, used=490MB, limit=500MB” |
| PROTO_002 | ”Please update Zentalk to continue." | "PROTO_002: Version mismatch, local=1.2.0, remote=2.0.0” |
Recovery Best Practices
Error Recovery Priority
| Priority | Error Category | Rationale |
|---|---|---|
| 1 (Highest) | Security errors | Prevent data exposure |
| 2 | Cryptographic sync | Restore communication |
| 3 | Storage integrity | Prevent data loss |
| 4 | Network connectivity | Restore service |
| 5 (Lowest) | UI/UX errors | User convenience |
Logging and Diagnostics
| Log Level | Error Types | Retention |
|---|---|---|
| ERROR | All failures | 7 days |
| WARN | Retryable issues | 3 days |
| INFO | Recovery actions | 1 day |
| DEBUG | Detailed traces | Session only |
Log Entry Structure:
{
timestamp: "2024-01-15T10:30:00.000Z",
level: "ERROR",
code: "CRYPTO_001",
message: "Decryption failed",
context: {
session_id: "abc123",
peer: "0x1234...5678",
message_num: 42,
attempt: 2
},
stack: "<stack trace for DEBUG only>"
}Circuit Breaker Pattern
For operations that repeatedly fail, Zentalk implements a circuit breaker to prevent resource exhaustion.
| State | Behavior | Transition |
|---|---|---|
| CLOSED | Normal operation | Opens after 5 failures |
| OPEN | Fail fast, no attempts | Half-opens after 60s |
| HALF-OPEN | Allow single test request | Closes on success, opens on failure |
Circuit Breaker Logic:
state = CLOSED
failure_count = 0
last_failure_time = null
on_operation():
if state == OPEN:
if now() - last_failure_time > 60 seconds:
state = HALF_OPEN
else:
return FAIL_FAST
result = attempt_operation()
if result == SUCCESS:
state = CLOSED
failure_count = 0
else:
failure_count += 1
last_failure_time = now()
if failure_count >= 5:
state = OPENRelated Documentation
- Protocol Specification - Cryptographic protocol details
- Wire Protocol - Network message formats
- Multi-Device Support - Device synchronization
- Threat Model - Security considerations