Link Preview Privacy

Technical specification for privacy-preserving link previews in Zentalk.

Overview

Link previews enhance user experience by displaying website metadata (title, description, thumbnail) inline with messages. However, traditional implementations create significant privacy risks. Zentalk implements a relay-based architecture that prevents the target server from learning anything about the user requesting the preview.

The Privacy Problem

Traditional Link Preview Risks

When a messaging application fetches a link preview directly from the target URL, it exposes sensitive information:

Risk	Description	Privacy Impact
IP Address Exposure	User’s real IP sent to target server	Location tracking, identity correlation
Referer Header Leakage	Messenger identified in HTTP headers	Usage pattern disclosure
Timing Correlation	Request timing reveals message activity	Behavioral analysis
DNS Leakage	DNS queries reveal browsing intent	ISP surveillance, network monitoring
TLS Fingerprinting	Client characteristics exposed	Device identification

Attack Vectors in Traditional Implementations

Attack	Mechanism	Consequence
Tracking Pixel Injection	Unique URL per recipient	Identify who viewed preview
IP Harvesting	Log requests to shared links	Map user IP addresses
Timing Analysis	Correlate preview fetch with message send	Deanonymize senders
Referer Mining	Extract messenger identity from headers	Profile user’s app usage
Request Fingerprinting	Analyze TLS/HTTP characteristics	Identify device types

Correlation Attack Example


Traditional Preview Flow (INSECURE):

1. Alice sends link to Bob in Zentalk
2. Alice's client fetches preview from target.com
   → Target sees: IP=Alice, Referer=zentalk-client
3. Bob's client fetches preview from target.com
   → Target sees: IP=Bob, Referer=zentalk-client
4. Target can correlate:
   → Two Zentalk users accessed same URL
   → Timing suggests communication between them
   → IPs reveal approximate locations

Zentalk’s Solution

Relay-Based Architecture

Zentalk routes all preview requests through a distributed relay network, ensuring no direct connection between user clients and target URLs.

Component	Role	Knowledge
User Client	Requests preview via relay	Knows target URL
Entry Relay	First hop in relay chain	Knows client IP, not target URL
Exit Relay	Fetches from target URL	Knows target URL, not client IP
Target Server	Serves preview content	Sees only relay IP

Privacy Guarantees

Property	Mechanism	Guarantee
IP Anonymity	Multi-hop relay routing	Target never sees client IP
Referer Protection	Relay strips/replaces headers	No messenger identification
Timing Obfuscation	Batched requests, random delays	Correlation resistance
DNS Privacy	Relay performs DNS resolution	Client DNS queries hidden
Request Unlinkability	Per-request circuit rotation	No persistent fingerprint

Comparison with Traditional Approaches

Approach	IP Hidden	Referer Hidden	Timing Protected	Decentralized
Direct Fetch	No	No	No	N/A
Single Proxy	Yes	Yes	No	No
VPN-Based	Yes	Partial	No	No
Tor-Style (Zentalk)	Yes	Yes	Partial	Yes

Preview Generation Flow

End-to-End Process


Privacy-Preserving Preview Flow:

1. USER PASTES LINK
   → Client detects URL pattern in message input
   → Preview request initiated (if enabled)

2. BUILD RELAY CIRCUIT
   → Select 2-hop circuit from relay pool
   → Entry relay: Knows client, not destination
   → Exit relay: Knows destination, not client

3. ENCRYPT REQUEST
   → Construct preview request
   → Encrypt for exit relay (inner layer)
   → Encrypt for entry relay (outer layer)

4. ROUTE THROUGH RELAYS
   → Client → Entry Relay (onion layer 1)
   → Entry Relay → Exit Relay (onion layer 2)
   → Exit Relay → Target URL (plaintext HTTPS)

5. FETCH AND SANITIZE
   → Exit relay fetches target URL
   → Content sanitized (scripts removed)
   → Metadata extracted (title, description, image)

6. RETURN ENCRYPTED RESPONSE
   → Exit relay encrypts response for client
   → Routed back through entry relay
   → Client decrypts preview data

7. CACHE AND DISPLAY
   → Preview cached locally (encrypted)
   → Rendered in message compose area
   → Attached to message when sent

Circuit Selection

Parameter	Value	Rationale
Hop Count	2	Balance: privacy vs. latency
Entry Selection	From guard set	Reduce entry diversity exposure
Exit Selection	Random from pool	Geographic diversity
Circuit Lifetime	Single request	Maximum unlinkability
Parallel Circuits	3 pre-built	Low-latency preview generation

Request Timing

Phase	Typical Duration	Maximum
Circuit selection	10ms	50ms
Onion encryption	5ms	20ms
Relay routing	100-300ms	2s
Target fetch	200-500ms	5s
Content parsing	50ms	200ms
Total preview time	400-900ms	8s

Preview Data Extraction

Metadata Sources

Preview data is extracted from target pages in priority order:

Priority	Source	Fields Extracted
1	Open Graph tags	og:title, og:description, og:image
2	Twitter Card tags	twitter:title, twitter:description, twitter:image
3	HTML meta tags	title, description
4	Structured data	JSON-LD, Schema.org
5	Page content	First heading, first paragraph

Extracted Fields

Field	Source Priority	Max Length	Fallback
Title	og:title → twitter:title → title tag → h1	200 chars	Domain name
Description	og:description → meta description → first p	500 chars	None
Image URL	og:image → twitter:image → first img	N/A	None
Site Name	og:site_name → domain	100 chars	Domain
Type	og:type → inferred	50 chars	”website”
Favicon	link rel=“icon” → /favicon.ico	N/A	None

Open Graph Extraction


Metadata Extraction Process:

1. PARSE HTML
   document = parse_html(response_body)

2. EXTRACT OPEN GRAPH
   og_tags = document.query_all('meta[property^="og:"]')
   FOR EACH tag IN og_tags:
       key = tag.property.replace("og:", "")
       value = tag.content
       metadata[key] = sanitize(value)

3. EXTRACT TWITTER CARDS
   twitter_tags = document.query_all('meta[name^="twitter:"]')
   FOR EACH tag IN twitter_tags:
       key = tag.name.replace("twitter:", "")
       IF key NOT IN metadata:
           metadata[key] = sanitize(tag.content)

4. FALLBACK TO HTML
   IF "title" NOT IN metadata:
       metadata["title"] = document.query('title').text

   IF "description" NOT IN metadata:
       meta_desc = document.query('meta[name="description"]')
       metadata["description"] = meta_desc.content

5. TRUNCATE AND SANITIZE
   metadata["title"] = truncate(metadata["title"], 200)
   metadata["description"] = truncate(metadata["description"], 500)

Preview Content Limits

Size Constraints

Content	Limit	Rationale
HTML fetch	512 KB	Sufficient for metadata extraction
Image fetch	2 MB	Reasonable thumbnail source
Generated thumbnail	100 KB	Bandwidth efficiency
Total preview payload	150 KB	Message size limits
Fetch timeout	5 seconds	User experience

Thumbnail Processing

Parameter	Value
Max source dimensions	4096 x 4096 px
Output dimensions	400 x 400 px (max)
Output format	WebP (JPEG fallback)
Quality	75%
Aspect ratio	Preserved


Thumbnail Generation:

1. FETCH IMAGE
   image_data = fetch_with_limit(image_url, max=2MB)

2. VALIDATE IMAGE
   IF NOT valid_image_format(image_data):
       SKIP thumbnail generation
   IF image_dimensions > 4096x4096:
       SKIP thumbnail generation

3. RESIZE
   thumbnail = resize_image(
       image_data,
       max_width=400,
       max_height=400,
       preserve_aspect=true
   )

4. ENCODE
   output = encode_webp(thumbnail, quality=75)
   IF output.size > 100KB:
       output = encode_jpeg(thumbnail, quality=60)

5. RETURN
   IF output.size ≤ 100KB:
       RETURN output
   ELSE:
       RETURN null  // Skip oversized thumbnails

Content Type Restrictions

Content Type	Allowed	Notes
text/html	Yes	Primary target
application/xhtml+xml	Yes	XML-based HTML
image/*	Yes	For thumbnail only
application/json	Partial	API responses with metadata
text/plain	No	No useful preview data
application/pdf	No	Cannot extract safely
video/*	No	Thumbnail only via poster

Caching Strategy

Relay-Side Caching

Exit relays maintain a shared cache to reduce repeated fetches and improve performance:

Parameter	Value	Rationale
Cache duration	1 hour	Balance freshness vs. efficiency
Cache key	SHA-256(normalized_url)	No URL stored in plaintext
Max cache size	1 GB per relay	Resource constraints
Eviction policy	LRU	Prioritize popular content

Cache Privacy Properties

Property	Implementation	Guarantee
No user correlation	Cache key is URL hash only	Cannot link users to URLs
No request logging	Requests not persisted	No audit trail
Shared cache	All users benefit equally	No per-user tracking
Cache-only serving	Stale cache served if target down	Reduces timing attacks

Cache Key Generation


Cache Key Derivation:

1. NORMALIZE URL
   normalized = url.lower()
   normalized = remove_tracking_params(normalized)
   normalized = sort_query_params(normalized)

2. GENERATE KEY
   cache_key = SHA-256(normalized)

3. LOOKUP
   cached_preview = cache.get(cache_key)
   IF cached_preview AND NOT expired(cached_preview):
       RETURN cached_preview

Tracking Parameters Removed:
  - utm_source, utm_medium, utm_campaign
  - fbclid, gclid, msclkid
  - ref, source, via
  - Any parameter matching tracking patterns

Client-Side Caching

Parameter	Value
Cache location	Encrypted local storage
Cache duration	24 hours
Cache key	SHA-256(url + conversation_id)
Encryption	AES-256-GCM with local key

Security Measures

Content Sanitization

All preview content is sanitized before delivery to clients:

Threat	Sanitization
JavaScript injection	All scripts removed
CSS attacks	Stylesheets stripped
Event handlers	on* attributes removed
External resources	Blocked except thumbnail
Meta refresh	Removed
Base tag manipulation	Removed
Form injection	All forms removed

Sanitization Rules


Content Sanitization Process:

1. REMOVE DANGEROUS ELEMENTS
   dangerous_tags = [
       'script', 'style', 'iframe', 'frame',
       'object', 'embed', 'applet', 'form',
       'input', 'button', 'select', 'textarea'
   ]
   FOR EACH tag IN dangerous_tags:
       document.remove_all(tag)

2. REMOVE EVENT HANDLERS
   FOR EACH element IN document.all_elements():
       FOR EACH attr IN element.attributes:
           IF attr.name.starts_with('on'):
               element.remove_attribute(attr.name)

3. SANITIZE URLS
   FOR EACH attr IN ['href', 'src', 'action']:
       FOR EACH element IN document.query_all('[' + attr + ']'):
           url = element.get(attr)
           IF NOT is_safe_url(url):
               element.remove_attribute(attr)

4. EXTRACT TEXT ONLY
   // Final preview contains only:
   // - Plain text title
   // - Plain text description
   // - Validated image URL
   // No HTML markup in final preview

Image Security

Check	Action	Purpose
MIME validation	Verify magic bytes match extension	Prevent type confusion
Dimension limits	Reject images > 4096px	Prevent DoS
File size limits	Reject images > 2MB	Bandwidth protection
Format whitelist	Only JPEG, PNG, GIF, WebP	Reduce attack surface
Decompression limits	Max 50MB decompressed	Prevent zip bombs
Metadata stripping	Remove EXIF, XMP	Privacy protection

Malicious URL Detection

Check	Method	Action
Known malware domains	Blocklist lookup	Reject with warning
Phishing detection	URL pattern analysis	Reject with warning
Homograph attacks	IDN normalization check	Display punycode
IP-based URLs	Detect raw IP targets	Warn user
Local network	Block RFC1918, localhost	Prevent SSRF
Unusual ports	Block non-80/443	Reduce attack surface

SSRF Prevention

Server-Side Request Forgery prevention on relay nodes:

Control	Implementation
DNS rebinding protection	Resolve DNS, validate IP before fetch
Private IP blocking	Reject 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16
Localhost blocking	Reject 127.0.0.0/8, ::1
Cloud metadata blocking	Reject 169.254.169.254, metadata.*
Protocol restriction	HTTPS only (HTTP redirects to HTTPS only)
Redirect limits	Maximum 3 redirects
Redirect validation	Each redirect target re-validated


SSRF Prevention Flow:

1. PARSE URL
   parsed = parse_url(target_url)
   IF parsed.scheme NOT IN ['http', 'https']:
       REJECT("Invalid protocol")

2. RESOLVE DNS
   ip_addresses = dns_resolve(parsed.host)

3. VALIDATE IPS
   FOR EACH ip IN ip_addresses:
       IF is_private_ip(ip):
           REJECT("Private IP not allowed")
       IF is_loopback(ip):
           REJECT("Loopback not allowed")
       IF is_cloud_metadata(ip):
           REJECT("Metadata endpoint not allowed")

4. FETCH WITH VALIDATED IP
   connection = connect_to_ip(ip_addresses[0], parsed.port)
   // Use original hostname for TLS SNI and Host header

5. FOLLOW REDIRECTS (limited)
   redirect_count = 0
   WHILE response.is_redirect AND redirect_count < 3:
       new_url = response.headers['Location']
       VALIDATE new_url (repeat steps 1-4)
       redirect_count += 1

User Controls

Preview Settings

Setting	Options	Default
Enable previews	On / Off	On
Auto-generate	Always / Ask / Never	Always
Preview in compose	Show / Hide	Show
Download images	Auto / Ask / Never	Auto
Send previews	Include / Exclude	Include

Per-Conversation Settings

Setting	Scope	Options
Disable previews	Single conversation	On / Off
Preview image quality	Single conversation	High / Low / None
Auto-expand previews	Single conversation	Yes / No

Preview Before Sending


Preview Confirmation Flow:

1. USER PASTES URL
   → Preview generated in background

2. PREVIEW DISPLAYED IN COMPOSE
   ┌────────────────────────────────┐
   │ [Preview Image]                │
   │ Title of the Page              │
   │ Description excerpt...         │
   │ example.com                    │
   │                                │
   │ [Include Preview] [Remove]     │
   └────────────────────────────────┘

3. USER CHOOSES
   → "Include Preview": Attach preview to message
   → "Remove": Send message without preview

4. RECIPIENT OPTIONS
   → Preview shown inline
   → Click to open URL (with warning)

Security Warnings

Condition	Warning Shown
HTTP (not HTTPS) URL	”This link uses an insecure connection”
Recently registered domain	”This domain was recently created”
IDN/Punycode domain	”This link contains special characters”
Mismatch: preview vs URL	”The preview may not match the destination”
Known tracker redirect	”This link goes through a tracking service”

Limitations

Technical Limitations

Limitation	Cause	Impact
Relay IP blocking	Some sites block datacenter IPs	Preview unavailable
JavaScript-rendered content	Content generated client-side	Incomplete preview
Authentication-required pages	No login credentials sent	Generic preview only
Rate-limited APIs	Target server throttling	Preview may fail
Geo-restricted content	Relay location mismatch	Different or no preview
Dynamic content	Content changes after fetch	Preview may be stale

Content That Cannot Be Previewed

Content Type	Reason	User Experience
Login-required pages	No authentication	”Preview unavailable”
Paywalled articles	Content hidden	Title/domain only
Single-page apps	JavaScript required	May show loading state
PDF documents	Cannot extract safely	File type indicator only
Private/internal URLs	SSRF protection	Blocked
Tor .onion sites	Not supported	Link shown without preview

Staleness Considerations

Scenario	Preview Behavior
Content updated after preview	Shows cached version
URL redirects changed	Original preview persists
Page removed (404)	Cached preview may still show
A/B tested pages	Preview may differ from actual


Staleness Mitigation:

1. CACHE HEADERS
   Respect Cache-Control from origin
   max-age used when present

2. FRESHNESS INDICATORS
   Show "Preview from [time]" if > 1 hour old

3. REFRESH OPTION
   User can manually refresh preview
   Bypasses cache, fetches fresh content

4. RECIPIENT FETCH
   Recipients can optionally re-fetch
   Useful for time-sensitive content

Error Handling

Error Types and Responses

Error	Cause	User Message
TIMEOUT	Target server slow	”Preview timed out”
BLOCKED	Relay IP blocked	”Preview unavailable for this site”
NOT_FOUND	URL returns 404	”Page not found”
SSL_ERROR	Certificate issues	”Secure connection failed”
CONTENT_TOO_LARGE	Exceeds limits	”Page too large to preview”
INVALID_CONTENT	No extractable metadata	”No preview available”
SSRF_BLOCKED	Security restriction	”URL not allowed”

Graceful Degradation


Fallback Hierarchy:

1. FULL PREVIEW
   Title + Description + Image
   ↓ (if image fetch fails)

2. TEXT PREVIEW
   Title + Description only
   ↓ (if metadata extraction fails)

3. MINIMAL PREVIEW
   Domain name + favicon
   ↓ (if everything fails)

4. LINK ONLY
   Plain URL displayed
   No preview attachment

Wire Format

Preview Request Message

Field	Size	Description
Version	1 byte	Protocol version (0x01)
Request ID	16 bytes	Random identifier
URL Length	2 bytes	Length of URL string
URL	Variable	Target URL (UTF-8)
Options	1 byte	Bit flags for request options

Preview Response Message

Field	Size	Description
Version	1 byte	Protocol version (0x01)
Request ID	16 bytes	Matches request
Status	1 byte	Success/error code
Title Length	2 bytes	Length of title
Title	Variable	Page title (UTF-8)
Description Length	2 bytes	Length of description
Description	Variable	Page description (UTF-8)
Site Name Length	1 byte	Length of site name
Site Name	Variable	Site name (UTF-8)
Image Present	1 byte	0x00 or 0x01
Image Data Length	4 bytes	If present, image size
Image Data	Variable	WebP/JPEG thumbnail
Favicon Present	1 byte	0x00 or 0x01
Favicon Data	Variable	ICO/PNG favicon

Message Attachment Format

When a preview is included with a message:


Preview Attachment Structure:

┌─────────────────────────────────────────┐
│ Attachment Type (1 byte): LINK_PREVIEW  │
├─────────────────────────────────────────┤
│ Original URL (variable)                 │
├─────────────────────────────────────────┤
│ Title (variable)                        │
├─────────────────────────────────────────┤
│ Description (variable)                  │
├─────────────────────────────────────────┤
│ Site Name (variable)                    │
├─────────────────────────────────────────┤
│ Thumbnail Key (32 bytes)                │
├─────────────────────────────────────────┤
│ Thumbnail Ref (32 bytes, hash)          │
├─────────────────────────────────────────┤
│ Fetch Timestamp (8 bytes)               │
└─────────────────────────────────────────┘

// Thumbnail stored separately in mesh
// Same encryption as media thumbnails

Performance Metrics

Typical Performance

Metric	P50	P95	P99
Preview generation	450ms	1.2s	3s
Relay latency	150ms	400ms	800ms
Cache hit rate	35%	-	-
Success rate	92%	-	-

Optimization Techniques

Technique	Benefit
Pre-built circuits	Reduces initial latency by ~200ms
Parallel metadata + image fetch	Reduces total time
Aggressive caching	Cache hits return in ~50ms
Predictive pre-fetch	Start fetch on URL detection
Circuit reuse for same domain	Reduces overhead

Onion Routing - Relay network architecture
Media Encryption - Thumbnail encryption details
Privacy Features - Overall privacy design
Threat Model - Security analysis
Architecture - System components