Files
Arkie-Library-Frontend/docs/link-preview.md

259 lines
12 KiB
Markdown
Raw Permalink Normal View History

# Link preview (`/api/link-preview`)
Telegram-style rich card for the **first URL** found in a post's text.
Front-end renders a single clickable card showing site name, title,
description, and a thumbnail; the data is fetched from a back-end proxy
that scrapes Open Graph / oEmbed / Twitter Card metadata once and caches
it.
> **Scope**: only the first link in the post text gets a preview, matching
> Telegram's behaviour. Any additional URLs in the same post still render
> as inline autolinks but do not get their own card.
## Why a back-end proxy
Browsers cannot fetch arbitrary cross-origin pages, so OG metadata must be
fetched server-side. A single proxy endpoint keeps secrets / outbound IPs on
the server and lets us cache so each URL is only scraped once for the whole
audience.
---
## Endpoint contract
```
GET /api/link-preview?url=<encoded-absolute-url>
```
| Query | Required | Notes |
| ----- | -------- | ------------------------------------------------------------------------------------------------------------------------------- |
| `url` | yes | Absolute `http://` or `https://` URL. Must be `URI` encoded so query strings inside the target URL survive the round trip. |
### Success — `200 OK`
```json
{
"url": "https://app.safe.global/welcome",
"canonicalUrl": "https://app.safe.global/welcome",
"siteName": "app.safe.global",
"title": "Safe{Wallet}",
"description": "Safe{Wallet} is the most trusted smart account wallet on Ethereum with over $100B secured.",
"imageUrl": "https://app.safe.global/og.png",
"imageWidth": 1200,
"imageHeight": 630,
"favicon": "https://app.safe.global/favicon.ico",
"themeColor": "#12FF80",
"fetchedAt": "2026-05-29T10:00:00Z",
"cacheTtlSeconds": 86400
}
```
- All string fields except `url` may be empty. The front-end gracefully hides
rows that are missing (e.g. no `imageUrl` → image area is omitted).
- `url` echoes the original input so the client can match the response
against the URL it asked about, even if the request was racy.
- `canonicalUrl` is the URL the client should open when the card is tapped.
Defaults to `url` if no `<link rel=canonical>` was found.
### Already cached / freshly cached — same shape
The endpoint is idempotent and the response shape is identical whether
the metadata is hot, warm, or freshly scraped.
### Errors
| Status | When | Body shape |
| ------ | --------------------------------------------------- | --------------------------------------------------------------------------- |
| `400` | Missing / invalid / non-http(s) `url` | `{ "error": "invalid_url" }` |
| `422` | URL passed validation but resolves to a private/internal address (SSRF guard) | `{ "error": "blocked_target" }` |
| `404` | Target returned 404 or fetch produced no metadata | `{ "error": "not_found" }` |
| `408` | Target took longer than the timeout to respond | `{ "error": "timeout" }` |
| `502` | Target returned 5xx | `{ "error": "upstream_error" }` |
| `429` | Rate limit on this client / IP | `{ "error": "rate_limited", "retryAfter": 60 }` |
The front-end treats every non-`200` as “no preview available” and
silently renders nothing. No toasts. URLs already render as inline
clickable text via `autolink`, so the user is never blocked.
---
## Caching strategy
Store one row per `canonicalUrl` (or normalized `url` if `canonicalUrl` is
absent). Suggested TTLs:
- Successful preview: **24 hours** (`cacheTtlSeconds: 86400`).
- 404 / timeout / blocked: **6 hours** negative cache. Otherwise transient
failures on the target site will hammer the proxy.
- Send `Cache-Control: public, max-age=86400` so CDN / browser also cache.
Cache key normalization:
- Lowercase scheme + host.
- Strip the trailing slash on the path when it's the only character.
- Strip `utm_*`, `ref`, `referrer`, `fbclid`, `gclid` query params.
- Keep the rest of the query and fragment as-is.
---
## SSRF and abuse guard (must-have)
The proxy will fetch any URL the front-end asks about, which is dangerous.
Before issuing the outbound request:
1. Resolve the host to all of its A/AAAA records.
2. Reject if any resolved IP is in: loopback, link-local, private
(RFC1918), `0.0.0.0/8`, multicast, broadcast, or the internal cluster
CIDR.
3. Reject schemes other than `http` and `https`.
4. Cap response body at **5 MB**; abort on overflow.
5. Cap request total time at **5 s**; abort on timeout.
6. Cap redirect chain at **3 hops**; re-validate target IP at each hop.
7. Do not forward client cookies, auth headers, or `Referer` to the target.
8. Use a clear `User-Agent` such as `ArkLibraryLinkBot/1.0 (+https://ark-library.com/bot)`.
9. Per-client (IP or session) rate limit, e.g. 60 req / min.
---
## Metadata extraction precedence
For each field, pick the first present:
| Field | Sources (in order) |
| ------------- | -------------------------------------------------------------------------------------------------------- |
| `title` | `og:title``twitter:title``<title>` → empty |
| `description` | `og:description``twitter:description``<meta name="description">` → empty |
| `imageUrl` | `og:image:secure_url``og:image``twitter:image` → first prominent `<img>` (skip if &lt;200×200) → empty |
| `siteName` | `og:site_name``application-name` → hostname (sans `www.`) |
| `canonicalUrl`| `<link rel="canonical">` → request URL |
| `favicon` | `<link rel="icon">``<link rel="shortcut icon">``/favicon.ico` |
| `themeColor` | `<meta name="theme-color">` |
Resolve any relative URLs (`og:image`, `favicon`, `canonical`) against the
final response URL (after redirects).
---
## Provider quirks worth handling
These quirks save a lot of "why doesn't this site preview?" debugging later.
- **Twitter / X**: `x.com` and `twitter.com` strip OG when not signed in. Use
the public oEmbed endpoint
`https://publish.twitter.com/oembed?url=...&omit_script=1` for
Twitter/X URLs and map: `title = author_name`, `description = html` stripped
to text, `imageUrl = thumbnail_url` if available.
- **YouTube**: prefer `https://noembed.com/embed?url=...` or
`https://www.youtube.com/oembed?url=...&format=json` (no key).
- **Reddit / Mastodon**: standard OG works fine.
- **Sites behind Cloudflare bot challenge**: surface 502 to the client.
Don't retry hot — let the negative-cache TTL absorb it.
- **AMP pages**: prefer `og:url` when present so the cached entry points to
the canonical page, not the AMP variant.
---
## Front-end integration
### Type addition (`src/types/post.ts`)
```ts
export type LinkPreview = {
url: string;
canonicalUrl: string;
siteName: string;
title: string;
description: string;
imageUrl?: string;
imageWidth?: number;
imageHeight?: number;
favicon?: string;
themeColor?: string;
};
export type Post = {
// ...existing fields
/** Preview for the first URL in `text`. At most one per post. */
linkPreview?: LinkPreview;
};
```
### Which URL gets previewed
The back-end picks the **first** URL it finds in `text` using the same
regex as the front-end's `autolink` (`/(https?:\/\/[^\s<>"]+[^\s<>".,;:!?)\]}'])/i`).
Only that URL is fetched, stored, and returned as `post.linkPreview`. Any
later URLs in the same post are ignored for preview purposes (still
clickable inline via `autolink`).
### Where data comes from
Two viable paths — pick one when wiring the back-end.
1. **Inline on `Post`** (preferred): the post API enriches each post with
`linkPreview`. The first URL in `text` is resolved once at write time
(or lazily on first read with a background job). The client renders
without making any extra request.
2. **Client-side lookup**: the client extracts the first URL via the
existing `autolink` regex, calls `/api/link-preview?url=...` once per
post (with in-memory dedupe across posts that share the same URL), and
renders the card when the response comes back. Slower first paint but
keeps the posts endpoint cheap.
Recommend (1) for the public feed and keep `/api/link-preview` available for
(2) only on admin previews.
### Rendering
- New component: `src/components/messageStream/LinkPreviewCard.tsx`
- Renders a card with a left vertical 3px accent bar (`themeColor`
fallback `bg-ark-gold`).
- Layout:
```
┌──────────────────────────────────────────────────┐
│ ▍ siteName (12px, neutral-400) │
│ ▍ Title (15px, bold, neutral-100) │
│ ▍ Description (13px, neutral-300, 3-line clamp) │
│ ▍ ┌────────────────────────────────────────────┐ │
│ ▍ │ imageUrl (lazy, aspect-video, rounded) │ │
│ ▍ └────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
```
- Whole card is `<a href={canonicalUrl} target="_blank" rel="noopener noreferrer">`.
- Reuse the bubble background (`bg-[#272632]` is OK, slightly lift with
`bg-white/[0.03]` overlay so the card reads as inset within the bubble).
- Mount points (text-bearing bubbles only): `TextBubble`,
`ImageWithTextBubble`, `AlbumBubble`, `VideoBubble`, `FileDocBubble`.
Render below the existing `CollapsibleText` so cards stay visible even
when long text is collapsed.
### Picking the URL to preview
If `post.linkPreview` is present, render that single card. Otherwise the
bubble renders nothing extra (URLs still autolink inline). The front-end
never picks the URL itself — that decision lives on the back-end so the
client and server agree on which URL was chosen.
### Falling back gracefully
- No `imageUrl` → omit the image area, keep the text block.
- Title shorter than 8 characters → hide the description below (treat as
a low-confidence preview).
- Title empty and description empty → render nothing.
---
## Open questions for the back-end
- Where in the stack will OG extraction live? Existing post pipeline, a
worker queue, or inline on read?
- Storage: a new `link_previews` table keyed by `canonicalUrl`, with a
`post_link_previews` join table preserving original URL order, or just a
JSON column on `posts`?
- How aggressive should re-scrape be? E.g. re-scrape every 30 days for
successful previews, every 24 hours for `themeColor` updates.
- Should admin be able to override / hide a preview per post? Telegram has
a "no preview" toggle and editors often want it.
- Do we want a manual "refresh preview" button in the admin UI?