259 lines
12 KiB
Markdown
259 lines
12 KiB
Markdown
|
|
# Link preview (`/api/link-preview`)
|
|||
|
|
|
|||
|
|
Telegram-style rich card for the **first URL** found in a post's text.
|
|||
|
|
Front-end renders a single clickable card showing site name, title,
|
|||
|
|
description, and a thumbnail; the data is fetched from a back-end proxy
|
|||
|
|
that scrapes Open Graph / oEmbed / Twitter Card metadata once and caches
|
|||
|
|
it.
|
|||
|
|
|
|||
|
|
> **Scope**: only the first link in the post text gets a preview, matching
|
|||
|
|
> Telegram's behaviour. Any additional URLs in the same post still render
|
|||
|
|
> as inline autolinks but do not get their own card.
|
|||
|
|
|
|||
|
|
## Why a back-end proxy
|
|||
|
|
|
|||
|
|
Browsers cannot fetch arbitrary cross-origin pages, so OG metadata must be
|
|||
|
|
fetched server-side. A single proxy endpoint keeps secrets / outbound IPs on
|
|||
|
|
the server and lets us cache so each URL is only scraped once for the whole
|
|||
|
|
audience.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Endpoint contract
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
GET /api/link-preview?url=<encoded-absolute-url>
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
| Query | Required | Notes |
|
|||
|
|
| ----- | -------- | ------------------------------------------------------------------------------------------------------------------------------- |
|
|||
|
|
| `url` | yes | Absolute `http://` or `https://` URL. Must be `URI` encoded so query strings inside the target URL survive the round trip. |
|
|||
|
|
|
|||
|
|
### Success — `200 OK`
|
|||
|
|
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"url": "https://app.safe.global/welcome",
|
|||
|
|
"canonicalUrl": "https://app.safe.global/welcome",
|
|||
|
|
"siteName": "app.safe.global",
|
|||
|
|
"title": "Safe{Wallet}",
|
|||
|
|
"description": "Safe{Wallet} is the most trusted smart account wallet on Ethereum with over $100B secured.",
|
|||
|
|
"imageUrl": "https://app.safe.global/og.png",
|
|||
|
|
"imageWidth": 1200,
|
|||
|
|
"imageHeight": 630,
|
|||
|
|
"favicon": "https://app.safe.global/favicon.ico",
|
|||
|
|
"themeColor": "#12FF80",
|
|||
|
|
"fetchedAt": "2026-05-29T10:00:00Z",
|
|||
|
|
"cacheTtlSeconds": 86400
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- All string fields except `url` may be empty. The front-end gracefully hides
|
|||
|
|
rows that are missing (e.g. no `imageUrl` → image area is omitted).
|
|||
|
|
- `url` echoes the original input so the client can match the response
|
|||
|
|
against the URL it asked about, even if the request was racy.
|
|||
|
|
- `canonicalUrl` is the URL the client should open when the card is tapped.
|
|||
|
|
Defaults to `url` if no `<link rel=canonical>` was found.
|
|||
|
|
|
|||
|
|
### Already cached / freshly cached — same shape
|
|||
|
|
|
|||
|
|
The endpoint is idempotent and the response shape is identical whether
|
|||
|
|
the metadata is hot, warm, or freshly scraped.
|
|||
|
|
|
|||
|
|
### Errors
|
|||
|
|
|
|||
|
|
| Status | When | Body shape |
|
|||
|
|
| ------ | --------------------------------------------------- | --------------------------------------------------------------------------- |
|
|||
|
|
| `400` | Missing / invalid / non-http(s) `url` | `{ "error": "invalid_url" }` |
|
|||
|
|
| `422` | URL passed validation but resolves to a private/internal address (SSRF guard) | `{ "error": "blocked_target" }` |
|
|||
|
|
| `404` | Target returned 404 or fetch produced no metadata | `{ "error": "not_found" }` |
|
|||
|
|
| `408` | Target took longer than the timeout to respond | `{ "error": "timeout" }` |
|
|||
|
|
| `502` | Target returned 5xx | `{ "error": "upstream_error" }` |
|
|||
|
|
| `429` | Rate limit on this client / IP | `{ "error": "rate_limited", "retryAfter": 60 }` |
|
|||
|
|
|
|||
|
|
The front-end treats every non-`200` as “no preview available” and
|
|||
|
|
silently renders nothing. No toasts. URLs already render as inline
|
|||
|
|
clickable text via `autolink`, so the user is never blocked.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Caching strategy
|
|||
|
|
|
|||
|
|
Store one row per `canonicalUrl` (or normalized `url` if `canonicalUrl` is
|
|||
|
|
absent). Suggested TTLs:
|
|||
|
|
|
|||
|
|
- Successful preview: **24 hours** (`cacheTtlSeconds: 86400`).
|
|||
|
|
- 404 / timeout / blocked: **6 hours** negative cache. Otherwise transient
|
|||
|
|
failures on the target site will hammer the proxy.
|
|||
|
|
- Send `Cache-Control: public, max-age=86400` so CDN / browser also cache.
|
|||
|
|
|
|||
|
|
Cache key normalization:
|
|||
|
|
- Lowercase scheme + host.
|
|||
|
|
- Strip the trailing slash on the path when it's the only character.
|
|||
|
|
- Strip `utm_*`, `ref`, `referrer`, `fbclid`, `gclid` query params.
|
|||
|
|
- Keep the rest of the query and fragment as-is.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## SSRF and abuse guard (must-have)
|
|||
|
|
|
|||
|
|
The proxy will fetch any URL the front-end asks about, which is dangerous.
|
|||
|
|
Before issuing the outbound request:
|
|||
|
|
|
|||
|
|
1. Resolve the host to all of its A/AAAA records.
|
|||
|
|
2. Reject if any resolved IP is in: loopback, link-local, private
|
|||
|
|
(RFC1918), `0.0.0.0/8`, multicast, broadcast, or the internal cluster
|
|||
|
|
CIDR.
|
|||
|
|
3. Reject schemes other than `http` and `https`.
|
|||
|
|
4. Cap response body at **5 MB**; abort on overflow.
|
|||
|
|
5. Cap request total time at **5 s**; abort on timeout.
|
|||
|
|
6. Cap redirect chain at **3 hops**; re-validate target IP at each hop.
|
|||
|
|
7. Do not forward client cookies, auth headers, or `Referer` to the target.
|
|||
|
|
8. Use a clear `User-Agent` such as `ArkLibraryLinkBot/1.0 (+https://ark-library.com/bot)`.
|
|||
|
|
9. Per-client (IP or session) rate limit, e.g. 60 req / min.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Metadata extraction precedence
|
|||
|
|
|
|||
|
|
For each field, pick the first present:
|
|||
|
|
|
|||
|
|
| Field | Sources (in order) |
|
|||
|
|
| ------------- | -------------------------------------------------------------------------------------------------------- |
|
|||
|
|
| `title` | `og:title` → `twitter:title` → `<title>` → empty |
|
|||
|
|
| `description` | `og:description` → `twitter:description` → `<meta name="description">` → empty |
|
|||
|
|
| `imageUrl` | `og:image:secure_url` → `og:image` → `twitter:image` → first prominent `<img>` (skip if <200×200) → empty |
|
|||
|
|
| `siteName` | `og:site_name` → `application-name` → hostname (sans `www.`) |
|
|||
|
|
| `canonicalUrl`| `<link rel="canonical">` → request URL |
|
|||
|
|
| `favicon` | `<link rel="icon">` → `<link rel="shortcut icon">` → `/favicon.ico` |
|
|||
|
|
| `themeColor` | `<meta name="theme-color">` |
|
|||
|
|
|
|||
|
|
Resolve any relative URLs (`og:image`, `favicon`, `canonical`) against the
|
|||
|
|
final response URL (after redirects).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Provider quirks worth handling
|
|||
|
|
|
|||
|
|
These quirks save a lot of "why doesn't this site preview?" debugging later.
|
|||
|
|
|
|||
|
|
- **Twitter / X**: `x.com` and `twitter.com` strip OG when not signed in. Use
|
|||
|
|
the public oEmbed endpoint
|
|||
|
|
`https://publish.twitter.com/oembed?url=...&omit_script=1` for
|
|||
|
|
Twitter/X URLs and map: `title = author_name`, `description = html` stripped
|
|||
|
|
to text, `imageUrl = thumbnail_url` if available.
|
|||
|
|
- **YouTube**: prefer `https://noembed.com/embed?url=...` or
|
|||
|
|
`https://www.youtube.com/oembed?url=...&format=json` (no key).
|
|||
|
|
- **Reddit / Mastodon**: standard OG works fine.
|
|||
|
|
- **Sites behind Cloudflare bot challenge**: surface 502 to the client.
|
|||
|
|
Don't retry hot — let the negative-cache TTL absorb it.
|
|||
|
|
- **AMP pages**: prefer `og:url` when present so the cached entry points to
|
|||
|
|
the canonical page, not the AMP variant.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Front-end integration
|
|||
|
|
|
|||
|
|
### Type addition (`src/types/post.ts`)
|
|||
|
|
|
|||
|
|
```ts
|
|||
|
|
export type LinkPreview = {
|
|||
|
|
url: string;
|
|||
|
|
canonicalUrl: string;
|
|||
|
|
siteName: string;
|
|||
|
|
title: string;
|
|||
|
|
description: string;
|
|||
|
|
imageUrl?: string;
|
|||
|
|
imageWidth?: number;
|
|||
|
|
imageHeight?: number;
|
|||
|
|
favicon?: string;
|
|||
|
|
themeColor?: string;
|
|||
|
|
};
|
|||
|
|
|
|||
|
|
export type Post = {
|
|||
|
|
// ...existing fields
|
|||
|
|
/** Preview for the first URL in `text`. At most one per post. */
|
|||
|
|
linkPreview?: LinkPreview;
|
|||
|
|
};
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Which URL gets previewed
|
|||
|
|
|
|||
|
|
The back-end picks the **first** URL it finds in `text` using the same
|
|||
|
|
regex as the front-end's `autolink` (`/(https?:\/\/[^\s<>"]+[^\s<>".,;:!?)\]}'])/i`).
|
|||
|
|
Only that URL is fetched, stored, and returned as `post.linkPreview`. Any
|
|||
|
|
later URLs in the same post are ignored for preview purposes (still
|
|||
|
|
clickable inline via `autolink`).
|
|||
|
|
|
|||
|
|
### Where data comes from
|
|||
|
|
|
|||
|
|
Two viable paths — pick one when wiring the back-end.
|
|||
|
|
|
|||
|
|
1. **Inline on `Post`** (preferred): the post API enriches each post with
|
|||
|
|
`linkPreview`. The first URL in `text` is resolved once at write time
|
|||
|
|
(or lazily on first read with a background job). The client renders
|
|||
|
|
without making any extra request.
|
|||
|
|
2. **Client-side lookup**: the client extracts the first URL via the
|
|||
|
|
existing `autolink` regex, calls `/api/link-preview?url=...` once per
|
|||
|
|
post (with in-memory dedupe across posts that share the same URL), and
|
|||
|
|
renders the card when the response comes back. Slower first paint but
|
|||
|
|
keeps the posts endpoint cheap.
|
|||
|
|
|
|||
|
|
Recommend (1) for the public feed and keep `/api/link-preview` available for
|
|||
|
|
(2) only on admin previews.
|
|||
|
|
|
|||
|
|
### Rendering
|
|||
|
|
|
|||
|
|
- New component: `src/components/messageStream/LinkPreviewCard.tsx`
|
|||
|
|
- Renders a card with a left vertical 3px accent bar (`themeColor` →
|
|||
|
|
fallback `bg-ark-gold`).
|
|||
|
|
- Layout:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
┌──────────────────────────────────────────────────┐
|
|||
|
|
│ ▍ siteName (12px, neutral-400) │
|
|||
|
|
│ ▍ Title (15px, bold, neutral-100) │
|
|||
|
|
│ ▍ Description (13px, neutral-300, 3-line clamp) │
|
|||
|
|
│ ▍ ┌────────────────────────────────────────────┐ │
|
|||
|
|
│ ▍ │ imageUrl (lazy, aspect-video, rounded) │ │
|
|||
|
|
│ ▍ └────────────────────────────────────────────┘ │
|
|||
|
|
└──────────────────────────────────────────────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
- Whole card is `<a href={canonicalUrl} target="_blank" rel="noopener noreferrer">`.
|
|||
|
|
- Reuse the bubble background (`bg-[#272632]` is OK, slightly lift with
|
|||
|
|
`bg-white/[0.03]` overlay so the card reads as inset within the bubble).
|
|||
|
|
- Mount points (text-bearing bubbles only): `TextBubble`,
|
|||
|
|
`ImageWithTextBubble`, `AlbumBubble`, `VideoBubble`, `FileDocBubble`.
|
|||
|
|
Render below the existing `CollapsibleText` so cards stay visible even
|
|||
|
|
when long text is collapsed.
|
|||
|
|
|
|||
|
|
### Picking the URL to preview
|
|||
|
|
|
|||
|
|
If `post.linkPreview` is present, render that single card. Otherwise the
|
|||
|
|
bubble renders nothing extra (URLs still autolink inline). The front-end
|
|||
|
|
never picks the URL itself — that decision lives on the back-end so the
|
|||
|
|
client and server agree on which URL was chosen.
|
|||
|
|
|
|||
|
|
### Falling back gracefully
|
|||
|
|
|
|||
|
|
- No `imageUrl` → omit the image area, keep the text block.
|
|||
|
|
- Title shorter than 8 characters → hide the description below (treat as
|
|||
|
|
a low-confidence preview).
|
|||
|
|
- Title empty and description empty → render nothing.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Open questions for the back-end
|
|||
|
|
|
|||
|
|
- Where in the stack will OG extraction live? Existing post pipeline, a
|
|||
|
|
worker queue, or inline on read?
|
|||
|
|
- Storage: a new `link_previews` table keyed by `canonicalUrl`, with a
|
|||
|
|
`post_link_previews` join table preserving original URL order, or just a
|
|||
|
|
JSON column on `posts`?
|
|||
|
|
- How aggressive should re-scrape be? E.g. re-scrape every 30 days for
|
|||
|
|
successful previews, every 24 hours for `themeColor` updates.
|
|||
|
|
- Should admin be able to override / hide a preview per post? Telegram has
|
|||
|
|
a "no preview" toggle and editors often want it.
|
|||
|
|
- Do we want a manual "refresh preview" button in the admin UI?
|