Skip to content

How it works

This page explains how a Semantic Text-Code is built and why semantically similar texts — including translations — produce codes with low Hamming distance.

The problem

The standard ISCC Content-Code for text matches on lexical similarity: it compares the words that appear in a document. That works well for near-duplicates, but it cannot tell that a German translation carries the same meaning as its English original — the two share almost no words.

The Semantic Text-Code targets meaning instead of wording. It is built so that texts about the same thing land close together in code space, whatever language they are written in.

The pipeline

graph LR
    A[Text] --> B[Split into chunks]
    B --> C[Embed each chunk]
    C --> D[Mean-pool to one vector]
    D --> E[Binarize]
    E --> F[ISCC header + base32]

Each stage has a specific job.

Split

A document is first split into overlapping chunks at sensible boundaries (up to 127 tokens each, with up to 48 tokens of overlap). Overlap keeps a sentence that straddles a boundary from being lost to both chunks.

Text without regular paragraph breaks — such as text extracted from print-layout PDFs — takes a guarded code path that produces the same chunks without the super-linear cost the naive splitter would incur on those inputs.

Embed

Each chunk is run through a multilingual sentence-transformer model (paraphrase-multilingual-MiniLM-L12-v2) exported to ONNX. The model turns a chunk into a 384-dimensional embedding: a vector of numbers that encodes the chunk's meaning. The model was trained so that texts with similar meaning — across more than 60 languages — map to nearby vectors.

The per-token outputs are combined into one chunk vector by attention-mask pooling, then normalized.

Aggregate

The chunk vectors are averaged into a single document vector and normalized again. This is the one vector that represents the whole document's meaning.

Binarize

Each component of the document vector becomes one bit: positive values become 1, the rest become 0. The full vector yields a 384-bit digest.

The digest is truncated to the requested length (bits, up to 256), prefixed with a 2-byte ISCC header that marks it as a Semantic Text-Code, and encoded as base32 with the ISCC: prefix.

Why binarized vectors still match

Two texts with similar meaning produce similar embeddings — vectors pointing in nearly the same direction. Nearly-aligned vectors agree on the sign of most of their components, so their bit patterns agree on most bits, giving a low Hamming distance.

Unrelated texts produce roughly perpendicular vectors, which agree on only about half their signs — so their codes differ in about half their bits. That gap between "few bits differ" and "about half the bits differ" is what makes a similarity threshold meaningful.

Property Behavior
Similar meaning Few differing bits (low Hamming distance)
Unrelated content About 50% of bits differ
Translation Treated like similar meaning — a near-match
Longer bits Wider spread between near-matches and unrelated content

iscc_distance() measures similarity by stripping the ISCC: prefix and the 2-byte header, then counting the differing bits between the two code bodies. cosine_similarity() rescales that distance to a -100+100 score. See comparing texts.

Cross-lingual matching

Cross-lingual matching is a direct consequence of the embedding model. Because the model was trained to place a sentence and its translation near each other in vector space, the document vectors stay close, the sign patterns stay mostly equal, and the codes stay a few bits apart.

No machine translation happens at code time. The shared meaning is captured once, by the model, during embedding.

Granular features

Alongside the document code, iscc-sct can emit a simprint for each chunk. Because chunks carry their offsets, you can locate matching passages within documents and align similar passages across two documents — even when they sit at different positions. See granular features.

Relation to the ISCC standard

The Semantic Text-Code is a planned, experimental ISCC-UNIT (MainType SEMANTIC, SubType TEXT). It is a proof of concept and not part of ISO 24138:2024. The codes it produces may change between releases below v1.0.0.

The codes are still structurally compatible with the ISCC framework: a standard 2-byte ISCC header identifies the unit type and length, so a Semantic Text-Code slots in alongside the other ISCC-UNITs.

Further reading