# iscc-sct [![Tests](https://github.com/iscc/iscc-sct/actions/workflows/tests.yml/badge.svg)](https://github.com/iscc/iscc-sct/actions/workflows/tests.yml) [![Version](https://img.shields.io/pypi/v/iscc-sct.svg)](https://pypi.python.org/pypi/iscc-sct/) [![Downloads](https://pepy.tech/badge/iscc-sct)](https://pepy.tech/project/iscc-sct) [![License](https://img.shields.io/badge/license-Apache%202.0-green.svg)](https://github.com/iscc/iscc-sct/blob/main/LICENSE) [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/iscc/iscc-sct) **A cross-lingual, similarity-preserving binary code for text. Semantically similar texts, including translations, produce codes with low hamming distance.** !!! warning "Proof of concept" All releases below v1.0.0 may break backward compatibility and produce incompatible Semantic Text-Codes. The algorithms in `iscc-sct` are experimental and **not** part of the official [ISO 24138:2024](https://www.iso.org/standard/77899.html) standard. ## Introduction The [ISCC](https://iscc.codes) framework already includes a Text-Code based on lexical similarity for near-duplicate matching. The Semantic Text-Code (SCT) is a planned additional ISCC-UNIT that captures a broader, more abstract similarity. It is engineered to be robust against rephrasing and, most notably, translations that lexical matching cannot detect. `iscc-sct` turns any text into a compact binary code built from a binarized, multilingual document embedding. The same content expressed in different languages maps to **(near-)identical codes**, opening up cross-lingual content identification and similarity detection. | Feature | ISCC Content-Code Text | ISCC Semantic-Code Text | | ------------- | ------------------------ | --------------------------------- | | Focus | Lexical similarity | Semantic similarity | | Cross-lingual | No | Yes | | Use case | Near-duplicate detection | Semantic similarity, translations | **Key features:** - **Semantic similarity** - deep multilingual embeddings capture the meaning of the text - **Translation matching** - near-identical codes for the same content across languages - **Bit-length flexibility** - codes from 32 up to 256 bits for adjustable granularity - **ISCC compatible** - codes integrate directly with existing ISCC-based systems - **60+ languages** - one model covers a broad range of scripts and language pairs ## Quick start === "pip" ```bash pip install "iscc-sct[cpu]" ``` === "uv" ```bash uv add "iscc-sct[cpu]" ``` === "uvx (no install)" ```bash uvx "iscc-sct[cpu]" "path/to/textfile.txt" ``` Generate a Semantic Text-Code from Python: ```python import iscc_sct as sct text = "This is some sample text. It can be a longer document or even an entire book." print(sct.create(text, bits=256).iscc) # ISCC:CADV3GG6JH3XEVRNSVYGCLJ7AAV3BOT5J7EHEZKPFXEGRJ2CTWACGZI ``` Or from the command line: ```bash iscc-sct "path/to/textfile.txt" ``` !!! note "Choose exactly one ONNX runtime" A plain `pip install iscc-sct` installs no ONNX runtime; the first code generation then fails with install instructions. Install the `cpu` extra (works everywhere) or the `gpu` extra for NVIDIA CUDA acceleration (requires CUDA 12.x and cuDNN 9.x), or run `iscc-sct doctor` to detect and install the right one. Never install both: `onnxruntime` and `onnxruntime-gpu` unpack into the same directory and silently overwrite each other. ## How it works ```text Text -> split -> embed -> mean-pool -> binarize -> ISCC header + base32 ``` 1. **Split** the text into overlapping chunks at syntactically sensible boundaries. 1. **Embed** each chunk with a multilingual sentence-transformer running on ONNX. 1. **Aggregate** the chunk embeddings into one mean-pooled, normalized document vector. 1. **Binarize** the vector (positive components become 1-bits), truncate to the requested bit-length, prefix the ISCC header, and base32-encode it. This process is robust to variation and translation, enabling cross-lingual matching from a short Simprint. ## Documentation
- **[Getting started](tutorials/getting-started.md)** - Tutorial Install, generate your first code, and match a translation across languages. - **[How-to guides](howto/compare-texts.md)** - Task recipes Compare texts, work with granular features, configure options, and use the CLI. - **[How it works](explanation/how-it-works.md)** - Understand the design The pipeline, cross-lingual matching, and why binarized vectors still match. - **[API reference](reference/api.md)** - Library details Generated reference for `create()`, the data model, options, and similarity functions. - **[For Coding Agents](reference/for-coding-agents.md)** - Build on iscc-sct A dense architecture map, constraints, and task recipes for AI coding agents. - **[Live Demo :lucide-external-link:](https://huggingface.co/spaces/iscc/iscc-sct)** - Try it in the browser Generate and compare Semantic Text-Codes interactively on Hugging Face Spaces. - **[Source Code :lucide-external-link:](https://github.com/iscc/iscc-sct)** - Read the implementation The full algorithm, CLI, and Gradio demo on GitHub. - **[Full text for LLMs :lucide-external-link:](llms-full.txt)** - Machine-readable docs Every page concatenated into a single file for language models.
??? note "Supported languages (60+)" Arabic, Armenian, Bengali, Bosnian, Bulgarian, Burmese, Catalan, Chinese (China), Chinese (Taiwan), Croatian, Czech, Danish, Dutch, English, Estonian, Farsi, Finnish, French, French (Canada), Galician, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Korean, Kurdish, Latvian, Lithuanian, Macedonian, Malay, Malayalam, Marathi, Mongolian, Norwegian Bokmål, Persian, Polish, Portuguese, Portuguese (Brazil), Romanian, Russian, Serbian, Sinhala, Slovak, Slovenian, Spanish, Swedish, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Vietnamese. [Source code on GitHub :lucide-external-link:](https://github.com/iscc/iscc-sct){ .md-button } --- # Getting started Install `iscc-sct`, generate your first Semantic Text-Code, and watch two languages produce near-identical codes. ## Prerequisites - Python 3.10 or later - [`uv`](https://docs.astral.sh/uv/) (recommended) or `pip` ## Install `iscc-sct` needs an ONNX runtime, selected through an install extra. The `cpu` extra works everywhere: === "uv" ```bash uv add "iscc-sct[cpu]" ``` === "pip" ```bash pip install "iscc-sct[cpu]" ``` !!! warning "Choose exactly one runtime" Install the `cpu` extra, or the `gpu` extra for NVIDIA CUDA acceleration — never both. The `onnxruntime` and `onnxruntime-gpu` packages unpack into the same directory and overwrite each other. A plain `pip install iscc-sct` installs no runtime, and the first code generation then fails with install instructions. If you are unsure which extra fits your machine, run `iscc-sct doctor` (see the [command-line guide](../howto/command-line.md)). ### Verify the installation ```python import iscc_sct print(iscc_sct.__version__) ``` ## Generate your first code Pass any text to `create()`. The first call downloads the embedding model (about 450 MB) to your user data directory; later calls reuse it. ```python import iscc_sct as sct text = "This is some sample text. It can be a longer document or even an entire book." print(sct.create(text, bits=256).iscc) # ISCC:CADV3GG6JH3XEVRNSVYGCLJ7AAV3BOT5J7EHEZKPFXEGRJ2CTWACGZI ``` The `bits` argument sets the code length. Longer codes carry more detail; 64 bits is the library default, and 256 bits is the most precise. See [comparing texts](../howto/compare-texts.md) for how length affects matching. ## Match a translation The point of a Semantic Text-Code is that meaning survives translation. Generate codes for the same passage in English and German, then measure their distance: ```python import iscc_sct as sct english = ( "An ISCC applies to a specific digital asset and is a data-descriptor deterministically " "constructed from multiple hash digests using the algorithms and rules in this document. " "This document does not provide information on registration of ISCCs." ) german = ( "Ein ISCC bezieht sich auf ein bestimmtes digitales Gut und ist ein Daten-Deskriptor, der " "deterministisch aus mehreren Hash-Digests unter Verwendung der Algorithmen und Regeln in " "diesem Dokument erstellt wird. Dieses Dokument enthält keine Informationen über die " "Registrierung von ISCCs." ) a = sct.create(english) b = sct.create(german) print(sct.iscc_distance(a.iscc, b.iscc)) # 3 ``` A distance of `3` bits out of 64 means the translation is a near-match. For contrast, an unrelated sentence lands far away: ```python import iscc_sct as sct unrelated = ( "The recipe calls for two cups of flour, a pinch of salt, and three ripe bananas mashed " "until smooth before folding the mixture into the buttered baking tin." ) print(sct.iscc_distance(a.iscc, sct.create(unrelated).iscc)) # 33 ``` Low distance means similar meaning; high distance means unrelated content. Translation matching is the behavior that lexical (word-based) codes cannot provide. ## Look inside the text Set `granular=True` to get per-chunk features alongside the document code. Each chunk carries its own offset, size, similarity-preserving fingerprint (simprint), and text: ```python import iscc_sct as sct text = "This is some sample text. It can be a longer document or even an entire book." meta = sct.create(text, bits=256, granular=True) feature = meta.features[0].simprints[0] print(feature.offset, feature.size, feature.simprint) # 0 77 XZjeSfdyVi0 ``` Granular features let you match individual passages across documents, even when the surrounding text differs. The [granular features guide](../howto/granular-features.md) covers chunk-level matching and content reconstruction. ## Next steps - **[Compare texts](../howto/compare-texts.md)** — Measure semantic and cross-lingual similarity. - **[Granular features](../howto/granular-features.md)** — Work with per-chunk simprints and offsets. - **[Configuration](../howto/configuration.md)** — Tune bit-length, chunking, and output options. - **[How it works](../explanation/how-it-works.md)** — Understand why the codes match across languages. --- # Compare texts This guide shows how to measure how similar two texts are in meaning — including across languages — by comparing their Semantic Text-Codes. Similarity is computed from the codes, not the original text. You can store or share the codes and still compare content without keeping the source documents. ## Compare two whole documents Generate a code for each text, then measure the Hamming distance between them with `iscc_distance()`. It returns the number of differing bits — lower means more similar: ```python import iscc_sct as sct english = ( "An ISCC applies to a specific digital asset and is a data-descriptor deterministically " "constructed from multiple hash digests using the algorithms and rules in this document. " "This document does not provide information on registration of ISCCs." ) german = ( "Ein ISCC bezieht sich auf ein bestimmtes digitales Gut und ist ein Daten-Deskriptor, der " "deterministisch aus mehreren Hash-Digests unter Verwendung der Algorithmen und Regeln in " "diesem Dokument erstellt wird. Dieses Dokument enthält keine Informationen über die " "Registrierung von ISCCs." ) a = sct.create(english) b = sct.create(german) print(sct.iscc_distance(a.iscc, b.iscc)) # 3 ``` `iscc_distance()` strips the `ISCC:` prefix and the 2-byte header, then compares the code bodies. The two codes must be the same bit-length, or it raises `ValueError`. ## Read the distance Distance is measured in bits, so the scale depends on the code length. Compare a translation against an unrelated text at the same length to see the spread: ```python import iscc_sct as sct unrelated = ( "The recipe calls for two cups of flour, a pinch of salt, and three ripe bananas mashed " "until smooth before folding the mixture into the buttered baking tin." ) u = sct.create(unrelated) print(sct.iscc_distance(a.iscc, b.iscc)) # translation # 3 print(sct.iscc_distance(a.iscc, u.iscc)) # unrelated # 33 ``` A translation sits a few bits apart; unrelated content sits near half the bit-length away (random codes differ in about 50% of their bits). ## Choose a bit-length Longer codes spread similar and dissimilar texts further apart, which makes a threshold easier to pick. The same three texts at 256 bits: ```python import iscc_sct as sct a256 = sct.create(english, bits=256) b256 = sct.create(german, bits=256) u256 = sct.create(unrelated, bits=256) print(sct.iscc_distance(a256.iscc, b256.iscc)) # translation # 23 print(sct.iscc_distance(a256.iscc, u256.iscc)) # unrelated # 123 ``` | Goal | Suggested length | | ------------------------------------------ | ---------------- | | Compact storage, coarse matching | 64 bits | | Balanced precision | 128 bits | | Finest discrimination between near-matches | 256 bits | Both codes in a comparison must use the same `bits` value. See [configuration](configuration.md) for how to set the default. ## Score similarity from raw digests `iscc_distance()` works on code strings. To compare raw digests directly — for example the output of `soft_hash_text_semantic()` — use `cosine_similarity()`, which scales the distance to a `-100` to `+100` score: ```python import iscc_sct as sct a = sct.soft_hash_text_semantic("An ISCC applies to a specific digital asset.") b = sct.soft_hash_text_semantic("Ein ISCC bezieht sich auf ein bestimmtes digitales Gut.") print(sct.cosine_similarity(a, b)) # 71 ``` `hamming_distance()` returns the raw bit distance between two equal-length digests if you want the unscaled value. Both functions raise `ValueError` when the inputs differ in length. ## Match passages instead of whole documents To find which parts of two documents are similar — rather than scoring them as a whole — use granular features and `granular_similarity()`. See [granular features](granular-features.md). ## Related pages - **[Granular features](granular-features.md)** — Chunk-level matching with simprints. - **[How it works](../explanation/how-it-works.md)** — Why distance reflects meaning. - **[API reference](../reference/api.md)** — Signatures for every similarity function. --- # Granular features This guide shows how to work with granular features: the per-chunk fingerprints that let you match individual passages, locate them in the source text, and reconstruct content. A document code summarizes a whole text in one code. Granular features instead describe each chunk the text was split into, so you can compare documents passage by passage. ## Generate granular features Pass `granular=True` to `create()`. Each chunk becomes a `Feature` with a simprint (its similarity-preserving fingerprint), a character `offset`, a `size`, and the chunk `content`: ```python import iscc_sct as sct text = "This is some sample text. It can be a longer document or even an entire book." meta = sct.create(text, bits=256, granular=True) feature = meta.features[0].simprints[0] print(feature.simprint, feature.offset, feature.size) # XZjeSfdyVi0 0 77 print(feature.content) # This is some sample text. It can be a longer document or even an entire book. ``` Short texts produce a single chunk. Longer texts produce many — see [matching passages](#match-passages-across-documents) below. ## Choose a feature format Granular features come in two interchangeable shapes: | Format | Shape | Produced by | | ------------- | ------------------------------------------------------------ | -------------------------- | | Object-Format | a list of self-contained `Feature` objects | `create(granular=True)` | | Index-Format | parallel arrays: `simprints`, `offsets`, `sizes`, `contents` | `gen_text_code_semantic()` | `create()` returns Object-Format, which is easy to read and iterate. Convert between the two with `to_index_format()` and `to_object_format()`: ```python import iscc_sct as sct meta = sct.create(text, bits=256, granular=True) # Object-Format index = meta.to_index_format() print(index.features[0].simprints) # ['XZjeSfdyVi0'] print(index.features[0].offsets) # [0] ``` Index-Format stores related attributes in parallel arrays, which is compact for storage and bulk indexing. The low-level `gen_text_code_semantic()` returns a plain dict already in Index-Format. ## Control the simprint length Granular simprints have their own bit-length, set by `bits_granular` (default **64**). It is independent of the document `bits`, so `bits=256, granular=True` still yields 64-bit simprints: ```python import iscc_sct as sct meta = sct.create(text, bits=256, granular=True, bits_granular=256) digest = sct.decode_base64(meta.features[0].simprints[0].simprint) print(len(digest) * 8) # 256 ``` Shorter simprints save space; longer simprints discriminate finer between near-matching passages. ## Reconstruct the original text When features include both `offset` and `content`, `get_content()` stitches the chunks back into the original text, removing the overlap between adjacent chunks: ```python import iscc_sct as sct en = ( "The International Standard Content Code identifies digital content of any media type. " "It is generated algorithmically from the content itself, much like a cryptographic hash. " "Unlike a cryptographic hash, the ISCC preserves similarity between related items. " "Two near-duplicate files therefore receive two codes that are close to each other. " "This property supports deduplication, similarity clustering, and content provenance. " "The Semantic Text-Code extends these ideas to the meaning of text across languages. " "A translation of a document keeps a code that stays close to the original code. " "That makes cross-lingual search and matching practical without machine translation." ) meta = sct.create(en, granular=True) print(len(meta.features[0].simprints)) # number of chunks # 2 print(meta.get_content() == en) # True ``` `get_overlaps()` returns the overlapping text between consecutive chunks if you need to inspect the chunk boundaries directly. ## Use byte offsets Offsets and sizes are character positions by default. Enable `byte_offsets` to report UTF-8 byte positions instead — useful for fetching a chunk by random access from a remote file: ```python import iscc_sct as sct meta = sct.create(en, granular=True, byte_offsets=True) print(meta.features[0].byte_offsets) # True print(meta.features[0].simprints[1].offset) # byte offset of the second chunk ``` ## Match passages across documents `granular_similarity()` compares the simprints of two `Metadata` objects and returns the matching passages above a similarity threshold. Each result is a `(Feature, score, Feature)` tuple, where the score is the `cosine_similarity` between the two simprints: ```python import iscc_sct as sct de = ( "Der International Standard Content Code identifiziert digitale Inhalte jedes Medientyps. " "Er wird algorithmisch aus dem Inhalt selbst erzeugt, ähnlich einem kryptografischen Hash. " "Anders als ein kryptografischer Hash bewahrt der ISCC die Ähnlichkeit verwandter Objekte. " "Zwei nahezu identische Dateien erhalten daher zwei Codes, die nahe beieinander liegen. " "Diese Eigenschaft unterstützt Deduplizierung, Ähnlichkeits-Clustering und Herkunftsnachweis. " "Der Semantic Text-Code überträgt diese Ideen auf die Bedeutung von Text über Sprachen hinweg. " "Eine Übersetzung eines Dokuments behält einen Code, der nahe am Originalcode bleibt. " "Das macht sprachübergreifende Suche und Zuordnung ohne maschinelle Übersetzung praktikabel." ) a = sct.create(en, granular=True) b = sct.create(de, granular=True) for feat_a, score, feat_b in sct.granular_similarity(a, b, threshold=80): print(score, feat_a.offset, feat_b.offset) # 81 0 0 # 90 340 356 ``` Each English passage is matched to its closest German counterpart, even though the offsets differ because the languages produce text of different lengths. Raise `threshold` to keep only the strongest matches; lower it to surface weaker ones. Only the single best match per passage in the first document is returned. ## Related pages - **[Compare texts](compare-texts.md)** — Whole-document similarity. - **[Configuration](configuration.md)** — Defaults for `bits_granular`, chunking, and offsets. - **[How it works](../explanation/how-it-works.md)** — How chunks and overlaps are produced. --- # Configuration This guide shows how to control code length, chunking, and output detail through options — per call, through environment variables, or globally. Every option has the same name everywhere: as a keyword argument, as an `ISCC_SCT_*` environment variable, and as a field on the `SctOptions` model. ## Options reference | Option | Env variable | Default | Notes | | --------------- | ------------------------ | ------- | ------------------------------------------------------- | | `bits` | `ISCC_SCT_BITS` | `64` | Document code length. 32–256, multiple of 32. | | `bits_granular` | `ISCC_SCT_BITS_GRANULAR` | `64` | Granular simprint length. 32–256, multiple of 32. | | `characters` | `ISCC_SCT_CHARACTERS` | `True` | Include the document character count. | | `embedding` | `ISCC_SCT_EMBEDDING` | `False` | Include the global document embedding vector. | | `precision` | `ISCC_SCT_PRECISION` | `8` | Max fractional digits when storing the embedding. | | `simprints` | `ISCC_SCT_SIMPRINTS` | `False` | Include granular per-chunk simprints. | | `offsets` | `ISCC_SCT_OFFSETS` | `False` | Include per-chunk offsets. | | `byte_offsets` | `ISCC_SCT_BYTE_OFFSETS` | `False` | Report UTF-8 byte offsets instead of character offsets. | | `sizes` | `ISCC_SCT_SIZES` | `False` | Include per-chunk sizes. | | `contents` | `ISCC_SCT_CONTENTS` | `False` | Include the per-chunk text. | | `max_tokens` | `ISCC_SCT_MAX_TOKENS` | `127` | Max tokens per chunk. Cannot exceed 127. | | `overlap` | `ISCC_SCT_OVERLAP` | `48` | Max tokens shared between adjacent chunks. | | `trim` | `ISCC_SCT_TRIM` | `False` | Trim whitespace from chunks. | The `granular=True` shortcut on `create()` is equivalent to setting `simprints`, `offsets`, `sizes`, and `contents` to `True` at once. ## Override per call Pass options as keyword arguments to `create()`. They apply to that call only: ```python import iscc_sct as sct text = "This is some sample text. It can be a longer document or even an entire book." meta = sct.create(text, bits=128, simprints=True, contents=True) print(meta.iscc) # ISCC:CABV3GG6JH3XEVRNSVYGCLJ7AAV3A ``` ## Set defaults with environment variables Set any option through its `ISCC_SCT_*` variable. This changes the default for every call in the process: ```bash export ISCC_SCT_BITS=128 export ISCC_SCT_MAX_TOKENS=100 ``` You can also place these in a `.env` file in the working directory — `iscc-sct` loads it automatically on import: ```ini # .env ISCC_SCT_BITS=128 ISCC_SCT_OVERLAP=24 ``` ## Override the global settings `sct_opts` is the global settings instance. Use `override()` to get a validated copy with some fields changed, without mutating the global: ```python import iscc_sct as sct opts = sct.sct_opts.override({"bits": 128}) print(opts.bits) # 128 print(sct.sct_opts.bits) # 64 — the global is unchanged ``` Prefer `override()` or per-call keyword arguments over assigning to `sct_opts` fields directly. Both keep the global default predictable for other code in the same process. ## Validation Options are validated whenever they are set. An out-of-range or wrong-shaped value raises a pydantic `ValidationError`: ```python import iscc_sct as sct sct.sct_opts.override({"bits": 100}) # not a multiple of 32 -> ValidationError sct.sct_opts.override({"max_tokens": 200}) # exceeds 127 -> ValidationError ``` !!! warning "Library and CLI defaults differ" `create()` and `SctOptions` default `bits` to **64**. The `iscc-sct` command-line tool defaults `--bits` to **256**. The same text therefore yields a different code length depending on which entry point you use. Set `bits` explicitly when you need a specific length. ## Related pages - **[Compare texts](compare-texts.md)** — How `bits` affects matching. - **[Granular features](granular-features.md)** — `bits_granular`, offsets, and chunk contents. - **[Command line](command-line.md)** — Setting options from the CLI. - **[API reference](../reference/api.md)** — The `SctOptions` model in full. --- # Command line This guide covers the `iscc-sct` command-line tool: generating codes from files, processing many files at once, checking the ONNX runtime, and launching the demo. The CLI is installed with the package. Run `iscc-sct --help` to see every option: ```text usage: iscc-sct [-h] [-b BITS] [-g] [-d] [-y] [path] Generate Semantic Text-Codes for text files. positional arguments: path Path to text files (glob patterns), 'doctor' to check the ONNX runtime, or 'gui' for the demo. options: -h, --help show this help message and exit -b, --bits BITS Bit-Length of Code (default 256) -g, --granular Activate granular processing. -d, --debug Show debugging messages. -y, --yes Auto-confirm the 'doctor' runtime install. ``` ## Generate a code from a file Pass a path to print its Semantic Text-Code: ```bash iscc-sct article.txt # ISCC:CADV3GG6JH3XEVRNSVYGCLJ7AAV3BOT5J7EHEZKPFXEGRJ2CTWACGZI ``` The CLI defaults to **256-bit** codes. Set a different length with `--bits`: ```bash iscc-sct --bits 64 article.txt ``` !!! note "The CLI default differs from the library" `iscc-sct` defaults to 256 bits, but `create()` in Python defaults to 64. Pass `--bits` (or the `bits` argument) explicitly when the length matters. See [configuration](configuration.md). ## Process multiple files The `path` argument accepts a glob pattern. Quote it so `iscc-sct` expands the pattern itself rather than the shell: ```bash iscc-sct "texts/*.txt" ``` Each matching file prints its code. Files that cannot be decoded as UTF-8 are decoded with a detected character set, and empty files are skipped. ## Granular output Add `--granular` to print the full metadata — document code plus per-chunk simprints, offsets, sizes, and contents — as JSON: ```bash iscc-sct --granular article.txt ``` See [granular features](granular-features.md) for what the fields mean. ## Run without installing With [`uv`](https://docs.astral.sh/uv/), run the CLI in one line without a permanent install: ```bash uvx "iscc-sct[cpu]" article.txt ``` ## Check the ONNX runtime `iscc-sct` needs exactly one ONNX runtime (`cpu` or `gpu` extra). The `doctor` command inspects your environment, names the problem, and recommends the right extra: ```bash iscc-sct doctor ``` ```text iscc-sct ONNX runtime check ONNX runtime: not installed CUDA provider: no NVIDIA GPU: no Status: no ONNX runtime installed. Recommended: pip install "iscc-sct[cpu]" ``` When a fix is available, `doctor` offers to run it. Add `--yes` to install without the prompt: ```bash iscc-sct doctor --yes ``` `doctor` also detects the case where `onnxruntime-gpu` is installed but a plain `onnxruntime` package has shadowed it (issue #23), and reinstalls only the GPU build. ## Launch the demo The `gui` command launches the interactive Gradio demo in your browser. It needs the `demo` extra: ```bash pip install "iscc-sct[cpu,demo]" iscc-sct gui ``` ## Related pages - **[Getting started](../tutorials/getting-started.md)** — Install and first code. - **[Configuration](configuration.md)** — Options and environment variables. - **[How it works](../explanation/how-it-works.md)** — What happens behind a code. --- # How it works This page explains how a Semantic Text-Code is built and why semantically similar texts — including translations — produce codes with low Hamming distance. ## The problem The standard ISCC Content-Code for text matches on lexical similarity: it compares the words that appear in a document. That works well for near-duplicates, but it cannot tell that a German translation carries the same meaning as its English original — the two share almost no words. The Semantic Text-Code targets meaning instead of wording. It is built so that texts about the same thing land close together in code space, whatever language they are written in. ## The pipeline ```mermaid graph LR A[Text] --> B[Split into chunks] B --> C[Embed each chunk] C --> D[Mean-pool to one vector] D --> E[Binarize] E --> F[ISCC header + base32] ``` Each stage has a specific job. ### Split A document is first split into overlapping chunks at sensible boundaries (up to 127 tokens each, with up to 48 tokens of overlap). Overlap keeps a sentence that straddles a boundary from being lost to both chunks. Text without regular paragraph breaks — such as text extracted from print-layout PDFs — takes a guarded code path that produces the same chunks without the super-linear cost the naive splitter would incur on those inputs. ### Embed Each chunk is run through a multilingual sentence-transformer model ([paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2)) exported to ONNX. The model turns a chunk into a 384-dimensional embedding: a vector of numbers that encodes the chunk's meaning. The model was trained so that texts with similar meaning — across more than 60 languages — map to nearby vectors. The per-token outputs are combined into one chunk vector by attention-mask pooling, then normalized. ### Aggregate The chunk vectors are averaged into a single document vector and normalized again. This is the one vector that represents the whole document's meaning. ### Binarize Each component of the document vector becomes one bit: positive values become `1`, the rest become `0`. The full vector yields a 384-bit digest. The digest is truncated to the requested length (`bits`, up to 256), prefixed with a 2-byte ISCC header that marks it as a Semantic Text-Code, and encoded as base32 with the `ISCC:` prefix. ## Why binarized vectors still match Two texts with similar meaning produce similar embeddings — vectors pointing in nearly the same direction. Nearly-aligned vectors agree on the sign of most of their components, so their bit patterns agree on most bits, giving a low Hamming distance. Unrelated texts produce roughly perpendicular vectors, which agree on only about half their signs — so their codes differ in about half their bits. That gap between "few bits differ" and "about half the bits differ" is what makes a similarity threshold meaningful. | Property | Behavior | | ----------------- | ------------------------------------------------------- | | Similar meaning | Few differing bits (low Hamming distance) | | Unrelated content | About 50% of bits differ | | Translation | Treated like similar meaning — a near-match | | Longer `bits` | Wider spread between near-matches and unrelated content | `iscc_distance()` measures similarity by stripping the `ISCC:` prefix and the 2-byte header, then counting the differing bits between the two code bodies. `cosine_similarity()` rescales that distance to a `-100`–`+100` score. See [comparing texts](../howto/compare-texts.md). ## Cross-lingual matching Cross-lingual matching is a direct consequence of the embedding model. Because the model was trained to place a sentence and its translation near each other in vector space, the document vectors stay close, the sign patterns stay mostly equal, and the codes stay a few bits apart. No machine translation happens at code time. The shared meaning is captured once, by the model, during embedding. ## Granular features Alongside the document code, `iscc-sct` can emit a simprint for each chunk. Because chunks carry their offsets, you can locate matching passages within documents and align similar passages across two documents — even when they sit at different positions. See [granular features](../howto/granular-features.md). ## Relation to the ISCC standard The Semantic Text-Code is a planned, experimental ISCC-UNIT (MainType SEMANTIC, SubType TEXT). It is a proof of concept and **not** part of [ISO 24138:2024](https://www.iso.org/standard/77899.html). The codes it produces may change between releases below v1.0.0. The codes are still structurally compatible with the ISCC framework: a standard 2-byte ISCC header identifies the unit type and length, so a Semantic Text-Code slots in alongside the other ISCC-UNITs. ## Further reading - **[Getting started](../tutorials/getting-started.md)** — Try the pipeline hands-on. - **[Compare texts](../howto/compare-texts.md)** — Use the distance in practice. - **[For Coding Agents](../reference/for-coding-agents.md)** — The exact algorithm, constants, and invariants. --- # API reference The API reference is generated from source docstrings and rendered with full signatures at https://sct.iscc.codes/reference/api/. For a plain-text API map of every public symbol - its source module and purpose - see the For Coding Agents page (in this file and at https://sct.iscc.codes/reference/for-coding-agents/). --- # For Coding Agents A compressed reference for AI agents working on **iscc-sct** (ISCC Semantic Text-Code). Read this before editing the codebase or integrating the library. Terminology matches the source exactly. `iscc-sct` turns text into a cross-lingual, similarity-preserving binary ISCC-UNIT (MainType SEMANTIC, SubType TEXT). It is an experimental proof of concept, **not** part of ISO 24138:2024. ## Architecture map ### File layout | Path | Contains | | ------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `iscc_sct/main.py` | `create()` - high-level API; returns a `Metadata` object in Object-Format. | | `iscc_sct/code_semantic_text.py` | Core algorithm. `gen_text_code_semantic()` (returns a plain dict, Index-Format), splitting, embedding, pooling, binarization. Module-level `@cache` singletons: `tokenizer()`, `splitter()`, `splitter_guarded()`, `model()`. | | `iscc_sct/models.py` | Pydantic schema: `Feature`, `FeatureSet`, `Metadata` + format converters. | | `iscc_sct/options.py` | `SctOptions` (pydantic-settings), `sct_opts` singleton, `.override()`. | | `iscc_sct/utils.py` | Codecs (base32/base64url), distances, model download + blake3 integrity, `char_to_byte_offsets`, `MODEL_PATH`, `timer`. | | `iscc_sct/cli.py` | `iscc-sct` console entry point (`main()`): glob, charset detection, `gui` subcommand. | | `iscc_sct/demo.py`, `iscc_sct/app.py` | Gradio demo (Hugging Face Space). Omitted from coverage. | | `iscc_sct/dev.py` | Dev-only poe task helpers. Omitted from coverage. | | `iscc_sct/tokenizer.json` | Vendored tokenizer. Byte-exact; excluded from whitespace/EOL hooks. | | `tests/chunking_vectors.json` | Frozen chunk-boundary test vectors. Never hand-edit (see Change playbook). | ### Pipeline ```text text -> split_text() -> embed_chunks() -> mean_pooling() -> binarize() -> ISCC header + base32 ``` `gen_text_code_semantic(text, **options)` orchestrates this: 1. Reject empty text (`ValueError`), then `sct_opts.override(options)`. 1. `split_text()` -> `[(offset, chunk), ...]` at semantic boundaries (max 127 tokens, 48 overlap). 1. `embed_chunks()` tokenizes (batch 100), runs the ONNX model, `attention_pooling()` per chunk. 1. `mean_pooling()` averages chunk vectors into one L2-normalized document vector. 1. `binarize()` (`vec >= 0` -> bits), truncate to `bits // 8`, prefix the 2-byte ISCC header, `encode_base32()`, prepend `"ISCC:"`. ### Import flow ```text iscc_sct/__init__.py (star-imports, defines __version__) -> options -> utils -> code_semantic_text -> models -> main main -> models, code_semantic_text, options code_semantic_text -> onnxruntime (lazy import), semantic_text_splitter, tokenizers, numpy, iscc_sct as sct (uses sct.* at call time, not import time) utils -> models cli -> main, charset_normalizer ``` ### Public API `iscc_sct/__init__.py` re-exports every module's `__all__`. The public surface (`import iscc_sct as sct`): | Symbol | Source | Purpose | | ------------------------------------------------------------------------------- | ------------------ | ------------------------------------------------ | | `create(text, granular=False, **options)` | main | Primary API. Returns `Metadata` (Object-Format). | | `gen_text_code_semantic(text, **options)` | code_semantic_text | Low-level. Returns `dict` (Index-Format). | | `code_text_semantic(fp, **options)` | code_semantic_text | Same, reading a UTF-8 file path. | | `soft_hash_text_semantic(text)` | code_semantic_text | Raw 384-bit digest (`bytes`), no header. | | `embed_chunks(chunks, batch_size=100)` | code_semantic_text | Chunk list -> embedding array. | | `Metadata`, `FeatureSet`, `Feature` | models | Result schema + converters. | | `SctOptions`, `sct_opts` | options | Settings model + global instance. | | `iscc_distance`, `hamming_distance`, `cosine_similarity`, `granular_similarity` | utils | Similarity metrics. | | `encode_base32`, `decode_base32`, `encode_base64`, `decode_base64` | utils | Codecs. | | `char_to_byte_offsets`, `get_model`, `MODEL_PATH`, `timer` | utils | Helpers. | | `__version__` | `__init__` | `"0.2.0"`. | ## Decision dispatch ### Which entry point? | Goal | Use | | -------------------------------------------------------------- | ------------------------------------------------------- | | One ISCC code from a string, ready-to-use object | `create(text)` -> `Metadata` (Object-Format) | | Granular per-chunk features (simprints/offsets/sizes/contents) | `create(text, granular=True)` | | Compact parallel-array result for storage/indexing | `gen_text_code_semantic(text, ...)` (Index-Format dict) | | Process a text file from disk | `code_text_semantic(path)` | | Just the raw 384-bit vector digest | `soft_hash_text_semantic(text)` -> `bytes` | | Command line / batch over files | `iscc-sct ` | ### Which similarity metric? | Inputs | Use | Returns | | ------------------------------------- | ----------------------------------------- | --------------------------------------- | | Two full ISCC code strings | `iscc_distance(iscc1, iscc2)` | Hamming distance in bits (`int`) | | Two raw digests of equal length | `hamming_distance(a, b)` | bit distance | | Two raw digests, normalized score | `cosine_similarity(a, b)` | `int` in `[-100, 100]` | | Two `Metadata` objects with simprints | `granular_similarity(a, b, threshold=80)` | `[(Feature, similarity, Feature), ...]` | ### Which feature format? | Format | Produced by | Shape | Convert with | | ------------- | -------------------------- | -------------------------------------------------------- | ----------------------------- | | Index-Format | `gen_text_code_semantic()` | parallel arrays `simprints`/`offsets`/`sizes`/`contents` | `Metadata.to_index_format()` | | Object-Format | `create()` | list of self-contained `Feature` objects | `Metadata.to_object_format()` | ## Constraints and invariants - **No base ONNX runtime.** The base package declares no `onnxruntime`. Install exactly one of the mutually exclusive `cpu` / `gpu` extras. `onnxruntime` and `onnxruntime-gpu` unpack into the same directory and clobber each other (issue #23); `tool.uv.conflicts` enforces exclusivity for uv. - **ONNX runtime is imported lazily.** `import iscc_sct`, the `iscc-sct` CLI, and `iscc-sct doctor` load without any runtime installed. `load_onnxruntime()` (called from `model()` on first code generation) raises a guarded `ImportError` with install instructions - keep that guard intact. - **Codes are base32; simprints are base64url.** Never cross the codecs. - **ISCC header is fixed:** MainType `SEMANTIC` (`0001`), SubType `TEXT` (`0000`), Version `0000`, plus a length nibble from `BIT_LEN_MAP`. 2 bytes, big-endian. `iscc_distance` strips the 5-char `ISCC:` prefix, base32-decodes, drops the 2-byte header, then compares bodies of equal length. - **`bits` constraints:** `32 <= bits <= 256`, `multiple_of=32`. Same for `bits_granular`. - **Default bit-length differs by entry point:** `SctOptions.bits` / `create()` default to **64**; the **CLI** `--bits` defaults to **256**. - **`SctOptions` validates on assignment.** Mutate only via `.override(dict)`, which deep-copies and sets fields individually so validators run. Options flow as `**opts.model_dump()` through all layers. Env vars use the `ISCC_SCT_` prefix; `.env` is loaded. - **Chunk boundaries are frozen** by `tests/chunking_vectors.json` and identical on both the normal (`splitter`) and guarded (`splitter_guarded`) paths. `needs_split_guard()` routes texts whose positions sit more than `SPLIT_GUARD_GAP` (8192) chars from the next paragraph separator to the guarded sizer (PDF-extracted text, issue #24). - **Singletons:** `tokenizer()`, `splitter(**opts)`, `splitter_guarded(**opts)`, `model()` are `@cache`d. The splitter cache keys on the option kwargs, so all option values must be hashable. - **Coverage must stay at 100%** (`--cov-fail-under=100`). GPU branches and model download use `# pragma: no cover`. `dev.py`, `demo.py`, and `tests/` are omitted from coverage. - **Style:** PEP 484 **type comments** (first line after `def`), PEP 585 generics, PEP 604 unions. Short pure functions, max 3 args, no nested functions, sphinx-style docstrings. Ruff line length 100, LF endings, rule F401 disabled. ## Side effects catalog | Function | Effect | | ------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `model()` (first call) | If the model file is absent/corrupt, `get_model()` **downloads ~450 MB** to the platformdirs user-data dir and verifies a blake3 checksum. Creates a cached ONNX `InferenceSession`. | | `get_model()` | Network download + disk write to `MODEL_PATH`; blake3 integrity check (`# pragma: no cover`). | | `tokenizer()`, `splitter()`, `splitter_guarded()` | Populate process-global `@cache` (loaded once). | | `model()` | Calls `rt.preload_dlls()` when CUDA is available; logs a warning via `warn_gpu_shadowed()` if `onnxruntime-gpu` is installed but CUDA is missing. | | `import iscc_sct.options` | `load_dotenv()` reads `.env`; instantiates `sct_opts`. | | `import iscc_sct.utils` | `os.makedirs(user_data_dir)` (idempotent). | | `create()`, `gen_text_code_semantic()` | Pure given a loaded model: no disk writes, no mutation of inputs. | | CLI `main()` | Reads files, prints to stdout, removes the loguru logger unless `--debug`. | ## Task recipes ### Generate a code (library) ```python import iscc_sct as sct meta = sct.create("This is some sample text.", bits=256) print(meta.iscc) # "ISCC:CAD..." print(meta.characters) # input length in characters ``` ### Granular per-chunk features ```python import iscc_sct as sct meta = sct.create(long_text, bits=256, granular=True) for feature in meta.features[0].simprints: # Object-Format print(feature.offset, feature.size, feature.simprint, feature.content[:40]) ``` ### Compare two texts (cross-lingual) ```python import iscc_sct as sct a = sct.create("An ISCC applies to a specific digital asset...") b = sct.create("Ein ISCC bezieht sich auf ein bestimmtes digitales Gut...") print(sct.iscc_distance(a.iscc, b.iscc)) # low bit distance => similar ``` ### Match granular simprints ```python import iscc_sct as sct a = sct.create(doc_a, granular=True) b = sct.create(doc_b, granular=True) for feat_a, similarity, feat_b in sct.granular_similarity(a, b, threshold=80): print(similarity, feat_a.offset, feat_b.offset) ``` ### Configure via options ```python import iscc_sct as sct # Per-call override (validated copy of the global settings): meta = sct.create(text, bits=128, contents=True) # Or globally via environment: ISCC_SCT_BITS=128, ISCC_SCT_MAX_TOKENS=127, ... ``` ## Change playbook | If you change... | Also update... | | ----------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | The embedding model file | `MODEL_FILENAME`, `MODEL_URL`, `MODEL_CHECKSUM` in `utils.py`; verify codes are bit-stable or bump version + mark BREAKING in CHANGELOG. | | Chunking (`max_tokens`, `overlap`, `split_text`, `needs_split_guard`, token sizing) | Regenerate `tests/chunking_vectors.json` via `uv run python tests/test_chunking_vectors.py` (only for an intentional algorithm change); multi-chunk codes shift. | | A public function name/signature | The module's `__all__` (it is star-exported through `__init__.py`); README/docs examples. | | Add a processing option | Add a field to `SctOptions` (env var name, constraints); it flows via `model_dump()`. Wire into `create()` if it is a granular toggle; add to the README config table. | | `Metadata` / `FeatureSet` / `Feature` schema | Keep `to_index_format()`, `to_object_format()`, `get_content()`, `get_overlaps()` consistent. | | ISCC header constants (`MAINTYPE`/`SUBTYPE`/`SCT_VERSION`/`BIT_LEN_MAP`) | Codes change - this is BREAKING; bump version and document. | | onnxruntime import / provider logic | Keep the `cpu`/`gpu` extras, `tool.uv.conflicts`, the `ImportError` guard, and `warn_gpu_shadowed()` in sync. | | Any code path | Add tests to keep coverage at 100%; use `# pragma: no cover` only for GPU/download branches. Run `uv run poe all`. | | Dependencies | Regenerate `requirements.txt` via `uv run poe export-requirements` (never hand-edit it). | ## Common mistakes **NEVER** add `onnxruntime` as a base dependency to "make install easier." ```toml # WRONG - re-breaks issue #23: [gpu] then installs both wheels and CPU clobbers GPU dependencies = ["onnxruntime"] ``` **ALWAYS** keep it behind the mutually exclusive `cpu` / `gpu` extras. --- **NEVER** use inline type annotations. ```python def binarize(vec: NDArray) -> bytes: ... # WRONG for this codebase ``` **ALWAYS** use PEP 484 type comments as the first line after `def`. ```python def binarize(vec): # type: (NDArray) -> bytes ... ``` --- **NEVER** expect Object-Format from the low-level function. ```python meta = sct.gen_text_code_semantic(text) # returns a dict in INDEX-Format meta.features[0].simprints # WRONG - it is a dict, not Metadata ``` **ALWAYS** use `create()` for an Object-Format `Metadata`, or wrap the dict: `Metadata(**data)`. --- **NEVER** mutate `SctOptions` fields directly or assume CLI/library defaults match. ```python sct.sct_opts.bits = 256 # avoid - mutates the global; use .override() sct.create(text) # library default bits=64, NOT the CLI's 256 ``` **ALWAYS** pass overrides per call (`create(text, bits=256)`) or via `sct_opts.override({...})`. --- **NEVER** hand-edit `tests/chunking_vectors.json`, `requirements.txt`, `iscc_sct/tokenizer.json`, or the model file. Each is generated or byte-exact. **ALWAYS** regenerate via the documented command and run `uv run poe all` before reporting done. --- **NEVER** mix codecs: ISCC codes are RFC4648 base32 (no padding); granular simprints are base64url (no padding). Decode with the matching `decode_base32` / `decode_base64`.