# iscc-sct

[![Tests](https://github.com/iscc/iscc-sct/actions/workflows/tests.yml/badge.svg)](https://github.com/iscc/iscc-sct/actions/workflows/tests.yml)
[![Version](https://img.shields.io/pypi/v/iscc-sct.svg)](https://pypi.python.org/pypi/iscc-sct/)
[![Downloads](https://pepy.tech/badge/iscc-sct)](https://pepy.tech/project/iscc-sct)
[![License](https://img.shields.io/badge/license-Apache%202.0-green.svg)](https://github.com/iscc/iscc-sct/blob/main/LICENSE)
[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/iscc/iscc-sct)

**A cross-lingual, similarity-preserving binary code for text. Semantically similar texts, including
translations, produce codes with low hamming distance.**

!!! warning "Proof of concept"

    All releases below v1.0.0 may break backward compatibility and produce incompatible Semantic
    Text-Codes. The algorithms in `iscc-sct` are experimental and **not** part of the official
    [ISO 24138:2024](https://www.iso.org/standard/77899.html) standard.

## Introduction

The [ISCC](https://iscc.codes) framework already includes a Text-Code based on lexical similarity
for near-duplicate matching. The Semantic Text-Code (SCT) is a planned additional ISCC-UNIT that
captures a broader, more abstract similarity. It is engineered to be robust against rephrasing and,
most notably, translations that lexical matching cannot detect.

`iscc-sct` turns any text into a compact binary code built from a binarized, multilingual document
embedding. The same content expressed in different languages maps to **(near-)identical codes**,
opening up cross-lingual content identification and similarity detection.

| Feature       | ISCC Content-Code Text   | ISCC Semantic-Code Text           |
| ------------- | ------------------------ | --------------------------------- |
| Focus         | Lexical similarity       | Semantic similarity               |
| Cross-lingual | No                       | Yes                               |
| Use case      | Near-duplicate detection | Semantic similarity, translations |

**Key features:**

- **Semantic similarity** - deep multilingual embeddings capture the meaning of the text
- **Translation matching** - near-identical codes for the same content across languages
- **Bit-length flexibility** - codes from 32 up to 256 bits for adjustable granularity
- **ISCC compatible** - codes integrate directly with existing ISCC-based systems
- **60+ languages** - one model covers a broad range of scripts and language pairs

## Quick start

=== "pip"

    ```bash
    pip install "iscc-sct[cpu]"
    ```

=== "uv"

    ```bash
    uv add "iscc-sct[cpu]"
    ```

=== "uvx (no install)"

    ```bash
    uvx "iscc-sct[cpu]" "path/to/textfile.txt"
    ```

Generate a Semantic Text-Code from Python:

```python
import iscc_sct as sct

text = "This is some sample text. It can be a longer document or even an entire book."
print(sct.create(text, bits=256).iscc)
# ISCC:CADV3GG6JH3XEVRNSVYGCLJ7AAV3BOT5J7EHEZKPFXEGRJ2CTWACGZI
```

Or from the command line:

```bash
iscc-sct "path/to/textfile.txt"
```

!!! note "Choose exactly one ONNX runtime"

    A plain `pip install iscc-sct` installs no ONNX runtime; the first code generation then fails with
    install instructions. Install the `cpu` extra (works everywhere) or the `gpu` extra for NVIDIA CUDA
    acceleration (requires CUDA 12.x and cuDNN 9.x), or run `iscc-sct doctor` to detect and install the
    right one. Never install both: `onnxruntime` and `onnxruntime-gpu` unpack into the same directory
    and silently overwrite each other.

## How it works

```text
Text -> split -> embed -> mean-pool -> binarize -> ISCC header + base32
```

1. **Split** the text into overlapping chunks at syntactically sensible boundaries.
1. **Embed** each chunk with a multilingual sentence-transformer running on ONNX.
1. **Aggregate** the chunk embeddings into one mean-pooled, normalized document vector.
1. **Binarize** the vector (positive components become 1-bits), truncate to the requested
    bit-length, prefix the ISCC header, and base32-encode it.

This process is robust to variation and translation, enabling cross-lingual matching from a short
Simprint.

## Documentation

<div class="grid cards" markdown>

- **[Getting started](tutorials/getting-started.md)** - Tutorial

    Install, generate your first code, and match a translation across languages.

- **[How-to guides](howto/compare-texts.md)** - Task recipes

    Compare texts, work with granular features, configure options, and use the CLI.

- **[How it works](explanation/how-it-works.md)** - Understand the design

    The pipeline, cross-lingual matching, and why binarized vectors still match.

- **[API reference](reference/api.md)** - Library details

    Generated reference for `create()`, the data model, options, and similarity functions.

- **[For Coding Agents](reference/for-coding-agents.md)** - Build on iscc-sct

    A dense architecture map, constraints, and task recipes for AI coding agents.

- **[Live Demo :lucide-external-link:](https://huggingface.co/spaces/iscc/iscc-sct)** - Try it in
    the browser

    Generate and compare Semantic Text-Codes interactively on Hugging Face Spaces.

- **[Source Code :lucide-external-link:](https://github.com/iscc/iscc-sct)** - Read the
    implementation

    The full algorithm, CLI, and Gradio demo on GitHub.

- **[Full text for LLMs :lucide-external-link:](llms-full.txt)** - Machine-readable docs

    Every page concatenated into a single file for language models.

</div>

??? note "Supported languages (60+)"

    Arabic, Armenian, Bengali, Bosnian, Bulgarian, Burmese, Catalan, Chinese (China), Chinese (Taiwan),
    Croatian, Czech, Danish, Dutch, English, Estonian, Farsi, Finnish, French, French (Canada),
    Galician, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian,
    Japanese, Kannada, Korean, Kurdish, Latvian, Lithuanian, Macedonian, Malay, Malayalam, Marathi,
    Mongolian, Norwegian Bokmål, Persian, Polish, Portuguese, Portuguese (Brazil), Romanian, Russian,
    Serbian, Sinhala, Slovak, Slovenian, Spanish, Swedish, Tamil, Telugu, Thai, Turkish, Ukrainian,
    Urdu, Vietnamese.

[Source code on GitHub :lucide-external-link:](https://github.com/iscc/iscc-sct){ .md-button }

---

# Getting started

Install `iscc-sct`, generate your first Semantic Text-Code, and watch two languages produce
near-identical codes.

## Prerequisites

- Python 3.10 or later
- [`uv`](https://docs.astral.sh/uv/) (recommended) or `pip`

## Install

`iscc-sct` needs an ONNX runtime, selected through an install extra. The `cpu` extra works
everywhere:

=== "uv"

    ```bash
    uv add "iscc-sct[cpu]"
    ```

=== "pip"

    ```bash
    pip install "iscc-sct[cpu]"
    ```

!!! warning "Choose exactly one runtime"

    Install the `cpu` extra, or the `gpu` extra for NVIDIA CUDA acceleration — never both. The
    `onnxruntime` and `onnxruntime-gpu` packages unpack into the same directory and overwrite each
    other. A plain `pip install iscc-sct` installs no runtime, and the first code generation then fails
    with install instructions. If you are unsure which extra fits your machine, run `iscc-sct doctor`
    (see the [command-line guide](../howto/command-line.md)).

### Verify the installation

```python
import iscc_sct

print(iscc_sct.__version__)
```

## Generate your first code

Pass any text to `create()`. The first call downloads the embedding model (about 450 MB) to your
user data directory; later calls reuse it.

```python
import iscc_sct as sct

text = "This is some sample text. It can be a longer document or even an entire book."
print(sct.create(text, bits=256).iscc)
# ISCC:CADV3GG6JH3XEVRNSVYGCLJ7AAV3BOT5J7EHEZKPFXEGRJ2CTWACGZI
```

The `bits` argument sets the code length. Longer codes carry more detail; 64 bits is the library
default, and 256 bits is the most precise. See [comparing texts](../howto/compare-texts.md) for how
length affects matching.

## Match a translation

The point of a Semantic Text-Code is that meaning survives translation. Generate codes for the same
passage in English and German, then measure their distance:

```python
import iscc_sct as sct

english = (
    "An ISCC applies to a specific digital asset and is a data-descriptor deterministically "
    "constructed from multiple hash digests using the algorithms and rules in this document. "
    "This document does not provide information on registration of ISCCs."
)
german = (
    "Ein ISCC bezieht sich auf ein bestimmtes digitales Gut und ist ein Daten-Deskriptor, der "
    "deterministisch aus mehreren Hash-Digests unter Verwendung der Algorithmen und Regeln in "
    "diesem Dokument erstellt wird. Dieses Dokument enthält keine Informationen über die "
    "Registrierung von ISCCs."
)

a = sct.create(english)
b = sct.create(german)

print(sct.iscc_distance(a.iscc, b.iscc))
# 3
```

A distance of `3` bits out of 64 means the translation is a near-match. For contrast, an unrelated
sentence lands far away:

```python
import iscc_sct as sct

unrelated = (
    "The recipe calls for two cups of flour, a pinch of salt, and three ripe bananas mashed "
    "until smooth before folding the mixture into the buttered baking tin."
)

print(sct.iscc_distance(a.iscc, sct.create(unrelated).iscc))
# 33
```

Low distance means similar meaning; high distance means unrelated content. Translation matching is
the behavior that lexical (word-based) codes cannot provide.

## Look inside the text

Set `granular=True` to get per-chunk features alongside the document code. Each chunk carries its
own offset, size, similarity-preserving fingerprint (simprint), and text:

```python
import iscc_sct as sct

text = "This is some sample text. It can be a longer document or even an entire book."
meta = sct.create(text, bits=256, granular=True)

feature = meta.features[0].simprints[0]
print(feature.offset, feature.size, feature.simprint)
# 0 77 XZjeSfdyVi0
```

Granular features let you match individual passages across documents, even when the surrounding text
differs. The [granular features guide](../howto/granular-features.md) covers chunk-level matching
and content reconstruction.

## Next steps

- **[Compare texts](../howto/compare-texts.md)** — Measure semantic and cross-lingual similarity.
- **[Granular features](../howto/granular-features.md)** — Work with per-chunk simprints and
    offsets.
- **[Configuration](../howto/configuration.md)** — Tune bit-length, chunking, and output options.
- **[How it works](../explanation/how-it-works.md)** — Understand why the codes match across
    languages.

---

# Compare texts

This guide shows how to measure how similar two texts are in meaning — including across languages —
by comparing their Semantic Text-Codes.

Similarity is computed from the codes, not the original text. You can store or share the codes and
still compare content without keeping the source documents.

## Compare two whole documents

Generate a code for each text, then measure the Hamming distance between them with
`iscc_distance()`. It returns the number of differing bits — lower means more similar:

```python
import iscc_sct as sct

english = (
    "An ISCC applies to a specific digital asset and is a data-descriptor deterministically "
    "constructed from multiple hash digests using the algorithms and rules in this document. "
    "This document does not provide information on registration of ISCCs."
)
german = (
    "Ein ISCC bezieht sich auf ein bestimmtes digitales Gut und ist ein Daten-Deskriptor, der "
    "deterministisch aus mehreren Hash-Digests unter Verwendung der Algorithmen und Regeln in "
    "diesem Dokument erstellt wird. Dieses Dokument enthält keine Informationen über die "
    "Registrierung von ISCCs."
)

a = sct.create(english)
b = sct.create(german)

print(sct.iscc_distance(a.iscc, b.iscc))
# 3
```

`iscc_distance()` strips the `ISCC:` prefix and the 2-byte header, then compares the code bodies.
The two codes must be the same bit-length, or it raises `ValueError`.

## Read the distance

Distance is measured in bits, so the scale depends on the code length. Compare a translation against
an unrelated text at the same length to see the spread:

```python
import iscc_sct as sct

unrelated = (
    "The recipe calls for two cups of flour, a pinch of salt, and three ripe bananas mashed "
    "until smooth before folding the mixture into the buttered baking tin."
)
u = sct.create(unrelated)

print(sct.iscc_distance(a.iscc, b.iscc))  # translation
# 3
print(sct.iscc_distance(a.iscc, u.iscc))  # unrelated
# 33
```

A translation sits a few bits apart; unrelated content sits near half the bit-length away (random
codes differ in about 50% of their bits).

## Choose a bit-length

Longer codes spread similar and dissimilar texts further apart, which makes a threshold easier to
pick. The same three texts at 256 bits:

```python
import iscc_sct as sct

a256 = sct.create(english, bits=256)
b256 = sct.create(german, bits=256)
u256 = sct.create(unrelated, bits=256)

print(sct.iscc_distance(a256.iscc, b256.iscc))  # translation
# 23
print(sct.iscc_distance(a256.iscc, u256.iscc))  # unrelated
# 123
```

| Goal                                       | Suggested length |
| ------------------------------------------ | ---------------- |
| Compact storage, coarse matching           | 64 bits          |
| Balanced precision                         | 128 bits         |
| Finest discrimination between near-matches | 256 bits         |

Both codes in a comparison must use the same `bits` value. See [configuration](configuration.md) for
how to set the default.

## Score similarity from raw digests

`iscc_distance()` works on code strings. To compare raw digests directly — for example the output of
`soft_hash_text_semantic()` — use `cosine_similarity()`, which scales the distance to a `-100` to
`+100` score:

```python
import iscc_sct as sct

a = sct.soft_hash_text_semantic("An ISCC applies to a specific digital asset.")
b = sct.soft_hash_text_semantic("Ein ISCC bezieht sich auf ein bestimmtes digitales Gut.")

print(sct.cosine_similarity(a, b))
# 71
```

`hamming_distance()` returns the raw bit distance between two equal-length digests if you want the
unscaled value. Both functions raise `ValueError` when the inputs differ in length.

## Match passages instead of whole documents

To find which parts of two documents are similar — rather than scoring them as a whole — use
granular features and `granular_similarity()`. See [granular features](granular-features.md).

## Related pages

- **[Granular features](granular-features.md)** — Chunk-level matching with simprints.
- **[How it works](../explanation/how-it-works.md)** — Why distance reflects meaning.
- **[API reference](../reference/api.md)** — Signatures for every similarity function.

---

# Granular features

This guide shows how to work with granular features: the per-chunk fingerprints that let you match
individual passages, locate them in the source text, and reconstruct content.

A document code summarizes a whole text in one code. Granular features instead describe each chunk
the text was split into, so you can compare documents passage by passage.

## Generate granular features

Pass `granular=True` to `create()`. Each chunk becomes a `Feature` with a simprint (its
similarity-preserving fingerprint), a character `offset`, a `size`, and the chunk `content`:

```python
import iscc_sct as sct

text = "This is some sample text. It can be a longer document or even an entire book."
meta = sct.create(text, bits=256, granular=True)

feature = meta.features[0].simprints[0]
print(feature.simprint, feature.offset, feature.size)
# XZjeSfdyVi0 0 77
print(feature.content)
# This is some sample text. It can be a longer document or even an entire book.
```

Short texts produce a single chunk. Longer texts produce many — see
[matching passages](#match-passages-across-documents) below.

## Choose a feature format

Granular features come in two interchangeable shapes:

| Format        | Shape                                                        | Produced by                |
| ------------- | ------------------------------------------------------------ | -------------------------- |
| Object-Format | a list of self-contained `Feature` objects                   | `create(granular=True)`    |
| Index-Format  | parallel arrays: `simprints`, `offsets`, `sizes`, `contents` | `gen_text_code_semantic()` |

`create()` returns Object-Format, which is easy to read and iterate. Convert between the two with
`to_index_format()` and `to_object_format()`:

```python
import iscc_sct as sct

meta = sct.create(text, bits=256, granular=True)  # Object-Format

index = meta.to_index_format()
print(index.features[0].simprints)  # ['XZjeSfdyVi0']
print(index.features[0].offsets)  # [0]
```

Index-Format stores related attributes in parallel arrays, which is compact for storage and bulk
indexing. The low-level `gen_text_code_semantic()` returns a plain dict already in Index-Format.

## Control the simprint length

Granular simprints have their own bit-length, set by `bits_granular` (default **64**). It is
independent of the document `bits`, so `bits=256, granular=True` still yields 64-bit simprints:

```python
import iscc_sct as sct

meta = sct.create(text, bits=256, granular=True, bits_granular=256)
digest = sct.decode_base64(meta.features[0].simprints[0].simprint)
print(len(digest) * 8)
# 256
```

Shorter simprints save space; longer simprints discriminate finer between near-matching passages.

## Reconstruct the original text

When features include both `offset` and `content`, `get_content()` stitches the chunks back into the
original text, removing the overlap between adjacent chunks:

```python
import iscc_sct as sct

en = (
    "The International Standard Content Code identifies digital content of any media type. "
    "It is generated algorithmically from the content itself, much like a cryptographic hash. "
    "Unlike a cryptographic hash, the ISCC preserves similarity between related items. "
    "Two near-duplicate files therefore receive two codes that are close to each other. "
    "This property supports deduplication, similarity clustering, and content provenance. "
    "The Semantic Text-Code extends these ideas to the meaning of text across languages. "
    "A translation of a document keeps a code that stays close to the original code. "
    "That makes cross-lingual search and matching practical without machine translation."
)
meta = sct.create(en, granular=True)

print(len(meta.features[0].simprints))  # number of chunks
# 2
print(meta.get_content() == en)
# True
```

`get_overlaps()` returns the overlapping text between consecutive chunks if you need to inspect the
chunk boundaries directly.

## Use byte offsets

Offsets and sizes are character positions by default. Enable `byte_offsets` to report UTF-8 byte
positions instead — useful for fetching a chunk by random access from a remote file:

```python
import iscc_sct as sct

meta = sct.create(en, granular=True, byte_offsets=True)
print(meta.features[0].byte_offsets)  # True
print(meta.features[0].simprints[1].offset)  # byte offset of the second chunk
```

## Match passages across documents

`granular_similarity()` compares the simprints of two `Metadata` objects and returns the matching
passages above a similarity threshold. Each result is a `(Feature, score, Feature)` tuple, where the
score is the `cosine_similarity` between the two simprints:

```python
import iscc_sct as sct

de = (
    "Der International Standard Content Code identifiziert digitale Inhalte jedes Medientyps. "
    "Er wird algorithmisch aus dem Inhalt selbst erzeugt, ähnlich einem kryptografischen Hash. "
    "Anders als ein kryptografischer Hash bewahrt der ISCC die Ähnlichkeit verwandter Objekte. "
    "Zwei nahezu identische Dateien erhalten daher zwei Codes, die nahe beieinander liegen. "
    "Diese Eigenschaft unterstützt Deduplizierung, Ähnlichkeits-Clustering und Herkunftsnachweis. "
    "Der Semantic Text-Code überträgt diese Ideen auf die Bedeutung von Text über Sprachen hinweg. "
    "Eine Übersetzung eines Dokuments behält einen Code, der nahe am Originalcode bleibt. "
    "Das macht sprachübergreifende Suche und Zuordnung ohne maschinelle Übersetzung praktikabel."
)

a = sct.create(en, granular=True)
b = sct.create(de, granular=True)

for feat_a, score, feat_b in sct.granular_similarity(a, b, threshold=80):
    print(score, feat_a.offset, feat_b.offset)
# 81 0 0
# 90 340 356
```

Each English passage is matched to its closest German counterpart, even though the offsets differ
because the languages produce text of different lengths. Raise `threshold` to keep only the
strongest matches; lower it to surface weaker ones. Only the single best match per passage in the
first document is returned.

## Related pages

- **[Compare texts](compare-texts.md)** — Whole-document similarity.
- **[Configuration](configuration.md)** — Defaults for `bits_granular`, chunking, and offsets.
- **[How it works](../explanation/how-it-works.md)** — How chunks and overlaps are produced.

---

# Configuration

This guide shows how to control code length, chunking, and output detail through options — per call,
through environment variables, or globally.

Every option has the same name everywhere: as a keyword argument, as an `ISCC_SCT_*` environment
variable, and as a field on the `SctOptions` model.

## Options reference

| Option          | Env variable             | Default | Notes                                                   |
| --------------- | ------------------------ | ------- | ------------------------------------------------------- |
| `bits`          | `ISCC_SCT_BITS`          | `64`    | Document code length. 32–256, multiple of 32.           |
| `bits_granular` | `ISCC_SCT_BITS_GRANULAR` | `64`    | Granular simprint length. 32–256, multiple of 32.       |
| `characters`    | `ISCC_SCT_CHARACTERS`    | `True`  | Include the document character count.                   |
| `embedding`     | `ISCC_SCT_EMBEDDING`     | `False` | Include the global document embedding vector.           |
| `precision`     | `ISCC_SCT_PRECISION`     | `8`     | Max fractional digits when storing the embedding.       |
| `simprints`     | `ISCC_SCT_SIMPRINTS`     | `False` | Include granular per-chunk simprints.                   |
| `offsets`       | `ISCC_SCT_OFFSETS`       | `False` | Include per-chunk offsets.                              |
| `byte_offsets`  | `ISCC_SCT_BYTE_OFFSETS`  | `False` | Report UTF-8 byte offsets instead of character offsets. |
| `sizes`         | `ISCC_SCT_SIZES`         | `False` | Include per-chunk sizes.                                |
| `contents`      | `ISCC_SCT_CONTENTS`      | `False` | Include the per-chunk text.                             |
| `max_tokens`    | `ISCC_SCT_MAX_TOKENS`    | `127`   | Max tokens per chunk. Cannot exceed 127.                |
| `overlap`       | `ISCC_SCT_OVERLAP`       | `48`    | Max tokens shared between adjacent chunks.              |
| `trim`          | `ISCC_SCT_TRIM`          | `False` | Trim whitespace from chunks.                            |

The `granular=True` shortcut on `create()` is equivalent to setting `simprints`, `offsets`, `sizes`,
and `contents` to `True` at once.

## Override per call

Pass options as keyword arguments to `create()`. They apply to that call only:

```python
import iscc_sct as sct

text = "This is some sample text. It can be a longer document or even an entire book."
meta = sct.create(text, bits=128, simprints=True, contents=True)
print(meta.iscc)
# ISCC:CABV3GG6JH3XEVRNSVYGCLJ7AAV3A
```

## Set defaults with environment variables

Set any option through its `ISCC_SCT_*` variable. This changes the default for every call in the
process:

```bash
export ISCC_SCT_BITS=128
export ISCC_SCT_MAX_TOKENS=100
```

You can also place these in a `.env` file in the working directory — `iscc-sct` loads it
automatically on import:

```ini
# .env
ISCC_SCT_BITS=128
ISCC_SCT_OVERLAP=24
```

## Override the global settings

`sct_opts` is the global settings instance. Use `override()` to get a validated copy with some
fields changed, without mutating the global:

```python
import iscc_sct as sct

opts = sct.sct_opts.override({"bits": 128})
print(opts.bits)  # 128
print(sct.sct_opts.bits)  # 64 — the global is unchanged
```

Prefer `override()` or per-call keyword arguments over assigning to `sct_opts` fields directly. Both
keep the global default predictable for other code in the same process.

## Validation

Options are validated whenever they are set. An out-of-range or wrong-shaped value raises a pydantic
`ValidationError`:

```python
import iscc_sct as sct

sct.sct_opts.override({"bits": 100})  # not a multiple of 32 -> ValidationError
sct.sct_opts.override({"max_tokens": 200})  # exceeds 127 -> ValidationError
```

!!! warning "Library and CLI defaults differ"

    `create()` and `SctOptions` default `bits` to **64**. The `iscc-sct` command-line tool defaults
    `--bits` to **256**. The same text therefore yields a different code length depending on which entry
    point you use. Set `bits` explicitly when you need a specific length.

## Related pages

- **[Compare texts](compare-texts.md)** — How `bits` affects matching.
- **[Granular features](granular-features.md)** — `bits_granular`, offsets, and chunk contents.
- **[Command line](command-line.md)** — Setting options from the CLI.
- **[API reference](../reference/api.md)** — The `SctOptions` model in full.

---

# Command line

This guide covers the `iscc-sct` command-line tool: generating codes from files, processing many
files at once, checking the ONNX runtime, and launching the demo.

The CLI is installed with the package. Run `iscc-sct --help` to see every option:

```text
usage: iscc-sct [-h] [-b BITS] [-g] [-d] [-y] [path]

Generate Semantic Text-Codes for text files.

positional arguments:
  path             Path to text files (glob patterns), 'doctor' to check the
                   ONNX runtime, or 'gui' for the demo.

options:
  -h, --help       show this help message and exit
  -b, --bits BITS  Bit-Length of Code (default 256)
  -g, --granular   Activate granular processing.
  -d, --debug      Show debugging messages.
  -y, --yes        Auto-confirm the 'doctor' runtime install.
```

## Generate a code from a file

Pass a path to print its Semantic Text-Code:

```bash
iscc-sct article.txt
# ISCC:CADV3GG6JH3XEVRNSVYGCLJ7AAV3BOT5J7EHEZKPFXEGRJ2CTWACGZI
```

The CLI defaults to **256-bit** codes. Set a different length with `--bits`:

```bash
iscc-sct --bits 64 article.txt
```

!!! note "The CLI default differs from the library"

    `iscc-sct` defaults to 256 bits, but `create()` in Python defaults to 64. Pass `--bits` (or the
    `bits` argument) explicitly when the length matters. See [configuration](configuration.md).

## Process multiple files

The `path` argument accepts a glob pattern. Quote it so `iscc-sct` expands the pattern itself rather
than the shell:

```bash
iscc-sct "texts/*.txt"
```

Each matching file prints its code. Files that cannot be decoded as UTF-8 are decoded with a
detected character set, and empty files are skipped.

## Granular output

Add `--granular` to print the full metadata — document code plus per-chunk simprints, offsets,
sizes, and contents — as JSON:

```bash
iscc-sct --granular article.txt
```

See [granular features](granular-features.md) for what the fields mean.

## Run without installing

With [`uv`](https://docs.astral.sh/uv/), run the CLI in one line without a permanent install:

```bash
uvx "iscc-sct[cpu]" article.txt
```

## Check the ONNX runtime

`iscc-sct` needs exactly one ONNX runtime (`cpu` or `gpu` extra). The `doctor` command inspects your
environment, names the problem, and recommends the right extra:

```bash
iscc-sct doctor
```

```text
iscc-sct ONNX runtime check

  ONNX runtime:   not installed
  CUDA provider:  no
  NVIDIA GPU:     no

Status: no ONNX runtime installed.
Recommended: pip install "iscc-sct[cpu]"
```

When a fix is available, `doctor` offers to run it. Add `--yes` to install without the prompt:

```bash
iscc-sct doctor --yes
```

`doctor` also detects the case where `onnxruntime-gpu` is installed but a plain `onnxruntime`
package has shadowed it (issue #23), and reinstalls only the GPU build.

## Launch the demo

The `gui` command launches the interactive Gradio demo in your browser. It needs the `demo` extra:

```bash
pip install "iscc-sct[cpu,demo]"
iscc-sct gui
```

## Related pages

- **[Getting started](../tutorials/getting-started.md)** — Install and first code.
- **[Configuration](configuration.md)** — Options and environment variables.
- **[How it works](../explanation/how-it-works.md)** — What happens behind a code.

---

# How it works

This page explains how a Semantic Text-Code is built and why semantically similar texts — including
translations — produce codes with low Hamming distance.

## The problem

The standard ISCC Content-Code for text matches on lexical similarity: it compares the words that
appear in a document. That works well for near-duplicates, but it cannot tell that a German
translation carries the same meaning as its English original — the two share almost no words.

The Semantic Text-Code targets meaning instead of wording. It is built so that texts about the same
thing land close together in code space, whatever language they are written in.

## The pipeline

```mermaid
graph LR
    A[Text] --> B[Split into chunks]
    B --> C[Embed each chunk]
    C --> D[Mean-pool to one vector]
    D --> E[Binarize]
    E --> F[ISCC header + base32]
```

Each stage has a specific job.

### Split

A document is first split into overlapping chunks at sensible boundaries (up to 127 tokens each,
with up to 48 tokens of overlap). Overlap keeps a sentence that straddles a boundary from being lost
to both chunks.

Text without regular paragraph breaks — such as text extracted from print-layout PDFs — takes a
guarded code path that produces the same chunks without the super-linear cost the naive splitter
would incur on those inputs.

### Embed

Each chunk is run through a multilingual sentence-transformer model
([paraphrase-multilingual-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2))
exported to ONNX. The model turns a chunk into a 384-dimensional embedding: a vector of numbers that
encodes the chunk's meaning. The model was trained so that texts with similar meaning — across more
than 60 languages — map to nearby vectors.

The per-token outputs are combined into one chunk vector by attention-mask pooling, then normalized.

### Aggregate

The chunk vectors are averaged into a single document vector and normalized again. This is the one
vector that represents the whole document's meaning.

### Binarize

Each component of the document vector becomes one bit: positive values become `1`, the rest become
`0`. The full vector yields a 384-bit digest.

The digest is truncated to the requested length (`bits`, up to 256), prefixed with a 2-byte ISCC
header that marks it as a Semantic Text-Code, and encoded as base32 with the `ISCC:` prefix.

## Why binarized vectors still match

Two texts with similar meaning produce similar embeddings — vectors pointing in nearly the same
direction. Nearly-aligned vectors agree on the sign of most of their components, so their bit
patterns agree on most bits, giving a low Hamming distance.

Unrelated texts produce roughly perpendicular vectors, which agree on only about half their signs —
so their codes differ in about half their bits. That gap between "few bits differ" and "about half
the bits differ" is what makes a similarity threshold meaningful.

| Property          | Behavior                                                |
| ----------------- | ------------------------------------------------------- |
| Similar meaning   | Few differing bits (low Hamming distance)               |
| Unrelated content | About 50% of bits differ                                |
| Translation       | Treated like similar meaning — a near-match             |
| Longer `bits`     | Wider spread between near-matches and unrelated content |

`iscc_distance()` measures similarity by stripping the `ISCC:` prefix and the 2-byte header, then
counting the differing bits between the two code bodies. `cosine_similarity()` rescales that
distance to a `-100`–`+100` score. See [comparing texts](../howto/compare-texts.md).

## Cross-lingual matching

Cross-lingual matching is a direct consequence of the embedding model. Because the model was trained
to place a sentence and its translation near each other in vector space, the document vectors stay
close, the sign patterns stay mostly equal, and the codes stay a few bits apart.

No machine translation happens at code time. The shared meaning is captured once, by the model,
during embedding.

## Granular features

Alongside the document code, `iscc-sct` can emit a simprint for each chunk. Because chunks carry
their offsets, you can locate matching passages within documents and align similar passages across
two documents — even when they sit at different positions. See
[granular features](../howto/granular-features.md).

## Relation to the ISCC standard

The Semantic Text-Code is a planned, experimental ISCC-UNIT (MainType SEMANTIC, SubType TEXT). It is
a proof of concept and **not** part of [ISO 24138:2024](https://www.iso.org/standard/77899.html).
The codes it produces may change between releases below v1.0.0.

The codes are still structurally compatible with the ISCC framework: a standard 2-byte ISCC header
identifies the unit type and length, so a Semantic Text-Code slots in alongside the other
ISCC-UNITs.

## Further reading

- **[Getting started](../tutorials/getting-started.md)** — Try the pipeline hands-on.
- **[Compare texts](../howto/compare-texts.md)** — Use the distance in practice.
- **[For Coding Agents](../reference/for-coding-agents.md)** — The exact algorithm, constants, and
    invariants.

---

# API reference

The API reference is generated from source docstrings and rendered with full signatures at
https://sct.iscc.codes/reference/api/.

For a plain-text API map of every public symbol - its source module and purpose - see the
For Coding Agents page (in this file and at https://sct.iscc.codes/reference/for-coding-agents/).

---

# For Coding Agents

A compressed reference for AI agents working on **iscc-sct** (ISCC Semantic Text-Code). Read this
before editing the codebase or integrating the library. Terminology matches the source exactly.

`iscc-sct` turns text into a cross-lingual, similarity-preserving binary ISCC-UNIT (MainType
SEMANTIC, SubType TEXT). It is an experimental proof of concept, **not** part of ISO 24138:2024.

## Architecture map

### File layout

| Path                                  | Contains                                                                                                                                                                                                                      |
| ------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `iscc_sct/main.py`                    | `create()` - high-level API; returns a `Metadata` object in Object-Format.                                                                                                                                                    |
| `iscc_sct/code_semantic_text.py`      | Core algorithm. `gen_text_code_semantic()` (returns a plain dict, Index-Format), splitting, embedding, pooling, binarization. Module-level `@cache` singletons: `tokenizer()`, `splitter()`, `splitter_guarded()`, `model()`. |
| `iscc_sct/models.py`                  | Pydantic schema: `Feature`, `FeatureSet`, `Metadata` + format converters.                                                                                                                                                     |
| `iscc_sct/options.py`                 | `SctOptions` (pydantic-settings), `sct_opts` singleton, `.override()`.                                                                                                                                                        |
| `iscc_sct/utils.py`                   | Codecs (base32/base64url), distances, model download + blake3 integrity, `char_to_byte_offsets`, `MODEL_PATH`, `timer`.                                                                                                       |
| `iscc_sct/cli.py`                     | `iscc-sct` console entry point (`main()`): glob, charset detection, `gui` subcommand.                                                                                                                                         |
| `iscc_sct/demo.py`, `iscc_sct/app.py` | Gradio demo (Hugging Face Space). Omitted from coverage.                                                                                                                                                                      |
| `iscc_sct/dev.py`                     | Dev-only poe task helpers. Omitted from coverage.                                                                                                                                                                             |
| `iscc_sct/tokenizer.json`             | Vendored tokenizer. Byte-exact; excluded from whitespace/EOL hooks.                                                                                                                                                           |
| `tests/chunking_vectors.json`         | Frozen chunk-boundary test vectors. Never hand-edit (see Change playbook).                                                                                                                                                    |

### Pipeline

```text
text -> split_text() -> embed_chunks() -> mean_pooling() -> binarize() -> ISCC header + base32
```

`gen_text_code_semantic(text, **options)` orchestrates this:

1. Reject empty text (`ValueError`), then `sct_opts.override(options)`.
1. `split_text()` -> `[(offset, chunk), ...]` at semantic boundaries (max 127 tokens, 48 overlap).
1. `embed_chunks()` tokenizes (batch 100), runs the ONNX model, `attention_pooling()` per chunk.
1. `mean_pooling()` averages chunk vectors into one L2-normalized document vector.
1. `binarize()` (`vec >= 0` -> bits), truncate to `bits // 8`, prefix the 2-byte ISCC header,
    `encode_base32()`, prepend `"ISCC:"`.

### Import flow

```text
iscc_sct/__init__.py  (star-imports, defines __version__)
  -> options -> utils -> code_semantic_text -> models -> main
main            -> models, code_semantic_text, options
code_semantic_text -> onnxruntime (lazy import), semantic_text_splitter, tokenizers, numpy,
                       iscc_sct as sct  (uses sct.* at call time, not import time)
utils           -> models
cli             -> main, charset_normalizer
```

### Public API

`iscc_sct/__init__.py` re-exports every module's `__all__`. The public surface
(`import iscc_sct as sct`):

| Symbol                                                                          | Source             | Purpose                                          |
| ------------------------------------------------------------------------------- | ------------------ | ------------------------------------------------ |
| `create(text, granular=False, **options)`                                       | main               | Primary API. Returns `Metadata` (Object-Format). |
| `gen_text_code_semantic(text, **options)`                                       | code_semantic_text | Low-level. Returns `dict` (Index-Format).        |
| `code_text_semantic(fp, **options)`                                             | code_semantic_text | Same, reading a UTF-8 file path.                 |
| `soft_hash_text_semantic(text)`                                                 | code_semantic_text | Raw 384-bit digest (`bytes`), no header.         |
| `embed_chunks(chunks, batch_size=100)`                                          | code_semantic_text | Chunk list -> embedding array.                   |
| `Metadata`, `FeatureSet`, `Feature`                                             | models             | Result schema + converters.                      |
| `SctOptions`, `sct_opts`                                                        | options            | Settings model + global instance.                |
| `iscc_distance`, `hamming_distance`, `cosine_similarity`, `granular_similarity` | utils              | Similarity metrics.                              |
| `encode_base32`, `decode_base32`, `encode_base64`, `decode_base64`              | utils              | Codecs.                                          |
| `char_to_byte_offsets`, `get_model`, `MODEL_PATH`, `timer`                      | utils              | Helpers.                                         |
| `__version__`                                                                   | `__init__`         | `"0.2.0"`.                                       |

## Decision dispatch

### Which entry point?

| Goal                                                           | Use                                                     |
| -------------------------------------------------------------- | ------------------------------------------------------- |
| One ISCC code from a string, ready-to-use object               | `create(text)` -> `Metadata` (Object-Format)            |
| Granular per-chunk features (simprints/offsets/sizes/contents) | `create(text, granular=True)`                           |
| Compact parallel-array result for storage/indexing             | `gen_text_code_semantic(text, ...)` (Index-Format dict) |
| Process a text file from disk                                  | `code_text_semantic(path)`                              |
| Just the raw 384-bit vector digest                             | `soft_hash_text_semantic(text)` -> `bytes`              |
| Command line / batch over files                                | `iscc-sct <glob>`                                       |

### Which similarity metric?

| Inputs                                | Use                                       | Returns                                 |
| ------------------------------------- | ----------------------------------------- | --------------------------------------- |
| Two full ISCC code strings            | `iscc_distance(iscc1, iscc2)`             | Hamming distance in bits (`int`)        |
| Two raw digests of equal length       | `hamming_distance(a, b)`                  | bit distance                            |
| Two raw digests, normalized score     | `cosine_similarity(a, b)`                 | `int` in `[-100, 100]`                  |
| Two `Metadata` objects with simprints | `granular_similarity(a, b, threshold=80)` | `[(Feature, similarity, Feature), ...]` |

### Which feature format?

| Format        | Produced by                | Shape                                                    | Convert with                  |
| ------------- | -------------------------- | -------------------------------------------------------- | ----------------------------- |
| Index-Format  | `gen_text_code_semantic()` | parallel arrays `simprints`/`offsets`/`sizes`/`contents` | `Metadata.to_index_format()`  |
| Object-Format | `create()`                 | list of self-contained `Feature` objects                 | `Metadata.to_object_format()` |

## Constraints and invariants

- **No base ONNX runtime.** The base package declares no `onnxruntime`. Install exactly one of the
    mutually exclusive `cpu` / `gpu` extras. `onnxruntime` and `onnxruntime-gpu` unpack into the
    same directory and clobber each other (issue #23); `tool.uv.conflicts` enforces exclusivity for
    uv.
- **ONNX runtime is imported lazily.** `import iscc_sct`, the `iscc-sct` CLI, and `iscc-sct doctor`
    load without any runtime installed. `load_onnxruntime()` (called from `model()` on first code
    generation) raises a guarded `ImportError` with install instructions - keep that guard intact.
- **Codes are base32; simprints are base64url.** Never cross the codecs.
- **ISCC header is fixed:** MainType `SEMANTIC` (`0001`), SubType `TEXT` (`0000`), Version `0000`,
    plus a length nibble from `BIT_LEN_MAP`. 2 bytes, big-endian. `iscc_distance` strips the 5-char
    `ISCC:` prefix, base32-decodes, drops the 2-byte header, then compares bodies of equal length.
- **`bits` constraints:** `32 <= bits <= 256`, `multiple_of=32`. Same for `bits_granular`.
- **Default bit-length differs by entry point:** `SctOptions.bits` / `create()` default to **64**;
    the **CLI** `--bits` defaults to **256**.
- **`SctOptions` validates on assignment.** Mutate only via `.override(dict)`, which deep-copies and
    sets fields individually so validators run. Options flow as `**opts.model_dump()` through all
    layers. Env vars use the `ISCC_SCT_` prefix; `.env` is loaded.
- **Chunk boundaries are frozen** by `tests/chunking_vectors.json` and identical on both the normal
    (`splitter`) and guarded (`splitter_guarded`) paths. `needs_split_guard()` routes texts whose
    positions sit more than `SPLIT_GUARD_GAP` (8192) chars from the next paragraph separator to the
    guarded sizer (PDF-extracted text, issue #24).
- **Singletons:** `tokenizer()`, `splitter(**opts)`, `splitter_guarded(**opts)`, `model()` are
    `@cache`d. The splitter cache keys on the option kwargs, so all option values must be hashable.
- **Coverage must stay at 100%** (`--cov-fail-under=100`). GPU branches and model download use
    `# pragma: no cover`. `dev.py`, `demo.py`, and `tests/` are omitted from coverage.
- **Style:** PEP 484 **type comments** (first line after `def`), PEP 585 generics, PEP 604 unions.
    Short pure functions, max 3 args, no nested functions, sphinx-style docstrings. Ruff line length
    100, LF endings, rule F401 disabled.

## Side effects catalog

| Function                                          | Effect                                                                                                                                                                               |
| ------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `model()` (first call)                            | If the model file is absent/corrupt, `get_model()` **downloads ~450 MB** to the platformdirs user-data dir and verifies a blake3 checksum. Creates a cached ONNX `InferenceSession`. |
| `get_model()`                                     | Network download + disk write to `MODEL_PATH`; blake3 integrity check (`# pragma: no cover`).                                                                                        |
| `tokenizer()`, `splitter()`, `splitter_guarded()` | Populate process-global `@cache` (loaded once).                                                                                                                                      |
| `model()`                                         | Calls `rt.preload_dlls()` when CUDA is available; logs a warning via `warn_gpu_shadowed()` if `onnxruntime-gpu` is installed but CUDA is missing.                                    |
| `import iscc_sct.options`                         | `load_dotenv()` reads `.env`; instantiates `sct_opts`.                                                                                                                               |
| `import iscc_sct.utils`                           | `os.makedirs(user_data_dir)` (idempotent).                                                                                                                                           |
| `create()`, `gen_text_code_semantic()`            | Pure given a loaded model: no disk writes, no mutation of inputs.                                                                                                                    |
| CLI `main()`                                      | Reads files, prints to stdout, removes the loguru logger unless `--debug`.                                                                                                           |

## Task recipes

### Generate a code (library)

```python
import iscc_sct as sct

meta = sct.create("This is some sample text.", bits=256)
print(meta.iscc)  # "ISCC:CAD..."
print(meta.characters)  # input length in characters
```

### Granular per-chunk features

```python
import iscc_sct as sct

meta = sct.create(long_text, bits=256, granular=True)
for feature in meta.features[0].simprints:  # Object-Format
    print(feature.offset, feature.size, feature.simprint, feature.content[:40])
```

### Compare two texts (cross-lingual)

```python
import iscc_sct as sct

a = sct.create("An ISCC applies to a specific digital asset...")
b = sct.create("Ein ISCC bezieht sich auf ein bestimmtes digitales Gut...")
print(sct.iscc_distance(a.iscc, b.iscc))  # low bit distance => similar
```

### Match granular simprints

```python
import iscc_sct as sct

a = sct.create(doc_a, granular=True)
b = sct.create(doc_b, granular=True)
for feat_a, similarity, feat_b in sct.granular_similarity(a, b, threshold=80):
    print(similarity, feat_a.offset, feat_b.offset)
```

### Configure via options

```python
import iscc_sct as sct

# Per-call override (validated copy of the global settings):
meta = sct.create(text, bits=128, contents=True)

# Or globally via environment: ISCC_SCT_BITS=128, ISCC_SCT_MAX_TOKENS=127, ...
```

## Change playbook

| If you change...                                                                    | Also update...                                                                                                                                                         |
| ----------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| The embedding model file                                                            | `MODEL_FILENAME`, `MODEL_URL`, `MODEL_CHECKSUM` in `utils.py`; verify codes are bit-stable or bump version + mark BREAKING in CHANGELOG.                               |
| Chunking (`max_tokens`, `overlap`, `split_text`, `needs_split_guard`, token sizing) | Regenerate `tests/chunking_vectors.json` via `uv run python tests/test_chunking_vectors.py` (only for an intentional algorithm change); multi-chunk codes shift.       |
| A public function name/signature                                                    | The module's `__all__` (it is star-exported through `__init__.py`); README/docs examples.                                                                              |
| Add a processing option                                                             | Add a field to `SctOptions` (env var name, constraints); it flows via `model_dump()`. Wire into `create()` if it is a granular toggle; add to the README config table. |
| `Metadata` / `FeatureSet` / `Feature` schema                                        | Keep `to_index_format()`, `to_object_format()`, `get_content()`, `get_overlaps()` consistent.                                                                          |
| ISCC header constants (`MAINTYPE`/`SUBTYPE`/`SCT_VERSION`/`BIT_LEN_MAP`)            | Codes change - this is BREAKING; bump version and document.                                                                                                            |
| onnxruntime import / provider logic                                                 | Keep the `cpu`/`gpu` extras, `tool.uv.conflicts`, the `ImportError` guard, and `warn_gpu_shadowed()` in sync.                                                          |
| Any code path                                                                       | Add tests to keep coverage at 100%; use `# pragma: no cover` only for GPU/download branches. Run `uv run poe all`.                                                     |
| Dependencies                                                                        | Regenerate `requirements.txt` via `uv run poe export-requirements` (never hand-edit it).                                                                               |

## Common mistakes

**NEVER** add `onnxruntime` as a base dependency to "make install easier."

```toml
# WRONG - re-breaks issue #23: [gpu] then installs both wheels and CPU clobbers GPU
dependencies = ["onnxruntime"]
```

**ALWAYS** keep it behind the mutually exclusive `cpu` / `gpu` extras.

---

**NEVER** use inline type annotations.

```python
def binarize(vec: NDArray) -> bytes: ...  # WRONG for this codebase
```

**ALWAYS** use PEP 484 type comments as the first line after `def`.

```python
def binarize(vec):
    # type: (NDArray) -> bytes
    ...
```

---

**NEVER** expect Object-Format from the low-level function.

```python
meta = sct.gen_text_code_semantic(text)  # returns a dict in INDEX-Format
meta.features[0].simprints  # WRONG - it is a dict, not Metadata
```

**ALWAYS** use `create()` for an Object-Format `Metadata`, or wrap the dict: `Metadata(**data)`.

---

**NEVER** mutate `SctOptions` fields directly or assume CLI/library defaults match.

```python
sct.sct_opts.bits = 256  # avoid - mutates the global; use .override()
sct.create(text)  # library default bits=64, NOT the CLI's 256
```

**ALWAYS** pass overrides per call (`create(text, bits=256)`) or via `sct_opts.override({...})`.

---

**NEVER** hand-edit `tests/chunking_vectors.json`, `requirements.txt`, `iscc_sct/tokenizer.json`, or
the model file. Each is generated or byte-exact.

**ALWAYS** regenerate via the documented command and run `uv run poe all` before reporting done.

---

**NEVER** mix codecs: ISCC codes are RFC4648 base32 (no padding); granular simprints are base64url
(no padding). Decode with the matching `decode_base32` / `decode_base64`.