Getting started¶
Install iscc-sct, generate your first Semantic Text-Code, and watch two languages produce
near-identical codes.
Prerequisites¶
- Python 3.10 or later
uv(recommended) orpip
Install¶
iscc-sct needs an ONNX runtime, selected through an install extra. The cpu extra works
everywhere:
Choose exactly one runtime
Install the cpu extra, or the gpu extra for NVIDIA CUDA acceleration — never both. The
onnxruntime and onnxruntime-gpu packages unpack into the same directory and overwrite each
other. A plain pip install iscc-sct installs no runtime, and the first code generation then fails
with install instructions. If you are unsure which extra fits your machine, run iscc-sct doctor
(see the command-line guide).
Verify the installation¶
Generate your first code¶
Pass any text to create(). The first call downloads the embedding model (about 450 MB) to your
user data directory; later calls reuse it.
import iscc_sct as sct
text = "This is some sample text. It can be a longer document or even an entire book."
print(sct.create(text, bits=256).iscc)
# ISCC:CADV3GG6JH3XEVRNSVYGCLJ7AAV3BOT5J7EHEZKPFXEGRJ2CTWACGZI
The bits argument sets the code length. Longer codes carry more detail; 64 bits is the library
default, and 256 bits is the most precise. See comparing texts for how
length affects matching.
Match a translation¶
The point of a Semantic Text-Code is that meaning survives translation. Generate codes for the same passage in English and German, then measure their distance:
import iscc_sct as sct
english = (
"An ISCC applies to a specific digital asset and is a data-descriptor deterministically "
"constructed from multiple hash digests using the algorithms and rules in this document. "
"This document does not provide information on registration of ISCCs."
)
german = (
"Ein ISCC bezieht sich auf ein bestimmtes digitales Gut und ist ein Daten-Deskriptor, der "
"deterministisch aus mehreren Hash-Digests unter Verwendung der Algorithmen und Regeln in "
"diesem Dokument erstellt wird. Dieses Dokument enthält keine Informationen über die "
"Registrierung von ISCCs."
)
a = sct.create(english)
b = sct.create(german)
print(sct.iscc_distance(a.iscc, b.iscc))
# 3
A distance of 3 bits out of 64 means the translation is a near-match. For contrast, an unrelated
sentence lands far away:
import iscc_sct as sct
unrelated = (
"The recipe calls for two cups of flour, a pinch of salt, and three ripe bananas mashed "
"until smooth before folding the mixture into the buttered baking tin."
)
print(sct.iscc_distance(a.iscc, sct.create(unrelated).iscc))
# 33
Low distance means similar meaning; high distance means unrelated content. Translation matching is the behavior that lexical (word-based) codes cannot provide.
Look inside the text¶
Set granular=True to get per-chunk features alongside the document code. Each chunk carries its
own offset, size, similarity-preserving fingerprint (simprint), and text:
import iscc_sct as sct
text = "This is some sample text. It can be a longer document or even an entire book."
meta = sct.create(text, bits=256, granular=True)
feature = meta.features[0].simprints[0]
print(feature.offset, feature.size, feature.simprint)
# 0 77 XZjeSfdyVi0
Granular features let you match individual passages across documents, even when the surrounding text differs. The granular features guide covers chunk-level matching and content reconstruction.
Next steps¶
- Compare texts — Measure semantic and cross-lingual similarity.
- Granular features — Work with per-chunk simprints and offsets.
- Configuration — Tune bit-length, chunking, and output options.
- How it works — Understand why the codes match across languages.