For Coding Agents¶
A compressed reference for AI agents working on iscc-sct (ISCC Semantic Text-Code). Read this before editing the codebase or integrating the library. Terminology matches the source exactly.
iscc-sct turns text into a cross-lingual, similarity-preserving binary ISCC-UNIT (MainType
SEMANTIC, SubType TEXT). It is an experimental proof of concept, not part of ISO 24138:2024.
Architecture map¶
File layout¶
| Path | Contains |
|---|---|
iscc_sct/main.py |
create() - high-level API; returns a Metadata object in Object-Format. |
iscc_sct/code_semantic_text.py |
Core algorithm. gen_text_code_semantic() (returns a plain dict, Index-Format), splitting, embedding, pooling, binarization. Module-level @cache singletons: tokenizer(), splitter(), splitter_guarded(), model(). |
iscc_sct/models.py |
Pydantic schema: Feature, FeatureSet, Metadata + format converters. |
iscc_sct/options.py |
SctOptions (pydantic-settings), sct_opts singleton, .override(). |
iscc_sct/utils.py |
Codecs (base32/base64url), distances, model download + blake3 integrity, char_to_byte_offsets, MODEL_PATH, timer. |
iscc_sct/cli.py |
iscc-sct console entry point (main()): glob, charset detection, gui subcommand. |
iscc_sct/demo.py, iscc_sct/app.py |
Gradio demo (Hugging Face Space). Omitted from coverage. |
iscc_sct/dev.py |
Dev-only poe task helpers. Omitted from coverage. |
iscc_sct/tokenizer.json |
Vendored tokenizer. Byte-exact; excluded from whitespace/EOL hooks. |
tests/chunking_vectors.json |
Frozen chunk-boundary test vectors. Never hand-edit (see Change playbook). |
Pipeline¶
gen_text_code_semantic(text, **options) orchestrates this:
- Reject empty text (
ValueError), thensct_opts.override(options). split_text()->[(offset, chunk), ...]at semantic boundaries (max 127 tokens, 48 overlap).embed_chunks()tokenizes (batch 100), runs the ONNX model,attention_pooling()per chunk.mean_pooling()averages chunk vectors into one L2-normalized document vector.binarize()(vec >= 0-> bits), truncate tobits // 8, prefix the 2-byte ISCC header,encode_base32(), prepend"ISCC:".
Import flow¶
iscc_sct/__init__.py (star-imports, defines __version__)
-> options -> utils -> code_semantic_text -> models -> main
main -> models, code_semantic_text, options
code_semantic_text -> onnxruntime (lazy import), semantic_text_splitter, tokenizers, numpy,
iscc_sct as sct (uses sct.* at call time, not import time)
utils -> models
cli -> main, charset_normalizer
Public API¶
iscc_sct/__init__.py re-exports every module's __all__. The public surface
(import iscc_sct as sct):
| Symbol | Source | Purpose |
|---|---|---|
create(text, granular=False, **options) |
main | Primary API. Returns Metadata (Object-Format). |
gen_text_code_semantic(text, **options) |
code_semantic_text | Low-level. Returns dict (Index-Format). |
code_text_semantic(fp, **options) |
code_semantic_text | Same, reading a UTF-8 file path. |
soft_hash_text_semantic(text) |
code_semantic_text | Raw 384-bit digest (bytes), no header. |
embed_chunks(chunks, batch_size=100) |
code_semantic_text | Chunk list -> embedding array. |
Metadata, FeatureSet, Feature |
models | Result schema + converters. |
SctOptions, sct_opts |
options | Settings model + global instance. |
iscc_distance, hamming_distance, cosine_similarity, granular_similarity |
utils | Similarity metrics. |
encode_base32, decode_base32, encode_base64, decode_base64 |
utils | Codecs. |
char_to_byte_offsets, get_model, MODEL_PATH, timer |
utils | Helpers. |
__version__ |
__init__ |
"0.2.0". |
Decision dispatch¶
Which entry point?¶
| Goal | Use |
|---|---|
| One ISCC code from a string, ready-to-use object | create(text) -> Metadata (Object-Format) |
| Granular per-chunk features (simprints/offsets/sizes/contents) | create(text, granular=True) |
| Compact parallel-array result for storage/indexing | gen_text_code_semantic(text, ...) (Index-Format dict) |
| Process a text file from disk | code_text_semantic(path) |
| Just the raw 384-bit vector digest | soft_hash_text_semantic(text) -> bytes |
| Command line / batch over files | iscc-sct <glob> |
Which similarity metric?¶
| Inputs | Use | Returns |
|---|---|---|
| Two full ISCC code strings | iscc_distance(iscc1, iscc2) |
Hamming distance in bits (int) |
| Two raw digests of equal length | hamming_distance(a, b) |
bit distance |
| Two raw digests, normalized score | cosine_similarity(a, b) |
int in [-100, 100] |
Two Metadata objects with simprints |
granular_similarity(a, b, threshold=80) |
[(Feature, similarity, Feature), ...] |
Which feature format?¶
| Format | Produced by | Shape | Convert with |
|---|---|---|---|
| Index-Format | gen_text_code_semantic() |
parallel arrays simprints/offsets/sizes/contents |
Metadata.to_index_format() |
| Object-Format | create() |
list of self-contained Feature objects |
Metadata.to_object_format() |
Constraints and invariants¶
- No base ONNX runtime. The base package declares no
onnxruntime. Install exactly one of the mutually exclusivecpu/gpuextras.onnxruntimeandonnxruntime-gpuunpack into the same directory and clobber each other (issue #23);tool.uv.conflictsenforces exclusivity for uv. - ONNX runtime is imported lazily.
import iscc_sct, theiscc-sctCLI, andiscc-sct doctorload without any runtime installed.load_onnxruntime()(called frommodel()on first code generation) raises a guardedImportErrorwith install instructions - keep that guard intact. - Codes are base32; simprints are base64url. Never cross the codecs.
- ISCC header is fixed: MainType
SEMANTIC(0001), SubTypeTEXT(0000), Version0000, plus a length nibble fromBIT_LEN_MAP. 2 bytes, big-endian.iscc_distancestrips the 5-charISCC:prefix, base32-decodes, drops the 2-byte header, then compares bodies of equal length. bitsconstraints:32 <= bits <= 256,multiple_of=32. Same forbits_granular.- Default bit-length differs by entry point:
SctOptions.bits/create()default to 64; the CLI--bitsdefaults to 256. SctOptionsvalidates on assignment. Mutate only via.override(dict), which deep-copies and sets fields individually so validators run. Options flow as**opts.model_dump()through all layers. Env vars use theISCC_SCT_prefix;.envis loaded.- Chunk boundaries are frozen by
tests/chunking_vectors.jsonand identical on both the normal (splitter) and guarded (splitter_guarded) paths.needs_split_guard()routes texts whose positions sit more thanSPLIT_GUARD_GAP(8192) chars from the next paragraph separator to the guarded sizer (PDF-extracted text, issue #24). - Singletons:
tokenizer(),splitter(**opts),splitter_guarded(**opts),model()are@cached. The splitter cache keys on the option kwargs, so all option values must be hashable. - Coverage must stay at 100% (
--cov-fail-under=100). GPU branches and model download use# pragma: no cover.dev.py,demo.py, andtests/are omitted from coverage. - Style: PEP 484 type comments (first line after
def), PEP 585 generics, PEP 604 unions. Short pure functions, max 3 args, no nested functions, sphinx-style docstrings. Ruff line length 100, LF endings, rule F401 disabled.
Side effects catalog¶
| Function | Effect |
|---|---|
model() (first call) |
If the model file is absent/corrupt, get_model() downloads ~450 MB to the platformdirs user-data dir and verifies a blake3 checksum. Creates a cached ONNX InferenceSession. |
get_model() |
Network download + disk write to MODEL_PATH; blake3 integrity check (# pragma: no cover). |
tokenizer(), splitter(), splitter_guarded() |
Populate process-global @cache (loaded once). |
model() |
Calls rt.preload_dlls() when CUDA is available; logs a warning via warn_gpu_shadowed() if onnxruntime-gpu is installed but CUDA is missing. |
import iscc_sct.options |
load_dotenv() reads .env; instantiates sct_opts. |
import iscc_sct.utils |
os.makedirs(user_data_dir) (idempotent). |
create(), gen_text_code_semantic() |
Pure given a loaded model: no disk writes, no mutation of inputs. |
CLI main() |
Reads files, prints to stdout, removes the loguru logger unless --debug. |
Task recipes¶
Generate a code (library)¶
import iscc_sct as sct
meta = sct.create("This is some sample text.", bits=256)
print(meta.iscc) # "ISCC:CAD..."
print(meta.characters) # input length in characters
Granular per-chunk features¶
import iscc_sct as sct
meta = sct.create(long_text, bits=256, granular=True)
for feature in meta.features[0].simprints: # Object-Format
print(feature.offset, feature.size, feature.simprint, feature.content[:40])
Compare two texts (cross-lingual)¶
import iscc_sct as sct
a = sct.create("An ISCC applies to a specific digital asset...")
b = sct.create("Ein ISCC bezieht sich auf ein bestimmtes digitales Gut...")
print(sct.iscc_distance(a.iscc, b.iscc)) # low bit distance => similar
Match granular simprints¶
import iscc_sct as sct
a = sct.create(doc_a, granular=True)
b = sct.create(doc_b, granular=True)
for feat_a, similarity, feat_b in sct.granular_similarity(a, b, threshold=80):
print(similarity, feat_a.offset, feat_b.offset)
Configure via options¶
import iscc_sct as sct
# Per-call override (validated copy of the global settings):
meta = sct.create(text, bits=128, contents=True)
# Or globally via environment: ISCC_SCT_BITS=128, ISCC_SCT_MAX_TOKENS=127, ...
Change playbook¶
| If you change... | Also update... |
|---|---|
| The embedding model file | MODEL_FILENAME, MODEL_URL, MODEL_CHECKSUM in utils.py; verify codes are bit-stable or bump version + mark BREAKING in CHANGELOG. |
Chunking (max_tokens, overlap, split_text, needs_split_guard, token sizing) |
Regenerate tests/chunking_vectors.json via uv run python tests/test_chunking_vectors.py (only for an intentional algorithm change); multi-chunk codes shift. |
| A public function name/signature | The module's __all__ (it is star-exported through __init__.py); README/docs examples. |
| Add a processing option | Add a field to SctOptions (env var name, constraints); it flows via model_dump(). Wire into create() if it is a granular toggle; add to the README config table. |
Metadata / FeatureSet / Feature schema |
Keep to_index_format(), to_object_format(), get_content(), get_overlaps() consistent. |
ISCC header constants (MAINTYPE/SUBTYPE/SCT_VERSION/BIT_LEN_MAP) |
Codes change - this is BREAKING; bump version and document. |
| onnxruntime import / provider logic | Keep the cpu/gpu extras, tool.uv.conflicts, the ImportError guard, and warn_gpu_shadowed() in sync. |
| Any code path | Add tests to keep coverage at 100%; use # pragma: no cover only for GPU/download branches. Run uv run poe all. |
| Dependencies | Regenerate requirements.txt via uv run poe export-requirements (never hand-edit it). |
Common mistakes¶
NEVER add onnxruntime as a base dependency to "make install easier."
# WRONG - re-breaks issue #23: [gpu] then installs both wheels and CPU clobbers GPU
dependencies = ["onnxruntime"]
ALWAYS keep it behind the mutually exclusive cpu / gpu extras.
NEVER use inline type annotations.
ALWAYS use PEP 484 type comments as the first line after def.
NEVER expect Object-Format from the low-level function.
meta = sct.gen_text_code_semantic(text) # returns a dict in INDEX-Format
meta.features[0].simprints # WRONG - it is a dict, not Metadata
ALWAYS use create() for an Object-Format Metadata, or wrap the dict: Metadata(**data).
NEVER mutate SctOptions fields directly or assume CLI/library defaults match.
sct.sct_opts.bits = 256 # avoid - mutates the global; use .override()
sct.create(text) # library default bits=64, NOT the CLI's 256
ALWAYS pass overrides per call (create(text, bits=256)) or via sct_opts.override({...}).
NEVER hand-edit tests/chunking_vectors.json, requirements.txt, iscc_sct/tokenizer.json, or
the model file. Each is generated or byte-exact.
ALWAYS regenerate via the documented command and run uv run poe all before reporting done.
NEVER mix codecs: ISCC codes are RFC4648 base32 (no padding); granular simprints are base64url
(no padding). Decode with the matching decode_base32 / decode_base64.