Skip to content

iscc-sct

Tests Version Downloads License Ask DeepWiki

A cross-lingual, similarity-preserving binary code for text. Semantically similar texts, including translations, produce codes with low hamming distance.

Proof of concept

All releases below v1.0.0 may break backward compatibility and produce incompatible Semantic Text-Codes. The algorithms in iscc-sct are experimental and not part of the official ISO 24138:2024 standard.

Introduction

The ISCC framework already includes a Text-Code based on lexical similarity for near-duplicate matching. The Semantic Text-Code (SCT) is a planned additional ISCC-UNIT that captures a broader, more abstract similarity. It is engineered to be robust against rephrasing and, most notably, translations that lexical matching cannot detect.

iscc-sct turns any text into a compact binary code built from a binarized, multilingual document embedding. The same content expressed in different languages maps to (near-)identical codes, opening up cross-lingual content identification and similarity detection.

Feature ISCC Content-Code Text ISCC Semantic-Code Text
Focus Lexical similarity Semantic similarity
Cross-lingual No Yes
Use case Near-duplicate detection Semantic similarity, translations

Key features:

  • Semantic similarity - deep multilingual embeddings capture the meaning of the text
  • Translation matching - near-identical codes for the same content across languages
  • Bit-length flexibility - codes from 32 up to 256 bits for adjustable granularity
  • ISCC compatible - codes integrate directly with existing ISCC-based systems
  • 60+ languages - one model covers a broad range of scripts and language pairs

Quick start

pip install "iscc-sct[cpu]"
uv add "iscc-sct[cpu]"
uvx "iscc-sct[cpu]" "path/to/textfile.txt"

Generate a Semantic Text-Code from Python:

import iscc_sct as sct

text = "This is some sample text. It can be a longer document or even an entire book."
print(sct.create(text, bits=256).iscc)
# ISCC:CADV3GG6JH3XEVRNSVYGCLJ7AAV3BOT5J7EHEZKPFXEGRJ2CTWACGZI

Or from the command line:

iscc-sct "path/to/textfile.txt"

Choose exactly one ONNX runtime

A plain pip install iscc-sct installs no ONNX runtime; the first code generation then fails with install instructions. Install the cpu extra (works everywhere) or the gpu extra for NVIDIA CUDA acceleration (requires CUDA 12.x and cuDNN 9.x), or run iscc-sct doctor to detect and install the right one. Never install both: onnxruntime and onnxruntime-gpu unpack into the same directory and silently overwrite each other.

How it works

Text -> split -> embed -> mean-pool -> binarize -> ISCC header + base32
  1. Split the text into overlapping chunks at syntactically sensible boundaries.
  2. Embed each chunk with a multilingual sentence-transformer running on ONNX.
  3. Aggregate the chunk embeddings into one mean-pooled, normalized document vector.
  4. Binarize the vector (positive components become 1-bits), truncate to the requested bit-length, prefix the ISCC header, and base32-encode it.

This process is robust to variation and translation, enabling cross-lingual matching from a short Simprint.

Documentation

  • Getting started - Tutorial

    Install, generate your first code, and match a translation across languages.

  • How-to guides - Task recipes

    Compare texts, work with granular features, configure options, and use the CLI.

  • How it works - Understand the design

    The pipeline, cross-lingual matching, and why binarized vectors still match.

  • API reference - Library details

    Generated reference for create(), the data model, options, and similarity functions.

  • For Coding Agents - Build on iscc-sct

    A dense architecture map, constraints, and task recipes for AI coding agents.

  • Live Demo - Try it in the browser

    Generate and compare Semantic Text-Codes interactively on Hugging Face Spaces.

  • Source Code - Read the implementation

    The full algorithm, CLI, and Gradio demo on GitHub.

  • Full text for LLMs - Machine-readable docs

    Every page concatenated into a single file for language models.

Supported languages (60+)

Arabic, Armenian, Bengali, Bosnian, Bulgarian, Burmese, Catalan, Chinese (China), Chinese (Taiwan), Croatian, Czech, Danish, Dutch, English, Estonian, Farsi, Finnish, French, French (Canada), Galician, German, Greek, Gujarati, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Korean, Kurdish, Latvian, Lithuanian, Macedonian, Malay, Malayalam, Marathi, Mongolian, Norwegian Bokmål, Persian, Polish, Portuguese, Portuguese (Brazil), Romanian, Russian, Serbian, Sinhala, Slovak, Slovenian, Spanish, Swedish, Tamil, Telugu, Thai, Turkish, Ukrainian, Urdu, Vietnamese.

Source code on GitHub