Skip to content

API reference

This page documents the public API re-exported from the iscc_sct package. Import it with import iscc_sct as sct. The reference below is generated from the source docstrings.

High-level API

The primary entry point. Returns a Metadata object with the code and, optionally, granular features.

create

create(text, granular=False, **options)

Create Semantic Text-Code

High-Level API for creating Semantic Text-Code.

Parameters:

Name Type Description Default
text

Text used for creating Semantic Text-Code.

required
granular

Activate options for granular processing (Default: False).

False
options

Override individual options for creating Semantic Text-Code.

{}

Returns:

Type Description

Semantic Text-Code Metadata object in Object-Format

Core functions

Lower-level building blocks of the processing pipeline.

gen_text_code_semantic

gen_text_code_semantic(text, **options)

Create an ISCC Semantic-Code Text from plaintext.

Parameters:

Name Type Description Default
text str

Plain text for ISCC processing

required
options

Custom processing options for overriding global options. Recognized keys:

  • bits (int): Length of generated Semantic Text-Code in bits (default 64).
  • characters (bool): Return document character count (default True).
  • embedding (bool): Return global document embedding (default False).
  • precision (int): Max fractional digits for embeddings (default 8).
  • simprints (bool): Return granular document features (default False).
  • offsets (bool): Return character offsets for granular features (default False).
  • sizes (bool): Include sizes of granular features in chars (default False).
  • contents (bool): Return text chunks (default False).
  • max_tokens (int): Max tokens per chunk (default 127).
  • overlap (int): Max tokens allowed overlapping between chunks (default 48).
  • trim (bool): Trim whitespace from chunks (default False).
{}

Returns:

Type Description

Dict with ISCC processing results (using Index-Format for granular features)

code_text_semantic

code_text_semantic(fp, **options)

Generate ISCC Semantic-Code Text from a text file.

NOTE: Enable granular features with simprints=True. Their length is set by bits_granular (default 64) and is independent of the document bits.

Parameters:

Name Type Description Default
fp

File path of a plaintext file to process

required
options

Custom processing options for overriding global options. Recognized keys:

  • bits (int): Length of generated Semantic Text-Code in bits (default 64).
  • characters (bool): Return document character count (default True).
  • embedding (bool): Return global document embedding (default False).
  • precision (int): Max fractional digits for embeddings (default 8).
  • simprints (bool): Return granular document features (default False).
  • offsets (bool): Return character offsets for granular features (default False).
  • sizes (bool): Include sizes of granular features in chars (default False).
  • contents (bool): Return text chunks (default False).
  • max_tokens (int): Max tokens per chunk (default 127).
  • overlap (int): Max tokens allowed to overlap between chunks (default 48).
  • trim (bool): Trim whitespace from chunks (default False).
{}

Returns:

Type Description

Dict with ISCC processing results

soft_hash_text_semantic

soft_hash_text_semantic(text)

Create a similarity-preserving hash for text as the full binarized document embedding (384 bits).

embed_chunks

embed_chunks(chunks, batch_size=100)

Embed text chunks and return vector embeddings.

Parameters:

Name Type Description Default
chunks

Text chunks to embed.

required
batch_size

Number of chunks to process in each batch.

100

Returns:

Type Description

An array of embeddings for each chunk.

Data model

The result schema. Metadata holds the code and feature sets; FeatureSet and Feature carry the granular data in either Index-Format or Object-Format.

Metadata

Bases: PrettyBaseModel

to_index_format

to_index_format() -> Metadata

Convert the Metadata object to use the Index-Format for features. Returns a new Metadata object.

get_content

get_content() -> Optional[str]

Reconstruct and return the original input text if all necessary data is available. This method removes overlaps in adjacent text chunks.

Returns:

Type Description
Optional[str]

The reconstructed original text, or None if the necessary data is not available.

get_overlaps

get_overlaps() -> List[str]

Returns a list of overlapping text between consecutive chunks. For non-overlapping consecutive chunks, returns an empty string.

Returns:

Type Description
List[str]

List of overlapping text or empty strings.

to_object_format

to_object_format() -> Metadata

Convert the Metadata object to use the Object-Format for features. Returns a new Metadata object.

FeatureSet

Bases: PrettyBaseModel

Feature

Bases: PrettyBaseModel

Options

The settings model. Configure it per call, with ISCC_SCT_* environment variables, or via override().

SctOptions

Bases: BaseSettings

override

override(update=None)

Returns an updated and validated deep copy of the current settings instance.

Similarity and distance

Functions for comparing codes, digests, and granular features.

iscc_distance

iscc_distance(iscc1, iscc2)

Calculate the Hamming distance between two ISCC Semantic Text Codes.

Parameters:

Name Type Description Default
iscc1

The first ISCC Semantic Text Code.

required
iscc2

The second ISCC Semantic Text Code.

required

Returns:

Type Description

The Hamming distance between the two ISCC codes.

Raises:

Type Description
ValueError

If the input ISCCs are not valid or of different lengths.

hamming_distance

hamming_distance(a, b)

Calculate the bitwise Hamming distance between two bytes objects.

Parameters:

Name Type Description Default
a

The first bytes object.

required
b

The second bytes object.

required

Returns:

Type Description

The Hamming distance between two bytes objects.

Raises:

Type Description
ValueError

If a and b are not the same length.

cosine_similarity

cosine_similarity(a, b)

Calculate the approximate cosine similarity based on Hamming distance for two bytes inputs.

Parameters:

Name Type Description Default
a

The first bytes object.

required
b

The second bytes object.

required

Returns:

Type Description

The approximate cosine similarity between the two inputs, scaled from -100 to +100.

Raises:

Type Description
ValueError

If a and b are not the same length.

granular_similarity

granular_similarity(metadata_a, metadata_b, threshold=80)

Compare simprints from two Metadata objects and return matching pairs above a similarity threshold. Only the most similar pair for each simprint_a is included.

Parameters:

Name Type Description Default
metadata_a

The first Metadata object.

required
metadata_b

The second Metadata object.

required
threshold

The similarity threshold (0-100) above which simprints are considered a match.

80

Returns:

Type Description

A list of tuples containing matching simprints and their similarity.

Codecs and helpers

Encoders, decoders, and utilities.

encode_base32

encode_base32(data)

Standard RFC4648 base32 encoding without padding.

Parameters:

Name Type Description Default
data bytes

Data for base32 encoding

required

Returns:

Type Description

Base32 encoded str

decode_base32

decode_base32(code)

Standard RFC4648 base32 decoding without padding and with casefolding.

encode_base64

encode_base64(data)

Standard RFC4648 base64url encoding without padding.

decode_base64

decode_base64(code)

Standard RFC4648 base64url decoding without padding.

char_to_byte_offsets

char_to_byte_offsets(text, char_positions)

Efficiently convert character positions to byte positions in a single pass.

Parameters:

Name Type Description Default
text

The input text

required
char_positions

List of character positions to convert

required

Returns:

Type Description

List of corresponding byte positions

get_model

get_model()

Check and return local model file if it exists, otherwise download.