API reference¶

This page documents the public API re-exported from the iscc_sct package. Import it with import iscc_sct as sct. The reference below is generated from the source docstrings.

High-level API¶

The primary entry point. Returns a Metadata object with the code and, optionally, granular features.

create ¶

create(text, granular=False, **options)

Create Semantic Text-Code

High-Level API for creating Semantic Text-Code.

Parameters:

Name	Description	Default
`text`	Text used for creating Semantic Text-Code.	required
`granular`	Activate options for granular processing (Default: False).	`False`
`options`	Override individual options for creating Semantic Text-Code.	`{}`

Returns:

Type	Description
	Semantic Text-Code `Metadata` object in Object-Format

Core functions¶

Lower-level building blocks of the processing pipeline.

gen_text_code_semantic ¶

gen_text_code_semantic(text, **options)

Create an ISCC Semantic-Code Text from plaintext.

Parameters:

Name	Type	Description	Default
`text`	`str`	Plain text for ISCC processing	required
`options`		Custom processing options for overriding global options. Recognized keys: `bits` (int): Length of generated Semantic Text-Code in bits (default 64). `characters` (bool): Return document character count (default True). `embedding` (bool): Return global document embedding (default False). `precision` (int): Max fractional digits for embeddings (default 8). `simprints` (bool): Return granular document features (default False). `offsets` (bool): Return character offsets for granular features (default False). `sizes` (bool): Include sizes of granular features in chars (default False). `contents` (bool): Return text chunks (default False). `max_tokens` (int): Max tokens per chunk (default 127). `overlap` (int): Max tokens allowed overlapping between chunks (default 48). `trim` (bool): Trim whitespace from chunks (default False).	`{}`

Returns:

Type	Description
	Dict with ISCC processing results (using Index-Format for granular features)

code_text_semantic ¶

code_text_semantic(fp, **options)

Generate ISCC Semantic-Code Text from a text file.

NOTE: Enable granular features with simprints=True. Their length is set by bits_granular (default 64) and is independent of the document bits.

Parameters:

Name	Type	Description	Default
`fp`		File path of a plaintext file to process	required
`options`		Custom processing options for overriding global options. Recognized keys: `bits` (int): Length of generated Semantic Text-Code in bits (default 64). `characters` (bool): Return document character count (default True). `embedding` (bool): Return global document embedding (default False). `precision` (int): Max fractional digits for embeddings (default 8). `simprints` (bool): Return granular document features (default False). `offsets` (bool): Return character offsets for granular features (default False). `sizes` (bool): Include sizes of granular features in chars (default False). `contents` (bool): Return text chunks (default False). `max_tokens` (int): Max tokens per chunk (default 127). `overlap` (int): Max tokens allowed to overlap between chunks (default 48). `trim` (bool): Trim whitespace from chunks (default False).	`{}`

Returns:

Type	Description
	Dict with ISCC processing results

soft_hash_text_semantic ¶

soft_hash_text_semantic(text)

Create a similarity-preserving hash for text as the full binarized document embedding (384 bits).

embed_chunks ¶

embed_chunks(chunks, batch_size=100)

Embed text chunks and return vector embeddings.

Parameters:

Name	Type	Description	Default
`chunks`		Text chunks to embed.	required
`batch_size`		Number of chunks to process in each batch.	`100`

Returns:

Type	Description
	An array of embeddings for each chunk.

Data model¶

The result schema. Metadata holds the code and feature sets; FeatureSet and Feature carry the granular data in either Index-Format or Object-Format.

Metadata ¶

Bases: PrettyBaseModel

to_index_format ¶

to_index_format() -> Metadata

Convert the Metadata object to use the Index-Format for features. Returns a new Metadata object.

get_content ¶

get_content() -> Optional[str]

Reconstruct and return the original input text if all necessary data is available. This method removes overlaps in adjacent text chunks.

Returns:

Type	Description
`Optional[str]`	The reconstructed original text, or None if the necessary data is not available.

get_overlaps ¶

get_overlaps() -> List[str]

Returns a list of overlapping text between consecutive chunks. For non-overlapping consecutive chunks, returns an empty string.

Returns:

Type	Description
`List[str]`	List of overlapping text or empty strings.

to_object_format ¶

to_object_format() -> Metadata

Convert the Metadata object to use the Object-Format for features. Returns a new Metadata object.

FeatureSet ¶

Bases: PrettyBaseModel

Feature ¶

Bases: PrettyBaseModel

Options¶

The settings model. Configure it per call, with ISCC_SCT_* environment variables, or via override().

SctOptions ¶

Bases: BaseSettings

override ¶

override(update=None)

Returns an updated and validated deep copy of the current settings instance.

Similarity and distance¶

Functions for comparing codes, digests, and granular features.

iscc_distance ¶

iscc_distance(iscc1, iscc2)

Calculate the Hamming distance between two ISCC Semantic Text Codes.

Parameters:

Name	Type	Description	Default
`iscc1`		The first ISCC Semantic Text Code.	required
`iscc2`		The second ISCC Semantic Text Code.	required

Returns:

Type	Description
	The Hamming distance between the two ISCC codes.

Raises:

Type	Description
`ValueError`	If the input ISCCs are not valid or of different lengths.

hamming_distance ¶

hamming_distance(a, b)

Calculate the bitwise Hamming distance between two bytes objects.

Parameters:

Name	Type	Description	Default
`a`		The first bytes object.	required
`b`		The second bytes object.	required

Returns:

Type	Description
	The Hamming distance between two bytes objects.

Raises:

Type	Description
`ValueError`	If a and b are not the same length.

cosine_similarity ¶

cosine_similarity(a, b)

Calculate the approximate cosine similarity based on Hamming distance for two bytes inputs.

Parameters:

Name	Type	Description	Default
`a`		The first bytes object.	required
`b`		The second bytes object.	required

Returns:

Type	Description
	The approximate cosine similarity between the two inputs, scaled from -100 to +100.

Raises:

Type	Description
`ValueError`	If a and b are not the same length.

granular_similarity ¶

granular_similarity(metadata_a, metadata_b, threshold=80)

Compare simprints from two Metadata objects and return matching pairs above a similarity threshold. Only the most similar pair for each simprint_a is included.

Parameters:

Name	Description	Default
`metadata_a`	The first Metadata object.	required
`metadata_b`	The second Metadata object.	required
`threshold`	The similarity threshold (0-100) above which simprints are considered a match.	`80`

Returns:

Type	Description
	A list of tuples containing matching simprints and their similarity.

Codecs and helpers¶

Encoders, decoders, and utilities.

encode_base32 ¶

encode_base32(data)

Standard RFC4648 base32 encoding without padding.

Parameters:

Name	Type	Description	Default
`data`	`bytes`	Data for base32 encoding	required

Returns:

Type	Description
	Base32 encoded str

decode_base32 ¶

decode_base32(code)

Standard RFC4648 base32 decoding without padding and with casefolding.

encode_base64 ¶

encode_base64(data)

Standard RFC4648 base64url encoding without padding.

decode_base64 ¶

decode_base64(code)

Standard RFC4648 base64url decoding without padding.

char_to_byte_offsets ¶

char_to_byte_offsets(text, char_positions)

Efficiently convert character positions to byte positions in a single pass.

Parameters:

Name	Type	Description	Default
`text`		The input text	required
`char_positions`		List of character positions to convert	required

Returns:

Type	Description
	List of corresponding byte positions

get_model ¶

get_model()

Check and return local model file if it exists, otherwise download.