API reference¶
This page documents the public API re-exported from the iscc_sct package. Import it with
import iscc_sct as sct. The reference below is generated from the source docstrings.
High-level API¶
The primary entry point. Returns a Metadata object with the code and, optionally, granular
features.
create ¶
Create Semantic Text-Code
High-Level API for creating Semantic Text-Code.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
Text used for creating Semantic Text-Code. |
required | |
granular
|
Activate options for granular processing (Default: False). |
False
|
|
options
|
Override individual options for creating Semantic Text-Code. |
{}
|
Returns:
| Type | Description |
|---|---|
|
Semantic Text-Code |
Core functions¶
Lower-level building blocks of the processing pipeline.
gen_text_code_semantic ¶
Create an ISCC Semantic-Code Text from plaintext.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Plain text for ISCC processing |
required |
options
|
Custom processing options for overriding global options. Recognized keys:
|
{}
|
Returns:
| Type | Description |
|---|---|
|
Dict with ISCC processing results (using Index-Format for granular features) |
code_text_semantic ¶
Generate ISCC Semantic-Code Text from a text file.
NOTE:
Enable granular features with simprints=True. Their length is set by bits_granular
(default 64) and is independent of the document bits.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fp
|
File path of a plaintext file to process |
required | |
options
|
Custom processing options for overriding global options. Recognized keys:
|
{}
|
Returns:
| Type | Description |
|---|---|
|
Dict with ISCC processing results |
soft_hash_text_semantic ¶
Create a similarity-preserving hash for text as the full binarized document embedding (384 bits).
embed_chunks ¶
Embed text chunks and return vector embeddings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunks
|
Text chunks to embed. |
required | |
batch_size
|
Number of chunks to process in each batch. |
100
|
Returns:
| Type | Description |
|---|---|
|
An array of embeddings for each chunk. |
Data model¶
The result schema. Metadata holds the code and feature sets; FeatureSet and Feature carry the
granular data in either Index-Format or Object-Format.
Metadata ¶
Bases: PrettyBaseModel
to_index_format ¶
Convert the Metadata object to use the Index-Format for features. Returns a new Metadata object.
get_content ¶
Reconstruct and return the original input text if all necessary data is available. This method removes overlaps in adjacent text chunks.
Returns:
| Type | Description |
|---|---|
Optional[str]
|
The reconstructed original text, or None if the necessary data is not available. |
get_overlaps ¶
Returns a list of overlapping text between consecutive chunks. For non-overlapping consecutive chunks, returns an empty string.
Returns:
| Type | Description |
|---|---|
List[str]
|
List of overlapping text or empty strings. |
to_object_format ¶
Convert the Metadata object to use the Object-Format for features. Returns a new Metadata object.
FeatureSet ¶
Bases: PrettyBaseModel
Feature ¶
Bases: PrettyBaseModel
Options¶
The settings model. Configure it per call, with ISCC_SCT_* environment variables, or via
override().
SctOptions ¶
Bases: BaseSettings
override ¶
Returns an updated and validated deep copy of the current settings instance.
Similarity and distance¶
Functions for comparing codes, digests, and granular features.
iscc_distance ¶
Calculate the Hamming distance between two ISCC Semantic Text Codes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
iscc1
|
The first ISCC Semantic Text Code. |
required | |
iscc2
|
The second ISCC Semantic Text Code. |
required |
Returns:
| Type | Description |
|---|---|
|
The Hamming distance between the two ISCC codes. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the input ISCCs are not valid or of different lengths. |
hamming_distance ¶
Calculate the bitwise Hamming distance between two bytes objects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a
|
The first bytes object. |
required | |
b
|
The second bytes object. |
required |
Returns:
| Type | Description |
|---|---|
|
The Hamming distance between two bytes objects. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If a and b are not the same length. |
cosine_similarity ¶
Calculate the approximate cosine similarity based on Hamming distance for two bytes inputs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
a
|
The first bytes object. |
required | |
b
|
The second bytes object. |
required |
Returns:
| Type | Description |
|---|---|
|
The approximate cosine similarity between the two inputs, scaled from -100 to +100. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If a and b are not the same length. |
granular_similarity ¶
Compare simprints from two Metadata objects and return matching pairs above a similarity threshold. Only the most similar pair for each simprint_a is included.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata_a
|
The first Metadata object. |
required | |
metadata_b
|
The second Metadata object. |
required | |
threshold
|
The similarity threshold (0-100) above which simprints are considered a match. |
80
|
Returns:
| Type | Description |
|---|---|
|
A list of tuples containing matching simprints and their similarity. |
Codecs and helpers¶
Encoders, decoders, and utilities.
encode_base32 ¶
Standard RFC4648 base32 encoding without padding.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
bytes
|
Data for base32 encoding |
required |
Returns:
| Type | Description |
|---|---|
|
Base32 encoded str |
decode_base32 ¶
Standard RFC4648 base32 decoding without padding and with casefolding.
char_to_byte_offsets ¶
Efficiently convert character positions to byte positions in a single pass.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
The input text |
required | |
char_positions
|
List of character positions to convert |
required |
Returns:
| Type | Description |
|---|---|
|
List of corresponding byte positions |