Models

Data models extending sssom-schema for secondary-to-primary mappings.

Mapping Sets

class Sec2PriMappingSet(*args, _if_missing: Callable[[JsonObj, str], Tuple[bool, Any]] = None, **kwargs)[source]

Bases: MappingSet

A MappingSet for Sec2Pri, with helpers for cardinality and export.

_primary_ids

Private store for the full authoritative primary ID set. Kept private so sssom serialisers never include it in any output. Access it only through to_pri_ids().

Type:

set[str]

Initialise the mapping set and the private primary-IDs store.

to_sssom(output_path: Path | str | None = None) sssom_document.MappingSetDocument[source]

Return an SSSOM MappingSetDocument, optionally writing to TSV.

Parameters:

output_path – If given, the document is also serialised to an SSSOM TSV file at this path

Returns:

sssom.sssom_document.MappingSetDocument for the mapping set.

to_rdf(output_path: Path | str | None = None, serialisation: str = 'turtle') rdflib.Graph[source]

Return an RDFLib graph, optionally writing it to a file.

When output_path is given (or auto-generated via the save dispatcher), the graph is also serialised to disk. Either way the rdflib.Graph is returned so callers can query or manipulate it directly.

Parameters:
  • output_path – Destination path. Pass a path (or None to auto-generate one) to persist the graph. If you only want the in-memory graph without touching the file-system, call to_rdf() with no arguments and ignore the path attribute.

  • serialisation – RDFLib serialisation format (default: "turtle").

Returns:

rdflib.Graph containing all mappings as RDF triples.

to_json(output_path: Path | str | None = None) dict[str, Any][source]

Return the mapping set as a JSON-compatible dict, optionally writing to file.

Parameters:

output_path – If given, the JSON is also written to this path.

Returns:

dict representation of the mapping set in SSSOM JSON format.

to_owl(output_path: Path | str | None = None, serialisation: str = 'turtle') rdflib.Graph[source]

Return an OWL rdflib.Graph, optionally writing to file.

Parameters:
  • output_path – If given, the graph is also serialised to this path.

  • serialisation – RDFLib serialisation format (default: "turtle").

Returns:

rdflib.Graph containing OWL axioms for the mapping set.

save(fmt: str, output_path: Path | str | None = None, **kwargs: object) Path[source]

Write to any supported format by name.

Shared formats: "sssom", "rdf", "json", "owl". Subclasses override this to add type-specific formats.

Parameters:
  • fmt – Format key (see above).

  • output_path – Destination path. Auto-generated if None.

  • **kwargs – Forwarded to the format-specific writer.

Returns:

Path to the written file.

Raises:

ValueError – For unknown format keys.

class IdMappingSet(*args, _if_missing: Callable[[JsonObj, str], Tuple[bool, Any]] = None, **kwargs)[source]

Bases: Sec2PriMappingSet

Mapping set for ID-based (secondary to primary identifier) mappings.

Initialise the mapping set and the private primary-IDs store.

compute_cardinalities() None[source]

Compute cardinalities using subject_id and object_id fields.

to_sec2pri(output_path: Path | str | None = None) pd.DataFrame[source]

Return a DataFrame of secondary to primary ID mappings.

Columns: subject_id (secondary), object_id (primary), predicate_id, mapping_cardinality.

Parameters:

output_path – If given, the DataFrame is also written as a TSV file.

Returns:

pandas.DataFrame with one row per mapping.

to_pri_ids(output_path: Path | str | None = None) list[str][source]

Return a sorted list of unique primary IDs, optionally writing to TXT.

When _primary_ids is populated (e.g. from the HGNC complete set) that set is used. Otherwise primary IDs are derived from the unique object_id values in the mappings.

Parameters:

output_path – If given, the IDs are also written one-per-line to a text file.

Returns:

Sorted list of unique primary ID strings.

save(fmt: str, output_path: Path | str | None = None, **kwargs: object) Path[source]

Write to any supported format by name.

Formats: "sssom", "rdf", "json", "owl", "sec2pri", "pri_ids".

Parameters:
  • fmt – Format key (see above).

  • output_path – Destination path. Auto-generated if None.

  • **kwargs – Forwarded to the format-specific writer.

Returns:

Path to the written file.

Raises:

ValueError – For unknown format keys.

class LabelMappingSet(*args, _if_missing: Callable[[JsonObj, str], Tuple[bool, Any]] = None, **kwargs)[source]

Bases: Sec2PriMappingSet

Mapping set for label-based (previous/alias symbol to current symbol) mappings.

Initialise the mapping set and the private primary-IDs store.

compute_cardinalities() None[source]

Compute cardinalities using subject_label and object_label.

to_symbol_sec2pri(output_path: Path | str | None = None) pd.DataFrame[source]

Return a DataFrame of previous/alias symbol to current symbol mappings.

Columns: subject_id, subject_label (secondary/previous symbol), object_id, object_label (primary/current symbol), predicate_id, mapping_cardinality.

Parameters:

output_path – If given, the DataFrame is also written as a TSV file.

Returns:

pandas.DataFrame with one row per symbol mapping.

to_pri_symbols(output_path: Path | str | None = None) list[str][source]

Return a sorted list of unique current/primary symbols, optionally writing to TXT.

Derived from the unique object_label values in the mappings.

Parameters:

output_path – If given, the symbols are also written one-per-line to a text file.

Returns:

Sorted list of unique primary symbol strings.

to_name2synonym(output_path: Path | str | None = None) pd.DataFrame[source]

Return a name to synonym DataFrame, optionally writing to TSV.

Columns: subject_id, subject_label (primary name), object_label (synonym/previous name).

Parameters:

output_path – If given, the DataFrame is also written as a TSV file.

Returns:

pandas.DataFrame with label mapping rows.

save(fmt: str, output_path: Path | str | None = None, **kwargs: object) Path[source]

Write to any supported format by name.

Formats: "sssom", "rdf", "json", "owl", "symbol_sec2pri" ("symbol2prev" is a deprecated alias), "pri_symbols", "name2synonym".

Parameters:
  • fmt – Format key (see above).

  • output_path – Destination path. Auto-generated if None.

  • **kwargs – Forwarded to the format-specific writer.

Returns:

Path to the written file.

Raises:

ValueError – For unknown format keys.

Configuration

class DatasourceConfig(name: str, prefix: str, curie_base_url: str, default_output_filename: str = '', available_outputs: list[str] = <factory>, download_urls: dict[str, ~typing.Any] = <factory>, primary_file_key: str = '', id_pattern: str = '', archive_url: str = '', input_file_types: list[str] = <factory>, source: str = '', homepage: str = '', data_license: str = '', sparql_endpoint: str = '', queries: dict[str, str] = <factory>, new_format_version: int | None = None, mappingset_metadata: dict[str, ~typing.Any] = <factory>, mapping_metadata: dict[str, ~typing.Any] = <factory>)[source]

Configuration for a biological database datasource loaded from YAML.

Constants

Pre-loaded datasource configurations.

Constants for supported datasources.

Base Parser

class BaseParser(version: str | None = None, show_progress: bool = True, config_name: str | None = None)[source]

Abstract base class for all datasource parsers.

Each parser is responsible for reading files from a specific datasource and extracting secondary-to-primary identifier Mapping Sets.

Initialize the parser.

Parameters:
  • version – Version/release identifier for the datasource.

  • show_progress – Whether to show progress bars during parsing.

  • config_name – Name of config file to load (defaults to class name).

property config: DatasourceConfig | None

Get the loaded configuration.

get_download_url(key: str) str | None[source]

Get a download URL from config by key.

get_curie_map() dict[str, str][source]

Get the CURIE map from config.

get_mappingset_metadata() dict[str, Any][source]

Get mapping set metadata from config.

get_mapping_metadata() dict[str, Any][source]

Get mapping metadata from config.

load_metadata(yaml_path: str) dict[str, Any][source]

Load metadata from a YAML config file.

apply_metadata_to_mappingset(mappingset: MappingSet, metadata: dict[str, Any]) None[source]

Apply metadata to a MappingSet and its Mappings.

abstractmethod parse(input_path: Path | str | None) MappingSet[source]

Parse the input file(s) and return a MappingSet.

Parameters:

input_path – Path to the input file or directory.

Returns:

A MappingSet containing all extracted mappings.

static normalize_withdrawn_id(subject_id: str | None) str[source]

Normalize a primary ID, converting empty/null to withdrawn.

Parameters:

subject_id – The raw primary identifier from the source file.

Returns:

The normalized primary ID, or WITHDRAWN_ENTRY for empty values.

static is_withdrawn(identifier: str) bool[source]

Return if is withdrawn.

static is_withdrawn_primary(subject_id: str) bool[source]

Check if a primary ID represents a withdrawn/deleted entry.

Parameters:

subject_id – The primary identifier to check.

Returns:

True if the primary ID indicates a withdrawn entry.

create_mapping_set(mappings: list[Mapping], mapping_type: str = 'id') Sec2PriMappingSet[source]

Create an IdMappingSet or LabelMappingSet with config metadata.

Common factory method for creating mapping sets with all SSSOM metadata populated from the YAML config. It also computes cardinalities for mappings.

Parameters:
  • mappings – List of SSSOM Mapping objects.

  • mapping_type – “id” for IdMappingSet (cardinality by ID), “label” for LabelMappingSet (cardinality by label).

Returns:

MappingSet with computed cardinalities.