API

Main functions for pysec2pri.

This module provides functions for parsing biological database secondary-to-primary mapping files and generating and using the standardized Mapping sets.

class ContextSpec(kind: Literal['label', 'id', 'xref'], column: str, xref_mapping: XrefMapping | None = None, predicates: set[str] | None = None)[source]

One source of per-row evidence used to disambiguate a flagged-ambiguous cell.

Parameters:

kind – "label" (alias/synonym string, matched via an alias index), "id" (a related identifier string, also matched via an alias index), or "xref" (a cross-reference token, matched via an XrefMapping crosswalk table).
column – Name of the DataFrame column carrying this evidence.
xref_mapping – Required when kind == "xref".
predicates – Accepted equivalence predicates for kind == "xref". None means no restriction (any predicate is accepted).

combine_mapping_sets(id_mappings: BaseMappingSet | None, synonym_mappings: BaseMappingSet | None) → BaseMappingSet[source]

Combine two mapping sets into one.

Parameters:

id_mappings – First mapping set (e.g. ID mappings).
synonym_mappings – Second mapping set (e.g. synonym mappings).

Returns:

Combined mapping set.

Raises:

ValueError – If both mapping sets are None.

find_ambiguous(mapping_set: BaseMappingSet) → AmbiguousMappingSet[source]

Find identifiers that are ambiguous in mapping_set.

An identifier is ambiguous when it appears both as a subject_id (i.e. a secondary/previous term) and as a current primary identifier. Such entries cannot be automatically resolved without risk of corrupting references that are already current.

This is a convenience wrapper around find_ambiguous().

Parameters:: mapping_set – A BaseMappingSet (e.g. the result of generate_ids("hgnc")).
Returns:: An AmbiguousMappingSet whose mappings list contains one entry for each conflicting subject, with a comment explaining the conflict.

generate_ids(source: str, *, version: str | None = None, show_progress: bool = True, inputs: dict[str, Path | str] | None = None, consolidate: bool = False, cache_dir: Path | None = None, force: bool = False, **options: Any) → BaseMappingSet[source]

Return source’s secondary-to-primary ID mappings.

Parameters:

source – Datasource name; see sources() for what is available.
version – Release to build. The latest is used when None.
show_progress – Whether to show progress bars.
inputs – Local input files keyed as in the config’s download_urls (e.g. {"withdrawn": "withdrawn.txt"}). Anything omitted is downloaded.
consolidate – Recover mappings the current release’s files no longer state, by walking the source’s historical releases, and stamp every mapping with the release it first appeared in. Slow and network-heavy; resumable via cache_dir. Only for sources whose releases carry such history (see supports_consolidate()).
cache_dir – Where to keep the resumable cross-release index. Defaults to $PYSEC2PRI_CACHE_DIR or ~/.cache/pysec2pri.
force – Re-walk every release, ignoring any resume state.
**options – Source-specific options, e.g. species for NCBI/VGNC/ Ensembl, subset for ChEBI, entity_type for Wikidata. An option a source does not accept is ignored.

Returns:

An IdMappingSet of secondary -> primary identifier mappings.

generate_labels(source: str, *, version: str | None = None, show_progress: bool = True, inputs: dict[str, Path | str] | None = None, consolidate: bool = False, cache_dir: Path | None = None, force: bool = False, **options: Any) → BaseMappingSet[source]

Return source’s previous/alias-to-current label mappings.

Parameters:

source – Datasource name; see sources("labels") for what is available.
version – Release to build. The latest is used when None.
show_progress – Whether to show progress bars.
inputs – Local input files keyed as in the config’s download_urls. Anything omitted is downloaded.
consolidate – Recover label changes the current release’s files no longer state, by walking historical releases, and stamp each with the release it first appeared in. See generate_ids().
cache_dir – Where to keep the resumable cross-release index.
force – Re-walk every release, ignoring any resume state.
**options – Source-specific options; see generate_ids().

Returns:

A LabelMappingSet of secondary -> primary label mappings.

generate_primary_ids(source: str, *, version: str | None = None, show_progress: bool = True, inputs: dict[str, Path | str] | None = None, **options: Any) → BaseMappingSet[source]

Return a mapping set carrying only source’s full current-ID list.

The mappings list is empty; the set exists to drive to_pri_ids(). Use this to get the authoritative ID list without parsing the withdrawn file.

Parameters:

source – Datasource name; see sources("primary_ids").
version – Release to build. The latest is used when None.
show_progress – Whether to show progress bars.
inputs – Local input files keyed as in the config’s download_urls.
**options – Source-specific options; see generate_ids().

Returns:

A mapping set with no mappings and _primary_ids populated.

generate_primary_labels(source: str, *, version: str | None = None, show_progress: bool = True, inputs: dict[str, Path | str] | None = None, **options: Any) → BaseMappingSet[source]

Return a mapping set carrying only source’s full current-label list.

Parameters:

source – Datasource name; see sources("primary_labels").
version – Release to build. The latest is used when None.
show_progress – Whether to show progress bars.
inputs – Local input files keyed as in the config’s download_urls.
**options – Source-specific options; see generate_ids().

Returns:

A mapping set with no mappings and _primary_labels populated.

list_versions(datasource: str) → Any[source]

List all available archive versions for a datasource.

For datasources that publish versioned archives (ChEBI, HGNC, UniProt), this queries the remote archive index and returns all available version strings sorted in ascending order.

NCBI and HMDB do not maintain versioned archives; calling this function for those datasources raises ValueError.

Parameters:

datasource – Datasource name, one of "chebi", "hgnc", or "uniprot".

Returns:

chebi: integer release numbers, e.g. ["200", ..., "245"]
hgnc: ISO dates, e.g. ["2023-01-01", ..., "2026-04-07"]
uniprot: release IDs, e.g. ["2024_01", "2024_02", ...]

Return type:

Sorted list of version strings. Format depends on the datasource

Raises:

ValueError – If datasource is unknown or has no versioned archive.

load_label_mapping(path: Path | str) → LabelMappingSet[source]

Load a label mapping set from an SSSOM TSV file.

Produces the same LabelMappingSet a fresh parse would, ready to pass to resolve_labels().

Parameters:: path – Path to the SSSOM TSV file to load.

load_mapping(path: Path | str) → IdMappingSet[source]

Load an ID mapping set from an SSSOM TSV file.

Produces the same IdMappingSet a fresh parse would, ready to pass to resolve_ids().

Parameters:: path – Path to the SSSOM TSV file to load.

load_xref_mapping(path: Path | str, *, subject_col: str = 'subject_id', object_col: str = 'object_id', object_label_col: str = 'object_label', predicate_col: str = 'predicate_id', sep: str | None = None) → XrefMapping[source]

Load a crosswalk table as an XrefMapping.

Reads either a real SSSOM TSV (a #-prefixed metadata header is skipped automatically) or a plain subject/object table.

Parameters:

path – Path to the crosswalk file.
subject_col – Column with the cross-reference token.
object_col – Column with the target id.
object_label_col – Column with the target label (optional).
predicate_col – Column with the equivalence predicate (optional).
sep – Field delimiter.

Returns:

An XrefMapping with one XrefRecord per non-empty subject row.

Resolve secondary IDs to primary IDs.

Direct lookup: when input_path is a plain identifier string or a list of identifier strings (i.e. not a path to an existing file), the function returns the resolved primary ID(s). at, output_path, suffix, and sep are ignored in this mode:

resolve_ids("HMDB00001", hmdb_ms)  # -> "HMDB:HMDB0000001"
resolve_ids(["HMDB00001", "HMDB00002"], hmdb_ms)  # -> ["...", "..."]

DataFrame mode: when input_path points to an existing TSV/CSV file, at is required. The file is read with pandas.read_csv and for each column named in at a new column <col><suffix> is appended containing the resolved primary IDs. Identifiers not present in mapping_set are kept unchanged.

Parameters:

input_path – An identifier string, a list of identifier strings, or the path to a TSV/CSV file.
mapping_set – A BaseMappingSet (e.g. the result of generate_ids("hgnc")).
at – Column name(s) to resolve. Required in DataFrame mode; ignored in direct-lookup mode.
output_path – If given, the resulting DataFrame is written to this path (DataFrame mode only).
suffix – Suffix appended to each resolved column name (default "_primary").
sep – Delimiter for reading the file. Inferred from the extension when None ("\\t" for .tsv, "," otherwise).
xref – DataFrame mode only. Column with a per-row cross-reference token, passed through to update_ids().
xref_mapping – The XrefMapping crosswalk table to resolve xref tokens against. Required when xref is given.
report_path – When given, every disambiguation attempt (from synonyms and/or xref) is logged to this TSV.

Returns:

A resolved identifier string, a list of resolved strings (direct-lookup mode), or a pandas.DataFrame with one additional column per entry in at (DataFrame mode).

Resolve previous/alias labels to current labels.

Direct lookup: when input_path is a plain label string or a list of label strings (i.e. not a path to an existing file), the function returns the resolved current label(s). at, output_path, suffix, and sep are ignored in this mode:

resolve_labels("Ibuprofen", chebi_ms)  # -> "ibuprofen"
resolve_labels(["Ibuprofen", "Glucose"], chebi_ms)  # -> ["...", "..."]

DataFrame mode: when input_path points to an existing TSV/CSV file, at is required. For each column named in at a new column <col><suffix> is appended containing the resolved current labels. Symbols not present in mapping_set are kept unchanged.

Parameters:

input_path – A label string, a list of label strings, or the path to a TSV/CSV file.
mapping_set – A LabelMappingSet (e.g. the result of generate_labels("hgnc")).
at – Column name(s) to resolve. Required in DataFrame mode; ignored in direct-lookup mode.
output_path – If given, the resulting DataFrame is written to this path (DataFrame mode only).
suffix – Suffix appended to each resolved column name (default "_current").
sep – Delimiter for reading the file. Inferred from the extension when None ("\\t" for .tsv, "," otherwise).
xref – DataFrame mode only. Column with a per-row cross-reference token, passed through to update_labels().
xref_mapping – The XrefMapping crosswalk table to resolve xref tokens against. Required when xref is given.
report_path – When given, every disambiguation attempt (from synonyms and/or xref) is logged to this TSV.

Returns:

A resolved label string, a list of resolved strings (direct-lookup mode), or a pandas.DataFrame with one additional column per entry in at (DataFrame mode).

save(mapping_set: BaseMappingSet, output_format: str, output: Path | str | None = None, *, base_name: str) → Path[source]

Write mapping_set and return the path that was written.

Delegates to save() for single formats and write_all_formats() for "all".

Parameters:

mapping_set – The mapping set to write.
output_format – One of sssom, sec2pri, pri_ids, name2synonym, label_sec2pri, pri_labels, rdf, json, owl, or all.
output – Explicit output path or directory. When None, a default name derived from base_name is used.
base_name – Stem used to derive file names, e.g. "hgnc_2026-04-07".

Returns:

The directory (for "all") or file path that was written.

sources(kind: str | None = None) → list[str][source]

Return the datasources the config files declare.

Parameters:: kind – Restrict to sources declaring this mapping-set kind, e.g. "labels". None returns every source.
Returns:: Sorted datasource names accepted by generate_ids() and friends.

supports_consolidate(source: str, kind: str = 'ids') → bool[source]: Whether source can recover extra history for kind (see --consolidate).

write_all_formats(mapping_set: BaseMappingSet, output_dir: Path, base_name: str, include_name2synonym: bool = True) → None[source]

Write mapping set in all output formats to a directory.

Parameters:

mapping_set – The mapping set to write.
output_dir – Directory to write files to.
base_name – Base name for output files (e.g., “chebi_3star_245”).
include_name2synonym – Whether to include name2synonym format.

write_diff_output(result: MappingDiff, output_path: Path) → None[source]

Write diff results to a TSV file.

Parameters:

result – MappingDiff object with added/removed/changed mappings.
output_path – Path to write the TSV file.

write_json(mapping_set: BaseMappingSet, output_path: Path | str) → Path[source]

Write a mapping set to an SSSOM JSON file.

Parameters:

mapping_set – The mapping set to write.
output_path – Destination file path (e.g. mappings.json).

Returns:

Path to the written file.

write_label_sec2pri(mapping_set: BaseMappingSet, output_path: Path | str) → Path[source]

Write the full previous/alias-label to current-label table to a TSV file.

Every mapping row is written (both deprecation and synonym predicates). Columns: secondary_id, secondary_label, primary_id, primary_label, predicate_id, mapping_cardinality.

Parameters:

mapping_set – The mapping set to write.
output_path – Destination file path (e.g. label_sec2pri.tsv).

Returns:

Path to the written file.

write_name2synonym(mapping_set: BaseMappingSet, output_path: Path | str) → Path[source]

Write name to synonym mappings to a TSV file.

Only oboInOwl:hasExactSynonym rows are written; deprecation rows (IAO:0100001) are excluded because they belong in the label2prev output. Columns: primary_id, name, synonym.

Parameters:

mapping_set – The mapping set to write.
output_path – Destination file path (e.g. name2synonym.tsv).

Returns:

Path to the written file.

write_output(mapping_set: BaseMappingSet, output_format: str, output_path: Path | str) → Path[source]

Write a mapping set in any registered output format.

Parameters:

mapping_set – The mapping set to write.
output_format – Format name (must be a key in WRITERS).
output_path – Path to write to.

Returns:

Path to the written file.

Raises:

ValueError – If output_format is not recognized.

write_owl(mapping_set: BaseMappingSet, output_path: Path | str, serialisation: str = 'turtle') → Path[source]

Write a mapping set to an OWL/RDF file (default: Turtle).

Parameters:

mapping_set – The mapping set to write.
output_path – Destination file path (e.g. mappings_owl.ttl).
serialisation – RDFLib serialisation format.

Returns:

Path to the written file.

write_rdf(mapping_set: BaseMappingSet, output_path: Path | str, serialisation: str = 'turtle') → Path[source]

Write a mapping set to an RDF file.

Parameters:

mapping_set – The mapping set to write.
output_path – Destination file path (e.g. mappings.ttl).
serialisation – RDFLib serialisation format.

Returns:

Path to the written file.

write_sec2pri(mapping_set: BaseMappingSet, output_path: Path | str) → Path[source]

Write secondary to primary ID mappings to a TSV file.

Columns: primary_id (object_id), secondary_id (subject_id), predicate_id, mapping_cardinality.

Parameters:

mapping_set – The mapping set to write.
output_path – Destination file path (e.g. sec2pri.tsv).

Returns:

Path to the written file.

write_sssom(mapping_set: BaseMappingSet, output_path: Path | str) → Path[source]

Write a mapping set to an SSSOM TSV file.

Parameters:

mapping_set – The mapping set to write.
output_path – Destination .sssom.tsv file path.

Returns:

Path to the written file.