API

Functions for parsing biological databases and generating SSSOM-compliant mappings.

All parsing functions return Sec2PriMappingSet objects for integration with the SSSOM ecosystem.

Main functions for pysec2pri.

This module provides functions for parsing biological database secondary-to-primary mapping files and generating and using the standardized Mapping sets.

combine_mapping_sets(id_mappings: Sec2PriMappingSet | None, synonym_mappings: Sec2PriMappingSet | None) Sec2PriMappingSet[source]

Combine two mapping sets into one.

Parameters:
  • id_mappings – First mapping set (e.g. ID mappings).

  • synonym_mappings – Second mapping set (e.g. synonym mappings).

Returns:

Combined mapping set.

Raises:

ValueError – If both mapping sets are None.

find_ambiguous(mapping_set: Sec2PriMappingSet) AmbiguousMappingSet[source]

Find identifiers that are ambiguous in mapping_set.

An identifier is ambiguous when it appears both as a subject_id (i.e. a secondary/previous term) and as a current primary identifier. Such entries cannot be automatically resolved without risk of corrupting references that are already current.

This is a convenience wrapper around find_ambiguous().

Parameters:

mapping_set – A Sec2PriMappingSet (e.g. the result of generate_hgnc()).

Returns:

An AmbiguousMappingSet whose mappings list contains one entry for each conflicting subject, with a comment explaining the conflict.

generate_chebi(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True, subset: str = '3star', mapping_sets: str = 'ids') Sec2PriMappingSet[source]

Return ChEBI mappings (IDs, synonyms, or both).

Downloads the latest release automatically when input_path is omitted. Pass an SDF file (releases < 245) or a directory of TSV flat files (releases >= 245) to use a local copy.

Parameters:
  • input_path – Local SDF file or TSV directory. Auto-downloaded if None.

  • version – Release number (e.g. "245").

  • show_progress – Whether to show progress bars.

  • subset"3star" (default) or "complete".

  • mapping_sets"ids" (default), "synonyms", or "all".

generate_chebi_primary_ids(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True, subset: str = '3star') Sec2PriMappingSet[source]

Return a mapping set containing the full list of current ChEBI primary IDs.

Reads compounds.tsv to extract every current ChEBI compound ID. The returned mapping set has an empty mappings list; _primary_ids is populated with every current CHEBI:<n> CURIE.

Parameters:
  • input_path – Local compounds.tsv file or directory containing it. Auto-downloaded if None.

  • version – Release number (e.g. "245").

  • show_progress – Whether to show progress bars.

  • subset"3star" (default) or "complete".

generate_chebi_primary_symbols(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True, subset: str = '3star') Sec2PriMappingSet[source]

Return a mapping set containing the full list of current ChEBI compound names.

Reads compounds.tsv to extract every current compound’s canonical name. The returned mapping set has an empty mappings list; _primary_symbols is populated.

Parameters:
  • input_path – Local compounds.tsv file or directory containing it. Auto-downloaded if None.

  • version – Release number (e.g. "245").

  • show_progress – Whether to show progress bars.

  • subset"3star" (default) or "complete".

generate_chebi_synonyms(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True, subset: str = '3star') Sec2PriMappingSet[source]

Return ChEBI synonym (name) mappings.

generate_hgnc(input_path: Path | str | None = None, complete_set_path: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]

Return HGNC secondary to primary ID mappings.

Downloads the withdrawn and complete set files automatically when input_path / complete_set_path are omitted. The complete set is used to populate the full list of current primary IDs so that to_pri_ids() returns the authoritative list (~45 k IDs) rather than just the ~5 k primaries that happen to have a secondary.

Parameters:
  • input_path – Local HGNC withdrawn TSV. Auto-downloaded if None.

  • complete_set_path – Local HGNC complete set TSV. Auto-downloaded if None.

  • version – Version string for metadata.

  • show_progress – Whether to show progress bars.

generate_hgnc_primary_ids(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]

Return a mapping set containing the full list of current HGNC primary IDs.

Only the HGNC complete set file is downloaded/read. The returned mapping set has an empty mappings list; its _primary_ids store is populated with every current HGNC ID so that to_pri_ids() produces the authoritative complete list, not just the subset of primaries that happen to have an associated secondary.

Parameters:
  • input_path – Local HGNC complete set TSV. Auto-downloaded if None.

  • version – Version string for metadata.

  • show_progress – Whether to show progress bars.

generate_hgnc_symbols(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True, statuses: list[str] | None = None) Sec2PriMappingSet[source]

Return HGNC symbol to previous-symbol mappings.

Downloads the complete set file automatically when input_path is omitted.

Parameters:
  • input_path – Local HGNC complete set TSV. Auto-downloaded if None.

  • version – Version string for metadata.

  • show_progress – Whether to show progress bars.

  • statuses – Entry statuses to include (e.g. ["Approved"]).

generate_hmdb(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]

Return HMDB metabolite secondary to primary accession mappings.

Downloads hmdb_metabolites.xml automatically when input_path is omitted.

Parameters:
  • input_path – Local hmdb_metabolites.xml (or .zip/.gz). Auto-downloaded if None.

  • version – Version string for metadata.

  • show_progress – Whether to show progress bars.

generate_hmdb_primary_ids(metabolites_path: Path | str | None = None, proteins_path: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]

Return a mapping set containing the full list of current HMDB primary IDs.

Reads one or both of hmdb_metabolites.xml and hmdb_proteins.xml and collects all primary accession numbers. The returned mapping set has an empty mappings list; _primary_ids is populated with every current HMDB:<acc> CURIE.

Parameters:
  • metabolites_path – Local metabolites XML file. Auto-downloaded if both paths are None.

  • proteins_path – Local proteins XML file (optional).

  • version – Version string for metadata.

  • show_progress – Whether to show progress bars.

generate_hmdb_proteins(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]

Return HMDB protein secondary to primary accession mappings.

Downloads hmdb_proteins.xml automatically when input_path is omitted.

Parameters:
  • input_path – Local hmdb_proteins.xml (or .zip/.gz). Auto-downloaded if None.

  • version – Version string for metadata.

  • show_progress – Whether to show progress bars.

generate_ncbi(input_path: Path | str | None = None, gene_info_path: Path | str | None = None, tax_id: str = '9606', version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]

Return NCBI Gene secondary to primary ID mappings.

Downloads the gene_history file automatically when input_path is omitted. When gene_info_path is supplied (or auto-downloaded), the full list of current primary IDs is read from gene_info and stored in _primary_ids, so that to_pri_ids() returns the authoritative complete set rather than only the subset of primaries that happen to appear in gene_history.

Parameters:
  • input_path – Local gene_history file. Auto-downloaded if None.

  • gene_info_path – Local gene_info file used to populate the full primary ID list. Auto-downloaded together with input_path when both are None.

  • tax_id – NCBI taxonomy ID to filter (default: "9606" for human).

  • version – Version string for metadata.

  • show_progress – Whether to show progress bars.

generate_ncbi_primary_ids(input_path: Path | str | None = None, tax_id: str = '9606', version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]

Return a mapping set containing the full list of current NCBI Gene primary IDs.

Reads gene_info to extract every current Gene ID for the given taxonomy. The returned mapping set has an empty mappings list; _primary_ids is populated with every current NCBIGene:<id> CURIE.

Parameters:
  • input_path – Local gene_info file. Auto-downloaded if None.

  • tax_id – Taxonomy ID to filter by (default: "9606" for human).

  • version – Version string for metadata.

  • show_progress – Whether to show progress bars.

generate_ncbi_primary_symbols(input_path: Path | str | None = None, tax_id: str = '9606', version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]

Return a mapping set containing the full list of current NCBI Gene symbols.

Reads gene_info to extract every current gene symbol for the given taxonomy. The returned mapping set has an empty mappings list; _primary_symbols is populated.

Parameters:
  • input_path – Local gene_info file. Auto-downloaded if None.

  • tax_id – Taxonomy ID to filter by (default: "9606" for human).

  • version – Version string for metadata.

  • show_progress – Whether to show progress bars.

generate_ncbi_symbols(input_path: Path | str | None = None, tax_id: str = '9606', version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]

Return NCBI Gene symbol to previous-symbol mappings.

Downloads the gene_info file automatically when input_path is omitted.

Parameters:
  • input_path – Local gene_info file. Auto-downloaded if None.

  • tax_id – NCBI taxonomy ID to filter (default: "9606" for human).

  • version – Version string for metadata.

  • show_progress – Whether to show progress bars.

generate_uniprot(input_path: Path | str | None = None, delac_file: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]

Return UniProt secondary to primary accession mappings.

Downloads sec_ac.txt and delac_sp.txt automatically when input_path is omitted.

Parameters:
  • input_path – Local sec_ac.txt. Auto-downloaded if None.

  • delac_file – Local delac_sp.txt.

  • version – Version string for metadata.

  • show_progress – Whether to show progress bars.

generate_uniprot_primary_ids(acindex_path: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]

Return a mapping set containing the full list of current UniProt primary ACs.

Parses acindex.txt to extract every accession number that currently appears in UniProtKB/Swiss-Prot. The returned mapping set has an empty mappings list; _primary_ids is populated with every current UniProtKB:<AC> CURIE.

For versioned (legacy) releases the file is available at:

https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/
release-{version}/knowledgebase/docs/acindex.txt.gz
Parameters:
  • acindex_path – Local acindex.txt (plain or .gz). Auto-downloaded from the current release when None.

  • version – Version string for metadata.

  • show_progress – Whether to show progress bars.

generate_wikidata(input_path: Path | str | None = None, entity_type: str | None = None, version: str | None = None, endpoint: str | None = None, show_progress: bool = True, test_subset: bool = False) Sec2PriMappingSet[source]

Return Wikidata redirect mappings via SPARQL (or a pre-downloaded TSV).

Queries the QLever Wikidata endpoint when input_path is omitted. If entity_type is None, all entity types (metabolites, genes, proteins) are queried and combined.

Parameters:
  • input_path – Pre-downloaded TSV file. Queries SPARQL if None.

  • entity_type"metabolites", "chemicals", "genes", or

  • None. ("proteins"`. Queries all types when)

  • version – Version string for metadata (defaults to today’s date).

  • endpoint – Custom SPARQL endpoint URL.

  • show_progress – Whether to show progress bars.

  • test_subset – Use test queries limited to 10 results.

generate_wikidata_symbols(input_path: Path | str | None = None, entity_type: str | None = None, version: str | None = None, endpoint: str | None = None, show_progress: bool = True, test_subset: bool = False) LabelMappingSet[source]

Return Wikidata label mappings (previous label to current label).

Queries the QLever Wikidata endpoint when input_path is omitted. If entity_type is None, all entity types are queried and their label mappings combined.

Parameters:
  • input_path – Pre-downloaded TSV file. Queries SPARQL if None.

  • entity_type"metabolites", "chemicals", "genes", or "proteins". Queries all types when None.

  • version – Version string for metadata.

  • endpoint – Custom SPARQL endpoint URL.

  • show_progress – Whether to show progress bars.

  • test_subset – Use test queries limited to 10 results.

Returns:

LabelMappingSet with label mappings.

list_versions(datasource: str) Any[source]

List all available archive versions for a datasource.

For datasources that publish versioned archives (ChEBI, HGNC, UniProt), this queries the remote archive index and returns all available version strings sorted in ascending order.

NCBI and HMDB do not maintain versioned archives; calling this function for those datasources raises ValueError.

Parameters:

datasource – Datasource name, one of "chebi", "hgnc", or "uniprot".

Returns:

  • chebi: integer release numbers, e.g. ["200", ..., "245"]

  • hgnc: ISO dates, e.g. ["2023-01-01", ..., "2026-04-07"]

  • uniprot: release IDs, e.g. ["2024_01", "2024_02", ...]

Return type:

Sorted list of version strings. Format depends on the datasource

Raises:

ValueError – If datasource is unknown or has no versioned archive.

load_label_mapping(path: Path | str) LabelMappingSet[source]

Load a label/symbol mapping set from a pysec2pri TSV file.

Accepts the symbol2prev TSV format (columns subject_id, subject_label, object_label, mapping_cardinality) and the full SSSOM TSV format (comment-prefixed metadata lines are skipped).

Parameters:

path – Path to the TSV file to load.

Returns:

A LabelMappingSet populated from the file, ready to pass to resolve_symbols().

load_mapping(path: Path | str) IdMappingSet[source]

Load an ID mapping set from a pysec2pri TSV file.

Accepts the sec2pri TSV format (columns subject_id, object_id, predicate_id, mapping_cardinality) and the full SSSOM TSV format (comment-prefixed metadata lines are skipped automatically).

Parameters:

path – Path to the TSV file to load.

Returns:

An IdMappingSet populated from the file, ready to pass to resolve_ids().

resolve_ids(input_path: Path | str | list[str], mapping_set: Sec2PriMappingSet, at: str | list[str] | None = None, *, output_path: Path | str | None = None, suffix: str = '_primary', sep: str | None = None, synonyms: str | None = None, label_mapping_set: Sec2PriMappingSet | None = None) pd.DataFrame | str | list[str][source]

Resolve secondary IDs to primary IDs.

Direct lookup: when input_path is a plain identifier string or a list of identifier strings (i.e. not a path to an existing file), the function returns the resolved primary ID(s). at, output_path, suffix, and sep are ignored in this mode:

resolve_ids("HMDB00001", hmdb_ms)  # -> "HMDB:HMDB0000001"
resolve_ids(["HMDB00001", "HMDB00002"], hmdb_ms)  # -> ["...", "..."]

DataFrame mode: when input_path points to an existing TSV/CSV file, at is required. The file is read with pandas.read_csv and for each column named in at a new column <col><suffix> is appended containing the resolved primary IDs. Identifiers not present in mapping_set are kept unchanged.

Parameters:
  • input_path – An identifier string, a list of identifier strings, or the path to a TSV/CSV file.

  • mapping_set – A Sec2PriMappingSet (e.g. the result of generate_hgnc()).

  • at – Column name(s) to resolve. Required in DataFrame mode; ignored in direct-lookup mode.

  • output_path – If given, the resulting DataFrame is written to this path (DataFrame mode only).

  • suffix – Suffix appended to each resolved column name (default "_primary").

  • sep – Delimiter for reading the file. Inferred from the extension when None ("\\t" for .tsv, "," otherwise).

Returns:

A resolved identifier string, a list of resolved strings (direct-lookup mode), or a pandas.DataFrame with one additional column per entry in at (DataFrame mode).

resolve_symbols(input_path: Path | str | list[str], mapping_set: Sec2PriMappingSet, at: str | list[str] | None = None, *, output_path: Path | str | None = None, suffix: str = '_current', sep: str | None = None, synonyms: str | None = None) pd.DataFrame | str | list[str][source]

Resolve previous/alias symbols to current symbols.

Direct lookup: when input_path is a plain symbol string or a list of symbol strings (i.e. not a path to an existing file), the function returns the resolved current symbol(s). at, output_path, suffix, and sep are ignored in this mode:

resolve_symbols("Ibuprofen", chebi_ms)  # -> "ibuprofen"
resolve_symbols(["Ibuprofen", "Glucose"], chebi_ms)  # -> ["...", "..."]

DataFrame mode: when input_path points to an existing TSV/CSV file, at is required. For each column named in at a new column <col><suffix> is appended containing the resolved current symbols. Symbols not present in mapping_set are kept unchanged.

Parameters:
  • input_path – A symbol string, a list of symbol strings, or the path to a TSV/CSV file.

  • mapping_set – A LabelMappingSet (e.g. the result of generate_hgnc_symbols()).

  • at – Column name(s) to resolve. Required in DataFrame mode; ignored in direct-lookup mode.

  • output_path – If given, the resulting DataFrame is written to this path (DataFrame mode only).

  • suffix – Suffix appended to each resolved column name (default "_current").

  • sep – Delimiter for reading the file. Inferred from the extension when None ("\\t" for .tsv, "," otherwise).

Returns:

A resolved symbol string, a list of resolved strings (direct-lookup mode), or a pandas.DataFrame with one additional column per entry in at (DataFrame mode).

save(mapping_set: Sec2PriMappingSet, output_format: str, output: Path | str | None = None, *, base_name: str) Path[source]

Write mapping_set and return the path that was written.

Delegates to save() for single formats and write_all_formats() for "all".

Parameters:
  • mapping_set – The mapping set to write.

  • output_format – One of sssom, sec2pri, pri_ids, name2synonym, symbol_sec2pri, pri_symbols, rdf, json, owl, or all.

  • output – Explicit output path or directory. When None, a default name derived from base_name is used.

  • base_name – Stem used to derive file names, e.g. "hgnc_2026-04-07".

Returns:

The directory (for "all") or file path that was written.

write_all_formats(mapping_set: Sec2PriMappingSet, output_dir: Path, base_name: str, include_name2synonym: bool = True) None[source]

Write mapping set in all output formats to a directory.

Parameters:
  • mapping_set – The mapping set to write.

  • output_dir – Directory to write files to.

  • base_name – Base name for output files (e.g., “chebi_3star_245”).

  • include_name2synonym – Whether to include name2synonym format.

write_diff_output(result: MappingDiff, output_path: Path) None[source]

Write diff results to a TSV file.

Parameters:
  • result – MappingDiff object with added/removed/changed mappings.

  • output_path – Path to write the TSV file.

write_json(mapping_set: Sec2PriMappingSet, output_path: Path | str) Path[source]

Write a mapping set to an SSSOM JSON file.

Parameters:
  • mapping_set – The mapping set to write.

  • output_path – Destination file path (e.g. mappings.json).

Returns:

Path to the written file.

write_name2synonym(mapping_set: Sec2PriMappingSet, output_path: Path | str) Path[source]

Write name to synonym mappings to a TSV file.

Only rows where at least one of subject_label or object_label is set are written. Columns: subject_id, subject_label, object_label.

Parameters:
  • mapping_set – The mapping set to write.

  • output_path – Destination file path (e.g. name2synonym.tsv).

Returns:

Path to the written file.

write_output(mapping_set: Sec2PriMappingSet, output_format: str, output_path: Path | str) Path[source]

Write a mapping set in any registered output format.

Parameters:
  • mapping_set – The mapping set to write.

  • output_format – Format name (must be a key in WRITERS).

  • output_path – Path to write to.

Returns:

Path to the written file.

Raises:

ValueError – If output_format is not recognized.

write_owl(mapping_set: Sec2PriMappingSet, output_path: Path | str, serialisation: str = 'turtle') Path[source]

Write a mapping set to an OWL/RDF file (default: Turtle).

Parameters:
  • mapping_set – The mapping set to write.

  • output_path – Destination file path (e.g. mappings_owl.ttl).

  • serialisation – RDFLib serialisation format.

Returns:

Path to the written file.

write_rdf(mapping_set: Sec2PriMappingSet, output_path: Path | str, serialisation: str = 'turtle') Path[source]

Write a mapping set to an RDF file.

Parameters:
  • mapping_set – The mapping set to write.

  • output_path – Destination file path (e.g. mappings.ttl).

  • serialisation – RDFLib serialisation format.

Returns:

Path to the written file.

write_sec2pri(mapping_set: Sec2PriMappingSet, output_path: Path | str) Path[source]

Write secondary to primary ID mappings to a TSV file.

Columns: subject_id, object_id, predicate_id, mapping_cardinality.

Parameters:
  • mapping_set – The mapping set to write.

  • output_path – Destination file path (e.g. sec2pri.tsv).

Returns:

Path to the written file.

write_sssom(mapping_set: Sec2PriMappingSet, output_path: Path | str) Path[source]

Write a mapping set to an SSSOM TSV file.

Parameters:
  • mapping_set – The mapping set to write.

  • output_path – Destination .sssom.tsv file path.

Returns:

Path to the written file.

write_symbol_sec2pri(mapping_set: Sec2PriMappingSet, output_path: Path | str) Path

Write symbol to previous symbol mappings to a TSV file.

Only rows where at least one of subject_label or object_label is set are written. Columns: subject_id, subject_label, object_label, mapping_cardinality.

Parameters:
  • mapping_set – The mapping set to write.

  • output_path – Destination file path (e.g. symbol2prev.tsv).

Returns:

Path to the written file.