API

Functions for parsing biological databases and generating SSSOM-compliant mappings.

All parsing functions return Sec2PriMappingSet objects for integration with the SSSOM ecosystem.

Main functions for pysec2pri.

This module provides functions for parsing biological database secondary-to-primary mapping files and generating and using the standardized Mapping sets.

combine_mapping_sets(id_mappings: Sec2PriMappingSet | None, synonym_mappings: Sec2PriMappingSet | None) Sec2PriMappingSet[source]

Combine two mapping sets into one.

Parameters:
  • id_mappings – First mapping set (e.g. ID mappings).

  • synonym_mappings – Second mapping set (e.g. synonym mappings).

Returns:

Combined mapping set.

Raises:

ValueError – If both mapping sets are None.

generate_chebi(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True, subset: str = '3star', mapping_sets: str = 'ids') Sec2PriMappingSet[source]

Return ChEBI mappings (IDs, synonyms, or both).

Downloads the latest release automatically when input_path is omitted. Pass an SDF file (releases < 245) or a directory of TSV flat files (releases >= 245) to use a local copy.

Parameters:
  • input_path – Local SDF file or TSV directory. Auto-downloaded if None.

  • version – Release number (e.g. "245").

  • show_progress – Whether to show progress bars.

  • subset"3star" (default) or "complete".

  • mapping_sets"ids" (default), "synonyms", or "all".

generate_chebi_synonyms(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True, subset: str = '3star') Sec2PriMappingSet[source]

Return ChEBI synonym (name) mappings.

generate_hgnc(input_path: Path | str | None = None, complete_set_path: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]

Return HGNC secondary to primary ID mappings.

Downloads the withdrawn and complete set files automatically when input_path / complete_set_path are omitted. The complete set is used to populate the full list of current primary IDs so that to_pri_ids() returns the authoritative list (~45 k IDs) rather than just the ~5 k primaries that happen to have a secondary.

Parameters:
  • input_path – Local HGNC withdrawn TSV. Auto-downloaded if None.

  • complete_set_path – Local HGNC complete set TSV. Auto-downloaded if None.

  • version – Version string for metadata.

  • show_progress – Whether to show progress bars.

generate_hgnc_primary_ids(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]

Return a mapping set containing the full list of current HGNC primary IDs.

Only the HGNC complete set file is downloaded/read. The returned mapping set has an empty mappings list; its _primary_ids store is populated with every current HGNC ID so that to_pri_ids() produces the authoritative complete list, not just the subset of primaries that happen to have an associated secondary.

Parameters:
  • input_path – Local HGNC complete set TSV. Auto-downloaded if None.

  • version – Version string for metadata.

  • show_progress – Whether to show progress bars.

generate_hgnc_symbols(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True, statuses: list[str] | None = None) Sec2PriMappingSet[source]

Return HGNC symbol to previous-symbol mappings.

Downloads the complete set file automatically when input_path is omitted.

Parameters:
  • input_path – Local HGNC complete set TSV. Auto-downloaded if None.

  • version – Version string for metadata.

  • show_progress – Whether to show progress bars.

  • statuses – Entry statuses to include (e.g. ["Approved"]).

generate_hmdb(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]

Return HMDB metabolite secondary to primary accession mappings.

Downloads hmdb_metabolites.xml automatically when input_path is omitted.

Parameters:
  • input_path – Local hmdb_metabolites.xml (or .zip/.gz). Auto-downloaded if None.

  • version – Version string for metadata.

  • show_progress – Whether to show progress bars.

generate_hmdb_proteins(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]

Return HMDB protein secondary to primary accession mappings.

Downloads hmdb_proteins.xml automatically when input_path is omitted.

Parameters:
  • input_path – Local hmdb_proteins.xml (or .zip/.gz). Auto-downloaded if None.

  • version – Version string for metadata.

  • show_progress – Whether to show progress bars.

generate_ncbi(input_path: Path | str | None = None, tax_id: str = '9606', version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]

Return NCBI Gene secondary to primary ID mappings.

Downloads the gene history file automatically when input_path is omitted.

Parameters:
  • input_path – Local gene_history file. Auto-downloaded if None.

  • tax_id – NCBI taxonomy ID to filter (default: "9606" for human).

  • version – Version string for metadata.

  • show_progress – Whether to show progress bars.

generate_ncbi_symbols(input_path: Path | str | None = None, tax_id: str = '9606', version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]

Return NCBI Gene symbol to previous-symbol mappings.

Downloads the gene_info file automatically when input_path is omitted.

Parameters:
  • input_path – Local gene_info file. Auto-downloaded if None.

  • tax_id – NCBI taxonomy ID to filter (default: "9606" for human).

  • version – Version string for metadata.

  • show_progress – Whether to show progress bars.

generate_uniprot(input_path: Path | str | None = None, delac_file: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]

Return UniProt secondary to primary accession mappings.

Downloads sec_ac.txt and delac_sp.txt automatically when input_path is omitted.

Parameters:
  • input_path – Local sec_ac.txt. Auto-downloaded if None.

  • delac_file – Local delac_sp.txt.

  • version – Version string for metadata.

  • show_progress – Whether to show progress bars.

generate_wikidata(input_path: Path | str | None = None, entity_type: str | None = None, version: str | None = None, endpoint: str | None = None, show_progress: bool = True, test_subset: bool = False) Sec2PriMappingSet[source]

Return Wikidata redirect mappings via SPARQL (or a pre-downloaded TSV).

Queries the QLever Wikidata endpoint when input_path is omitted. If entity_type is None, all entity types (metabolites, genes, proteins) are queried and combined.

Parameters:
  • input_path – Pre-downloaded TSV file. Queries SPARQL if None.

  • entity_type"metabolites", "chemicals", "genes", or

  • None. ("proteins"`. Queries all types when)

  • version – Version string for metadata (defaults to today’s date).

  • endpoint – Custom SPARQL endpoint URL.

  • show_progress – Whether to show progress bars.

  • test_subset – Use test queries limited to 10 results.

generate_wikidata_symbols(input_path: Path | str | None = None, entity_type: str | None = None, version: str | None = None, endpoint: str | None = None, show_progress: bool = True, test_subset: bool = False) LabelMappingSet[source]

Return Wikidata label mappings (previous label to current label).

Queries the QLever Wikidata endpoint when input_path is omitted. If entity_type is None, all entity types are queried and their label mappings combined.

Parameters:
  • input_path – Pre-downloaded TSV file. Queries SPARQL if None.

  • entity_type"metabolites", "chemicals", "genes", or "proteins". Queries all types when None.

  • version – Version string for metadata.

  • endpoint – Custom SPARQL endpoint URL.

  • show_progress – Whether to show progress bars.

  • test_subset – Use test queries limited to 10 results.

Returns:

LabelMappingSet with label mappings.

load_label_mapping(path: Path | str) LabelMappingSet[source]

Load a label/symbol mapping set from a pysec2pri TSV file.

Accepts the symbol2prev TSV format (columns subject_id, subject_label, object_label, mapping_cardinality) and the full SSSOM TSV format (comment-prefixed metadata lines are skipped).

Parameters:

path – Path to the TSV file to load.

Returns:

A LabelMappingSet populated from the file, ready to pass to resolve_symbols().

load_mapping(path: Path | str) IdMappingSet[source]

Load an ID mapping set from a pysec2pri TSV file.

Accepts the sec2pri TSV format (columns subject_id, object_id, predicate_id, mapping_cardinality) and the full SSSOM TSV format (comment-prefixed metadata lines are skipped automatically).

Parameters:

path – Path to the TSV file to load.

Returns:

An IdMappingSet populated from the file, ready to pass to resolve_ids().

resolve_ids(input_path: Path | str | list[str], mapping_set: Sec2PriMappingSet, at: str | list[str] | None = None, *, output_path: Path | str | None = None, suffix: str = '_primary', sep: str | None = None) pd.DataFrame | str | list[str][source]

Resolve secondary IDs to primary IDs.

Direct lookup: when input_path is a plain identifier string or a list of identifier strings (i.e. not a path to an existing file), the function returns the resolved primary ID(s). at, output_path, suffix, and sep are ignored in this mode:

resolve_ids("HMDB00001", hmdb_ms)  # → "HMDB:HMDB0000001"
resolve_ids(["HMDB00001", "HMDB00002"], hmdb_ms)  # → ["...", "..."]

DataFrame mode: when input_path points to an existing TSV/CSV file, at is required. The file is read with pandas.read_csv and for each column named in at a new column <col><suffix> is appended containing the resolved primary IDs. Identifiers not present in mapping_set are kept unchanged.

Parameters:
  • input_path – An identifier string, a list of identifier strings, or the path to a TSV/CSV file.

  • mapping_set – A Sec2PriMappingSet (e.g. the result of generate_hgnc()).

  • at – Column name(s) to resolve. Required in DataFrame mode; ignored in direct-lookup mode.

  • output_path – If given, the resulting DataFrame is written to this path (DataFrame mode only).

  • suffix – Suffix appended to each resolved column name (default "_primary").

  • sep – Delimiter for reading the file. Inferred from the extension when None ("\\t" for .tsv, "," otherwise).

Returns:

A resolved identifier string, a list of resolved strings (direct-lookup mode), or a pandas.DataFrame with one additional column per entry in at (DataFrame mode).

resolve_symbols(input_path: Path | str | list[str], mapping_set: Sec2PriMappingSet, at: str | list[str] | None = None, *, output_path: Path | str | None = None, suffix: str = '_current', sep: str | None = None) pd.DataFrame | str | list[str][source]

Resolve previous/alias symbols to current symbols.

Direct lookup: when input_path is a plain symbol string or a list of symbol strings (i.e. not a path to an existing file), the function returns the resolved current symbol(s). at, output_path, suffix, and sep are ignored in this mode:

resolve_symbols("Ibuprofen", chebi_ms)  # → "ibuprofen"
resolve_symbols(["Ibuprofen", "Glucose"], chebi_ms)  # → ["...", "..."]

DataFrame mode: when input_path points to an existing TSV/CSV file, at is required. For each column named in at a new column <col><suffix> is appended containing the resolved current symbols. Symbols not present in mapping_set are kept unchanged.

Parameters:
  • input_path – A symbol string, a list of symbol strings, or the path to a TSV/CSV file.

  • mapping_set – A LabelMappingSet (e.g. the result of generate_hgnc_symbols()).

  • at – Column name(s) to resolve. Required in DataFrame mode; ignored in direct-lookup mode.

  • output_path – If given, the resulting DataFrame is written to this path (DataFrame mode only).

  • suffix – Suffix appended to each resolved column name (default "_current").

  • sep – Delimiter for reading the file. Inferred from the extension when None ("\\t" for .tsv, "," otherwise).

Returns:

A resolved symbol string, a list of resolved strings (direct-lookup mode), or a pandas.DataFrame with one additional column per entry in at (DataFrame mode).

save(mapping_set: Sec2PriMappingSet, output_format: str, output: Path | str | None = None, *, base_name: str) Path[source]

Write mapping_set and return the path that was written.

Delegates to save() for single formats and write_all_formats() for "all".

Parameters:
  • mapping_set – The mapping set to write.

  • output_format – One of sssom, sec2pri, pri_ids, name2synonym, symbol_sec2pri, pri_symbols, rdf, json, owl, or all.

  • output – Explicit output path or directory. When None, a default name derived from base_name is used.

  • base_name – Stem used to derive file names, e.g. "hgnc_2026-04-07".

Returns:

The directory (for "all") or file path that was written.

write_all_formats(mapping_set: Sec2PriMappingSet, output_dir: Path, base_name: str, include_name2synonym: bool = True) None[source]

Write mapping set in all output formats to a directory.

Parameters:
  • mapping_set – The mapping set to write.

  • output_dir – Directory to write files to.

  • base_name – Base name for output files (e.g., “chebi_3star_245”).

  • include_name2synonym – Whether to include name2synonym format.

write_diff_output(result: MappingDiff, output_path: Path) None[source]

Write diff results to a TSV file.

Parameters:
  • result – MappingDiff object with added/removed/changed mappings.

  • output_path – Path to write the TSV file.

write_json(mapping_set: Sec2PriMappingSet, output_path: Path | str) Path[source]

Write a mapping set to an SSSOM JSON file.

Parameters:
  • mapping_set – The mapping set to write.

  • output_path – Destination file path (e.g. mappings.json).

Returns:

Path to the written file.

write_name2synonym(mapping_set: Sec2PriMappingSet, output_path: Path | str) Path[source]

Write name to synonym mappings to a TSV file.

Only rows where at least one of subject_label or object_label is set are written. Columns: subject_id, subject_label, object_label.

Parameters:
  • mapping_set – The mapping set to write.

  • output_path – Destination file path (e.g. name2synonym.tsv).

Returns:

Path to the written file.

write_output(mapping_set: Sec2PriMappingSet, output_format: str, output_path: Path | str) Path[source]

Write a mapping set in any registered output format.

Parameters:
  • mapping_set – The mapping set to write.

  • output_format – Format name (must be a key in WRITERS).

  • output_path – Path to write to.

Returns:

Path to the written file.

Raises:

ValueError – If output_format is not recognized.

write_owl(mapping_set: Sec2PriMappingSet, output_path: Path | str, serialisation: str = 'turtle') Path[source]

Write a mapping set to an OWL/RDF file (default: Turtle).

Parameters:
  • mapping_set – The mapping set to write.

  • output_path – Destination file path (e.g. mappings_owl.ttl).

  • serialisation – RDFLib serialisation format.

Returns:

Path to the written file.

write_rdf(mapping_set: Sec2PriMappingSet, output_path: Path | str, serialisation: str = 'turtle') Path[source]

Write a mapping set to an RDF file.

Parameters:
  • mapping_set – The mapping set to write.

  • output_path – Destination file path (e.g. mappings.ttl).

  • serialisation – RDFLib serialisation format.

Returns:

Path to the written file.

write_sec2pri(mapping_set: Sec2PriMappingSet, output_path: Path | str) Path[source]

Write secondary to primary ID mappings to a TSV file.

Columns: subject_id, object_id, predicate_id, mapping_cardinality.

Parameters:
  • mapping_set – The mapping set to write.

  • output_path – Destination file path (e.g. sec2pri.tsv).

Returns:

Path to the written file.

write_sssom(mapping_set: Sec2PriMappingSet, output_path: Path | str) Path[source]

Write a mapping set to an SSSOM TSV file.

Parameters:
  • mapping_set – The mapping set to write.

  • output_path – Destination .sssom.tsv file path.

Returns:

Path to the written file.

write_symbol_sec2pri(mapping_set: Sec2PriMappingSet, output_path: Path | str) Path

Write symbol to previous symbol mappings to a TSV file.

Only rows where at least one of subject_label or object_label is set are written. Columns: subject_id, subject_label, object_label, mapping_cardinality.

Parameters:
  • mapping_set – The mapping set to write.

  • output_path – Destination file path (e.g. symbol2prev.tsv).

Returns:

Path to the written file.