API
Functions for parsing biological databases and generating SSSOM-compliant mappings.
All parsing functions return Sec2PriMappingSet objects
for integration with the SSSOM ecosystem.
Main functions for pysec2pri.
This module provides functions for parsing biological database secondary-to-primary mapping files and generating and using the standardized Mapping sets.
- combine_mapping_sets(id_mappings: Sec2PriMappingSet | None, synonym_mappings: Sec2PriMappingSet | None) Sec2PriMappingSet[source]
Combine two mapping sets into one.
- Parameters:
id_mappings – First mapping set (e.g. ID mappings).
synonym_mappings – Second mapping set (e.g. synonym mappings).
- Returns:
Combined mapping set.
- Raises:
ValueError – If both mapping sets are
None.
- find_ambiguous(mapping_set: Sec2PriMappingSet) AmbiguousMappingSet[source]
Find identifiers that are ambiguous in mapping_set.
An identifier is ambiguous when it appears both as a
subject_id(i.e. a secondary/previous term) and as a current primary identifier. Such entries cannot be automatically resolved without risk of corrupting references that are already current.This is a convenience wrapper around
find_ambiguous().- Parameters:
mapping_set – A
Sec2PriMappingSet(e.g. the result ofgenerate_hgnc()).- Returns:
An
AmbiguousMappingSetwhosemappingslist contains one entry for each conflicting subject, with acommentexplaining the conflict.
- generate_chebi(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True, subset: str = '3star', mapping_sets: str = 'ids') Sec2PriMappingSet[source]
Return ChEBI mappings (IDs, synonyms, or both).
Downloads the latest release automatically when
input_pathis omitted. Pass an SDF file (releases < 245) or a directory of TSV flat files (releases >= 245) to use a local copy.- Parameters:
input_path – Local SDF file or TSV directory. Auto-downloaded if
None.version – Release number (e.g.
"245").show_progress – Whether to show progress bars.
subset –
"3star"(default) or"complete".mapping_sets –
"ids"(default),"synonyms", or"all".
- generate_chebi_primary_ids(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True, subset: str = '3star') Sec2PriMappingSet[source]
Return a mapping set containing the full list of current ChEBI primary IDs.
Reads
compounds.tsvto extract every current ChEBI compound ID. The returned mapping set has an emptymappingslist;_primary_idsis populated with every currentCHEBI:<n>CURIE.- Parameters:
input_path – Local
compounds.tsvfile or directory containing it. Auto-downloaded ifNone.version – Release number (e.g.
"245").show_progress – Whether to show progress bars.
subset –
"3star"(default) or"complete".
- generate_chebi_primary_symbols(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True, subset: str = '3star') Sec2PriMappingSet[source]
Return a mapping set containing the full list of current ChEBI compound names.
Reads
compounds.tsvto extract every current compound’s canonical name. The returned mapping set has an emptymappingslist;_primary_symbolsis populated.- Parameters:
input_path – Local
compounds.tsvfile or directory containing it. Auto-downloaded ifNone.version – Release number (e.g.
"245").show_progress – Whether to show progress bars.
subset –
"3star"(default) or"complete".
- generate_chebi_synonyms(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True, subset: str = '3star') Sec2PriMappingSet[source]
Return ChEBI synonym (name) mappings.
- generate_hgnc(input_path: Path | str | None = None, complete_set_path: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]
Return HGNC secondary to primary ID mappings.
Downloads the withdrawn and complete set files automatically when
input_path/complete_set_pathare omitted. The complete set is used to populate the full list of current primary IDs so thatto_pri_ids()returns the authoritative list (~45 k IDs) rather than just the ~5 k primaries that happen to have a secondary.- Parameters:
input_path – Local HGNC withdrawn TSV. Auto-downloaded if
None.complete_set_path – Local HGNC complete set TSV. Auto-downloaded if
None.version – Version string for metadata.
show_progress – Whether to show progress bars.
- generate_hgnc_primary_ids(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]
Return a mapping set containing the full list of current HGNC primary IDs.
Only the HGNC complete set file is downloaded/read. The returned mapping set has an empty
mappingslist; its_primary_idsstore is populated with every current HGNC ID so thatto_pri_ids()produces the authoritative complete list, not just the subset of primaries that happen to have an associated secondary.- Parameters:
input_path – Local HGNC complete set TSV. Auto-downloaded if
None.version – Version string for metadata.
show_progress – Whether to show progress bars.
- generate_hgnc_symbols(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True, statuses: list[str] | None = None) Sec2PriMappingSet[source]
Return HGNC symbol to previous-symbol mappings.
Downloads the complete set file automatically when
input_pathis omitted.- Parameters:
input_path – Local HGNC complete set TSV. Auto-downloaded if
None.version – Version string for metadata.
show_progress – Whether to show progress bars.
statuses – Entry statuses to include (e.g.
["Approved"]).
- generate_hmdb(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]
Return HMDB metabolite secondary to primary accession mappings.
Downloads hmdb_metabolites.xml automatically when
input_pathis omitted.- Parameters:
input_path – Local hmdb_metabolites.xml (or .zip/.gz). Auto-downloaded if
None.version – Version string for metadata.
show_progress – Whether to show progress bars.
- generate_hmdb_primary_ids(metabolites_path: Path | str | None = None, proteins_path: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]
Return a mapping set containing the full list of current HMDB primary IDs.
Reads one or both of
hmdb_metabolites.xmlandhmdb_proteins.xmland collects all primary accession numbers. The returned mapping set has an emptymappingslist;_primary_idsis populated with every currentHMDB:<acc>CURIE.- Parameters:
metabolites_path – Local metabolites XML file. Auto-downloaded if both paths are
None.proteins_path – Local proteins XML file (optional).
version – Version string for metadata.
show_progress – Whether to show progress bars.
- generate_hmdb_proteins(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]
Return HMDB protein secondary to primary accession mappings.
Downloads hmdb_proteins.xml automatically when
input_pathis omitted.- Parameters:
input_path – Local hmdb_proteins.xml (or .zip/.gz). Auto-downloaded if
None.version – Version string for metadata.
show_progress – Whether to show progress bars.
- generate_ncbi(input_path: Path | str | None = None, gene_info_path: Path | str | None = None, tax_id: str = '9606', version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]
Return NCBI Gene secondary to primary ID mappings.
Downloads the gene_history file automatically when
input_pathis omitted. Whengene_info_pathis supplied (or auto-downloaded), the full list of current primary IDs is read fromgene_infoand stored in_primary_ids, so thatto_pri_ids()returns the authoritative complete set rather than only the subset of primaries that happen to appear ingene_history.- Parameters:
input_path – Local gene_history file. Auto-downloaded if
None.gene_info_path – Local gene_info file used to populate the full primary ID list. Auto-downloaded together with
input_pathwhen both areNone.tax_id – NCBI taxonomy ID to filter (default:
"9606"for human).version – Version string for metadata.
show_progress – Whether to show progress bars.
- generate_ncbi_primary_ids(input_path: Path | str | None = None, tax_id: str = '9606', version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]
Return a mapping set containing the full list of current NCBI Gene primary IDs.
Reads
gene_infoto extract every current Gene ID for the given taxonomy. The returned mapping set has an emptymappingslist;_primary_idsis populated with every currentNCBIGene:<id>CURIE.- Parameters:
input_path – Local gene_info file. Auto-downloaded if
None.tax_id – Taxonomy ID to filter by (default:
"9606"for human).version – Version string for metadata.
show_progress – Whether to show progress bars.
- generate_ncbi_primary_symbols(input_path: Path | str | None = None, tax_id: str = '9606', version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]
Return a mapping set containing the full list of current NCBI Gene symbols.
Reads
gene_infoto extract every current gene symbol for the given taxonomy. The returned mapping set has an emptymappingslist;_primary_symbolsis populated.- Parameters:
input_path – Local gene_info file. Auto-downloaded if
None.tax_id – Taxonomy ID to filter by (default:
"9606"for human).version – Version string for metadata.
show_progress – Whether to show progress bars.
- generate_ncbi_symbols(input_path: Path | str | None = None, tax_id: str = '9606', version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]
Return NCBI Gene symbol to previous-symbol mappings.
Downloads the gene_info file automatically when
input_pathis omitted.- Parameters:
input_path – Local gene_info file. Auto-downloaded if
None.tax_id – NCBI taxonomy ID to filter (default:
"9606"for human).version – Version string for metadata.
show_progress – Whether to show progress bars.
- generate_uniprot(input_path: Path | str | None = None, delac_file: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]
Return UniProt secondary to primary accession mappings.
Downloads sec_ac.txt and delac_sp.txt automatically when
input_pathis omitted.- Parameters:
input_path – Local sec_ac.txt. Auto-downloaded if
None.delac_file – Local delac_sp.txt.
version – Version string for metadata.
show_progress – Whether to show progress bars.
- generate_uniprot_primary_ids(acindex_path: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]
Return a mapping set containing the full list of current UniProt primary ACs.
Parses
acindex.txtto extract every accession number that currently appears in UniProtKB/Swiss-Prot. The returned mapping set has an emptymappingslist;_primary_idsis populated with every currentUniProtKB:<AC>CURIE.For versioned (legacy) releases the file is available at:
https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/ release-{version}/knowledgebase/docs/acindex.txt.gz
- Parameters:
acindex_path – Local
acindex.txt(plain or.gz). Auto-downloaded from the current release whenNone.version – Version string for metadata.
show_progress – Whether to show progress bars.
- generate_wikidata(input_path: Path | str | None = None, entity_type: str | None = None, version: str | None = None, endpoint: str | None = None, show_progress: bool = True, test_subset: bool = False) Sec2PriMappingSet[source]
Return Wikidata redirect mappings via SPARQL (or a pre-downloaded TSV).
Queries the QLever Wikidata endpoint when
input_pathis omitted. Ifentity_typeisNone, all entity types (metabolites, genes, proteins) are queried and combined.- Parameters:
input_path – Pre-downloaded TSV file. Queries SPARQL if
None.entity_type –
"metabolites","chemicals","genes", orNone. ("proteins"`. Queries all types when)
version – Version string for metadata (defaults to today’s date).
endpoint – Custom SPARQL endpoint URL.
show_progress – Whether to show progress bars.
test_subset – Use test queries limited to 10 results.
- generate_wikidata_symbols(input_path: Path | str | None = None, entity_type: str | None = None, version: str | None = None, endpoint: str | None = None, show_progress: bool = True, test_subset: bool = False) LabelMappingSet[source]
Return Wikidata label mappings (previous label to current label).
Queries the QLever Wikidata endpoint when
input_pathis omitted. Ifentity_typeisNone, all entity types are queried and their label mappings combined.- Parameters:
input_path – Pre-downloaded TSV file. Queries SPARQL if
None.entity_type –
"metabolites","chemicals","genes", or"proteins". Queries all types whenNone.version – Version string for metadata.
endpoint – Custom SPARQL endpoint URL.
show_progress – Whether to show progress bars.
test_subset – Use test queries limited to 10 results.
- Returns:
LabelMappingSetwith label mappings.
- list_versions(datasource: str) Any[source]
List all available archive versions for a datasource.
For datasources that publish versioned archives (ChEBI, HGNC, UniProt), this queries the remote archive index and returns all available version strings sorted in ascending order.
NCBI and HMDB do not maintain versioned archives; calling this function for those datasources raises
ValueError.- Parameters:
datasource – Datasource name, one of
"chebi","hgnc", or"uniprot".- Returns:
chebi: integer release numbers, e.g.
["200", ..., "245"]hgnc: ISO dates, e.g.
["2023-01-01", ..., "2026-04-07"]uniprot: release IDs, e.g.
["2024_01", "2024_02", ...]
- Return type:
Sorted list of version strings. Format depends on the datasource
- Raises:
ValueError – If datasource is unknown or has no versioned archive.
- load_label_mapping(path: Path | str) LabelMappingSet[source]
Load a label/symbol mapping set from a pysec2pri TSV file.
Accepts the
symbol2prevTSV format (columnssubject_id,subject_label,object_label,mapping_cardinality) and the full SSSOM TSV format (comment-prefixed metadata lines are skipped).- Parameters:
path – Path to the TSV file to load.
- Returns:
A
LabelMappingSetpopulated from the file, ready to pass toresolve_symbols().
- load_mapping(path: Path | str) IdMappingSet[source]
Load an ID mapping set from a pysec2pri TSV file.
Accepts the
sec2priTSV format (columnssubject_id,object_id,predicate_id,mapping_cardinality) and the full SSSOM TSV format (comment-prefixed metadata lines are skipped automatically).- Parameters:
path – Path to the TSV file to load.
- Returns:
An
IdMappingSetpopulated from the file, ready to pass toresolve_ids().
- resolve_ids(input_path: Path | str | list[str], mapping_set: Sec2PriMappingSet, at: str | list[str] | None = None, *, output_path: Path | str | None = None, suffix: str = '_primary', sep: str | None = None, synonyms: str | None = None, label_mapping_set: Sec2PriMappingSet | None = None) pd.DataFrame | str | list[str][source]
Resolve secondary IDs to primary IDs.
Direct lookup: when input_path is a plain identifier string or a list of identifier strings (i.e. not a path to an existing file), the function returns the resolved primary ID(s). at, output_path, suffix, and sep are ignored in this mode:
resolve_ids("HMDB00001", hmdb_ms) # -> "HMDB:HMDB0000001" resolve_ids(["HMDB00001", "HMDB00002"], hmdb_ms) # -> ["...", "..."]
DataFrame mode: when input_path points to an existing TSV/CSV file, at is required. The file is read with
pandas.read_csvand for each column named in at a new column<col><suffix>is appended containing the resolved primary IDs. Identifiers not present in mapping_set are kept unchanged.- Parameters:
input_path – An identifier string, a list of identifier strings, or the path to a TSV/CSV file.
mapping_set – A
Sec2PriMappingSet(e.g. the result ofgenerate_hgnc()).at – Column name(s) to resolve. Required in DataFrame mode; ignored in direct-lookup mode.
output_path – If given, the resulting DataFrame is written to this path (DataFrame mode only).
suffix – Suffix appended to each resolved column name (default
"_primary").sep – Delimiter for reading the file. Inferred from the extension when
None("\\t"for.tsv,","otherwise).
- Returns:
A resolved identifier string, a list of resolved strings (direct-lookup mode), or a
pandas.DataFramewith one additional column per entry in at (DataFrame mode).
- resolve_symbols(input_path: Path | str | list[str], mapping_set: Sec2PriMappingSet, at: str | list[str] | None = None, *, output_path: Path | str | None = None, suffix: str = '_current', sep: str | None = None, synonyms: str | None = None) pd.DataFrame | str | list[str][source]
Resolve previous/alias symbols to current symbols.
Direct lookup: when input_path is a plain symbol string or a list of symbol strings (i.e. not a path to an existing file), the function returns the resolved current symbol(s). at, output_path, suffix, and sep are ignored in this mode:
resolve_symbols("Ibuprofen", chebi_ms) # -> "ibuprofen" resolve_symbols(["Ibuprofen", "Glucose"], chebi_ms) # -> ["...", "..."]
DataFrame mode: when input_path points to an existing TSV/CSV file, at is required. For each column named in at a new column
<col><suffix>is appended containing the resolved current symbols. Symbols not present in mapping_set are kept unchanged.- Parameters:
input_path – A symbol string, a list of symbol strings, or the path to a TSV/CSV file.
mapping_set – A
LabelMappingSet(e.g. the result ofgenerate_hgnc_symbols()).at – Column name(s) to resolve. Required in DataFrame mode; ignored in direct-lookup mode.
output_path – If given, the resulting DataFrame is written to this path (DataFrame mode only).
suffix – Suffix appended to each resolved column name (default
"_current").sep – Delimiter for reading the file. Inferred from the extension when
None("\\t"for.tsv,","otherwise).
- Returns:
A resolved symbol string, a list of resolved strings (direct-lookup mode), or a
pandas.DataFramewith one additional column per entry in at (DataFrame mode).
- save(mapping_set: Sec2PriMappingSet, output_format: str, output: Path | str | None = None, *, base_name: str) Path[source]
Write mapping_set and return the path that was written.
Delegates to
save()for single formats andwrite_all_formats()for"all".- Parameters:
mapping_set – The mapping set to write.
output_format – One of
sssom,sec2pri,pri_ids,name2synonym,symbol_sec2pri,pri_symbols,rdf,json,owl, orall.output – Explicit output path or directory. When
None, a default name derived from base_name is used.base_name – Stem used to derive file names, e.g.
"hgnc_2026-04-07".
- Returns:
The directory (for
"all") or file path that was written.
- write_all_formats(mapping_set: Sec2PriMappingSet, output_dir: Path, base_name: str, include_name2synonym: bool = True) None[source]
Write mapping set in all output formats to a directory.
- Parameters:
mapping_set – The mapping set to write.
output_dir – Directory to write files to.
base_name – Base name for output files (e.g., “chebi_3star_245”).
include_name2synonym – Whether to include name2synonym format.
- write_diff_output(result: MappingDiff, output_path: Path) None[source]
Write diff results to a TSV file.
- Parameters:
result – MappingDiff object with added/removed/changed mappings.
output_path – Path to write the TSV file.
- write_json(mapping_set: Sec2PriMappingSet, output_path: Path | str) Path[source]
Write a mapping set to an SSSOM JSON file.
- Parameters:
mapping_set – The mapping set to write.
output_path – Destination file path (e.g.
mappings.json).
- Returns:
Path to the written file.
- write_name2synonym(mapping_set: Sec2PriMappingSet, output_path: Path | str) Path[source]
Write name to synonym mappings to a TSV file.
Only rows where at least one of
subject_labelorobject_labelis set are written. Columns:subject_id,subject_label,object_label.- Parameters:
mapping_set – The mapping set to write.
output_path – Destination file path (e.g.
name2synonym.tsv).
- Returns:
Path to the written file.
- write_output(mapping_set: Sec2PriMappingSet, output_format: str, output_path: Path | str) Path[source]
Write a mapping set in any registered output format.
- Parameters:
mapping_set – The mapping set to write.
output_format – Format name (must be a key in WRITERS).
output_path – Path to write to.
- Returns:
Path to the written file.
- Raises:
ValueError – If output_format is not recognized.
- write_owl(mapping_set: Sec2PriMappingSet, output_path: Path | str, serialisation: str = 'turtle') Path[source]
Write a mapping set to an OWL/RDF file (default: Turtle).
- Parameters:
mapping_set – The mapping set to write.
output_path – Destination file path (e.g.
mappings_owl.ttl).serialisation – RDFLib serialisation format.
- Returns:
Path to the written file.
- write_rdf(mapping_set: Sec2PriMappingSet, output_path: Path | str, serialisation: str = 'turtle') Path[source]
Write a mapping set to an RDF file.
- Parameters:
mapping_set – The mapping set to write.
output_path – Destination file path (e.g.
mappings.ttl).serialisation – RDFLib serialisation format.
- Returns:
Path to the written file.
- write_sec2pri(mapping_set: Sec2PriMappingSet, output_path: Path | str) Path[source]
Write secondary to primary ID mappings to a TSV file.
Columns:
subject_id,object_id,predicate_id,mapping_cardinality.- Parameters:
mapping_set – The mapping set to write.
output_path – Destination file path (e.g.
sec2pri.tsv).
- Returns:
Path to the written file.
- write_sssom(mapping_set: Sec2PriMappingSet, output_path: Path | str) Path[source]
Write a mapping set to an SSSOM TSV file.
- Parameters:
mapping_set – The mapping set to write.
output_path – Destination
.sssom.tsvfile path.
- Returns:
Path to the written file.
- write_symbol_sec2pri(mapping_set: Sec2PriMappingSet, output_path: Path | str) Path
Write symbol to previous symbol mappings to a TSV file.
Only rows where at least one of
subject_labelorobject_labelis set are written. Columns:subject_id,subject_label,object_label,mapping_cardinality.- Parameters:
mapping_set – The mapping set to write.
output_path – Destination file path (e.g.
symbol2prev.tsv).
- Returns:
Path to the written file.