API
Functions for parsing biological databases and generating SSSOM-compliant mappings.
All parsing functions return Sec2PriMappingSet objects
for integration with the SSSOM ecosystem.
Main functions for pysec2pri.
This module provides functions for parsing biological database secondary-to-primary mapping files and generating and using the standardized Mapping sets.
- combine_mapping_sets(id_mappings: Sec2PriMappingSet | None, synonym_mappings: Sec2PriMappingSet | None) Sec2PriMappingSet[source]
Combine two mapping sets into one.
- Parameters:
id_mappings – First mapping set (e.g. ID mappings).
synonym_mappings – Second mapping set (e.g. synonym mappings).
- Returns:
Combined mapping set.
- Raises:
ValueError – If both mapping sets are
None.
- generate_chebi(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True, subset: str = '3star', mapping_sets: str = 'ids') Sec2PriMappingSet[source]
Return ChEBI mappings (IDs, synonyms, or both).
Downloads the latest release automatically when
input_pathis omitted. Pass an SDF file (releases < 245) or a directory of TSV flat files (releases >= 245) to use a local copy.- Parameters:
input_path – Local SDF file or TSV directory. Auto-downloaded if
None.version – Release number (e.g.
"245").show_progress – Whether to show progress bars.
subset –
"3star"(default) or"complete".mapping_sets –
"ids"(default),"synonyms", or"all".
- generate_chebi_synonyms(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True, subset: str = '3star') Sec2PriMappingSet[source]
Return ChEBI synonym (name) mappings.
- generate_hgnc(input_path: Path | str | None = None, complete_set_path: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]
Return HGNC secondary to primary ID mappings.
Downloads the withdrawn and complete set files automatically when
input_path/complete_set_pathare omitted. The complete set is used to populate the full list of current primary IDs so thatto_pri_ids()returns the authoritative list (~45 k IDs) rather than just the ~5 k primaries that happen to have a secondary.- Parameters:
input_path – Local HGNC withdrawn TSV. Auto-downloaded if
None.complete_set_path – Local HGNC complete set TSV. Auto-downloaded if
None.version – Version string for metadata.
show_progress – Whether to show progress bars.
- generate_hgnc_primary_ids(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]
Return a mapping set containing the full list of current HGNC primary IDs.
Only the HGNC complete set file is downloaded/read. The returned mapping set has an empty
mappingslist; its_primary_idsstore is populated with every current HGNC ID so thatto_pri_ids()produces the authoritative complete list, not just the subset of primaries that happen to have an associated secondary.- Parameters:
input_path – Local HGNC complete set TSV. Auto-downloaded if
None.version – Version string for metadata.
show_progress – Whether to show progress bars.
- generate_hgnc_symbols(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True, statuses: list[str] | None = None) Sec2PriMappingSet[source]
Return HGNC symbol to previous-symbol mappings.
Downloads the complete set file automatically when
input_pathis omitted.- Parameters:
input_path – Local HGNC complete set TSV. Auto-downloaded if
None.version – Version string for metadata.
show_progress – Whether to show progress bars.
statuses – Entry statuses to include (e.g.
["Approved"]).
- generate_hmdb(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]
Return HMDB metabolite secondary to primary accession mappings.
Downloads hmdb_metabolites.xml automatically when
input_pathis omitted.- Parameters:
input_path – Local hmdb_metabolites.xml (or .zip/.gz). Auto-downloaded if
None.version – Version string for metadata.
show_progress – Whether to show progress bars.
- generate_hmdb_proteins(input_path: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]
Return HMDB protein secondary to primary accession mappings.
Downloads hmdb_proteins.xml automatically when
input_pathis omitted.- Parameters:
input_path – Local hmdb_proteins.xml (or .zip/.gz). Auto-downloaded if
None.version – Version string for metadata.
show_progress – Whether to show progress bars.
- generate_ncbi(input_path: Path | str | None = None, tax_id: str = '9606', version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]
Return NCBI Gene secondary to primary ID mappings.
Downloads the gene history file automatically when
input_pathis omitted.- Parameters:
input_path – Local gene_history file. Auto-downloaded if
None.tax_id – NCBI taxonomy ID to filter (default:
"9606"for human).version – Version string for metadata.
show_progress – Whether to show progress bars.
- generate_ncbi_symbols(input_path: Path | str | None = None, tax_id: str = '9606', version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]
Return NCBI Gene symbol to previous-symbol mappings.
Downloads the gene_info file automatically when
input_pathis omitted.- Parameters:
input_path – Local gene_info file. Auto-downloaded if
None.tax_id – NCBI taxonomy ID to filter (default:
"9606"for human).version – Version string for metadata.
show_progress – Whether to show progress bars.
- generate_uniprot(input_path: Path | str | None = None, delac_file: Path | str | None = None, version: str | None = None, show_progress: bool = True) Sec2PriMappingSet[source]
Return UniProt secondary to primary accession mappings.
Downloads sec_ac.txt and delac_sp.txt automatically when
input_pathis omitted.- Parameters:
input_path – Local sec_ac.txt. Auto-downloaded if
None.delac_file – Local delac_sp.txt.
version – Version string for metadata.
show_progress – Whether to show progress bars.
- generate_wikidata(input_path: Path | str | None = None, entity_type: str | None = None, version: str | None = None, endpoint: str | None = None, show_progress: bool = True, test_subset: bool = False) Sec2PriMappingSet[source]
Return Wikidata redirect mappings via SPARQL (or a pre-downloaded TSV).
Queries the QLever Wikidata endpoint when
input_pathis omitted. Ifentity_typeisNone, all entity types (metabolites, genes, proteins) are queried and combined.- Parameters:
input_path – Pre-downloaded TSV file. Queries SPARQL if
None.entity_type –
"metabolites","chemicals","genes", orNone. ("proteins"`. Queries all types when)
version – Version string for metadata (defaults to today’s date).
endpoint – Custom SPARQL endpoint URL.
show_progress – Whether to show progress bars.
test_subset – Use test queries limited to 10 results.
- generate_wikidata_symbols(input_path: Path | str | None = None, entity_type: str | None = None, version: str | None = None, endpoint: str | None = None, show_progress: bool = True, test_subset: bool = False) LabelMappingSet[source]
Return Wikidata label mappings (previous label to current label).
Queries the QLever Wikidata endpoint when
input_pathis omitted. Ifentity_typeisNone, all entity types are queried and their label mappings combined.- Parameters:
input_path – Pre-downloaded TSV file. Queries SPARQL if
None.entity_type –
"metabolites","chemicals","genes", or"proteins". Queries all types whenNone.version – Version string for metadata.
endpoint – Custom SPARQL endpoint URL.
show_progress – Whether to show progress bars.
test_subset – Use test queries limited to 10 results.
- Returns:
LabelMappingSetwith label mappings.
- load_label_mapping(path: Path | str) LabelMappingSet[source]
Load a label/symbol mapping set from a pysec2pri TSV file.
Accepts the
symbol2prevTSV format (columnssubject_id,subject_label,object_label,mapping_cardinality) and the full SSSOM TSV format (comment-prefixed metadata lines are skipped).- Parameters:
path – Path to the TSV file to load.
- Returns:
A
LabelMappingSetpopulated from the file, ready to pass toresolve_symbols().
- load_mapping(path: Path | str) IdMappingSet[source]
Load an ID mapping set from a pysec2pri TSV file.
Accepts the
sec2priTSV format (columnssubject_id,object_id,predicate_id,mapping_cardinality) and the full SSSOM TSV format (comment-prefixed metadata lines are skipped automatically).- Parameters:
path – Path to the TSV file to load.
- Returns:
An
IdMappingSetpopulated from the file, ready to pass toresolve_ids().
- resolve_ids(input_path: Path | str | list[str], mapping_set: Sec2PriMappingSet, at: str | list[str] | None = None, *, output_path: Path | str | None = None, suffix: str = '_primary', sep: str | None = None) pd.DataFrame | str | list[str][source]
Resolve secondary IDs to primary IDs.
Direct lookup: when input_path is a plain identifier string or a list of identifier strings (i.e. not a path to an existing file), the function returns the resolved primary ID(s). at, output_path, suffix, and sep are ignored in this mode:
resolve_ids("HMDB00001", hmdb_ms) # → "HMDB:HMDB0000001" resolve_ids(["HMDB00001", "HMDB00002"], hmdb_ms) # → ["...", "..."]
DataFrame mode: when input_path points to an existing TSV/CSV file, at is required. The file is read with
pandas.read_csvand for each column named in at a new column<col><suffix>is appended containing the resolved primary IDs. Identifiers not present in mapping_set are kept unchanged.- Parameters:
input_path – An identifier string, a list of identifier strings, or the path to a TSV/CSV file.
mapping_set – A
Sec2PriMappingSet(e.g. the result ofgenerate_hgnc()).at – Column name(s) to resolve. Required in DataFrame mode; ignored in direct-lookup mode.
output_path – If given, the resulting DataFrame is written to this path (DataFrame mode only).
suffix – Suffix appended to each resolved column name (default
"_primary").sep – Delimiter for reading the file. Inferred from the extension when
None("\\t"for.tsv,","otherwise).
- Returns:
A resolved identifier string, a list of resolved strings (direct-lookup mode), or a
pandas.DataFramewith one additional column per entry in at (DataFrame mode).
- resolve_symbols(input_path: Path | str | list[str], mapping_set: Sec2PriMappingSet, at: str | list[str] | None = None, *, output_path: Path | str | None = None, suffix: str = '_current', sep: str | None = None) pd.DataFrame | str | list[str][source]
Resolve previous/alias symbols to current symbols.
Direct lookup: when input_path is a plain symbol string or a list of symbol strings (i.e. not a path to an existing file), the function returns the resolved current symbol(s). at, output_path, suffix, and sep are ignored in this mode:
resolve_symbols("Ibuprofen", chebi_ms) # → "ibuprofen" resolve_symbols(["Ibuprofen", "Glucose"], chebi_ms) # → ["...", "..."]
DataFrame mode: when input_path points to an existing TSV/CSV file, at is required. For each column named in at a new column
<col><suffix>is appended containing the resolved current symbols. Symbols not present in mapping_set are kept unchanged.- Parameters:
input_path – A symbol string, a list of symbol strings, or the path to a TSV/CSV file.
mapping_set – A
LabelMappingSet(e.g. the result ofgenerate_hgnc_symbols()).at – Column name(s) to resolve. Required in DataFrame mode; ignored in direct-lookup mode.
output_path – If given, the resulting DataFrame is written to this path (DataFrame mode only).
suffix – Suffix appended to each resolved column name (default
"_current").sep – Delimiter for reading the file. Inferred from the extension when
None("\\t"for.tsv,","otherwise).
- Returns:
A resolved symbol string, a list of resolved strings (direct-lookup mode), or a
pandas.DataFramewith one additional column per entry in at (DataFrame mode).
- save(mapping_set: Sec2PriMappingSet, output_format: str, output: Path | str | None = None, *, base_name: str) Path[source]
Write mapping_set and return the path that was written.
Delegates to
save()for single formats andwrite_all_formats()for"all".- Parameters:
mapping_set – The mapping set to write.
output_format – One of
sssom,sec2pri,pri_ids,name2synonym,symbol_sec2pri,pri_symbols,rdf,json,owl, orall.output – Explicit output path or directory. When
None, a default name derived from base_name is used.base_name – Stem used to derive file names, e.g.
"hgnc_2026-04-07".
- Returns:
The directory (for
"all") or file path that was written.
- write_all_formats(mapping_set: Sec2PriMappingSet, output_dir: Path, base_name: str, include_name2synonym: bool = True) None[source]
Write mapping set in all output formats to a directory.
- Parameters:
mapping_set – The mapping set to write.
output_dir – Directory to write files to.
base_name – Base name for output files (e.g., “chebi_3star_245”).
include_name2synonym – Whether to include name2synonym format.
- write_diff_output(result: MappingDiff, output_path: Path) None[source]
Write diff results to a TSV file.
- Parameters:
result – MappingDiff object with added/removed/changed mappings.
output_path – Path to write the TSV file.
- write_json(mapping_set: Sec2PriMappingSet, output_path: Path | str) Path[source]
Write a mapping set to an SSSOM JSON file.
- Parameters:
mapping_set – The mapping set to write.
output_path – Destination file path (e.g.
mappings.json).
- Returns:
Path to the written file.
- write_name2synonym(mapping_set: Sec2PriMappingSet, output_path: Path | str) Path[source]
Write name to synonym mappings to a TSV file.
Only rows where at least one of
subject_labelorobject_labelis set are written. Columns:subject_id,subject_label,object_label.- Parameters:
mapping_set – The mapping set to write.
output_path – Destination file path (e.g.
name2synonym.tsv).
- Returns:
Path to the written file.
- write_output(mapping_set: Sec2PriMappingSet, output_format: str, output_path: Path | str) Path[source]
Write a mapping set in any registered output format.
- Parameters:
mapping_set – The mapping set to write.
output_format – Format name (must be a key in WRITERS).
output_path – Path to write to.
- Returns:
Path to the written file.
- Raises:
ValueError – If output_format is not recognized.
- write_owl(mapping_set: Sec2PriMappingSet, output_path: Path | str, serialisation: str = 'turtle') Path[source]
Write a mapping set to an OWL/RDF file (default: Turtle).
- Parameters:
mapping_set – The mapping set to write.
output_path – Destination file path (e.g.
mappings_owl.ttl).serialisation – RDFLib serialisation format.
- Returns:
Path to the written file.
- write_rdf(mapping_set: Sec2PriMappingSet, output_path: Path | str, serialisation: str = 'turtle') Path[source]
Write a mapping set to an RDF file.
- Parameters:
mapping_set – The mapping set to write.
output_path – Destination file path (e.g.
mappings.ttl).serialisation – RDFLib serialisation format.
- Returns:
Path to the written file.
- write_sec2pri(mapping_set: Sec2PriMappingSet, output_path: Path | str) Path[source]
Write secondary to primary ID mappings to a TSV file.
Columns:
subject_id,object_id,predicate_id,mapping_cardinality.- Parameters:
mapping_set – The mapping set to write.
output_path – Destination file path (e.g.
sec2pri.tsv).
- Returns:
Path to the written file.
- write_sssom(mapping_set: Sec2PriMappingSet, output_path: Path | str) Path[source]
Write a mapping set to an SSSOM TSV file.
- Parameters:
mapping_set – The mapping set to write.
output_path – Destination
.sssom.tsvfile path.
- Returns:
Path to the written file.
- write_symbol_sec2pri(mapping_set: Sec2PriMappingSet, output_path: Path | str) Path
Write symbol to previous symbol mappings to a TSV file.
Only rows where at least one of
subject_labelorobject_labelis set are written. Columns:subject_id,subject_label,object_label,mapping_cardinality.- Parameters:
mapping_set – The mapping set to write.
output_path – Destination file path (e.g.
symbol2prev.tsv).
- Returns:
Path to the written file.