Parsers
Database-specific parsers that produce SSSOM-compliant MappingSets.
All parsers inherit from BaseParser
and return Sec2PriMappingSet objects.
Anybody can add a new database by adding the necessary config yaml,
parser class, and a DataSourceConfig entry in constants.py.
If possible, you can also add automated download in src/pysec2pri/download.py.
Database |
Input |
Methods |
|---|---|---|
ChEBI |
TSV directory (≥ release 245) or SDF file (< 245) |
|
HMDB |
|
|
HGNC |
|
|
NCBI Gene |
|
|
UniProt |
|
|
Wikidata |
SPARQL endpoint (live) or pre-fetched JSON |
|
Module Reference
- class ChEBIParser(version: str | None = None, show_progress: bool = True, subset: str = '3star')[source]
Parser for ChEBI data files.
Supports both TSV flat files (>= release 245) and (legacy) SDF files. Extracts secondary-to-primary ChEBI identifier mappings and name-to-synonym relationships.
Returns an IdMappingSet for ID mappings (cardinality computed on IDs) and can optionally include synonym mappings via LabelMappingSet.
Initialize the ChEBI parser.
- Parameters:
version – Version/release identifier for the datasource.
show_progress – Whether to show progress bars during parsing.
subset – “3star” or “complete” - which compound subset to use. For TSV format, filters by stars in compounds.tsv. For SDF format, determines which file to download.
- parse(input_path: Path | str | None = None, *, secondary_ids_path: Path | str | None = None, compounds_path: Path | str | None = None) Sec2PriMappingSet[source]
Parse ChEBI data into an IdMappingSet.
Accepts three calling conventions:
input_pathis a directory: expectssecondary_ids.tsv(and optionallycompounds.tsv) inside it (TSV format >= 245).input_pathis an SDF file: legacy format (< 245).Keyword args
secondary_ids_path/compounds_path: explicit TSV paths (kept for backwards compatibility).
- Parameters:
input_path – Path to an SDF file, or a directory of TSV files.
secondary_ids_path – Explicit path to secondary_ids.tsv (TSV format).
compounds_path – Explicit path to compounds.tsv for 3-star filtering.
- Returns:
IdMappingSet with computed cardinalities.
- parse_synonyms(input_path: Path | str | None = None, *, names_path: Path | str | None = None, compounds_path: Path | str | None = None) Sec2PriMappingSet[source]
Parse ChEBI data into a LabelMappingSet for synonyms.
Accepts three calling conventions:
input_pathis a directory: expectsnames.tsv(and optionallycompounds.tsv) inside it (TSV format >= 245).input_pathis an SDF file: legacy format (< 245).Keyword args
names_path/compounds_path: explicit TSV paths (kept for backwards compatibility).
- Parameters:
input_path – Path to an SDF file, or a directory of TSV files.
names_path – Explicit path to names.tsv (TSV format).
compounds_path – Explicit path to compounds.tsv for 3-star filtering.
- Returns:
LabelMappingSet with computed cardinalities based on labels.
- class HMDBParser(version: str | None = None, show_progress: bool = True, config_name: str | None = None)[source]
Parser for HMDB XML files (metabolites and proteins).
Extracts secondary-to-primary HMDB accession mappings from
hmdb_metabolites.xmland/orhmdb_proteins.xml.Initialize the parser.
- Parameters:
version – Version/release identifier for the datasource.
show_progress – Whether to show progress bars during parsing.
config_name – Name of config file to load (defaults to class name).
- parse(input_path: Path | str | None) Sec2PriMappingSet[source]
Parse HMDB metabolites XML file.
- Parameters:
input_path – Path to
hmdb_metabolites.xml(or.zip/.gz).- Returns:
IdMappingSet for metabolite accessions.
- parse_proteins(input_path: Path | str) Sec2PriMappingSet[source]
Parse HMDB proteins XML file.
Primary accessions have the form
HMDBP00001. Secondary accessions may be bare numbers (legacy format, e.g.5229) or fullHMDBPaccessions; both are normalised toHMDBP:HMDBP<zero-padded>using the same prefix logic as the metabolites parser.- Parameters:
input_path – Path to
hmdb_proteins.xml(or.zip/.gz).- Returns:
IdMappingSet for protein accessions.
- class HGNCParser(version: str | None = None, show_progress: bool = True)[source]
Parser for HGNC TSV files using Polars for memory efficiency.
Extracts secondary-to-primary HGNC identifier mappings and symbol mappings from HGNC withdrawn and complete set files.
Returns: - IdMappingSet for ID-to-ID mappings (withdrawn/merged IDs) - LabelMappingSet for symbol mappings (alias/previous symbols)
Initialize the HGNC parser.
- Parameters:
version – Version/release identifier for the datasource.
show_progress – Whether to show progress bars during parsing.
- parse(input_path: Path | str | None, complete_set_path: Path | str | None = None) Sec2PriMappingSet[source]
Parse HGNC withdrawn TSV file into an IdMappingSet.
- Parameters:
input_path – Path to the withdrawn HGNC TSV file.
complete_set_path – Optional path to the HGNC complete set TSV. When supplied,
all_primary_idson the returned mapping set is populated with every current HGNC ID, not just those that appear asobject_idin a withdrawn to primary mapping.
- Returns:
IdMappingSet with computed cardinalities based on IDs.
- parse_primary_ids(complete_set_path: Path | str | None) Sec2PriMappingSet[source]
Return a mapping set whose only content is the full primary ID list.
Reads the HGNC complete set to extract every current HGNC ID and stores it in
_primary_ids. Themappingslist is intentionally left empty, this mapping set exists only to driveto_pri_ids().- Parameters:
complete_set_path – Path to the HGNC complete set TSV file.
- Returns:
IdMappingSetwith no mappings and_primary_idspopulated with all current HGNC IDs.
- parse_symbols(complete_set_path: Path | str | None, statuses: list[str] | None = None) Sec2PriMappingSet[source]
Parse HGNC complete set for symbol (label) mappings.
- Parameters:
complete_set_path – Path to the complete HGNC set TSV file.
statuses – Entry statuses to include (e.g.
["Approved"]). IfNone(default), all entries are included.
- Returns:
LabelMappingSet with computed cardinalities based on labels.
- parse_all(withdrawn_path: Path | str | None, complete_set_path: Path | str | None) tuple[Sec2PriMappingSet, Sec2PriMappingSet][source]
Parse both withdrawn and complete set files.
- Parameters:
withdrawn_path – Path to the withdrawn HGNC TSV file.
complete_set_path – Path to the complete HGNC set TSV file.
- Returns:
Tuple of (IdMappingSet, LabelMappingSet).
- class NCBIParser(version: str | None = None, show_progress: bool = True, config_name: str | None = None)[source]
Parser for NCBI Gene TSV files using Polars.
Extracts secondary-to-primary NCBI Gene identifier mappings including gene symbols from gene_history and gene_info files.
Returns: - IdMappingSet for ID-to-ID mappings (discontinued Gene IDs) - LabelMappingSet for symbol mappings (gene synonyms)
Initialize the parser.
- Parameters:
version – Version/release identifier for the datasource.
show_progress – Whether to show progress bars during parsing.
config_name – Name of config file to load (defaults to class name).
- parse(input_path: Path | str | None = None, tax_id: str = '9606') Sec2PriMappingSet[source]
Parse NCBI gene_history file into an IdMappingSet.
- Parameters:
input_path – Path to gene_history file (can be .gz compressed).
tax_id – Taxonomy ID to filter by (default: “9606” for human).
- Returns:
IdMappingSet with computed cardinalities based on IDs.
- parse_symbols(gene_info_path: Path | str | None, tax_id: str = '9606') Sec2PriMappingSet[source]
Parse NCBI gene_info file for symbol (label) mappings.
- Parameters:
gene_info_path – Path to gene_info file.
tax_id – Taxonomy ID to filter by (default: “9606” for human).
- Returns:
LabelMappingSet with computed cardinalities based on labels.
- parse_all(gene_history_path: Path | str | None, gene_info_path: Path | str | None, tax_id: str = '9606') tuple[Sec2PriMappingSet, Sec2PriMappingSet][source]
Parse both gene_history and gene_info files.
- Parameters:
gene_history_path – Path to gene_history file.
gene_info_path – Path to gene_info file.
tax_id – Taxonomy ID to filter by.
- Returns:
Tuple of (IdMappingSet, LabelMappingSet).
- class UniProtParser(version: str | None = None, show_progress: bool = True, config_name: str | None = None)[source]
Parser for UniProt files using Polars.
Extracts secondary-to-primary UniProt accession mappings from sec_ac.txt (secondary accessions) and delac_sp.txt (deleted accessions).
Returns IdMappingSet for all mappings (UniProt only has ID mappings).
Initialize the parser.
- Parameters:
version – Version/release identifier for the datasource.
show_progress – Whether to show progress bars during parsing.
config_name – Name of config file to load (defaults to class name).
- parse(input_path: Path | str | None = None, delac_path: Path | str | None = None) Sec2PriMappingSet[source]
Parse UniProt mapping files into an IdMappingSet.
- Parameters:
input_path – Path to sec_ac.txt (secondary accessions file).
delac_path – Path to delac_sp.txt (deleted accessions file).
- Returns:
IdMappingSet with computed cardinalities based on IDs.
- class WikidataParser(version: str | None = None, show_progress: bool = True, entity_type: str = 'metabolites', endpoint: str | None = None, test_subset: bool = False)[source]
Parser for Wikidata redirect mappings via SPARQL.
Queries the QLever Wikidata endpoint to find redirect mappings for chemicals, genes, and proteins.
Returns IdMappingSet for all mappings.
Initialize the Wikidata parser.
- Parameters:
version – Version/date string for the mappings.
show_progress – Whether to show progress.
entity_type – Type of entities to query.
endpoint – Optional custom SPARQL endpoint.
test_subset – Whether to use test queries (LIMIT 10).
- parse(input_path: Path | str | None = None) Sec2PriMappingSet[source]
Query Wikidata and return a MappingSet.
- Parameters:
input_path – Ignored for Wikidata (queries endpoint directly).
- Returns:
IdMappingSet containing Wikidata redirect mappings.
- parse_all() Sec2PriMappingSet[source]
Query all entity types from config and return combined MappingSet.
Runs all SPARQL queries defined in the config file’s ‘queries’ section (e.g., chemical_redirects, gene_redirects, protein_redirects) and combines the results into a single MappingSet.
- Returns:
IdMappingSet containing all Wikidata redirect mappings.
- parse_from_file(input_path: Path | str) Sec2PriMappingSet[source]
Parse Wikidata redirects from a pre-downloaded TSV file.
- Parameters:
input_path – Path to TSV file with SPARQL results.
- Returns:
IdMappingSet with computed cardinalities.
- parse_symbols(input_path: Path | str | None = None) LabelMappingSet[source]
Return a LabelMappingSet of previous-label to current-label mappings.
Queries the SPARQL endpoint (or reads input_path) exactly like
parse(), but wraps the result in aLabelMappingSetso label-specific exports (symbol_sec2pri,pri_symbols) work.- Parameters:
input_path – Pre-downloaded TSV file. Queries SPARQL if
None.- Returns:
LabelMappingSetwith label-based mappings.
Adding a New Parser
Create config YAML (
config/mydb.yaml) withmappingset,mapping, anddownload_urlssections - see existing configs for reference.Create parser class (
src/pysec2pri/parsers/mydb.py):from pysec2pri.parsers.base import BaseParser class MyDBParser(BaseParser): datasource_name = "mydb" def parse(self, input_path): raw = self._load(input_path) mappings = self._build_id_mappings(raw) return self._create_mapping_set(mappings, mapping_type="id")
Register in constants (
src/pysec2pri/constants.py):MYDB = get_datasource_config("mydb") ALL_DATASOURCES = [..., MYDB]
Expose in API and CLI - add a
parse_mydb()function insrc/pysec2pri/api.pyand a command insrc/pysec2pri/cli.py.