Parsers

Database-specific parsers that produce SSSOM-compliant MappingSets.

All parsers inherit from BaseParser and return Sec2PriMappingSet objects.

Anybody can add a new database by adding the necessary config yaml, parser class, and a DataSourceConfig entry in constants.py. If possible, you can also add automated download in src/pysec2pri/download.py.

Database	Input	Methods
ChEBI	TSV directory (≥ release 245) or SDF file (< 245)	`parse()`, `parse_synonyms()`
HMDB	`hmdb_metabolites.xml` or `hmdb_proteins.xml`	`parse()`
HGNC	`hgnc_complete_set.txt`, `withdrawn.txt`	`parse()`, `parse_labels()`, `parse_all()`
NCBI Gene	`gene_history`, `gene_info`	`parse()`, `parse_labels()`, `parse_all()`
UniProt	`sec_ac.txt`, `delac_sp.txt`	`parse()`
Wikidata	SPARQL endpoint (live) or pre-fetched JSON	`parse()`, `parse_all()`, `parse_from_file()`

Module Reference

class ChEBIParser(version: str | None = None, show_progress: bool = True, subset: str = '3star')[source]

Parser for ChEBI data files.

Supports both TSV flat files (>= release 245) and (legacy) SDF files. Extracts secondary-to-primary ChEBI identifier mappings and name-to-synonym relationships.

Returns an IdMappingSet for ID mappings (cardinality computed on IDs) and can optionally include synonym mappings via LabelMappingSet.

Initialize the ChEBI parser.

Parameters:

version – Version/release identifier for the datasource.
show_progress – Whether to show progress bars during parsing.
subset – “3star” or “complete” - which compound subset to use. For TSV format, filters by stars in compounds.tsv. For SDF format, determines which file to download.

property source_url: str: Get the default download URL from config.

parse(input_path: Path | str | None = None, *, secondary_ids_path: Path | str | None = None, compounds_path: Path | str | None = None) → Sec2PriMappingSet[source]

Parse ChEBI data into an IdMappingSet.

Accepts three calling conventions:

input_path is a directory: expects secondary_ids.tsv (and optionally compounds.tsv) inside it (TSV format >= 245).
input_path is an SDF file: legacy format (< 245).
Keyword args secondary_ids_path / compounds_path: explicit TSV paths.

Parameters:

input_path – Path to an SDF file, or a directory of TSV files.
secondary_ids_path – Explicit path to secondary_ids.tsv (TSV format).
compounds_path – Explicit path to compounds.tsv for 3-star filtering.

Returns:

IdMappingSet with computed cardinalities.

parse_synonyms(input_path: Path | str | None = None, *, names_path: Path | str | None = None, compounds_path: Path | str | None = None) → Sec2PriMappingSet[source]

Parse ChEBI data into a LabelMappingSet for synonyms.

Accepts three calling conventions:

input_path is a directory: expects names.tsv (and optionally compounds.tsv) inside it (TSV format >= 245).
input_path is an SDF file: legacy format (< 245).
Keyword args names_path / compounds_path: explicit TSV paths (kept for backwards compatibility).

Parameters:

input_path – Path to an SDF file, or a directory of TSV files.
names_path – Explicit path to names.tsv (TSV format).
compounds_path – Explicit path to compounds.tsv for 3-star filtering.

Returns:

LabelMappingSet with computed cardinalities based on labels.

parse_primary_ids(input_path: Path | str | None = None, *, compounds_path: Path | str | None = None) → Sec2PriMappingSet[source]

Return a mapping set containing the full list of current ChEBI primary IDs.

Reads compounds.tsv (TSV releases >= 245) to extract every current ChEBI compound ID. The returned mapping set has an empty mappings list; its _primary_ids store is populated with every current ChEBI ID (CHEBI: prefixed) so that to_pri_ids() produces the authoritative complete list.

Parameters:

input_path – Path to a directory containing compounds.tsv, or directly to compounds.tsv itself.
compounds_path – Explicit path to compounds.tsv (overrides input_path).

Returns:

IdMappingSet with no mappings and _primary_ids populated with all current ChEBI IDs.

parse_primary_labels(input_path: Path | str | None = None, *, compounds_path: Path | str | None = None) → Sec2PriMappingSet[source]

Return a mapping set containing the full list of current ChEBI compound names.

Reads compounds.tsv to extract every current compound’s canonical name. The returned mapping set has an empty mappings list; its _primary_labels store is populated.

Parameters:

input_path – Path to a directory containing compounds.tsv, or directly to compounds.tsv itself.
compounds_path – Explicit path to compounds.tsv.

Returns:

LabelMappingSet with no mappings and _primary_labels populated.

class HMDBParser(version: str | None = None, show_progress: bool = True, config_name: str | None = None)[source]

Shared XML-parser for HMDB metabolite and protein files.

Use HMDBMetaboliteParser or HMDBProteinParser.

Initialize the parser.

Parameters:

version – Version/release identifier for the datasource.
show_progress – Whether to show progress bars during parsing.
config_name – Name of config file to load (defaults to class name).

parse_primary_ids(metabolites_path: Path | str | None = None, proteins_path: Path | str | None = None) → Sec2PriMappingSet[source]

Return a mapping set containing the full list of current HMDB primary IDs.

Reads one or both of hmdb_metabolites.xml and hmdb_proteins.xml and collects all primary accession numbers. The returned mapping set has an empty mappings list; _primary_ids is populated with every current HMDB:<acc> CURIE.

Parameters:

metabolites_path – Path to hmdb_metabolites.xml (or zip/gz).
proteins_path – Path to hmdb_proteins.xml (or zip/gz).

Returns:

IdMappingSet with _primary_ids populated. At least one of the two path arguments must be supplied.

class HGNCParser(version: str | None = None, show_progress: bool = True)[source]

Parser for HGNC TSV files using Polars for memory efficiency.

Extracts secondary-to-primary HGNC identifier mappings and symbol mappings from HGNC withdrawn and complete set files.

Returns: - IdMappingSet for ID-to-ID mappings (withdrawn/merged IDs) - LabelMappingSet for symbol mappings (alias/previous symbols)

Initialize the HGNC parser.

Parameters:

version – Version/release identifier for the datasource.
show_progress – Whether to show progress bars during parsing.

property withdrawn_source_url: str: Get the withdrawn file download URL from config.

property complete_set_source_url: str: Get the complete set download URL from config.

parse(input_path: Path | str | None, complete_set_path: Path | str | None = None) → Sec2PriMappingSet[source]

Parse HGNC withdrawn TSV file into an IdMappingSet.

Parameters:

input_path – Path to the withdrawn HGNC TSV file.
complete_set_path – Optional path to the HGNC complete set TSV. When supplied, all_primary_ids|labels on the returned mapping set is populated with every current HGNC ID, not just those that appear as object_id in a withdrawn to primary mapping.

Returns:

IdMappingSet with computed cardinalities based on IDs.

parse_primary_labels(complete_set_path: Path | str | None) → Sec2PriMappingSet[source]

Return a mapping set whose only content is the full primary Symbol list.

Reads the HGNC complete set to extract every current HGNC Symbol and stores it in _primary_labels. The mappings list is intentionally left empty, this mapping set exists only to drive to_pri_labels().

Parameters:: complete_set_path – Path to the HGNC complete set TSV file.
Returns:: LabelMappingSet with no mappings and _primary_labels populated with all current HGNC labels.

parse_primary_ids(complete_set_path: Path | str | None) → Sec2PriMappingSet[source]

Return a mapping set whose only content is the full primary ID list.

Reads the HGNC complete set to extract every current HGNC ID and stores it in _primary_ids. The mappings list is intentionally left empty, this mapping set exists only to drive to_pri_ids().

Parameters:: complete_set_path – Path to the HGNC complete set TSV file.
Returns:: IdMappingSet with no mappings and _primary_ids populated with all current HGNC IDs.

parse_labels(complete_set_path: Path | str | None, statuses: list[str] | None = None) → LabelMappingSet[source]

Parse HGNC complete set for symbol (label) mappings.

Parameters:

complete_set_path – Path to the complete HGNC set TSV file.
statuses – Entry statuses to include (e.g. ["Approved"]). If None (default), all entries are included.

Returns:

LabelMappingSet with computed cardinalities based on labels.

parse_all(withdrawn_path: Path | str | None, complete_set_path: Path | str | None) → tuple[Sec2PriMappingSet, Sec2PriMappingSet][source]

Parse both withdrawn and complete set files.

Parameters:

withdrawn_path – Path to the withdrawn HGNC TSV file.
complete_set_path – Path to the complete HGNC set TSV file.

Returns:

Tuple of (IdMappingSet, LabelMappingSet).

class NCBIParser(version: str | None = None, show_progress: bool = True, config_name: str | None = None)[source]

Parser for NCBI Gene TSV files using Polars.

Extracts secondary-to-primary NCBI Gene identifier mappings including gene symbols from gene_history and gene_info files.

Returns: - IdMappingSet for ID-to-ID mappings (discontinued Gene IDs) - LabelMappingSet for symbol mappings (gene synonyms)

Initialize the parser.

Parameters:

version – Version/release identifier for the datasource.
show_progress – Whether to show progress bars during parsing.
config_name – Name of config file to load (defaults to class name).

property history_source_url: str: Get the gene_history download URL from config.

property info_source_url: str: Get the gene_info download URL from config.

parse(input_path: Path | str | None = None, tax_id: str = '9606', gene_info_path: Path | str | None = None) → Sec2PriMappingSet[source]

Parse NCBI gene_history file into an IdMappingSet.

Parameters:

input_path – Path to gene_history file (can be .gz compressed).
tax_id – Taxonomy ID to filter by (default: “9606” for human).
gene_info_path – Optional path to the gene_info file. When supplied, _primary_ids on the returned mapping set is populated with every current NCBIGene:<id> CURIE for the given taxonomy, not just those that appear as object_id in a discontinued-to-primary mapping.

Returns:

IdMappingSet with computed cardinalities based on IDs.

parse_labels(gene_info_path: Path | str | None, tax_id: str = '9606') → Sec2PriMappingSet[source]

Parse NCBI gene_info file for label (label) mappings.

Parameters:

gene_info_path – Path to gene_info file.
tax_id – Taxonomy ID to filter by (default: “9606” for human).

Returns:

LabelMappingSet with computed cardinalities based on labels.

parse_primary_ids(gene_info_path: Path | str | None, tax_id: str = '9606') → Sec2PriMappingSet[source]

Return a mapping set containing the full list of current NCBI Gene primary IDs.

Reads gene_info to extract every current Gene ID for the given taxonomy. The returned mapping set has an empty mappings list; _primary_ids is populated with every current NCBIGene:<id> CURIE.

Parameters:

gene_info_path – Path to the gene_info file (can be .gz compressed).
tax_id – Taxonomy ID to filter by (default: "9606" for human).

Returns:

IdMappingSet with _primary_ids populated.

parse_primary_labels(gene_info_path: Path | str | None, tax_id: str = '9606') → Sec2PriMappingSet[source]

Return a mapping set containing the full list of current NCBI Gene labels.

Reads gene_info to extract every current gene label for the given taxonomy. The returned mapping set has an empty mappings list; _primary_labels is populated.

Parameters:

gene_info_path – Path to the gene_info file (can be .gz compressed).
tax_id – Taxonomy ID to filter by (default: "9606" for human).

Returns:

LabelMappingSet with _primary_labels populated.

parse_all(gene_history_path: Path | str | None, gene_info_path: Path | str | None, tax_id: str = '9606') → tuple[Sec2PriMappingSet, Sec2PriMappingSet][source]

Parse both gene_history and gene_info files.

Parameters:

gene_history_path – Path to gene_history file.
gene_info_path – Path to gene_info file.
tax_id – Taxonomy ID to filter by.

Returns:

Tuple of (IdMappingSet, LabelMappingSet).

class UniProtParser(version: str | None = None, show_progress: bool = True, config_name: str | None = None)[source]

Parser for UniProt files using Polars.

Extracts secondary-to-primary UniProt accession mappings from sec_ac.txt (secondary accessions) and delac_sp.txt (deleted accessions).

Returns IdMappingSet for all mappings (UniProt only has ID mappings).

Initialize the parser.

Parameters:

version – Version/release identifier for the datasource.
show_progress – Whether to show progress bars during parsing.
config_name – Name of config file to load (defaults to class name).

property sec_ac_url: str: Get the sec_ac.txt download URL from config.

property delac_url: str: Get the delac_sp.txt download URL from config.

parse(input_path: Path | str | None = None, delac_path: Path | str | None = None) → Sec2PriMappingSet[source]

Parse UniProt mapping files into an IdMappingSet.

Parameters:

input_path – Path to sec_ac.txt (secondary accessions file).
delac_path – Path to delac_sp.txt (deleted accessions file).

Returns:

IdMappingSet with computed cardinalities based on IDs.

parse_primary_ids(acindex_path: Path | str | None = None) → Sec2PriMappingSet[source]

Return a mapping set containing the full list of current UniProt primary ACs.

Parses acindex.txt (or a gzip-compressed variant) to extract every accession number that currently appears in UniProtKB/Swiss-Prot. The file lists one AC per row (after the __________ separator line); only the first whitespace-delimited token of each data line is taken.

For versioned (legacy) releases the file can be found at:

https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/
release-{version}/knowledgebase/docs/acindex.txt.gz

Parameters:: acindex_path – Local path to acindex.txt (plain or .gz). Auto-downloaded from the current release when None.
Returns:: IdMappingSet with no mappings and _primary_ids populated with all current UniProtKB:<AC> CURIEs.

class WikidataParser(version: str | None = None, show_progress: bool = True, entity_type: str = 'metabolites', endpoint: str | None = None, test_subset: bool = False)[source]

Parser for Wikidata redirect mappings via SPARQL.

Queries the QLever Wikidata endpoint to find redirect mappings for chemicals, genes, and proteins.

Returns IdMappingSet for all mappings.

Initialize the Wikidata parser.

Parameters:

version – Version/date string for the mappings.
show_progress – Whether to show progress.
entity_type – Type of entities to query.
endpoint – Optional custom SPARQL endpoint.
test_subset – Whether to use test queries (LIMIT 10).

parse(input_path: Path | str | None = None) → Sec2PriMappingSet[source]

Query Wikidata and return a MappingSet.

Parameters:: input_path – Ignored for Wikidata (queries endpoint directly).
Returns:: IdMappingSet containing Wikidata redirect mappings.

parse_all() → Sec2PriMappingSet[source]

Query all entity types from config and return combined MappingSet.

Runs all SPARQL queries defined in the config file’s ‘queries’ section (e.g., chemical_redirects, gene_redirects, protein_redirects) and combines the results into a single MappingSet.

Returns:: IdMappingSet containing all Wikidata redirect mappings.

parse_from_file(input_path: Path | str) → Sec2PriMappingSet[source]

Parse Wikidata redirects from a pre-downloaded TSV file.

Parameters:: input_path – Path to TSV file with SPARQL results.
Returns:: IdMappingSet with computed cardinalities.

parse_labels(input_path: Path | str | None = None) → LabelMappingSet[source]

Return a LabelMappingSet of previous-label to current-label mappings.

Queries the SPARQL endpoint (or reads input_path) exactly like parse(), but wraps the result in a LabelMappingSet so label-specific exports (label_sec2pri, pri_labels) work.

Parameters:: input_path – Pre-downloaded TSV file. Queries SPARQL if None.
Returns:: LabelMappingSet with label-based mappings.

Adding a New Parser

Create config YAML (config/mydb.yaml) with mappingset, mapping, and download_urls sections - see existing configs for reference.

Create parser class (src/pysec2pri/parsers/mydb.py):

from pysec2pri.parsers.base import BaseParser

class MyDBParser(BaseParser):
    datasource_name = "mydb"

    def parse(self, input_path):
        raw = self._load(input_path)
        mappings = self._build_id_mappings(raw)
        return self._create_mapping_set(mappings, mapping_type="id")

Register in constants (src/pysec2pri/constants.py):

MYDB = get_datasource_config("mydb")
ALL_DATASOURCES = [..., MYDB]

Expose in API and CLI - add a parse_mydb() function in src/pysec2pri/api.py and a command in src/pysec2pri/cli.py.