Parsers

Database-specific parsers that produce SSSOM-compliant MappingSets.

All parsers inherit from BaseParser and return BaseMappingSet objects.

Anybody can add a new database by adding the necessary config yaml, parser class, and a DataSourceConfig entry in constants.py. If possible, you can also add automated download in src/pysec2pri/download.py.

Database	Input	Methods
ChEBI	TSV directory (≥ release 245) or SDF file (< 245)	`parse()`, `parse_synonyms()`
Ensembl	`stable_id_event`, `mapping_session`, `gene`, `xref`, `external_synonym`	`parse()`, `parse_labels()`, `parse_all()`
HMDB	`hmdb_metabolites.xml` or `hmdb_proteins.xml`	`parse()`
HGNC	`hgnc_complete_set.txt`, `withdrawn.txt`	`parse()`, `parse_labels()`, `parse_all()`
NCBI Gene	`gene_history`, `gene_info`	`parse()`, `parse_labels()`, `parse_all()`
UniProt	`sec_ac.txt`, `delac_sp.txt`	`parse()`
VGNC	`all_vgnc_gene_set_All.tsv`, `all_vgnc_withdrawn.tsv`	`parse()`, `parse_labels()`, `parse_all()`
Wikidata	SPARQL endpoint (live) or pre-fetched JSON	`parse()`, `parse_all()`, `parse_from_file()`

Module Reference

class ChEBIParser(version: str | None = None, show_progress: bool = True, subset: str | None = None)[source]

Parser for ChEBI data files.

Supports both TSV flat files (>= release 245) and (legacy) SDF files. Extracts secondary-to-primary ChEBI identifier mappings and name-to-synonym relationships.

Returns an IdMappingSet for ID mappings (cardinality computed on IDs) and can optionally include synonym mappings via LabelMappingSet.

Initialize the ChEBI parser.

Parameters:

version – Version/release identifier for the datasource.
show_progress – Whether to show progress bars during parsing.
subset – Which compound subset to use, e.g. "3star" or "complete" (see chebi.yaml’s subset block). For TSV format, filters by stars in compounds.tsv; for SDF format, determines which file to download. Defaults to the config’s subset.default.

parse(input_path: Path | str | None = None, *, secondary_ids_path: Path | str | None = None, compounds_path: Path | str | None = None) → BaseMappingSet[source]

Parse ChEBI data into an IdMappingSet.

Accepts three calling conventions:

input_path is a directory: expects secondary_ids.tsv (and optionally compounds.tsv) inside it (TSV format >= 245).
input_path is an SDF file: legacy format (< 245).
Keyword args secondary_ids_path / compounds_path: explicit TSV paths.

Parameters:

input_path – Path to an SDF file, or a directory of TSV files.
secondary_ids_path – Explicit path to secondary_ids.tsv (TSV format).
compounds_path – Explicit path to compounds.tsv for 3-star filtering.

Returns:

IdMappingSet with computed cardinalities.

parse_synonyms(input_path: Path | str | None = None, *, names_path: Path | str | None = None, compounds_path: Path | str | None = None) → BaseMappingSet[source]

Parse ChEBI data into a LabelMappingSet for synonyms.

Accepts three calling conventions:

input_path is a directory: expects names.tsv (and optionally compounds.tsv) inside it (TSV format >= 245).
input_path is an SDF file: legacy format (< 245).
Keyword args names_path / compounds_path: explicit TSV paths (kept for backwards compatibility).

Parameters:

input_path – Path to an SDF file, or a directory of TSV files.
names_path – Explicit path to names.tsv (TSV format).
compounds_path – Explicit path to compounds.tsv for 3-star filtering.

Returns:

LabelMappingSet with computed cardinalities based on labels.

parse_primary_ids(input_path: Path | str | None = None, *, compounds_path: Path | str | None = None) → BaseMappingSet[source]

Return a mapping set containing the full list of current ChEBI primary IDs.

Reads compounds.tsv (TSV releases >= 245) to extract every current ChEBI compound ID. The returned mapping set has an empty mappings list; its _primary_ids store is populated with every current ChEBI ID (CHEBI: prefixed) so that to_pri_ids() produces the complete list.

Parameters:

input_path – Path to a directory containing compounds.tsv, or directly to compounds.tsv itself.
compounds_path – Explicit path to compounds.tsv (overrides input_path).

Returns:

IdMappingSet with no mappings and _primary_ids populated with all current ChEBI IDs.

parse_primary_labels(input_path: Path | str | None = None, *, compounds_path: Path | str | None = None) → BaseMappingSet[source]

Return a mapping set containing the full list of current ChEBI compound names.

Reads compounds.tsv to extract every current compound’s canonical name. The returned mapping set has an empty mappings list; its _primary_labels store is populated.

Parameters:

input_path – Path to a directory containing compounds.tsv, or directly to compounds.tsv itself.
compounds_path – Explicit path to compounds.tsv.

Returns:

LabelMappingSet with no mappings and _primary_labels populated.

class EnsemblParser(version: str | None = None, show_progress: bool = True, species: str | int = 9606)[source]

Parser for Ensembl core flat-file dumps using Polars.

Each release’s stable_id_event table is cumulative, so a single parse describes the whole state of Ensembl gene IDs at that release (no chain-walking across releases).

Returns: - IdMappingSet for ID-to-ID mappings (retired Ensembl gene IDs) - LabelMappingSet for symbol mappings (external gene synonyms)

Initialize the Ensembl parser.

Parameters:

version – Ensembl release number (e.g. "115").
show_progress – Whether to show progress bars during parsing.
species – Canonical NCBI taxon ID of the species being parsed. Unlike NCBI, this never filters rows (the input files are already species-specific downloads) – it only labels the output (see _product_slug()).

parse(stable_id_event_path: Path | str | None = None, mapping_session_path: Path | str | None = None, gene_path: Path | str | None = None) → BaseMappingSet[source]

Parse stable_id_event (+ mapping_session) into an IdMappingSet.

Parameters:

stable_id_event_path – Path to stable_id_event.txt (can be .gz compressed).
mapping_session_path – Optional path to mapping_session.txt, used to resolve each row’s mapping_date. When omitted, rows fall back to the set-level release date.
gene_path – Optional path to gene.txt. When supplied, _primary_ids is populated with every current gene ID.

Returns:

IdMappingSet with computed cardinalities based on IDs.

parse_labels(gene_path: Path | str | None = None, xref_path: Path | str | None = None, external_synonym_path: Path | str | None = None) → BaseMappingSet[source]

Parse external gene synonyms into a LabelMappingSet.

A release’s files state a gene’s current symbol and its synonyms, but not the symbols it previously had, so a single release yields alias mappings only. Renames are recovered by consolidating across releases.

Parameters:

gene_path – Path to gene.txt.
xref_path – Path to xref.txt.
external_synonym_path – Optional path to external_synonym.txt. When omitted, the returned set carries only the full primary label set (no synonym mappings).

Returns:

LabelMappingSet with computed cardinalities based on labels.

parse_primary_ids(gene_path: Path | str | None = None) → BaseMappingSet[source]

Return a mapping set containing the full list of current Ensembl gene IDs.

Parameters:: gene_path – Path to gene.txt (can be .gz compressed).
Returns:: IdMappingSet with _primary_ids populated.

parse_primary_labels(gene_path: Path | str | None = None, xref_path: Path | str | None = None) → BaseMappingSet[source]

Return a mapping set containing the full list of current Ensembl gene labels.

Parameters:

gene_path – Path to gene.txt.
xref_path – Path to xref.txt.

Returns:

LabelMappingSet with _primary_labels populated.

parse_all(stable_id_event_path: Path | str | None, mapping_session_path: Path | str | None, gene_path: Path | str | None, xref_path: Path | str | None, external_synonym_path: Path | str | None) → tuple[BaseMappingSet, BaseMappingSet][source]

Parse the full set of Ensembl core flat files.

Parameters:

stable_id_event_path – Path to stable_id_event.txt.
mapping_session_path – Path to mapping_session.txt.
gene_path – Path to gene.txt.
xref_path – Path to xref.txt.
external_synonym_path – Path to external_synonym.txt.

Returns:

Tuple of (IdMappingSet, LabelMappingSet).

current_label_snapshot(gene_path: Path | str, xref_path: Path | str) → dict[str, str][source]

Return {bare stable_id -> current display label} for every current gene.

Unlike _extract_primary_labels() (keyed by label, for ambiguity detection and to_pri_labels()), this is keyed by gene – the natural shape for diffing one release’s label snapshot against the next (see pysec2pri.consolidate.build_label_history()).

Parameters:

gene_path – Path to gene.txt.
xref_path – Path to xref.txt.

Returns:

dict[stable_id, label] (bare stable IDs, no ENSEMBL: prefix).

parse_label_history(transitions: Iterable[tuple[str, str, str, str | None]]) → BaseMappingSet[source]

Build a LabelMappingSet from precomputed previous->current label transitions.

Parameters:: transitions – Iterable of (stable_id, prev_label, curr_label, mapping_date) tuples, one per gene whose display label changed between two releases (see pysec2pri.consolidate.build_label_history()).
Returns:: LabelMappingSet with IAO:0100001 (“term replaced by”) mappings.

class HMDBParser(version: str | None = None, show_progress: bool = True, config_name: str | None = None)[source]

Shared XML-parser for HMDB metabolite and protein files.

Use HMDBMetaboliteParser or HMDBProteinParser.

Initialize the parser.

Parameters:

version – Version/release identifier for the datasource.
show_progress – Whether to show progress bars during parsing.
config_name – Name of config file to load (defaults to class name).

parse_primary_ids(metabolites_path: Path | str | None = None, proteins_path: Path | str | None = None) → BaseMappingSet[source]

Return a mapping set containing the full list of current HMDB primary IDs.

Reads one or both of hmdb_metabolites.xml and hmdb_proteins.xml and collects all primary accession numbers. The returned mapping set has an empty mappings list; _primary_ids is populated with every current HMDB:<acc> CURIE.

Parameters:

metabolites_path – Path to hmdb_metabolites.xml (or zip/gz).
proteins_path – Path to hmdb_proteins.xml (or zip/gz).

Returns:

IdMappingSet with _primary_ids populated. At least one of the two path arguments must be supplied.

class HGNCParser(version: str | None = None, show_progress: bool = True)[source]

Parser for HGNC TSV files using Polars for memory efficiency.

Extracts secondary-to-primary HGNC identifier mappings and symbol mappings from HGNC withdrawn and complete set files.

Returns: - IdMappingSet for ID-to-ID mappings (withdrawn/merged IDs) - LabelMappingSet for symbol mappings (alias/previous symbols)

Initialize the HGNC parser.

Parameters:

version – Version/release identifier for the datasource.
show_progress – Whether to show progress bars during parsing.

id_column: ClassVar[str] = 'hgnc_id': Column holding the primary identifier (e.g. "hgnc_id").

withdrawn_label_column: ClassVar[str] = 'symbol': Column holding a withdrawn entry symbol.

merged_patterns: ClassVar[list[str]] = ['merged_into_report(i.e. hgnc_id/symbol/status)', 'merged_into_report(i.e hgnc_id/symbol/status)', 'merged_into_report(s) (i.e hgnc_id|symbol|status)']: Naming variants of the merged-info column across file versions.

parse(input_path: Path | str | None, complete_set_path: Path | str | None = None) → BaseMappingSet[source]

Parse HGNC withdrawn TSV file into an IdMappingSet.

Parameters:

input_path – Path to the withdrawn HGNC TSV file.
complete_set_path – Optional path to the HGNC complete set TSV. When supplied, all_primary_ids|labels on the returned mapping set is populated with every current HGNC ID, not just those that appear as object_id in a withdrawn to primary mapping.

Returns:

IdMappingSet with computed cardinalities based on IDs.

parse_primary_labels(complete_set_path: Path | str | None) → BaseMappingSet[source]

Return a mapping set whose only content is the full primary Symbol list.

Reads the HGNC complete set to extract every current HGNC Symbol and stores it in _primary_labels. The mappings list is intentionally left empty, this mapping set exists only to drive to_pri_labels().

Parameters:: complete_set_path – Path to the HGNC complete set TSV file.
Returns:: LabelMappingSet with no mappings and _primary_labels populated with all current HGNC labels.

parse_primary_ids(complete_set_path: Path | str | None) → BaseMappingSet[source]

Return a mapping set whose only content is the full primary ID list.

Reads the HGNC complete set to extract every current HGNC ID and stores it in _primary_ids. The mappings list is intentionally left empty, this mapping set exists only to drive to_pri_ids().

Parameters:: complete_set_path – Path to the HGNC complete set TSV file.
Returns:: IdMappingSet with no mappings and _primary_ids populated with all current HGNC IDs.

parse_labels(complete_set_path: Path | str | None, statuses: list[str] | None = None) → LabelMappingSet[source]

Parse HGNC complete set for symbol (label) mappings.

Parameters:

complete_set_path – Path to the complete HGNC set TSV file.
statuses – Entry statuses to include (e.g. ["Approved"]). If None (default), all entries are included.

Returns:

LabelMappingSet with computed cardinalities based on labels.

parse_all(withdrawn_path: Path | str | None, complete_set_path: Path | str | None) → tuple[BaseMappingSet, BaseMappingSet][source]

Parse both withdrawn and complete set files.

Parameters:

withdrawn_path – Path to the withdrawn HGNC TSV file.
complete_set_path – Path to the complete HGNC set TSV file.

Returns:

Tuple of (IdMappingSet, LabelMappingSet).

class NCBIParser(version: str | None = None, show_progress: bool = True, config_name: str | None = None)[source]

Parser for NCBI Gene TSV files using Polars.

Extracts secondary-to-primary NCBI Gene identifier mappings including gene symbols from gene_history and gene_info files.

Returns: - IdMappingSet for ID-to-ID mappings (discontinued Gene IDs) - LabelMappingSet for symbol mappings (gene synonyms)

Initialize the parser.

Parameters:

version – Version/release identifier for the datasource.
show_progress – Whether to show progress bars during parsing.
config_name – Name of config file to load (defaults to class name).

parse(input_path: Path | str | None = None, species: str = '9606', gene_info_path: Path | str | None = None) → BaseMappingSet[source]

Parse NCBI gene_history file into an IdMappingSet.

Parameters:

input_path – Path to gene_history file (can be .gz compressed).
species – NCBI taxon ID to filter by, or “all” to skip filtering entirely (default: “9606” for human).
gene_info_path – Optional path to the gene_info file. When supplied, _primary_ids on the returned mapping set is populated with every current NCBIGene:<id> CURIE for the given taxonomy, not just those that appear as object_id in a discontinued-to-primary mapping.

Returns:

IdMappingSet with computed cardinalities based on IDs.

parse_labels(gene_info_path: Path | str | None, species: str = '9606') → BaseMappingSet[source]

Parse NCBI gene_info file for label (label) mappings.

Parameters:

gene_info_path – Path to gene_info file.
species – NCBI taxon ID to filter by, or “all” to skip filtering entirely (default: “9606” for human).

Returns:

LabelMappingSet with computed cardinalities based on labels.

parse_primary_ids(gene_info_path: Path | str | None, species: str = '9606') → BaseMappingSet[source]

Return a mapping set containing the full list of current NCBI Gene primary IDs.

Reads gene_info to extract every current Gene ID for the given taxonomy. The returned mapping set has an empty mappings list; _primary_ids is populated with every current NCBIGene:<id> CURIE.

Parameters:

gene_info_path – Path to the gene_info file (can be .gz compressed).
species – NCBI taxon ID to filter by, or "all" to skip filtering entirely (default: "9606" for human).

Returns:

IdMappingSet with _primary_ids populated.

parse_primary_labels(gene_info_path: Path | str | None, species: str = '9606') → BaseMappingSet[source]

Return a mapping set containing the full list of current NCBI Gene labels.

Reads gene_info to extract every current gene label for the given taxonomy. The returned mapping set has an empty mappings list; _primary_labels is populated.

Parameters:

gene_info_path – Path to the gene_info file (can be .gz compressed).
species – NCBI taxon ID to filter by, or "all" to skip filtering entirely (default: "9606" for human).

Returns:

LabelMappingSet with _primary_labels populated.

parse_all(gene_history_path: Path | str | None, gene_info_path: Path | str | None, species: str = '9606') → tuple[BaseMappingSet, BaseMappingSet][source]

Parse both gene_history and gene_info files.

Parameters:

gene_history_path – Path to gene_history file.
gene_info_path – Path to gene_info file.
species – NCBI taxon ID to filter by, or "all" to process every organism in the file (see ALL_SPECIES).

Returns:

Tuple of (IdMappingSet, LabelMappingSet).

class UniProtParser(version: str | None = None, show_progress: bool = True, config_name: str | None = None)[source]

Parser for UniProt files using Polars.

Extracts secondary-to-primary UniProt accession mappings from sec_ac.txt (secondary accessions) and delac_sp.txt (deleted accessions).

Returns IdMappingSet for all mappings (UniProt only has ID mappings).

Initialize the parser.

Parameters:

version – Version/release identifier for the datasource.
show_progress – Whether to show progress bars during parsing.
config_name – Name of config file to load (defaults to class name).

property sec_ac_url: str: Get the sec_ac.txt download URL from config.

property delac_url: str: Get the delac_sp.txt download URL from config.

parse(input_path: Path | str | None = None, delac_path: Path | str | None = None) → BaseMappingSet[source]

Parse UniProt mapping files into an IdMappingSet.

Parameters:

input_path – Path to sec_ac.txt (secondary accessions file).
delac_path – Path to delac_sp.txt (deleted accessions file).

Returns:

IdMappingSet with computed cardinalities based on IDs.

parse_primary_ids(acindex_path: Path | str | None = None) → BaseMappingSet[source]

Return a mapping set containing the full list of current UniProt primary ACs.

Parses acindex.txt (or a gzip-compressed variant) to extract every accession number that currently appears in UniProtKB/Swiss-Prot. The file lists one AC per row (after the __________ separator line); only the first whitespace-delimited token of each data line is taken.

For versioned (legacy) releases the file can be found at:

https://ftp.uniprot.org/pub/databases/uniprot/previous_releases/
release-{version}/knowledgebase/docs/acindex.txt.gz

Parameters:: acindex_path – Local path to acindex.txt (plain or .gz). Auto-downloaded from the current release when None.
Returns:: IdMappingSet with no mappings and _primary_ids populated with all current UniProtKB:<AC> CURIEs.

class VGNCParser(version: str | None = None, show_progress: bool = True)[source]

Parser for VGNC TSV files using Polars for memory efficiency.

Extracts secondary-to-primary VGNC identifier mappings and symbol mappings from the VGNC withdrawn and gene-set files.

Produces two mapping-set kinds:

IdMappingSet for ID-to-ID mappings (withdrawn/merged IDs).
LabelMappingSet for symbol mappings (alias/previous symbols), scoped to one species.

Initialize the VGNC parser.

Parameters:

version – Version/release identifier for the datasource.
show_progress – Whether to show progress bars during parsing.

id_column: ClassVar[str] = 'vgnc_id': Column holding the primary identifier (e.g. "hgnc_id").

withdrawn_label_column: ClassVar[str] = 'withdrawn_symbol': Column holding a withdrawn entry symbol.

Parse the VGNC withdrawn TSV file into an IdMappingSet.

Always parses the full, unfiltered withdrawn file first (see module docstring); when species is given (and isn’t ALL_SPECIES), the result is then subset by resolving each mapping’s primary VGNC ID against the gene-set file’s taxon_id column – this requires complete_set_path.

Parameters:

input_path – Path to the VGNC withdrawn TSV file.
complete_set_path – Path to the VGNC gene-set file. Required when species is given (used to resolve taxon IDs). When supplied, _primary_ids on the returned mapping set is populated with every current VGNC ID for species (or across all species, when species is None/"all").
species – NCBI taxon ID to subset the output to, or ALL_SPECIES. None (default) returns the full, unfiltered set across every species.

Returns:

IdMappingSet with computed cardinalities based on IDs.

Raises:

ValueError – If species is given without complete_set_path (there is no other way to resolve taxon IDs).

parse_primary_ids(complete_set_path: Path | str | None, species: str | None = None) → BaseMappingSet[source]

Return a mapping set whose only content is the full primary ID list.

Reads the VGNC gene-set file to extract every current VGNC ID, optionally subset to species.

Parameters:

complete_set_path – Path to the VGNC gene-set TSV file.
species – NCBI taxon ID to subset the result to, or ALL_SPECIES. None (default) returns the full, unfiltered set across every species.

Returns:

IdMappingSet with no mappings and _primary_ids populated.

parse_labels(complete_set_path: Path | str | None, species: str, statuses: list[str] | None = None) → LabelMappingSet[source]

Parse the VGNC gene-set file for symbol (label) mappings, scoped to one species.

Parameters:

complete_set_path – Path to the VGNC gene-set TSV file.
species – NCBI taxon ID to filter by, or ALL_SPECIES to process every species together (see module docstring for why that changes ambiguity detection). Required at this layer – callers needing config’s species.default fallback (see config/vgnc.yaml) should resolve it themselves, as pysec2pri.api’s generate_vgnc_labels does.
statuses – Entry statuses to include (e.g. ["Approved"]). If None (default), all entries are included.

Returns:

LabelMappingSet with computed cardinalities based on labels.

parse_primary_labels(complete_set_path: Path | str | None, species: str) → BaseMappingSet[source]

Return a mapping set whose only content is the full primary Symbol list.

Reads the VGNC gene-set file to extract every current approved symbol for species, storing it in _primary_labels. The mappings list is intentionally left empty; this mapping set exists only to drive to_pri_labels().

Parameters:

complete_set_path – Path to the VGNC gene-set TSV file.
species – NCBI taxon ID to filter by, or ALL_SPECIES.

Returns:

LabelMappingSet with no mappings and _primary_labels populated.

parse_all(withdrawn_path: Path | str | None, complete_set_path: Path | str | None, species: str) → tuple[BaseMappingSet, BaseMappingSet][source]

Parse both the withdrawn and gene-set files.

Parameters:

withdrawn_path – Path to the VGNC withdrawn TSV file.
complete_set_path – Path to the VGNC gene-set TSV file.
species – NCBI taxon ID to filter the label mappings by.

Returns:

Tuple of (IdMappingSet, LabelMappingSet).

class WikidataParser(version: str | None = None, show_progress: bool = True, entity_type: str | None = None, endpoint: str | None = None, test_subset: bool = False)[source]

Parser for Wikidata redirect mappings via SPARQL.

Queries the QLever Wikidata endpoint to find redirect mappings for chemicals, genes, and proteins.

Returns IdMappingSet for all mappings.

Initialize the Wikidata parser.

Parameters:

version – Version/date string for the mappings.
show_progress – Whether to show progress.
entity_type – Entity type to query; one of entity_types(). Defaults to the first one the config declares.
endpoint – Optional custom SPARQL endpoint.
test_subset – Whether to use test queries (LIMIT 10).

classmethod entity_types() → list[str][source]

Return the entity types declared by wikidata.yaml’s queries block.

Each key names both a --entity-type choice and its redirect query, so the config is the only place they are listed.

parse(input_path: Path | str | None = None) → BaseMappingSet[source]

Query Wikidata and return a MappingSet.

Parameters:: input_path – Ignored for Wikidata (queries endpoint directly).
Returns:: IdMappingSet containing Wikidata redirect mappings.

parse_all() → BaseMappingSet[source]

Query all entity types from config and return combined MappingSet.

Runs all SPARQL queries defined in the config file’s ‘queries’ section (e.g., chemical_redirects, gene_redirects, protein_redirects) and combines the results into a single MappingSet.

Returns:: IdMappingSet containing all Wikidata redirect mappings.

parse_from_file(input_path: Path | str) → BaseMappingSet[source]

Parse Wikidata redirects from a pre-downloaded TSV file.

Parameters:: input_path – Path to TSV file with SPARQL results.
Returns:: IdMappingSet with computed cardinalities.

parse_labels(input_path: Path | str | None = None) → LabelMappingSet[source]

Return a LabelMappingSet of previous-label to current-label mappings.

Queries the SPARQL endpoint (or reads input_path) exactly like parse(), but wraps the result in a LabelMappingSet so label-specific exports (label_sec2pri, pri_labels) work.

Parameters:: input_path – Pre-downloaded TSV file. Queries SPARQL if None.
Returns:: LabelMappingSet with label-based mappings.

Adding a New Parser

See Adding a source.