Parsers

Database-specific parsers that produce SSSOM-compliant MappingSets.

All parsers inherit from BaseParser and return Sec2PriMappingSet objects.

Anybody can add a new database by adding the necessary config yaml, parser class, and a DataSourceConfig entry in constants.py. If possible, you can also add automated download in src/pysec2pri/download.py.

Database

Input

Methods

ChEBI

TSV directory (≥ release 245) or SDF file (< 245)

parse(), parse_synonyms()

HMDB

hmdb_metabolites.xml or hmdb_proteins.xml

parse(), parse_proteins()

HGNC

hgnc_complete_set.txt, withdrawn.txt

parse(), parse_symbols(), parse_all()

NCBI Gene

gene_history, gene_info

parse(), parse_symbols(), parse_all()

UniProt

sec_ac.txt, delac_sp.txt

parse()

Wikidata

SPARQL endpoint (live) or pre-fetched JSON

parse(), parse_all(), parse_from_file()

Module Reference

class ChEBIParser(version: str | None = None, show_progress: bool = True, subset: str = '3star')[source]

Parser for ChEBI data files.

Supports both TSV flat files (>= release 245) and (legacy) SDF files. Extracts secondary-to-primary ChEBI identifier mappings and name-to-synonym relationships.

Returns an IdMappingSet for ID mappings (cardinality computed on IDs) and can optionally include synonym mappings via LabelMappingSet.

Initialize the ChEBI parser.

Parameters:
  • version – Version/release identifier for the datasource.

  • show_progress – Whether to show progress bars during parsing.

  • subset – “3star” or “complete” - which compound subset to use. For TSV format, filters by stars in compounds.tsv. For SDF format, determines which file to download.

property source_url: str

Get the default download URL from config.

parse(input_path: Path | str | None = None, *, secondary_ids_path: Path | str | None = None, compounds_path: Path | str | None = None) Sec2PriMappingSet[source]

Parse ChEBI data into an IdMappingSet.

Accepts three calling conventions:

  • input_path is a directory: expects secondary_ids.tsv (and optionally compounds.tsv) inside it (TSV format >= 245).

  • input_path is an SDF file: legacy format (< 245).

  • Keyword args secondary_ids_path / compounds_path: explicit TSV paths (kept for backwards compatibility).

Parameters:
  • input_path – Path to an SDF file, or a directory of TSV files.

  • secondary_ids_path – Explicit path to secondary_ids.tsv (TSV format).

  • compounds_path – Explicit path to compounds.tsv for 3-star filtering.

Returns:

IdMappingSet with computed cardinalities.

parse_synonyms(input_path: Path | str | None = None, *, names_path: Path | str | None = None, compounds_path: Path | str | None = None) Sec2PriMappingSet[source]

Parse ChEBI data into a LabelMappingSet for synonyms.

Accepts three calling conventions:

  • input_path is a directory: expects names.tsv (and optionally compounds.tsv) inside it (TSV format >= 245).

  • input_path is an SDF file: legacy format (< 245).

  • Keyword args names_path / compounds_path: explicit TSV paths (kept for backwards compatibility).

Parameters:
  • input_path – Path to an SDF file, or a directory of TSV files.

  • names_path – Explicit path to names.tsv (TSV format).

  • compounds_path – Explicit path to compounds.tsv for 3-star filtering.

Returns:

LabelMappingSet with computed cardinalities based on labels.

class HMDBParser(version: str | None = None, show_progress: bool = True, config_name: str | None = None)[source]

Parser for HMDB XML files (metabolites and proteins).

Extracts secondary-to-primary HMDB accession mappings from hmdb_metabolites.xml and/or hmdb_proteins.xml.

Initialize the parser.

Parameters:
  • version – Version/release identifier for the datasource.

  • show_progress – Whether to show progress bars during parsing.

  • config_name – Name of config file to load (defaults to class name).

property source_url: str

Get the metabolites download URL from config.

parse(input_path: Path | str | None) Sec2PriMappingSet[source]

Parse HMDB metabolites XML file.

Parameters:

input_path – Path to hmdb_metabolites.xml (or .zip/.gz).

Returns:

IdMappingSet for metabolite accessions.

parse_proteins(input_path: Path | str) Sec2PriMappingSet[source]

Parse HMDB proteins XML file.

Primary accessions have the form HMDBP00001. Secondary accessions may be bare numbers (legacy format, e.g. 5229) or full HMDBP accessions; both are normalised to HMDBP:HMDBP<zero-padded> using the same prefix logic as the metabolites parser.

Parameters:

input_path – Path to hmdb_proteins.xml (or .zip/.gz).

Returns:

IdMappingSet for protein accessions.

class HGNCParser(version: str | None = None, show_progress: bool = True)[source]

Parser for HGNC TSV files using Polars for memory efficiency.

Extracts secondary-to-primary HGNC identifier mappings and symbol mappings from HGNC withdrawn and complete set files.

Returns: - IdMappingSet for ID-to-ID mappings (withdrawn/merged IDs) - LabelMappingSet for symbol mappings (alias/previous symbols)

Initialize the HGNC parser.

Parameters:
  • version – Version/release identifier for the datasource.

  • show_progress – Whether to show progress bars during parsing.

property withdrawn_source_url: str

Get the withdrawn file download URL from config.

property complete_set_source_url: str

Get the complete set download URL from config.

parse(input_path: Path | str | None, complete_set_path: Path | str | None = None) Sec2PriMappingSet[source]

Parse HGNC withdrawn TSV file into an IdMappingSet.

Parameters:
  • input_path – Path to the withdrawn HGNC TSV file.

  • complete_set_path – Optional path to the HGNC complete set TSV. When supplied, all_primary_ids on the returned mapping set is populated with every current HGNC ID, not just those that appear as object_id in a withdrawn to primary mapping.

Returns:

IdMappingSet with computed cardinalities based on IDs.

parse_primary_ids(complete_set_path: Path | str | None) Sec2PriMappingSet[source]

Return a mapping set whose only content is the full primary ID list.

Reads the HGNC complete set to extract every current HGNC ID and stores it in _primary_ids. The mappings list is intentionally left empty, this mapping set exists only to drive to_pri_ids().

Parameters:

complete_set_path – Path to the HGNC complete set TSV file.

Returns:

IdMappingSet with no mappings and _primary_ids populated with all current HGNC IDs.

parse_symbols(complete_set_path: Path | str | None, statuses: list[str] | None = None) Sec2PriMappingSet[source]

Parse HGNC complete set for symbol (label) mappings.

Parameters:
  • complete_set_path – Path to the complete HGNC set TSV file.

  • statuses – Entry statuses to include (e.g. ["Approved"]). If None (default), all entries are included.

Returns:

LabelMappingSet with computed cardinalities based on labels.

parse_all(withdrawn_path: Path | str | None, complete_set_path: Path | str | None) tuple[Sec2PriMappingSet, Sec2PriMappingSet][source]

Parse both withdrawn and complete set files.

Parameters:
  • withdrawn_path – Path to the withdrawn HGNC TSV file.

  • complete_set_path – Path to the complete HGNC set TSV file.

Returns:

Tuple of (IdMappingSet, LabelMappingSet).

class NCBIParser(version: str | None = None, show_progress: bool = True, config_name: str | None = None)[source]

Parser for NCBI Gene TSV files using Polars.

Extracts secondary-to-primary NCBI Gene identifier mappings including gene symbols from gene_history and gene_info files.

Returns: - IdMappingSet for ID-to-ID mappings (discontinued Gene IDs) - LabelMappingSet for symbol mappings (gene synonyms)

Initialize the parser.

Parameters:
  • version – Version/release identifier for the datasource.

  • show_progress – Whether to show progress bars during parsing.

  • config_name – Name of config file to load (defaults to class name).

property history_source_url: str

Get the gene_history download URL from config.

property info_source_url: str

Get the gene_info download URL from config.

parse(input_path: Path | str | None = None, tax_id: str = '9606') Sec2PriMappingSet[source]

Parse NCBI gene_history file into an IdMappingSet.

Parameters:
  • input_path – Path to gene_history file (can be .gz compressed).

  • tax_id – Taxonomy ID to filter by (default: “9606” for human).

Returns:

IdMappingSet with computed cardinalities based on IDs.

parse_symbols(gene_info_path: Path | str | None, tax_id: str = '9606') Sec2PriMappingSet[source]

Parse NCBI gene_info file for symbol (label) mappings.

Parameters:
  • gene_info_path – Path to gene_info file.

  • tax_id – Taxonomy ID to filter by (default: “9606” for human).

Returns:

LabelMappingSet with computed cardinalities based on labels.

parse_all(gene_history_path: Path | str | None, gene_info_path: Path | str | None, tax_id: str = '9606') tuple[Sec2PriMappingSet, Sec2PriMappingSet][source]

Parse both gene_history and gene_info files.

Parameters:
  • gene_history_path – Path to gene_history file.

  • gene_info_path – Path to gene_info file.

  • tax_id – Taxonomy ID to filter by.

Returns:

Tuple of (IdMappingSet, LabelMappingSet).

class UniProtParser(version: str | None = None, show_progress: bool = True, config_name: str | None = None)[source]

Parser for UniProt files using Polars.

Extracts secondary-to-primary UniProt accession mappings from sec_ac.txt (secondary accessions) and delac_sp.txt (deleted accessions).

Returns IdMappingSet for all mappings (UniProt only has ID mappings).

Initialize the parser.

Parameters:
  • version – Version/release identifier for the datasource.

  • show_progress – Whether to show progress bars during parsing.

  • config_name – Name of config file to load (defaults to class name).

property sec_ac_url: str

Get the sec_ac.txt download URL from config.

property delac_url: str

Get the delac_sp.txt download URL from config.

parse(input_path: Path | str | None = None, delac_path: Path | str | None = None) Sec2PriMappingSet[source]

Parse UniProt mapping files into an IdMappingSet.

Parameters:
  • input_path – Path to sec_ac.txt (secondary accessions file).

  • delac_path – Path to delac_sp.txt (deleted accessions file).

Returns:

IdMappingSet with computed cardinalities based on IDs.

class WikidataParser(version: str | None = None, show_progress: bool = True, entity_type: str = 'metabolites', endpoint: str | None = None, test_subset: bool = False)[source]

Parser for Wikidata redirect mappings via SPARQL.

Queries the QLever Wikidata endpoint to find redirect mappings for chemicals, genes, and proteins.

Returns IdMappingSet for all mappings.

Initialize the Wikidata parser.

Parameters:
  • version – Version/date string for the mappings.

  • show_progress – Whether to show progress.

  • entity_type – Type of entities to query.

  • endpoint – Optional custom SPARQL endpoint.

  • test_subset – Whether to use test queries (LIMIT 10).

parse(input_path: Path | str | None = None) Sec2PriMappingSet[source]

Query Wikidata and return a MappingSet.

Parameters:

input_path – Ignored for Wikidata (queries endpoint directly).

Returns:

IdMappingSet containing Wikidata redirect mappings.

parse_all() Sec2PriMappingSet[source]

Query all entity types from config and return combined MappingSet.

Runs all SPARQL queries defined in the config file’s ‘queries’ section (e.g., chemical_redirects, gene_redirects, protein_redirects) and combines the results into a single MappingSet.

Returns:

IdMappingSet containing all Wikidata redirect mappings.

parse_from_file(input_path: Path | str) Sec2PriMappingSet[source]

Parse Wikidata redirects from a pre-downloaded TSV file.

Parameters:

input_path – Path to TSV file with SPARQL results.

Returns:

IdMappingSet with computed cardinalities.

parse_symbols(input_path: Path | str | None = None) LabelMappingSet[source]

Return a LabelMappingSet of previous-label to current-label mappings.

Queries the SPARQL endpoint (or reads input_path) exactly like parse(), but wraps the result in a LabelMappingSet so label-specific exports (symbol_sec2pri, pri_symbols) work.

Parameters:

input_path – Pre-downloaded TSV file. Queries SPARQL if None.

Returns:

LabelMappingSet with label-based mappings.

Adding a New Parser

  1. Create config YAML (config/mydb.yaml) with mappingset, mapping, and download_urls sections - see existing configs for reference.

  2. Create parser class (src/pysec2pri/parsers/mydb.py):

    from pysec2pri.parsers.base import BaseParser
    
    class MyDBParser(BaseParser):
        datasource_name = "mydb"
    
        def parse(self, input_path):
            raw = self._load(input_path)
            mappings = self._build_id_mappings(raw)
            return self._create_mapping_set(mappings, mapping_type="id")
    
  3. Register in constants (src/pysec2pri/constants.py):

    MYDB = get_datasource_config("mydb")
    ALL_DATASOURCES = [..., MYDB]
    
  4. Expose in API and CLI - add a parse_mydb() function in src/pysec2pri/api.py and a command in src/pysec2pri/cli.py.