Update IDs

Resolve secondary identifiers to primary identifiers using a MappingSet.

Typical usage

Single string (possibly separated by commas/semicolons/pipes/whitespace):

from pysec2pri import generate_hgnc
from pysec2pri.update_ids import update_ids

ms = generate_hgnc()
update_ids("HGNC:1234|HGNC:5678", ms)
# {'HGNC:1234': 'HGNC:9999', 'HGNC:5678': 'HGNC:5678'}

List of strings:

update_ids(["HGNC:1234", "HGNC:5678"], ms)

Pandas DataFrame, annotate one or more columns:

import pandas as pd

df = pd.DataFrame({"gene_id": ["HGNC:1234", "HGNC:5678"]})
update_ids(df, ms, at="gene_id")
# returns df with an extra column "gene_id_primary"

# Multiple columns at once:
update_ids(df, ms, at=["gene_id", "alt_id"])
# returns df with "gene_id_primary" and "alt_id_primary" columns added

Notes

  • Identifiers that are not found in the mapping set are returned/kept as-is.

  • Identifiers separated by common delimiters (|, ,, ;, whitespace) inside a single string are each looked up individually.

  • The mapping look-up is done once against the full set of unique IDs to avoid repeated scans of large mapping sets.

build_lookup(mapping_set: Sec2PriMappingSet) dict[str, str][source]

Return a {secondary_id: primary_id} dictionary.

Useful when you want to apply the look-up yourself or cache it for repeated calls.

Parameters:

mapping_set – A Sec2PriMappingSet (e.g. the object returned by generate_hgnc()).

Returns:

Dictionary mapping every secondary ID to its current primary ID.

build_symbol_lookup(mapping_set: Sec2PriMappingSet) dict[str, str][source]

Return a {secondary_label: primary_label} dictionary.

Useful when you want to apply the look-up yourself or cache it for repeated calls.

Parameters:

mapping_set – A LabelMappingSet (e.g. the result of generate_hgnc_symbols()).

Returns:

Dictionary mapping every previous/alias symbol to its current symbol.

update_ids(ids: str, mapping_set: Sec2PriMappingSet, *, at: None = None, suffix: str = '_primary', lookup: dict[str, str] | None = None) dict[str, str][source]
update_ids(ids: list[str], mapping_set: Sec2PriMappingSet, *, at: None = None, suffix: str = '_primary', lookup: dict[str, str] | None = None) dict[str, str]
update_ids(ids: pd.DataFrame, mapping_set: Sec2PriMappingSet, *, at: str | list[str], suffix: str = '_primary', lookup: dict[str, str] | None = None) pd.DataFrame

Resolve secondary identifiers to primary identifiers.

Parameters:
  • ids

    One of:

    • str: a single identifier, or multiple identifiers joined by |, ,, ;, or whitespace.

    • list[str]: a list of identifier strings (each may itself contain multiple IDs separated by the delimiters above).

    • pandas.DataFrame: a DataFrame; you must also supply at.

  • mapping_set – The Sec2PriMappingSet to look up against (e.g. the result of generate_hgnc()).

  • atDataFrame mode only. Column name or list of column names that contain identifiers. For each column col a new column named col + suffix is added to the returned DataFrame.

  • suffix – Suffix appended to column names in DataFrame mode (default "_primary").

  • lookup – Pre-built {secondary_id: primary_id} dictionary. Pass the result of build_lookup() to avoid rebuilding on repeated calls.

Returns:

  • dict[str, str] – When ids is a str or list[str]: a dictionary mapping each unique input identifier to its resolved primary ID. Identifiers not found in the mapping set are returned unchanged.

  • pandas.DataFrame – When ids is a DataFrame: a copy of the DataFrame with one new <col><suffix> column per entry in at.

Examples

Setup:

ms = generate_hgnc()

Single string:

update_ids("HGNC:1234", ms)
# {'HGNC:1234': 'HGNC:9999'}

Pipe-separated string:

update_ids("HGNC:1234|HGNC:5678", ms)
# {'HGNC:1234': 'HGNC:9999', 'HGNC:5678': 'HGNC:5678'}

List:

update_ids(["HGNC:1234", "HGNC:5678", "HGNC:1234"], ms)
# {'HGNC:1234': 'HGNC:9999', 'HGNC:5678': 'HGNC:5678'}

DataFrame:

import pandas as pd

df = pd.DataFrame({"gene": ["HGNC:1234", "HGNC:5678"]})
update_ids(df, ms, at="gene")
#        gene  gene_primary
# 0  HGNC:1234  HGNC:9999
# 1  HGNC:5678  HGNC:5678
update_symbols(symbols: IdsInput, mapping_set: Sec2PriMappingSet, *, at: str | list[str] | None = None, suffix: str = '_current', lookup: dict[str, str] | None = None) dict[str, str] | pd.DataFrame[source]

Resolve previous/alias gene symbols to current symbols.

Behaves identically to update_ids() but resolves via the subject_label to object_label mapping rather than IDs.

Parameters:
  • symbols

    One of:

    • str: a single symbol, or multiple symbols joined by |, ,, ;, or whitespace.

    • list[str]: a list of symbol strings.

    • pandas.DataFrame: a DataFrame; you must also supply at.

  • mapping_set – A LabelMappingSet (e.g. the result of generate_hgnc_symbols()).

  • atDataFrame mode only. Column name or list of column names that contain symbols. For each column col a new column named col + suffix is added to the returned DataFrame.

  • suffix – Suffix appended to column names in DataFrame mode (default "_current").

  • lookup – Pre-built {previous_symbol: current_symbol} dictionary. Pass the result of build_symbol_lookup() to avoid rebuilding on repeated calls.

Returns:

  • dict[str, str] – When symbols is a str or list[str]: a dictionary mapping each unique input symbol to its resolved current symbol. Symbols not found in the mapping set are returned unchanged.

  • pandas.DataFrame – When symbols is a DataFrame: a copy of the DataFrame with one new <col><suffix> column per entry in at.