Update IDs

Resolve secondary identifiers to primary identifiers using a MappingSet.

Typical usage

Single string (possibly separated by commas/semicolons/pipes/whitespace):

from pysec2pri import generate_hgnc
from pysec2pri.update_ids import update_ids

ms = generate_hgnc()
update_ids("HGNC:1234|HGNC:5678", ms)
# {'HGNC:1234': 'HGNC:9999', 'HGNC:5678': 'HGNC:5678'}

List of strings:

update_ids(["HGNC:1234", "HGNC:5678"], ms)

Pandas DataFrame, annotate one or more columns:

import pandas as pd

df = pd.DataFrame({"gene_id": ["HGNC:1234", "HGNC:5678"]})
update_ids(df, ms, at="gene_id")
# returns df with an extra column "gene_id_primary"

# Multiple columns at once:
update_ids(df, ms, at=["gene_id", "alt_id"])
# returns df with "gene_id_primary" and "alt_id_primary" columns added

Notes

  • Identifiers that are not found in the mapping set are returned/kept as-is.

  • Identifiers separated by common delimiters (|, ,, ;, whitespace) inside a single string are each looked up individually.

  • The mapping look-up is done once against the full set of unique IDs to avoid repeated scans of large mapping sets.

  • Ambiguous identifiers, those that appear both as a secondary ID in the mapping set and as a current primary ID, are left blank in the resolved output. A warning is emitted listing every ambiguous token so the user can resolve them manually.

build_alias_index(mapping_set: Sec2PriMappingSet) dict[str, list[str]][source]

Return {object_id: [subject_labels linked via non-IAO predicates]}.

Builds an index of all non-deprecation alias mappings in a LabelMappingSet. Only entries whose predicate_id is not IAO:0100001 are included; deprecation (IAO:0100001 / “term replaced by”) mappings are deliberately excluded because they express history, not active aliasing.

This index is used by resolve_ambiguous_with_hints() to confirm whether a user-supplied alias belongs to the secondary mapping’s target (confirming secondary usage) or to the entity’s own primary entry (confirming primary usage).

Parameters:

mapping_set – A LabelMappingSet (e.g. the result of generate_hgnc_symbols()).

Returns:

Dict mapping each object_id to the list of subject_label values that point to it via a non-IAO predicate.

build_ambiguous_set(mapping_set: Sec2PriMappingSet) set[str][source]

Return the set of ambiguous subject IDs in mapping_set.

An identifier is ambiguous when it appears both as a subject_id (i.e. a secondary/previous term) and as a current primary identifier, either in the explicitly stored _primary_ids set or among the object_id values of the mappings.

When such overlap exists a naïve replacement could silently corrupt references that already use the current entity, so the resolver intentionally leaves those cells blank.

Parameters:

mapping_set – Any Sec2PriMappingSet.

Returns:

Set of ID strings that are both secondary and primary. Empty set when no ambiguity is detected.

build_ambiguous_symbols_set(mapping_set: Sec2PriMappingSet) set[str][source]

Return the set of ambiguous subject labels in mapping_set.

Analogous to build_ambiguous_set() but operates on subject_label / object_label (symbol) mappings.

Parameters:

mapping_set – Any LabelMappingSet.

Returns:

Set of label strings that are both secondary and primary. Empty set when no ambiguity is detected.

build_lookup(mapping_set: Sec2PriMappingSet) dict[str, str][source]

Return a {secondary_id: primary_id} dictionary.

Useful when you want to apply the look-up yourself or cache it for repeated calls.

Parameters:

mapping_set – A Sec2PriMappingSet (e.g. the object returned by generate_hgnc()).

Returns:

Dictionary mapping every secondary ID to its current primary ID.

build_primary_token_to_id(mapping_set: Sec2PriMappingSet) dict[str, str][source]

Return {primary_label: primary_id} from a label mapping set.

Collects every (object_label, object_id) pair seen in the mappings and, where available, the _primary_symbols store. Useful for translating a primary symbol string into its CURIE so that build_alias_index() (keyed by object_id) can be looked up.

Parameters:

mapping_set – A LabelMappingSet.

Returns:

Dict {primary_symbol: primary_id}.

build_symbol_lookup(mapping_set: Sec2PriMappingSet) dict[str, str][source]

Return a {secondary_label: primary_label} dictionary.

Useful when you want to apply the look-up yourself or cache it for repeated calls.

Parameters:

mapping_set – A LabelMappingSet (e.g. the result of generate_hgnc_symbols()).

Returns:

Dictionary mapping every previous/alias symbol to its current symbol.

resolve_ambiguous_with_hints(ambiguous_token: str, user_aliases: list[str], lkp: dict[str, str], alias_index: dict[str, list[str]], token_to_id: dict[str, str] | None = None) tuple[str, str | None][source]

Attempt to resolve an ambiguous symbol or ID using user-provided alias hints.

An ambiguous token appears both as a current primary entry and as a secondary (subject) in a mapping that points to a different primary.

Two resolution cases are checked:

  1. Secondary usage: at least one of the user-supplied aliases matches the target token itself (its primary label or primary ID), or appears among the non-IAO aliases of the mapping’s target (lkp[ambiguous_token]). This confirms the token is being used as a secondary alias of the target, returns (target_token, target_id).

  2. Primary usage: at least one of the user-supplied aliases appears among the non-IAO aliases of the token’s own primary entry. This confirms the token is being used as a standalone primary, returns (ambiguous_token, own_id).

If neither case applies the ambiguity cannot be resolved and ("", None) is returned.

Parameters:
  • ambiguous_token – The symbol or ID that is both primary and secondary.

  • user_aliases – Alias strings provided by the user (e.g. from a known_alias column) to help determine which entity is actually meant.

  • lkp{secondary_token: resolved_token} lookup (from build_symbol_lookup() or build_lookup()).

  • alias_index{primary_id: [non-IAO alias strings]} built via build_alias_index().

  • token_to_id{primary_token: primary_id}. When None the token is treated as its own ID, which is appropriate for ID mapping sets where the token already is a CURIE (e.g. "HGNC:53564").

Returns:

  • Secondary case -> (target_token, target_id)

  • Primary case -> (ambiguous_token, own_id)

  • Unresolvable -> ("", None)

Return type:

A (resolved_token, resolved_id) tuple

update_ids(ids: str, mapping_set: Sec2PriMappingSet, *, at: None = None, suffix: str = '_primary', lookup: dict[str, str] | None = None, ambiguous: set[str] | None = None) dict[str, str][source]
update_ids(ids: list[str], mapping_set: Sec2PriMappingSet, *, at: None = None, suffix: str = '_primary', lookup: dict[str, str] | None = None, ambiguous: set[str] | None = None) dict[str, str]
update_ids(ids: pd.DataFrame, mapping_set: Sec2PriMappingSet, *, at: str | list[str], suffix: str = '_primary', lookup: dict[str, str] | None = None, ambiguous: set[str] | None = None, synonyms: str | list[str] | None = None, label_mapping_set: Sec2PriMappingSet | None = None) pd.DataFrame

Resolve secondary identifiers to primary identifiers.

Parameters:
  • ids

    One of:

    • str: a single identifier, or multiple identifiers joined by |, ,, ;, or whitespace.

    • list[str]: a list of identifier strings (each may itself contain multiple IDs separated by the delimiters above).

    • pandas.DataFrame: a DataFrame; you must also supply at.

  • mapping_set – The Sec2PriMappingSet to look up against (e.g. the result of generate_hgnc()).

  • atDataFrame mode only. Column name or list of column names that contain identifiers. For each column col a new column named col + suffix is added to the returned DataFrame.

  • suffix – Suffix appended to column names in DataFrame mode (default "_primary").

  • lookup – Pre-built {secondary_id: primary_id} dictionary. Pass the result of build_lookup() to avoid rebuilding on repeated calls.

  • ambiguous – Pre-built set of ambiguous IDs (see build_ambiguous_set()). When None, it is computed automatically from mapping_set. Pass an explicit set (including an empty one) to skip the computation.

  • synonymsDataFrame mode only. Name of a column in the DataFrame that contains user-supplied alias strings (delimited by |, ,, ;, or whitespace) to help resolve ambiguous identifiers. When provided, resolve_ambiguous_with_hints() is called for every ambiguous cell using that row’s alias list.

  • label_mapping_set – A LabelMappingSet used to build the alias index when synonyms is provided. When None and synonyms is set, hint-based resolution is skipped (ambiguous IDs remain blank) and a warning is emitted.

Returns:

  • dict[str, str] – When ids is a str or list[str]: a dictionary mapping each unique input identifier to its resolved primary ID. Identifiers not found in the mapping set are returned unchanged. Ambiguous identifiers are mapped to an empty string and a warning is emitted.

  • pandas.DataFrame – When ids is a DataFrame: a copy of the DataFrame with one new <col><suffix> column per entry in at. Ambiguous cells are set to ""; a warning is emitted after all columns are processed.

update_symbols(symbols: str, mapping_set: Sec2PriMappingSet, *, at: None = None, suffix: str = '_current', lookup: dict[str, str] | None = None, ambiguous: set[str] | None = None, synonyms: str | list[str] | None = None) dict[str, str][source]
update_symbols(symbols: list[str], mapping_set: Sec2PriMappingSet, *, at: None = None, suffix: str = '_current', lookup: dict[str, str] | None = None, ambiguous: set[str] | None = None, synonyms: str | list[str] | None = None) dict[str, str]
update_symbols(symbols: pd.DataFrame, mapping_set: Sec2PriMappingSet, *, at: str | list[str], suffix: str = '_current', lookup: dict[str, str] | None = None, ambiguous: set[str] | None = None, synonyms: str | list[str] | None = None) pd.DataFrame

Resolve previous/alias gene symbols to current symbols.

Same as update_ids() but resolves via the subject_label to object_label mapping rather than IDs.

Parameters:
  • symbols

    One of:

    • str: a single symbol, or multiple symbols joined by |, ,, ;, or whitespace.

    • list[str]: a list of symbol strings.

    • pandas.DataFrame: a DataFrame; you must also supply at.

  • mapping_set – A LabelMappingSet (e.g. the result of generate_hgnc_symbols()).

  • atDataFrame mode only. Column name or list of column names that contain symbols. For each column col a new column named col + suffix is added to the returned DataFrame.

  • suffix – Suffix appended to column names in DataFrame mode (default "_current").

  • lookup – Pre-built {previous_symbol: current_symbol} dictionary. Pass the result of build_symbol_lookup() to avoid rebuilding on repeated calls.

  • ambiguous – Pre-built set of ambiguous labels (see build_ambiguous_symbols_set()). When None, it is computed automatically from mapping_set.

  • synonymsDataFrame mode only. Name of a column in the DataFrame that contains user-supplied alias strings (delimited by |, ,, ;, or whitespace) to help resolve ambiguous symbols. When provided, resolve_ambiguous_with_hints() is called for every ambiguous cell using that row’s alias list. The alias index is built from mapping_set itself (non-IAO entries).

Returns:

  • dict[str, str] – When symbols is a str or list[str]: a dictionary mapping each unique input symbol to its resolved current symbol. Symbols not found in the mapping set are returned unchanged. Ambiguous symbols are mapped to an empty string and a warning is emitted.

  • pandas.DataFrame – When symbols is a DataFrame: a copy of the DataFrame with one new <col><suffix> column per entry in at. Ambiguous cells are set to ""; a warning is emitted after all columns are processed.