Update IDs

Resolve secondary identifiers to primary identifiers using a MappingSet.

Typical usage

Single string (possibly separated by commas/semicolons/pipes/whitespace):

from pysec2pri import generate_hgnc
from pysec2pri.update_ids import update_ids

ms = generate_hgnc()
update_ids("HGNC:1234|HGNC:5678", ms)
# {'HGNC:1234': 'HGNC:9999', 'HGNC:5678': 'HGNC:5678'}

List of strings:

update_ids(["HGNC:1234", "HGNC:5678"], ms)

Pandas DataFrame, annotate one or more columns:

import pandas as pd

df = pd.DataFrame({"gene_id": ["HGNC:1234", "HGNC:5678"]})
update_ids(df, ms, at="gene_id")
# returns df with an extra column "gene_id_primary"

# Multiple columns at once:
update_ids(df, ms, at=["gene_id", "alt_id"])
# returns df with "gene_id_primary" and "alt_id_primary" columns added

Notes

Identifiers that are not found in the mapping set are returned/kept as-is.
Identifiers separated by common delimiters (|, ,, ;, whitespace) inside a single string are each looked up individually.
The mapping look-up is done once against the full set of unique IDs to avoid repeated scans of large mapping sets.
Ambiguous identifiers, those that appear both as a secondary ID in the mapping set and as a current primary ID, are left blank in the resolved output. A warning is emitted listing every ambiguous token so the user can resolve them manually.

build_alias_index(mapping_set: Sec2PriMappingSet) → dict[str, list[str]][source]

Return {object_id: [subject_labels linked via non-IAO predicates]}.

Builds an index of all non-deprecation alias mappings in a LabelMappingSet. Only entries whose predicate_id is not IAO:0100001 are included; deprecation (IAO:0100001 / “term replaced by”) mappings are deliberately excluded because they express history, not active aliasing.

This index is used by resolve_ambiguous_with_hints() to confirm whether a user-supplied alias belongs to the secondary mapping’s target (confirming secondary usage) or to the entity’s own primary entry (confirming primary usage).

Parameters:: mapping_set – A LabelMappingSet (e.g. the result of generate_hgnc_labels()).
Returns:: Dict mapping each object_id to the list of subject_label values that point to it via a non-IAO predicate.

build_ambiguous_labels_set(mapping_set: Sec2PriMappingSet) → set[str][source]

Return the set of ambiguous subject labels in mapping_set.

Analogous to build_ambiguous_set() but operates on subject_label / object_label (label) mappings.

Parameters:: mapping_set – Any LabelMappingSet.
Returns:: Set of label strings that are both secondary and primary. Empty set when no ambiguity is detected.

build_ambiguous_set(mapping_set: Sec2PriMappingSet) → set[str][source]

Return the set of ambiguous subject IDs in mapping_set.

An identifier is ambiguous when it appears both as a subject_id (i.e. a secondary/previous term) and as a current primary identifier, either in the explicitly stored _primary_ids set or among the object_id values of the mappings.

When such overlap exists a naïve replacement could silently corrupt references that already use the current entity, so the resolver intentionally leaves those cells blank.

Parameters:: mapping_set – Any Sec2PriMappingSet.
Returns:: Set of ID strings that are both secondary and primary. Empty set when no ambiguity is detected.

build_label_lookup(mapping_set: Sec2PriMappingSet) → dict[str, str][source]

Return a {secondary_label: primary_label} dictionary.

Useful when you want to apply the look-up yourself or cache it for repeated calls.

Parameters:: mapping_set – A LabelMappingSet (e.g. the result of generate_hgnc_labels()).
Returns:: Dictionary mapping every previous/alias label to its current label.

build_lookup(mapping_set: Sec2PriMappingSet) → dict[str, str][source]

Return a {secondary_id: primary_id} dictionary.

Useful when you want to apply the look-up yourself or cache it for repeated calls.

Parameters:: mapping_set – A Sec2PriMappingSet (e.g. the object returned by generate_hgnc()).
Returns:: Dictionary mapping every secondary ID to its current primary ID.

build_primary_token_to_id(mapping_set: Sec2PriMappingSet) → dict[str, str][source]

Return {primary_label: primary_id} from a label mapping set.

Collects every (object_label, object_id) pair seen in the mappings and, where available, the _primary_labels store. Useful for translating a primary label string into its CURIE so that build_alias_index() (keyed by object_id) can be looked up.

Parameters:: mapping_set – A LabelMappingSet.
Returns:: Dict {primary_label: primary_id}.

resolve_ambiguous_with_hints(ambiguous_token: str, user_aliases: list[str], lkp: dict[str, str], alias_index: dict[str, list[str]], token_to_id: dict[str, str] | None = None) → tuple[str, str | None][source]

Attempt to resolve an ambiguous label or ID using user-provided alias hints.

An ambiguous token appears both as a current primary entry and as a secondary (subject) in a mapping that points to a different primary.

Two resolution cases are checked:

Secondary usage: at least one of the user-supplied aliases matches the target token itself (its primary label or primary ID), or appears among the non-IAO aliases of the mapping’s target (lkp[ambiguous_token]). This confirms the token is being used as a secondary alias of the target, returns (target_token, target_id).
Primary usage: at least one of the user-supplied aliases appears among the non-IAO aliases of the token’s own primary entry. This confirms the token is being used as a standalone primary, returns (ambiguous_token, own_id).

If neither case applies the ambiguity cannot be resolved and ("", None) is returned.

Parameters:

ambiguous_token – The label or ID that is both primary and secondary.
user_aliases – Alias strings provided by the user (e.g. from a known_alias column) to help determine which entity is actually meant.
lkp – {secondary_token: resolved_token} lookup (from build_label_lookup() or build_lookup()).
alias_index – {primary_id: [non-IAO alias strings]} built via build_alias_index().
token_to_id – {primary_token: primary_id}. When None the token is treated as its own ID, which is appropriate for ID mapping sets where the token already is a CURIE (e.g. "HGNC:53564").

Returns:

Secondary case -> (target_token, target_id)
Primary case -> (ambiguous_token, own_id)
Unresolvable -> ("", None)

Return type:

A (resolved_token, resolved_id) tuple

update_ids(ids: str, mapping_set: Sec2PriMappingSet, *, at: None = None, suffix: str = '_primary', lookup: dict[str, str] | None = None, ambiguous: set[str] | None = None) → dict[str, str][source]

update_ids(ids: list[str], mapping_set: Sec2PriMappingSet, *, at: None = None, suffix: str = '_primary', lookup: dict[str, str] | None = None, ambiguous: set[str] | None = None) → dict[str, str]

update_ids(ids: pd.DataFrame, mapping_set: Sec2PriMappingSet, *, at: str | list[str], suffix: str = '_primary', lookup: dict[str, str] | None = None, ambiguous: set[str] | None = None, synonyms: str | list[str] | None = None, label_mapping_set: Sec2PriMappingSet | None = None) → pd.DataFrame

Resolve secondary identifiers to primary identifiers.

Parameters:

ids –
One of:
- str: a single identifier, or multiple identifiers joined by |, ,, ;, or whitespace.
- list[str]: a list of identifier strings (each may itself contain multiple IDs separated by the delimiters above).
- pandas.DataFrame: a DataFrame; you must also supply at.
mapping_set – The Sec2PriMappingSet to look up against (e.g. the result of generate_hgnc()).
at – DataFrame mode only. Column name or list of column names that contain identifiers. For each column col a new column named col + suffix is added to the returned DataFrame.
suffix – Suffix appended to column names in DataFrame mode (default "_primary").
lookup – Pre-built {secondary_id: primary_id} dictionary. Pass the result of build_lookup() to avoid rebuilding on repeated calls.
ambiguous – Pre-built set of ambiguous IDs (see build_ambiguous_set()). When None, it is computed automatically from mapping_set. Pass an explicit set (including an empty one) to skip the computation.
synonyms – DataFrame mode only. Name of a column in the DataFrame that contains user-supplied alias strings (delimited by |, ,, ;, or whitespace) to help resolve ambiguous identifiers. When provided, resolve_ambiguous_with_hints() is called for every ambiguous cell using that row’s alias list.
label_mapping_set – A LabelMappingSet used to build the alias index when synonyms is provided. When None and synonyms is set, hint-based resolution is skipped (ambiguous IDs remain blank) and a warning is emitted.

Returns:

dict[str, str] – When ids is a str or list[str]: a dictionary mapping each unique input identifier to its resolved primary ID. Identifiers not found in the mapping set are returned unchanged. Ambiguous identifiers are mapped to an empty string and a warning is emitted.
pandas.DataFrame – When ids is a DataFrame: a copy of the DataFrame with one new <col><suffix> column per entry in at. Ambiguous cells are set to ""; a warning is emitted after all columns are processed.

update_labels(labels: str, mapping_set: Sec2PriMappingSet, *, at: None = None, suffix: str = '_current', lookup: dict[str, str] | None = None, ambiguous: set[str] | None = None, synonyms: str | list[str] | None = None) → dict[str, str][source]

update_labels(labels: list[str], mapping_set: Sec2PriMappingSet, *, at: None = None, suffix: str = '_current', lookup: dict[str, str] | None = None, ambiguous: set[str] | None = None, synonyms: str | list[str] | None = None) → dict[str, str]

update_labels(labels: pd.DataFrame, mapping_set: Sec2PriMappingSet, *, at: str | list[str], suffix: str = '_current', lookup: dict[str, str] | None = None, ambiguous: set[str] | None = None, synonyms: str | list[str] | None = None) → pd.DataFrame

Resolve previous/alias gene labels to current labels.

Same as update_ids() but resolves via the subject_label to object_label mapping rather than IDs.

Parameters:

labels –
One of:
- str: a single label, or multiple labels joined by |, ,, ;, or whitespace.
- list[str]: a list of label strings.
- pandas.DataFrame: a DataFrame; you must also supply at.
mapping_set – A LabelMappingSet (e.g. the result of generate_hgnc_labels()).
at – DataFrame mode only. Column name or list of column names that contain labels. For each column col a new column named col + suffix is added to the returned DataFrame.
suffix – Suffix appended to column names in DataFrame mode (default "_current").
lookup – Pre-built {previous_label: current_label} dictionary. Pass the result of build_label_lookup() to avoid rebuilding on repeated calls.
ambiguous – Pre-built set of ambiguous labels (see build_ambiguous_labels_set()). When None, it is computed automatically from mapping_set.
synonyms – DataFrame mode only. Name of a column in the DataFrame that contains user-supplied alias strings (delimited by |, ,, ;, or whitespace) to help resolve ambiguous labels. When provided, resolve_ambiguous_with_hints() is called for every ambiguous cell using that row’s alias list. The alias index is built from mapping_set itself (non-IAO entries).

Returns:

dict[str, str] – When labels is a str or list[str]: a dictionary mapping each unique input label to its resolved current label. Symbols not found in the mapping set are returned unchanged. Ambiguous labels are mapped to an empty string and a warning is emitted.
pandas.DataFrame – When labels is a DataFrame: a copy of the DataFrame with one new <col><suffix> column per entry in at. Ambiguous cells are set to ""; a warning is emitted after all columns are processed.