Update IDs
Resolve secondary identifiers to primary identifiers using a MappingSet.
Typical usage
Single string (possibly separated by commas/semicolons/pipes/whitespace):
from pysec2pri import generate_hgnc
from pysec2pri.update_ids import update_ids
ms = generate_hgnc()
update_ids("HGNC:1234|HGNC:5678", ms)
# {'HGNC:1234': 'HGNC:9999', 'HGNC:5678': 'HGNC:5678'}
List of strings:
update_ids(["HGNC:1234", "HGNC:5678"], ms)
Pandas DataFrame, annotate one or more columns:
import pandas as pd
df = pd.DataFrame({"gene_id": ["HGNC:1234", "HGNC:5678"]})
update_ids(df, ms, at="gene_id")
# returns df with an extra column "gene_id_primary"
# Multiple columns at once:
update_ids(df, ms, at=["gene_id", "alt_id"])
# returns df with "gene_id_primary" and "alt_id_primary" columns added
Notes
Identifiers that are not found in the mapping set are returned/kept as-is.
Identifiers separated by common delimiters (
|,,,;, whitespace) inside a single string are each looked up individually.The mapping look-up is done once against the full set of unique IDs to avoid repeated scans of large mapping sets.
Ambiguous identifiers, those that appear both as a secondary ID in the mapping set and as a current primary ID, are left blank in the resolved output. A warning is emitted listing every ambiguous token so the user can resolve them manually.
- build_alias_index(mapping_set: Sec2PriMappingSet) dict[str, list[str]][source]
Return
{object_id: [subject_labels linked via non-IAO predicates]}.Builds an index of all non-deprecation alias mappings in a
LabelMappingSet. Only entries whosepredicate_idis notIAO:0100001are included; deprecation (IAO:0100001/ “term replaced by”) mappings are deliberately excluded because they express history, not active aliasing.This index is used by
resolve_ambiguous_with_hints()to confirm whether a user-supplied alias belongs to the secondary mapping’s target (confirming secondary usage) or to the entity’s own primary entry (confirming primary usage).- Parameters:
mapping_set – A
LabelMappingSet(e.g. the result ofgenerate_hgnc_symbols()).- Returns:
Dict mapping each
object_idto the list ofsubject_labelvalues that point to it via a non-IAO predicate.
- build_ambiguous_set(mapping_set: Sec2PriMappingSet) set[str][source]
Return the set of ambiguous subject IDs in mapping_set.
An identifier is ambiguous when it appears both as a
subject_id(i.e. a secondary/previous term) and as a current primary identifier, either in the explicitly stored_primary_idsset or among theobject_idvalues of the mappings.When such overlap exists a naïve replacement could silently corrupt references that already use the current entity, so the resolver intentionally leaves those cells blank.
- Parameters:
mapping_set – Any
Sec2PriMappingSet.- Returns:
Set of ID strings that are both secondary and primary. Empty set when no ambiguity is detected.
- build_ambiguous_symbols_set(mapping_set: Sec2PriMappingSet) set[str][source]
Return the set of ambiguous subject labels in mapping_set.
Analogous to
build_ambiguous_set()but operates onsubject_label/object_label(symbol) mappings.- Parameters:
mapping_set – Any
LabelMappingSet.- Returns:
Set of label strings that are both secondary and primary. Empty set when no ambiguity is detected.
- build_lookup(mapping_set: Sec2PriMappingSet) dict[str, str][source]
Return a
{secondary_id: primary_id}dictionary.Useful when you want to apply the look-up yourself or cache it for repeated calls.
- Parameters:
mapping_set – A
Sec2PriMappingSet(e.g. the object returned bygenerate_hgnc()).- Returns:
Dictionary mapping every secondary ID to its current primary ID.
- build_primary_token_to_id(mapping_set: Sec2PriMappingSet) dict[str, str][source]
Return
{primary_label: primary_id}from a label mapping set.Collects every
(object_label, object_id)pair seen in the mappings and, where available, the_primary_symbolsstore. Useful for translating a primary symbol string into its CURIE so thatbuild_alias_index()(keyed by object_id) can be looked up.- Parameters:
mapping_set – A
LabelMappingSet.- Returns:
Dict
{primary_symbol: primary_id}.
- build_symbol_lookup(mapping_set: Sec2PriMappingSet) dict[str, str][source]
Return a
{secondary_label: primary_label}dictionary.Useful when you want to apply the look-up yourself or cache it for repeated calls.
- Parameters:
mapping_set – A
LabelMappingSet(e.g. the result ofgenerate_hgnc_symbols()).- Returns:
Dictionary mapping every previous/alias symbol to its current symbol.
- resolve_ambiguous_with_hints(ambiguous_token: str, user_aliases: list[str], lkp: dict[str, str], alias_index: dict[str, list[str]], token_to_id: dict[str, str] | None = None) tuple[str, str | None][source]
Attempt to resolve an ambiguous symbol or ID using user-provided alias hints.
An ambiguous token appears both as a current primary entry and as a secondary (subject) in a mapping that points to a different primary.
Two resolution cases are checked:
Secondary usage: at least one of the user-supplied aliases matches the target token itself (its primary label or primary ID), or appears among the non-IAO aliases of the mapping’s target (
lkp[ambiguous_token]). This confirms the token is being used as a secondary alias of the target, returns(target_token, target_id).Primary usage: at least one of the user-supplied aliases appears among the non-IAO aliases of the token’s own primary entry. This confirms the token is being used as a standalone primary, returns
(ambiguous_token, own_id).
If neither case applies the ambiguity cannot be resolved and
("", None)is returned.- Parameters:
ambiguous_token – The symbol or ID that is both primary and secondary.
user_aliases – Alias strings provided by the user (e.g. from a
known_aliascolumn) to help determine which entity is actually meant.lkp –
{secondary_token: resolved_token}lookup (frombuild_symbol_lookup()orbuild_lookup()).alias_index –
{primary_id: [non-IAO alias strings]}built viabuild_alias_index().token_to_id –
{primary_token: primary_id}. WhenNonethe token is treated as its own ID, which is appropriate for ID mapping sets where the token already is a CURIE (e.g."HGNC:53564").
- Returns:
Secondary case ->
(target_token, target_id)Primary case ->
(ambiguous_token, own_id)Unresolvable ->
("", None)
- Return type:
A
(resolved_token, resolved_id)tuple
- update_ids(ids: str, mapping_set: Sec2PriMappingSet, *, at: None = None, suffix: str = '_primary', lookup: dict[str, str] | None = None, ambiguous: set[str] | None = None) dict[str, str][source]
- update_ids(ids: list[str], mapping_set: Sec2PriMappingSet, *, at: None = None, suffix: str = '_primary', lookup: dict[str, str] | None = None, ambiguous: set[str] | None = None) dict[str, str]
- update_ids(ids: pd.DataFrame, mapping_set: Sec2PriMappingSet, *, at: str | list[str], suffix: str = '_primary', lookup: dict[str, str] | None = None, ambiguous: set[str] | None = None, synonyms: str | list[str] | None = None, label_mapping_set: Sec2PriMappingSet | None = None) pd.DataFrame
Resolve secondary identifiers to primary identifiers.
- Parameters:
ids –
One of:
str: a single identifier, or multiple identifiers joined by
|,,,;, or whitespace.list[str]: a list of identifier strings (each may itself contain multiple IDs separated by the delimiters above).
pandas.DataFrame: a DataFrame; you must also supply at.
mapping_set – The
Sec2PriMappingSetto look up against (e.g. the result ofgenerate_hgnc()).at – DataFrame mode only. Column name or list of column names that contain identifiers. For each column
cola new column namedcol + suffixis added to the returned DataFrame.suffix – Suffix appended to column names in DataFrame mode (default
"_primary").lookup – Pre-built
{secondary_id: primary_id}dictionary. Pass the result ofbuild_lookup()to avoid rebuilding on repeated calls.ambiguous – Pre-built set of ambiguous IDs (see
build_ambiguous_set()). WhenNone, it is computed automatically from mapping_set. Pass an explicit set (including an empty one) to skip the computation.synonyms – DataFrame mode only. Name of a column in the DataFrame that contains user-supplied alias strings (delimited by
|,,,;, or whitespace) to help resolve ambiguous identifiers. When provided,resolve_ambiguous_with_hints()is called for every ambiguous cell using that row’s alias list.label_mapping_set – A
LabelMappingSetused to build the alias index when synonyms is provided. WhenNoneand synonyms is set, hint-based resolution is skipped (ambiguous IDs remain blank) and a warning is emitted.
- Returns:
dict[str, str] – When ids is a
strorlist[str]: a dictionary mapping each unique input identifier to its resolved primary ID. Identifiers not found in the mapping set are returned unchanged. Ambiguous identifiers are mapped to an empty string and a warning is emitted.pandas.DataFrame – When ids is a
DataFrame: a copy of the DataFrame with one new<col><suffix>column per entry in at. Ambiguous cells are set to""; a warning is emitted after all columns are processed.
- update_symbols(symbols: str, mapping_set: Sec2PriMappingSet, *, at: None = None, suffix: str = '_current', lookup: dict[str, str] | None = None, ambiguous: set[str] | None = None, synonyms: str | list[str] | None = None) dict[str, str][source]
- update_symbols(symbols: list[str], mapping_set: Sec2PriMappingSet, *, at: None = None, suffix: str = '_current', lookup: dict[str, str] | None = None, ambiguous: set[str] | None = None, synonyms: str | list[str] | None = None) dict[str, str]
- update_symbols(symbols: pd.DataFrame, mapping_set: Sec2PriMappingSet, *, at: str | list[str], suffix: str = '_current', lookup: dict[str, str] | None = None, ambiguous: set[str] | None = None, synonyms: str | list[str] | None = None) pd.DataFrame
Resolve previous/alias gene symbols to current symbols.
Same as
update_ids()but resolves via thesubject_labeltoobject_labelmapping rather than IDs.- Parameters:
symbols –
One of:
str: a single symbol, or multiple symbols joined by
|,,,;, or whitespace.list[str]: a list of symbol strings.
pandas.DataFrame: a DataFrame; you must also supply at.
mapping_set – A
LabelMappingSet(e.g. the result ofgenerate_hgnc_symbols()).at – DataFrame mode only. Column name or list of column names that contain symbols. For each column
cola new column namedcol + suffixis added to the returned DataFrame.suffix – Suffix appended to column names in DataFrame mode (default
"_current").lookup – Pre-built
{previous_symbol: current_symbol}dictionary. Pass the result ofbuild_symbol_lookup()to avoid rebuilding on repeated calls.ambiguous – Pre-built set of ambiguous labels (see
build_ambiguous_symbols_set()). WhenNone, it is computed automatically from mapping_set.synonyms – DataFrame mode only. Name of a column in the DataFrame that contains user-supplied alias strings (delimited by
|,,,;, or whitespace) to help resolve ambiguous symbols. When provided,resolve_ambiguous_with_hints()is called for every ambiguous cell using that row’s alias list. The alias index is built from mapping_set itself (non-IAO entries).
- Returns:
dict[str, str] – When symbols is a
strorlist[str]: a dictionary mapping each unique input symbol to its resolved current symbol. Symbols not found in the mapping set are returned unchanged. Ambiguous symbols are mapped to an empty string and a warning is emitted.pandas.DataFrame – When symbols is a
DataFrame: a copy of the DataFrame with one new<col><suffix>column per entry in at. Ambiguous cells are set to""; a warning is emitted after all columns are processed.