Update IDs
Resolve secondary identifiers to primary identifiers using a MappingSet.
Typical usage
Single string (possibly separated by commas/semicolons/pipes/whitespace):
from pysec2pri import generate_hgnc
from pysec2pri.update_ids import update_ids
ms = generate_hgnc()
update_ids("HGNC:1234|HGNC:5678", ms)
# {'HGNC:1234': 'HGNC:9999', 'HGNC:5678': 'HGNC:5678'}
List of strings:
update_ids(["HGNC:1234", "HGNC:5678"], ms)
Pandas DataFrame, annotate one or more columns:
import pandas as pd
df = pd.DataFrame({"gene_id": ["HGNC:1234", "HGNC:5678"]})
update_ids(df, ms, at="gene_id")
# returns df with an extra column "gene_id_primary"
# Multiple columns at once:
update_ids(df, ms, at=["gene_id", "alt_id"])
# returns df with "gene_id_primary" and "alt_id_primary" columns added
Notes
Identifiers that are not found in the mapping set are returned/kept as-is.
Identifiers separated by common delimiters (
|,,,;, whitespace) inside a single string are each looked up individually.The mapping look-up is done once against the full set of unique IDs to avoid repeated scans of large mapping sets.
- build_lookup(mapping_set: Sec2PriMappingSet) dict[str, str][source]
Return a
{secondary_id: primary_id}dictionary.Useful when you want to apply the look-up yourself or cache it for repeated calls.
- Parameters:
mapping_set – A
Sec2PriMappingSet(e.g. the object returned bygenerate_hgnc()).- Returns:
Dictionary mapping every secondary ID to its current primary ID.
- build_symbol_lookup(mapping_set: Sec2PriMappingSet) dict[str, str][source]
Return a
{secondary_label: primary_label}dictionary.Useful when you want to apply the look-up yourself or cache it for repeated calls.
- Parameters:
mapping_set – A
LabelMappingSet(e.g. the result ofgenerate_hgnc_symbols()).- Returns:
Dictionary mapping every previous/alias symbol to its current symbol.
- update_ids(ids: str, mapping_set: Sec2PriMappingSet, *, at: None = None, suffix: str = '_primary', lookup: dict[str, str] | None = None) dict[str, str][source]
- update_ids(ids: list[str], mapping_set: Sec2PriMappingSet, *, at: None = None, suffix: str = '_primary', lookup: dict[str, str] | None = None) dict[str, str]
- update_ids(ids: pd.DataFrame, mapping_set: Sec2PriMappingSet, *, at: str | list[str], suffix: str = '_primary', lookup: dict[str, str] | None = None) pd.DataFrame
Resolve secondary identifiers to primary identifiers.
- Parameters:
ids –
One of:
str: a single identifier, or multiple identifiers joined by
|,,,;, or whitespace.list[str]: a list of identifier strings (each may itself contain multiple IDs separated by the delimiters above).
pandas.DataFrame: a DataFrame; you must also supply at.
mapping_set – The
Sec2PriMappingSetto look up against (e.g. the result ofgenerate_hgnc()).at – DataFrame mode only. Column name or list of column names that contain identifiers. For each column
cola new column namedcol + suffixis added to the returned DataFrame.suffix – Suffix appended to column names in DataFrame mode (default
"_primary").lookup – Pre-built
{secondary_id: primary_id}dictionary. Pass the result ofbuild_lookup()to avoid rebuilding on repeated calls.
- Returns:
dict[str, str] – When ids is a
strorlist[str]: a dictionary mapping each unique input identifier to its resolved primary ID. Identifiers not found in the mapping set are returned unchanged.pandas.DataFrame – When ids is a
DataFrame: a copy of the DataFrame with one new<col><suffix>column per entry in at.
Examples
Setup:
ms = generate_hgnc()
Single string:
update_ids("HGNC:1234", ms) # {'HGNC:1234': 'HGNC:9999'}
Pipe-separated string:
update_ids("HGNC:1234|HGNC:5678", ms) # {'HGNC:1234': 'HGNC:9999', 'HGNC:5678': 'HGNC:5678'}
List:
update_ids(["HGNC:1234", "HGNC:5678", "HGNC:1234"], ms) # {'HGNC:1234': 'HGNC:9999', 'HGNC:5678': 'HGNC:5678'}
DataFrame:
import pandas as pd df = pd.DataFrame({"gene": ["HGNC:1234", "HGNC:5678"]}) update_ids(df, ms, at="gene") # gene gene_primary # 0 HGNC:1234 HGNC:9999 # 1 HGNC:5678 HGNC:5678
- update_symbols(symbols: IdsInput, mapping_set: Sec2PriMappingSet, *, at: str | list[str] | None = None, suffix: str = '_current', lookup: dict[str, str] | None = None) dict[str, str] | pd.DataFrame[source]
Resolve previous/alias gene symbols to current symbols.
Behaves identically to
update_ids()but resolves via thesubject_labeltoobject_labelmapping rather than IDs.- Parameters:
symbols –
One of:
str: a single symbol, or multiple symbols joined by
|,,,;, or whitespace.list[str]: a list of symbol strings.
pandas.DataFrame: a DataFrame; you must also supply at.
mapping_set – A
LabelMappingSet(e.g. the result ofgenerate_hgnc_symbols()).at – DataFrame mode only. Column name or list of column names that contain symbols. For each column
cola new column namedcol + suffixis added to the returned DataFrame.suffix – Suffix appended to column names in DataFrame mode (default
"_current").lookup – Pre-built
{previous_symbol: current_symbol}dictionary. Pass the result ofbuild_symbol_lookup()to avoid rebuilding on repeated calls.
- Returns:
dict[str, str] – When symbols is a
strorlist[str]: a dictionary mapping each unique input symbol to its resolved current symbol. Symbols not found in the mapping set are returned unchanged.pandas.DataFrame – When symbols is a
DataFrame: a copy of the DataFrame with one new<col><suffix>column per entry in at.