Aggregate

A comprehensive set of tools for aggregating and reshaping tabular proteomics data.

The aggregation module contains submodules that offer functionalities to transform data from lower levels of abstraction (e.g. ions, peptides) to higher levels (e.g. peptides, proteins, PTMs) through various summarization and condensation techniques. It also includes methods for reshaping tables from "long" to "wide" format, a common prerequisite for aggregation. The MaxLFQ algorithm is integrated for specific quantitative summarizations, enabling users to build customized, higher-level data tables.

Modules:

Name	Description
`condense`	Low-level functions for aggregating numerical and string data.
`pivot`	Functionalities for reshaping tabular quantitative proteomics data.
`summarize`	High-level functions for aggregating quantitative proteomics data.

condense

Low-level functions for aggregating numerical and string data.

This module defines fundamental "condenser" functions that operate directly on NumPy arrays. These functions are designed to be applied to groups of data, performing operations such as summing values, finding maximum/minimum, counting or joining unique elements, and calculating abundance profiles. It includes the core implementations for MaxLFQ summation.

Functions:

Name	Description
`join_str`	Returns a joined string of sorted values from the array.
`join_str_per_column`	Returns for each column a joined string of sorted values.
`join_unique_str`	Returns a joined string of unique sorted values from the array.
`join_unique_str_per_column`	Returns for each column a joined strings of unique sorted values.
`sum`	Returns sum of values from one or multiple columns.
`sum_per_column`	Returns for each column the sum of values.
`maximum`	Returns the highest finitevalue from one or multiple columns.
`maximum_per_column`	Returns for each column the highest finite value.
`minimum`	Returns the lowest finite value from one or multiple columns.
`minimum_per_column`	Returns for each column the lowest finite value.
`count_unique`	Returns the number of unique values from one or multiple columns.
`count_unique_per_column`	Returns for each column the number of unique values.
`profile_by_median_ratio_regression`	Calculates abundance profiles by lstsq regression of pair-wise median ratios.
`sum_by_median_ratio_regression`	Calculates summed abundance by lstsq regression of pair-wise median ratios.

join_str

join_str(array: ndarray, sep: str = ';') -> str

Returns a joined string of sorted values from the array.

Note that empty strings or np.nan are not included in the joined string.

Source code in msreport\aggregate\condense.py

def join_str(array: np.ndarray, sep: str = ";") -> str:
    """Returns a joined string of sorted values from the array.

    Note that empty strings or np.nan are not included in the joined string.
    """
    elements = []
    for value in array.flatten():
        if value != "" and not (isinstance(value, float) and np.isnan(value)):
            elements.append(str(value))
    return sep.join(sorted(elements))

join_str_per_column

join_str_per_column(
    array: ndarray, sep: str = ";"
) -> ndarray

Returns for each column a joined string of sorted values.

Note that empty strings or np.nan are not included in the joined string.

Source code in msreport\aggregate\condense.py

def join_str_per_column(array: np.ndarray, sep: str = ";") -> np.ndarray:
    """Returns for each column a joined string of sorted values.

    Note that empty strings or np.nan are not included in the joined string.
    """
    return np.array([join_str(i) for i in array.transpose()])

join_unique_str

join_unique_str(array: ndarray, sep: str = ';') -> str

Returns a joined string of unique sorted values from the array.

Source code in msreport\aggregate\condense.py

def join_unique_str(array: np.ndarray, sep: str = ";") -> str:
    """Returns a joined string of unique sorted values from the array."""
    elements = []
    for value in array.flatten():
        if value != "" and not (isinstance(value, float) and np.isnan(value)):
            elements.append(str(value))
    return sep.join(sorted(set(elements)))

join_unique_str_per_column

join_unique_str_per_column(
    array: ndarray, sep: str = ";"
) -> ndarray

Returns for each column a joined strings of unique sorted values.

Source code in msreport\aggregate\condense.py

def join_unique_str_per_column(array: np.ndarray, sep: str = ";") -> np.ndarray:
    """Returns for each column a joined strings of unique sorted values."""
    return np.array([join_unique_str(i) for i in array.transpose()])

sum

sum(array: ndarray) -> float

Returns sum of values from one or multiple columns.

Note that if no finite values are present in the array np.nan is returned.

Source code in msreport\aggregate\condense.py

def sum(array: np.ndarray) -> float:
    """Returns sum of values from one or multiple columns.

    Note that if no finite values are present in the array np.nan is returned.
    """
    array = array.flatten()
    if np.isfinite(array).any():
        return np.nansum(array)
    else:
        return np.nan

sum_per_column

sum_per_column(array: ndarray) -> ndarray

Returns for each column the sum of values.

Note that if no finite values are present in a column np.nan is returned.

Source code in msreport\aggregate\condense.py

def sum_per_column(array: np.ndarray) -> np.ndarray:
    """Returns for each column the sum of values.

    Note that if no finite values are present in a column np.nan is returned.
    """
    return np.array([sum(i) for i in array.transpose()])

maximum

maximum(array: ndarray) -> float

Returns the highest finitevalue from one or multiple columns.

Source code in msreport\aggregate\condense.py

def maximum(array: np.ndarray) -> float:
    """Returns the highest finitevalue from one or multiple columns."""
    array = array.flatten()
    if np.isfinite(array).any():
        return np.nanmax(array)
    else:
        return np.nan

maximum_per_column

maximum_per_column(array: ndarray) -> ndarray

Returns for each column the highest finite value.

Source code in msreport\aggregate\condense.py

def maximum_per_column(array: np.ndarray) -> np.ndarray:
    """Returns for each column the highest finite value."""
    return np.array([maximum(i) for i in array.transpose()])

minimum

minimum(array: ndarray) -> float

Returns the lowest finite value from one or multiple columns.

Source code in msreport\aggregate\condense.py

def minimum(array: np.ndarray) -> float:
    """Returns the lowest finite value from one or multiple columns."""
    array = array.flatten()
    if np.isfinite(array).any():
        return np.nanmin(array)
    else:
        return np.nan

minimum_per_column

minimum_per_column(array: ndarray) -> ndarray

Returns for each column the lowest finite value.

Source code in msreport\aggregate\condense.py

def minimum_per_column(array: np.ndarray) -> np.ndarray:
    """Returns for each column the lowest finite value."""
    return np.array([minimum(i) for i in array.transpose()])

count_unique

count_unique(array: ndarray) -> int

Returns the number of unique values from one or multiple columns.

Note that empty strings or np.nan are not counted as unique values.

Source code in msreport\aggregate\condense.py

def count_unique(array: np.ndarray) -> int:
    """Returns the number of unique values from one or multiple columns.

    Note that empty strings or np.nan are not counted as unique values.
    """
    unique_elements = {
        x for x in array.flatten() if not (isinstance(x, float) and np.isnan(x))
    }
    unique_elements.discard("")

    return len(unique_elements)

count_unique_per_column

count_unique_per_column(array: ndarray) -> ndarray

Returns for each column the number of unique values.

Note that empty strings or np.nan are not counted as unique values.

Source code in msreport\aggregate\condense.py

def count_unique_per_column(array: np.ndarray) -> np.ndarray:
    """Returns for each column the number of unique values.

    Note that empty strings or np.nan are not counted as unique values.
    """
    if array.size > 0:
        return np.array([count_unique(i) for i in array.transpose()])
    else:
        return np.full(array.shape[0], 0)

profile_by_median_ratio_regression

profile_by_median_ratio_regression(
    array: ndarray,
) -> ndarray

Calculates abundance profiles by lstsq regression of pair-wise median ratios.

The function performs a least squares regression of pair-wise median ratios to calculate estimated abundance profiles.

Parameters:

Name	Type	Description	Default
`array`	`ndarray`	A two-dimensional array containing abundance values, with the first dimension corresponding to rows and the second dimension to columns. Abundance values must not be log transformed.	required

Returns:

Type	Description
`ndarray`	An array containing estimated abundance profiles, with length equal to the
`ndarray`	number of columns in the input array.

Source code in msreport\aggregate\condense.py

def profile_by_median_ratio_regression(array: np.ndarray) -> np.ndarray:
    """Calculates abundance profiles by lstsq regression of pair-wise median ratios.

    The function performs a least squares regression of pair-wise median ratios to
    calculate estimated abundance profiles.

    Args:
        array: A two-dimensional array containing abundance values, with the first
            dimension corresponding to rows and the second dimension to columns.
            Abundance values must not be log transformed.

    Returns:
        An array containing estimated abundance profiles, with length equal to the
        number of columns in the input array.
    """
    ratio_matrix = MAXLFQ.calculate_pairwise_median_log_ratio_matrix(
        array, log_transformed=False
    )
    coef_matrix, ratio_array, initial_rows = MAXLFQ.prepare_coefficient_matrix(
        ratio_matrix
    )
    log_profile = MAXLFQ.log_profiles_by_lstsq(coef_matrix, ratio_array)
    profile = np.power(2, log_profile)
    return profile

sum_by_median_ratio_regression

sum_by_median_ratio_regression(array: ndarray) -> ndarray

Calculates summed abundance by lstsq regression of pair-wise median ratios.

The function performs a least squares regression of pair-wise median ratios to calculate estimated abundance profiles. These profiles are then scaled based on the input array such that the columns with finite profile values are used and the sum of the scaled profiles matches the sum of the input array.

Parameters:

Name	Type	Description	Default
`array`	`ndarray`	A two-dimensional array containing abundance values, with the first dimension corresponding to rows and the second dimension to columns. Abundance values must not be log transformed.	required

Returns:

Type	Description
`ndarray`	An array containing summed abundance estimates, with length equal to the number
`ndarray`	of columns in the input array.

Source code in msreport\aggregate\condense.py

def sum_by_median_ratio_regression(array: np.ndarray) -> np.ndarray:
    """Calculates summed abundance by lstsq regression of pair-wise median ratios.

    The function performs a least squares regression of pair-wise median ratios to
    calculate estimated abundance profiles. These profiles are then scaled based on the
    input array such that the columns with finite profile values are used and the sum of
    the scaled profiles matches the sum of the input array.

    Args:
        array: A two-dimensional array containing abundance values, with the first
            dimension corresponding to rows and the second dimension to columns.
            Abundance values must not be log transformed.

    Returns:
        An array containing summed abundance estimates, with length equal to the number
        of columns in the input array.
    """
    profile = profile_by_median_ratio_regression(array)
    scaled_profile = profile
    if np.isfinite(profile).any():
        profile_mask = np.isfinite(profile)
        scaled_profile[profile_mask] = profile[profile_mask] * (
            np.nansum(array[:, profile_mask]) / np.nansum(profile[profile_mask])
        )

    return scaled_profile

pivot

Functionalities for reshaping tabular quantitative proteomics data.

This module offers methods to transform data from a "long" format into a "wide" format, which is a common and often necessary step before aggregation or analysis. It supports pivoting data based on specified index and grouping columns, and can handle both quantitative values and annotation columns.

Functions:

Name	Description
`pivot_table`	Generates a pivoted table in wide format.
`pivot_column`	Returns a reshaped dataframe, generated by pivoting the table on one column.
`join_unique`	Returns a new dataframe with unique values from a column and grouped by 'index'.

pivot_table

pivot_table(
    long_table: DataFrame,
    index: str,
    group_by: str,
    annotation_columns: Iterable[str],
    pivoting_columns: Iterable[str],
) -> DataFrame

Generates a pivoted table in wide format.

Parameters:

Name	Type	Description	Default
`long_table`	`DataFrame`	Dataframe in long format that is used to generate a table in wide format.	required
`index`	`str`	One or multiple column names that are used to group the table for pivoting.	required
`group_by`	`str`	Column that is used to split the table on its unique entries.	required
`annotation_columns`	`Iterable[str]`	Each column generates a new column in the pivoted table. Entries from each annotation column are aggregated for each group created by the column(s) specified by 'index' and unique values are joined together with ";" as separator.	required
`pivoting_columns`	`Iterable[str]`	Columns that are combined with unique entries from 'group_by' to generate new columns in the pivoted table.	required

Returns:

Type	Description
`DataFrame`	A reshaped, pivot table with length equal to unique values from the 'index'
`DataFrame`	column.

Example

table = pd.DataFrame( ... { ... "ID": ["A", "B", "C", "B", "C", "D"], ... "Sample": ["S1", "S1", "S1", "S2", "S2", "S2"], ... "Annotation": ["A", "B", "C", "B", "C", "D"], ... "Quant": [1.0, 1.0, 1.0, 2.0, 2.0, 2.0], ... } ... ) pivot_table(table, "ID", "Sample", ["Annotation"], ["Quant"]) ID Annotation Quant S1 Quant S2 0 A A 1.0 NaN 1 B B 1.0 2.0 2 C C 1.0 2.0 3 D D NaN 2.0

Source code in msreport\aggregate\pivot.py

def pivot_table(
    long_table: pd.DataFrame,
    index: str,
    group_by: str,
    annotation_columns: Iterable[str],
    pivoting_columns: Iterable[str],
) -> pd.DataFrame:
    """Generates a pivoted table in wide format.

    Args:
        long_table: Dataframe in long format that is used to generate a table in wide
            format.
        index: One or multiple column names that are used to group the table for
            pivoting.
        group_by: Column that is used to split the table on its unique entries.
        annotation_columns: Each column generates a new column in the pivoted table.
            Entries from each annotation column are aggregated for each group created by
            the column(s) specified by 'index' and unique values are joined together
            with ";" as separator.
        pivoting_columns: Columns that are combined with unique entries from 'group_by'
            to generate new columns in the pivoted table.

    Returns:
        A reshaped, pivot table with length equal to unique values from the 'index'
        column.

    Example:
        >>> table = pd.DataFrame(
        ...     {
        ...         "ID": ["A", "B", "C", "B", "C", "D"],
        ...         "Sample": ["S1", "S1", "S1", "S2", "S2", "S2"],
        ...         "Annotation": ["A", "B", "C", "B", "C", "D"],
        ...         "Quant": [1.0, 1.0, 1.0, 2.0, 2.0, 2.0],
        ...     }
        ... )
        >>> pivot_table(table, "ID", "Sample", ["Annotation"], ["Quant"])
          ID  Annotation  Quant S1  Quant S2
        0  A           A       1.0       NaN
        1  B           B       1.0       2.0
        2  C           C       1.0       2.0
        3  D           D       NaN       2.0
    """
    sub_tables = []
    for column in annotation_columns:
        sub_tables.append(join_unique(long_table, index, column))
    for column in pivoting_columns:
        sub_tables.append(pivot_column(long_table, index, group_by, column))

    wide_table = msreport.helper.join_tables(sub_tables, reset_index=True)
    return wide_table

pivot_column

pivot_column(
    table: DataFrame,
    index: str | Iterable[str],
    group_by: str,
    values: str,
) -> DataFrame

Returns a reshaped dataframe, generated by pivoting the table on one column.

Uses unique values from the specified 'index' to form the index axis of the new dataframe. Unique values from the 'group_by' column are used to split the data and generate new columns that are filled with values from the 'values' column. The column names are composed of the 'values' column and the unique entries from 'group_by'.

Parameters:

Name	Type	Description	Default
`table`	`DataFrame`	Dataframe that is used to generate the pivoted table.	required
`index`	`str \| Iterable[str]`	One or multiple column names that are used as the new index.	required
`group_by`	`str`	Column that is used to split the table, each unique entry from this column generates a new column in the pivoted table.	required
`values`	`str`	Column which values are used to populate the pivoted table.	required

Returns:

Type	Description
`DataFrame`	The pivoted dataframe.

Example

table = pd.DataFrame( ... { ... "ID": ["A", "A", "B", "B"], ... "Sample": ["S1", "S2", "S1", "S2"], ... "Entries": [1.0, 2.0, 1.0, 2.0], ... } ... ) pivot_column(table, "ID", "Sample", "Entries") Entries S1 Entries S2 ID A 1.0 2.0 B 1.0 2.0

Source code in msreport\aggregate\pivot.py

def pivot_column(
    table: pd.DataFrame, index: str | Iterable[str], group_by: str, values: str
) -> pd.DataFrame:
    """Returns a reshaped dataframe, generated by pivoting the table on one column.

    Uses unique values from the specified 'index' to form the index axis of the new
    dataframe. Unique values from the 'group_by' column are used to split the data and
    generate new columns that are filled with values from the 'values' column. The
    column names are composed of the 'values' column and the unique entries from
    'group_by'.

    Args:
        table: Dataframe that is used to generate the pivoted table.
        index: One or multiple column names that are used as the new index.
        group_by: Column that is used to split the table, each unique entry from this
            column generates a new column in the pivoted table.
        values: Column which values are used to populate the pivoted table.

    Returns:
        The pivoted dataframe.

    Example:
        >>> table = pd.DataFrame(
        ...     {
        ...         "ID": ["A", "A", "B", "B"],
        ...         "Sample": ["S1", "S2", "S1", "S2"],
        ...         "Entries": [1.0, 2.0, 1.0, 2.0],
        ...     }
        ... )
        >>> pivot_column(table, "ID", "Sample", "Entries")
            Entries S1  Entries S2
        ID
        A          1.0         2.0
        B          1.0         2.0
    """
    pivot = table.pivot(index=index, columns=group_by, values=values)
    pivot.columns = [f"{values} {sample_column}" for sample_column in pivot.columns]
    return pivot

join_unique

join_unique(
    table: DataFrame,
    index: str | Iterable[str],
    values: str,
) -> DataFrame

Returns a new dataframe with unique values from a column and grouped by 'index'.

Parameters:

Name	Type	Description	Default
`table`	`DataFrame`	Input dataframe from which to generate the new dataframe.	required
`index`	`str \| Iterable[str]`	One or multiple column names group the table by.	required
`values`	`str`	Column which is used to extract unique values.	required

Returns:

Type	Description
`DataFrame`	A dataframe with a single column named 'values', where the unique values of the column specified by 'values' are joined together with ";" for each group created by the column(s) specified by 'index'.

Example

table = pd.DataFrame( ... { ... "ID": ["A", "A", "B", "B"], ... "Annotation": ["A1", "A1", "B1", "B1"], ... } ... ) join_unique(table, "ID", "Annotation") Annotation ID A A1 B B1

Source code in msreport\aggregate\pivot.py

def join_unique(
    table: pd.DataFrame, index: str | Iterable[str], values: str
) -> pd.DataFrame:
    """Returns a new dataframe with unique values from a column and grouped by 'index'.

    Args:
        table: Input dataframe from which to generate the new dataframe.
        index: One or multiple column names group the table by.
        values: Column which is used to extract unique values.

    Returns:
        A dataframe with a single column named 'values', where the unique values of the
            column specified by 'values' are joined together with ";" for each group
            created by the column(s) specified by 'index'.

    Example:
        >>> table = pd.DataFrame(
        ...     {
        ...         "ID": ["A", "A", "B", "B"],
        ...         "Annotation": ["A1", "A1", "B1", "B1"],
        ...     }
        ... )
        >>> join_unique(table, "ID", "Annotation")
            Annotation
        ID
        A           A1
        B           B1
    """
    series = table.groupby(index)[values].agg(
        lambda x: CONDENSE.join_unique_str(x.to_numpy())
    )
    new_df = pd.DataFrame(series)
    new_df.columns = [values]
    return new_df

summarize

High-level functions for aggregating quantitative proteomics data.

This module offers functions to summarize data from a lower level of abstraction (e.g. ions, peptides) to a higher level (e.g., peptides, proteins, PTMs). It operates directly on pandas DataFrames, allowing users to specify a grouping column and the columns to be summarized. These functions often leverage low-level condenser operations defined in msreport.aggregate.condense. It includes specific functions for MaxLFQ summation, as well as general counting, joining, and summing of columns.

Functions:

Name	Description
`count_unique`	Aggregates column(s) by counting unique values for each unique group.
`join_unique`	Aggregates column(s) by concatenating unique values for each unique group.
`sum_columns`	Aggregates column(s) by summing up values for each unique group.
`sum_columns_maxlfq`	Aggregates column(s) by applying the MaxLFQ summation approach to unique group.
`aggregate_unique_groups`	Aggregates column(s) by applying a condenser function to unique groups.

count_unique

count_unique(
    table: DataFrame,
    group_by: str,
    input_column: str | Iterable[str],
    output_column: str = "Unique counts",
    is_sorted: bool = False,
) -> DataFrame

Aggregates column(s) by counting unique values for each unique group.

Note that empty strings and np.nan do not contribute to the unique value count.

Parameters:

Name	Type	Description	Default
`table`	`DataFrame`	The input DataFrame used for aggregating on unique groups.	required
`group_by`	`str`	The name of the column used to determine unique groups for aggregation.	required
`input_column`	`str \| Iterable[str]`	A column or a list of columns, whose unique values will be counted for each unique group during aggregation.	required
`output_column`	`str`	The name of the column containing the aggregation results. By default "Unique values" is used as the name of the output column.	`'Unique counts'`
`is_sorted`	`bool`	Indicates whether the input dataframe is already sorted with respect to the 'group_by' column.	`False`

Returns:

Type	Description
`DataFrame`	A dataframe with unique 'group_by' values as index and a unique counts column
`DataFrame`	containing the number of unique counts per group.

Example

table = pd.DataFrame( ... { ... "ID": ["A", "A", "B", "C", "C", "C"], ... "Peptide sequence": ["a", "a", "b", "c1", "c2", "c2"], ... } ... ) count_unique(table, group_by="ID", input_column="Peptide sequence") Unique counts A 1 B 1 C 2

Source code in msreport\aggregate\summarize.py

def count_unique(
    table: pd.DataFrame,
    group_by: str,
    input_column: str | Iterable[str],
    output_column: str = "Unique counts",
    is_sorted: bool = False,
) -> pd.DataFrame:
    """Aggregates column(s) by counting unique values for each unique group.

    Note that empty strings and np.nan do not contribute to the unique value count.

    Args:
        table: The input DataFrame used for aggregating on unique groups.
        group_by: The name of the column used to determine unique groups for
            aggregation.
        input_column: A column or a list of columns, whose unique values will be counted
            for each unique group during aggregation.
        output_column: The name of the column containing the aggregation results. By
            default "Unique values" is used as the name of the output column.
        is_sorted: Indicates whether the input dataframe is already sorted with respect
            to the 'group_by' column.

    Returns:
        A dataframe with unique 'group_by' values as index and a unique counts column
        containing the number of unique counts per group.

    Example:
        >>> table = pd.DataFrame(
        ...     {
        ...         "ID": ["A", "A", "B", "C", "C", "C"],
        ...         "Peptide sequence": ["a", "a", "b", "c1", "c2", "c2"],
        ...     }
        ... )
        >>> count_unique(table, group_by="ID", input_column="Peptide sequence")
           Unique counts
        A              1
        B              1
        C              2
    """
    aggregation, groups = aggregate_unique_groups(
        table, group_by, input_column, CONDENSE.count_unique, is_sorted
    )
    return pd.DataFrame(columns=[output_column], data=aggregation, index=groups)

join_unique

join_unique(
    table: DataFrame,
    group_by: str,
    input_column: str | Iterable[str],
    output_column: str = "Unique values",
    sep: str = ";",
    is_sorted: bool = False,
) -> DataFrame

Aggregates column(s) by concatenating unique values for each unique group.

Note that empty strings and np.nan do not contribute to the unique value count.

Parameters:

Name	Type	Description	Default
`table`	`DataFrame`	The input DataFrame used for aggregating on unique groups.	required
`group_by`	`str`	The name of the column used to determine unique groups for aggregation.	required
`input_column`	`str \| Iterable[str]`	A column or a list of columns, whose unique values will be joined into a single string for each unique group	required
`output_column`	`str`	The name of the column containing the aggregation results. By default "Unique values" is used as the name of the output column.	`'Unique values'`
`sep`	`str`	The separator string used to join multiple unique values together. Default is ";".	`';'`
`is_sorted`	`bool`	Indicates whether the input dataframe is already sorted with respect to the 'group_by' column.	`False`

Returns:

Type	Description
`DataFrame`	A dataframe with unique 'group_by' values as index and a unique values column
`DataFrame`	containing the joined unique values per group. Unique values are sorted and
`DataFrame`	joined with the specified separator.

Example

table = pd.DataFrame( ... { ... "ID": ["A", "A", "B", "C", "C", "C"], ... "Peptide sequence": ["a", "", "b", "c1", "c2", "c2"], ... } ... ) join_unique(table, group_by="ID", input_column="Peptide sequence") Unique values A a B b C c1;c2

Source code in msreport\aggregate\summarize.py

def join_unique(
    table: pd.DataFrame,
    group_by: str,
    input_column: str | Iterable[str],
    output_column: str = "Unique values",
    sep: str = ";",
    is_sorted: bool = False,
) -> pd.DataFrame:
    """Aggregates column(s) by concatenating unique values for each unique group.

    Note that empty strings and np.nan do not contribute to the unique value count.

    Args:
        table: The input DataFrame used for aggregating on unique groups.
        group_by: The name of the column used to determine unique groups for
            aggregation.
        input_column: A column or a list of columns, whose unique values will be joined
            into a single string for each unique group
        output_column: The name of the column containing the aggregation results. By
            default "Unique values" is used as the name of the output column.
        sep: The separator string used to join multiple unique values together. Default
            is ";".
        is_sorted: Indicates whether the input dataframe is already sorted with respect
            to the 'group_by' column.

    Returns:
        A dataframe with unique 'group_by' values as index and a unique values column
        containing the joined unique values per group. Unique values are sorted and
        joined with the specified separator.

    Example:
        >>> table = pd.DataFrame(
        ...     {
        ...         "ID": ["A", "A", "B", "C", "C", "C"],
        ...         "Peptide sequence": ["a", "", "b", "c1", "c2", "c2"],
        ...     }
        ... )
        >>> join_unique(table, group_by="ID", input_column="Peptide sequence")
          Unique values
        A             a
        B             b
        C         c1;c2
    """
    aggregation, groups = aggregate_unique_groups(
        table,
        group_by,
        input_column,
        lambda x: CONDENSE.join_unique_str(x, sep=sep),
        is_sorted,
    )
    return pd.DataFrame(columns=[output_column], data=aggregation, index=groups)

sum_columns

sum_columns(
    table: DataFrame,
    group_by: str,
    samples: Iterable[str],
    input_tag: str,
    output_tag: Optional[str] = None,
    is_sorted: bool = False,
) -> DataFrame

Aggregates column(s) by summing up values for each unique group.

Parameters:

Name	Type	Description	Default
`table`	`DataFrame`	The input DataFrame used for aggregating on unique groups.	required
`group_by`	`str`	The name of the column used to determine unique groups for aggregation.	required
`samples`	`Iterable[str]`	List of sample names that appear in columns of the table as substrings.	required
`input_tag`	`str`	Substring of column names, which is used together with the sample names to determine the columns whose values will be summarized for each unique group.	required
`output_tag`	`Optional[str]`	Optional, allows changing the ouptut column names by replacing the 'input_tag' with the 'output_tag'. If not specified the names of the columns that were used for aggregation will be used in the returned dataframe.	`None`
`is_sorted`	`bool`	Indicates whether the input dataframe is already sorted with respect to the 'group_by' column.	`False`

Returns:

Type	Description
`DataFrame`	A dataframe with unique 'group_by' values as index and one column per sample.
`DataFrame`	The columns contain the summed group values per sample.

Example

table = pd.DataFrame( ... { ... "ID": ["A", "A", "B", "C", "C", "C"], ... "Col S1": [1, 1, 1, 1, 1, 1], ... "Col S2": [2, 2, 2, 2, 2, 2], ... } ... ) sum_columns(table, "ID", samples=["S1", "S2"], input_tag="Col") Col S1 Col S2 A 2 4 B 1 2 C 3 6

Source code in msreport\aggregate\summarize.py

def sum_columns(
    table: pd.DataFrame,
    group_by: str,
    samples: Iterable[str],
    input_tag: str,
    output_tag: Optional[str] = None,
    is_sorted: bool = False,
) -> pd.DataFrame:
    """Aggregates column(s) by summing up values for each unique group.

    Args:
        table: The input DataFrame used for aggregating on unique groups.
        group_by: The name of the column used to determine unique groups for
            aggregation.
        samples: List of sample names that appear in columns of the table as substrings.
        input_tag: Substring of column names, which is used together with the sample
            names to determine the columns whose values will be summarized for each
            unique group.
        output_tag: Optional, allows changing the ouptut column names by replacing the
            'input_tag' with the 'output_tag'. If not specified the names of the columns
            that were used for aggregation will be used in the returned dataframe.
        is_sorted: Indicates whether the input dataframe is already sorted with respect
            to the 'group_by' column.

    Returns:
        A dataframe with unique 'group_by' values as index and one column per sample.
        The columns contain the summed group values per sample.

    Example:
        >>> table = pd.DataFrame(
        ...     {
        ...         "ID": ["A", "A", "B", "C", "C", "C"],
        ...         "Col S1": [1, 1, 1, 1, 1, 1],
        ...         "Col S2": [2, 2, 2, 2, 2, 2],
        ...     }
        ... )
        >>> sum_columns(table, "ID", samples=["S1", "S2"], input_tag="Col")
           Col S1  Col S2
        A       2       4
        B       1       2
        C       3       6
    """
    output_tag = input_tag if output_tag is None else output_tag
    columns = find_sample_columns(table, input_tag, samples)
    aggregation, groups = aggregate_unique_groups(
        table, group_by, columns, CONDENSE.sum_per_column, is_sorted
    )
    output_columns = [column.replace(input_tag, output_tag) for column in columns]
    return pd.DataFrame(columns=output_columns, data=aggregation, index=groups)

sum_columns_maxlfq

sum_columns_maxlfq(
    table: DataFrame,
    group_by: str,
    samples: Iterable[str],
    input_tag: str,
    output_tag: Optional[str] = None,
    is_sorted: bool = False,
) -> DataFrame

Aggregates column(s) by applying the MaxLFQ summation approach to unique group.

This function estimates abundance profiles from sample columns using pairwise median ratios and least square regression. It then selects abundance profiles with finite values and the corresponding input columns and scales the abundance profiles so that their total sum is equal to the total sum of the corresponding input columns.

Parameters:

Name	Type	Description	Default
`table`	`DataFrame`	The input DataFrame used for aggregating on unique groups.	required
`group_by`	`str`	The name of the column used to determine unique groups for aggregation.	required
`samples`	`Iterable[str]`	List of sample names that appear in columns of the table as substrings.	required
`input_tag`	`str`	Substring of column names, which is used together with the sample names to determine the columns whose values will be summarized for each unique group.	required
`output_tag`	`Optional[str]`	Optional, allows changing the ouptut column names by replacing the 'input_tag' with the 'output_tag'. If not specified the names of the columns that were used for aggregation will be used in the returned dataframe.	`None`
`is_sorted`	`bool`	Indicates whether the input dataframe is already sorted with respect to the 'group_by' column.	`False`

Returns:

Type	Description
`DataFrame`	A dataframe with unique 'group_by' values as index and one column per sample.
`DataFrame`	The columns contain the summed group values per sample.

Example

table = pd.DataFrame( ... { ... "ID": ["A", "A", "B", "C", "C", "C"], ... "Col S1": [1, 1, 1, 1, 1, 1], ... "Col S2": [2, 2, 2, 2, 2, 2], ... } ... ) sum_columns_maxlfq(table, "ID", samples=["S1", "S2"], input_tag="Col") Col S1 Col S2 A 2.0 4.0 B 1.0 2.0 C 3.0 6.0

Source code in msreport\aggregate\summarize.py

def sum_columns_maxlfq(
    table: pd.DataFrame,
    group_by: str,
    samples: Iterable[str],
    input_tag: str,
    output_tag: Optional[str] = None,
    is_sorted: bool = False,
) -> pd.DataFrame:
    """Aggregates column(s) by applying the MaxLFQ summation approach to unique group.

    This function estimates abundance profiles from sample columns using pairwise median
    ratios and least square regression. It then selects abundance profiles with finite
    values and the corresponding input columns and scales the abundance profiles so that
    their total sum is equal to the total sum of the corresponding input columns.

    Args:
        table: The input DataFrame used for aggregating on unique groups.
        group_by: The name of the column used to determine unique groups for
            aggregation.
        samples: List of sample names that appear in columns of the table as substrings.
        input_tag: Substring of column names, which is used together with the sample
            names to determine the columns whose values will be summarized for each
            unique group.
        output_tag: Optional, allows changing the ouptut column names by replacing the
            'input_tag' with the 'output_tag'. If not specified the names of the columns
            that were used for aggregation will be used in the returned dataframe.
        is_sorted: Indicates whether the input dataframe is already sorted with respect
            to the 'group_by' column.

    Returns:
        A dataframe with unique 'group_by' values as index and one column per sample.
        The columns contain the summed group values per sample.

    Example:
        >>> table = pd.DataFrame(
        ...     {
        ...         "ID": ["A", "A", "B", "C", "C", "C"],
        ...         "Col S1": [1, 1, 1, 1, 1, 1],
        ...         "Col S2": [2, 2, 2, 2, 2, 2],
        ...     }
        ... )
        >>> sum_columns_maxlfq(table, "ID", samples=["S1", "S2"], input_tag="Col")
           Col S1  Col S2
        A     2.0     4.0
        B     1.0     2.0
        C     3.0     6.0
    """
    output_tag = input_tag if output_tag is None else output_tag
    columns = find_sample_columns(table, input_tag, samples)
    aggregation, groups = aggregate_unique_groups(
        table, group_by, columns, CONDENSE.sum_by_median_ratio_regression, is_sorted
    )
    output_columns = [column.replace(input_tag, output_tag) for column in columns]
    return pd.DataFrame(columns=output_columns, data=aggregation, index=groups)

aggregate_unique_groups

aggregate_unique_groups(
    table: DataFrame,
    group_by: str,
    columns_to_aggregate: str | Iterable[str],
    condenser: Callable,
    is_sorted: bool,
) -> tuple[ndarray, ndarray]

Aggregates column(s) by applying a condenser function to unique groups.

The function returns two arrays containing the aggregated values and the corresponding group names. This function can be used for example to summarize data from an ion table to a peptide, protein or modification table. Suitable condenser functions can be found in the module msreport.aggregate.condense

Parameters:

Name	Type	Description	Default
`table`	`DataFrame`	The input dataframe used for aggregating on unique groups.	required
`group_by`	`str`	The name of the column used to determine unique groups for aggregation.	required
`columns_to_aggregate`	`str \| Iterable[str]`	A column or a list of columns, which will be passed to the condenser function for applying an aggregation to each unique group.	required
`condenser`	`Callable`	Function that is applied to each group for generating the aggregation result. If multiple columns are specified for aggregation, the input array for the condenser function will be two dimensional, with the first dimension corresponding to rows and the second to the column. E.g. an array with 3 rows and 2 columns: np.array([[1, 'a'], [2, 'b'], [3, 'c']])	required
`is_sorted`	`bool`	Indicates whether the input dataframe is already sorted with respect to the 'group_by' column.	required

Returns:

Type	Description
`ndarray`	Two numpy arrays, the first array contains the aggregation results of each each
`ndarray`	unique group and the second array contains the correpsonding group names.

Source code in msreport\aggregate\summarize.py

def aggregate_unique_groups(
    table: pd.DataFrame,
    group_by: str,
    columns_to_aggregate: str | Iterable[str],
    condenser: Callable,
    is_sorted: bool,
) -> tuple[np.ndarray, np.ndarray]:
    """Aggregates column(s) by applying a condenser function to unique groups.

    The function returns two arrays containing the aggregated values and the
    corresponding group names. This function can be used for example to summarize data
    from an ion table to a peptide, protein or modification table. Suitable condenser
    functions can be found in the module msreport.aggregate.condense

    Args:
        table: The input dataframe used for aggregating on unique groups.
        group_by: The name of the column used to determine unique groups for
            aggregation.
        columns_to_aggregate: A column or a list of columns, which will be passed to the
            condenser function for applying an aggregation to each unique group.
        condenser: Function that is applied to each group for generating the
            aggregation result. If multiple columns are specified for aggregation,
            the input array for the condenser function will be two dimensional, with the
            first dimension corresponding to rows and the second to the column. E.g. an
            array with 3 rows and 2 columns: np.array([[1, 'a'], [2, 'b'], [3, 'c']])
        is_sorted: Indicates whether the input dataframe is already sorted with respect
            to the 'group_by' column.

    Returns:
        Two numpy arrays, the first array contains the aggregation results of each each
        unique group and the second array contains the correpsonding group names.
    """
    group_start_indices, group_names, table = _prepare_grouping_indices(
        table, group_by, is_sorted
    )
    array = table[columns_to_aggregate].to_numpy()
    aggregation_result = np.array(
        [condenser(i) for i in np.split(array, group_start_indices[1:])]
    )
    return aggregation_result, group_names