Skip to content

Aggregate

A comprehensive set of tools for aggregating and reshaping tabular proteomics data.

The aggregation module contains submodules that offer functionalities to transform data from lower levels of abstraction (e.g. ions, peptides) to higher levels (e.g. peptides, proteins, PTMs) through various summarization and condensation techniques. It also includes methods for reshaping tables from "long" to "wide" format, a common prerequisite for aggregation. The MaxLFQ algorithm is integrated for specific quantitative summarizations, enabling users to build customized, higher-level data tables.

Modules:

Name Description
condense

Low-level functions for aggregating numerical and string data.

pivot

Functionalities for reshaping tabular quantitative proteomics data.

summarize

High-level functions for aggregating quantitative proteomics data.

condense

Low-level functions for aggregating numerical and string data.

This module defines fundamental "condenser" functions that operate directly on NumPy arrays. These functions are designed to be applied to groups of data, performing operations such as summing values, finding maximum/minimum, counting or joining unique elements, and calculating abundance profiles. It includes the core implementations for MaxLFQ summation.

Functions:

Name Description
join_str

Returns a joined string of sorted values from the array.

join_str_per_column

Returns for each column a joined string of sorted values.

join_unique_str

Returns a joined string of unique sorted values from the array.

join_unique_str_per_column

Returns for each column a joined strings of unique sorted values.

sum

Returns sum of values from one or multiple columns.

sum_per_column

Returns for each column the sum of values.

maximum

Returns the highest finitevalue from one or multiple columns.

maximum_per_column

Returns for each column the highest finite value.

minimum

Returns the lowest finite value from one or multiple columns.

minimum_per_column

Returns for each column the lowest finite value.

count_unique

Returns the number of unique values from one or multiple columns.

count_unique_per_column

Returns for each column the number of unique values.

profile_by_median_ratio_regression

Calculates abundance profiles by lstsq regression of pair-wise median ratios.

sum_by_median_ratio_regression

Calculates summed abundance by lstsq regression of pair-wise median ratios.

join_str

join_str(array: ndarray, sep: str = ';') -> str

Returns a joined string of sorted values from the array.

Note that empty strings or np.nan are not included in the joined string.

Source code in msreport\aggregate\condense.py
15
16
17
18
19
20
21
22
23
24
def join_str(array: np.ndarray, sep: str = ";") -> str:
    """Returns a joined string of sorted values from the array.

    Note that empty strings or np.nan are not included in the joined string.
    """
    elements = []
    for value in array.flatten():
        if value != "" and not (isinstance(value, float) and np.isnan(value)):
            elements.append(str(value))
    return sep.join(sorted(elements))

join_str_per_column

join_str_per_column(
    array: ndarray, sep: str = ";"
) -> ndarray

Returns for each column a joined string of sorted values.

Note that empty strings or np.nan are not included in the joined string.

Source code in msreport\aggregate\condense.py
27
28
29
30
31
32
def join_str_per_column(array: np.ndarray, sep: str = ";") -> np.ndarray:
    """Returns for each column a joined string of sorted values.

    Note that empty strings or np.nan are not included in the joined string.
    """
    return np.array([join_str(i) for i in array.transpose()])

join_unique_str

join_unique_str(array: ndarray, sep: str = ';') -> str

Returns a joined string of unique sorted values from the array.

Source code in msreport\aggregate\condense.py
35
36
37
38
39
40
41
def join_unique_str(array: np.ndarray, sep: str = ";") -> str:
    """Returns a joined string of unique sorted values from the array."""
    elements = []
    for value in array.flatten():
        if value != "" and not (isinstance(value, float) and np.isnan(value)):
            elements.append(str(value))
    return sep.join(sorted(set(elements)))

join_unique_str_per_column

join_unique_str_per_column(
    array: ndarray, sep: str = ";"
) -> ndarray

Returns for each column a joined strings of unique sorted values.

Source code in msreport\aggregate\condense.py
44
45
46
def join_unique_str_per_column(array: np.ndarray, sep: str = ";") -> np.ndarray:
    """Returns for each column a joined strings of unique sorted values."""
    return np.array([join_unique_str(i) for i in array.transpose()])

sum

sum(array: ndarray) -> float

Returns sum of values from one or multiple columns.

Note that if no finite values are present in the array np.nan is returned.

Source code in msreport\aggregate\condense.py
49
50
51
52
53
54
55
56
57
58
def sum(array: np.ndarray) -> float:
    """Returns sum of values from one or multiple columns.

    Note that if no finite values are present in the array np.nan is returned.
    """
    array = array.flatten()
    if np.isfinite(array).any():
        return np.nansum(array)
    else:
        return np.nan

sum_per_column

sum_per_column(array: ndarray) -> ndarray

Returns for each column the sum of values.

Note that if no finite values are present in a column np.nan is returned.

Source code in msreport\aggregate\condense.py
61
62
63
64
65
66
def sum_per_column(array: np.ndarray) -> np.ndarray:
    """Returns for each column the sum of values.

    Note that if no finite values are present in a column np.nan is returned.
    """
    return np.array([sum(i) for i in array.transpose()])

maximum

maximum(array: ndarray) -> float

Returns the highest finitevalue from one or multiple columns.

Source code in msreport\aggregate\condense.py
69
70
71
72
73
74
75
def maximum(array: np.ndarray) -> float:
    """Returns the highest finitevalue from one or multiple columns."""
    array = array.flatten()
    if np.isfinite(array).any():
        return np.nanmax(array)
    else:
        return np.nan

maximum_per_column

maximum_per_column(array: ndarray) -> ndarray

Returns for each column the highest finite value.

Source code in msreport\aggregate\condense.py
78
79
80
def maximum_per_column(array: np.ndarray) -> np.ndarray:
    """Returns for each column the highest finite value."""
    return np.array([maximum(i) for i in array.transpose()])

minimum

minimum(array: ndarray) -> float

Returns the lowest finite value from one or multiple columns.

Source code in msreport\aggregate\condense.py
83
84
85
86
87
88
89
def minimum(array: np.ndarray) -> float:
    """Returns the lowest finite value from one or multiple columns."""
    array = array.flatten()
    if np.isfinite(array).any():
        return np.nanmin(array)
    else:
        return np.nan

minimum_per_column

minimum_per_column(array: ndarray) -> ndarray

Returns for each column the lowest finite value.

Source code in msreport\aggregate\condense.py
92
93
94
def minimum_per_column(array: np.ndarray) -> np.ndarray:
    """Returns for each column the lowest finite value."""
    return np.array([minimum(i) for i in array.transpose()])

count_unique

count_unique(array: ndarray) -> int

Returns the number of unique values from one or multiple columns.

Note that empty strings or np.nan are not counted as unique values.

Source code in msreport\aggregate\condense.py
 97
 98
 99
100
101
102
103
104
105
106
107
def count_unique(array: np.ndarray) -> int:
    """Returns the number of unique values from one or multiple columns.

    Note that empty strings or np.nan are not counted as unique values.
    """
    unique_elements = {
        x for x in array.flatten() if not (isinstance(x, float) and np.isnan(x))
    }
    unique_elements.discard("")

    return len(unique_elements)

count_unique_per_column

count_unique_per_column(array: ndarray) -> ndarray

Returns for each column the number of unique values.

Note that empty strings or np.nan are not counted as unique values.

Source code in msreport\aggregate\condense.py
110
111
112
113
114
115
116
117
118
def count_unique_per_column(array: np.ndarray) -> np.ndarray:
    """Returns for each column the number of unique values.

    Note that empty strings or np.nan are not counted as unique values.
    """
    if array.size > 0:
        return np.array([count_unique(i) for i in array.transpose()])
    else:
        return np.full(array.shape[0], 0)

profile_by_median_ratio_regression

profile_by_median_ratio_regression(
    array: ndarray,
) -> ndarray

Calculates abundance profiles by lstsq regression of pair-wise median ratios.

The function performs a least squares regression of pair-wise median ratios to calculate estimated abundance profiles.

Parameters:

Name Type Description Default
array ndarray

A two-dimensional array containing abundance values, with the first dimension corresponding to rows and the second dimension to columns. Abundance values must not be log transformed.

required

Returns:

Type Description
ndarray

An array containing estimated abundance profiles, with length equal to the

ndarray

number of columns in the input array.

Source code in msreport\aggregate\condense.py
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
def profile_by_median_ratio_regression(array: np.ndarray) -> np.ndarray:
    """Calculates abundance profiles by lstsq regression of pair-wise median ratios.

    The function performs a least squares regression of pair-wise median ratios to
    calculate estimated abundance profiles.

    Args:
        array: A two-dimensional array containing abundance values, with the first
            dimension corresponding to rows and the second dimension to columns.
            Abundance values must not be log transformed.

    Returns:
        An array containing estimated abundance profiles, with length equal to the
        number of columns in the input array.
    """
    ratio_matrix = MAXLFQ.calculate_pairwise_median_log_ratio_matrix(
        array, log_transformed=False
    )
    coef_matrix, ratio_array, initial_rows = MAXLFQ.prepare_coefficient_matrix(
        ratio_matrix
    )
    log_profile = MAXLFQ.log_profiles_by_lstsq(coef_matrix, ratio_array)
    profile = np.power(2, log_profile)
    return profile

sum_by_median_ratio_regression

sum_by_median_ratio_regression(array: ndarray) -> ndarray

Calculates summed abundance by lstsq regression of pair-wise median ratios.

The function performs a least squares regression of pair-wise median ratios to calculate estimated abundance profiles. These profiles are then scaled based on the input array such that the columns with finite profile values are used and the sum of the scaled profiles matches the sum of the input array.

Parameters:

Name Type Description Default
array ndarray

A two-dimensional array containing abundance values, with the first dimension corresponding to rows and the second dimension to columns. Abundance values must not be log transformed.

required

Returns:

Type Description
ndarray

An array containing summed abundance estimates, with length equal to the number

ndarray

of columns in the input array.

Source code in msreport\aggregate\condense.py
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
def sum_by_median_ratio_regression(array: np.ndarray) -> np.ndarray:
    """Calculates summed abundance by lstsq regression of pair-wise median ratios.

    The function performs a least squares regression of pair-wise median ratios to
    calculate estimated abundance profiles. These profiles are then scaled based on the
    input array such that the columns with finite profile values are used and the sum of
    the scaled profiles matches the sum of the input array.

    Args:
        array: A two-dimensional array containing abundance values, with the first
            dimension corresponding to rows and the second dimension to columns.
            Abundance values must not be log transformed.

    Returns:
        An array containing summed abundance estimates, with length equal to the number
        of columns in the input array.
    """
    profile = profile_by_median_ratio_regression(array)
    scaled_profile = profile
    if np.isfinite(profile).any():
        profile_mask = np.isfinite(profile)
        scaled_profile[profile_mask] = profile[profile_mask] * (
            np.nansum(array[:, profile_mask]) / np.nansum(profile[profile_mask])
        )

    return scaled_profile

pivot

Functionalities for reshaping tabular quantitative proteomics data.

This module offers methods to transform data from a "long" format into a "wide" format, which is a common and often necessary step before aggregation or analysis. It supports pivoting data based on specified index and grouping columns, and can handle both quantitative values and annotation columns.

Functions:

Name Description
pivot_table

Generates a pivoted table in wide format.

pivot_column

Returns a reshaped dataframe, generated by pivoting the table on one column.

join_unique

Returns a new dataframe with unique values from a column and grouped by 'index'.

pivot_table

pivot_table(
    long_table: DataFrame,
    index: str,
    group_by: str,
    annotation_columns: Iterable[str],
    pivoting_columns: Iterable[str],
) -> DataFrame

Generates a pivoted table in wide format.

Parameters:

Name Type Description Default
long_table DataFrame

Dataframe in long format that is used to generate a table in wide format.

required
index str

One or multiple column names that are used to group the table for pivoting.

required
group_by str

Column that is used to split the table on its unique entries.

required
annotation_columns Iterable[str]

Each column generates a new column in the pivoted table. Entries from each annotation column are aggregated for each group created by the column(s) specified by 'index' and unique values are joined together with ";" as separator.

required
pivoting_columns Iterable[str]

Columns that are combined with unique entries from 'group_by' to generate new columns in the pivoted table.

required

Returns:

Type Description
DataFrame

A reshaped, pivot table with length equal to unique values from the 'index'

DataFrame

column.

Example

table = pd.DataFrame( ... { ... "ID": ["A", "B", "C", "B", "C", "D"], ... "Sample": ["S1", "S1", "S1", "S2", "S2", "S2"], ... "Annotation": ["A", "B", "C", "B", "C", "D"], ... "Quant": [1.0, 1.0, 1.0, 2.0, 2.0, 2.0], ... } ... ) pivot_table(table, "ID", "Sample", ["Annotation"], ["Quant"]) ID Annotation Quant S1 Quant S2 0 A A 1.0 NaN 1 B B 1.0 2.0 2 C C 1.0 2.0 3 D D NaN 2.0

Source code in msreport\aggregate\pivot.py
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def pivot_table(
    long_table: pd.DataFrame,
    index: str,
    group_by: str,
    annotation_columns: Iterable[str],
    pivoting_columns: Iterable[str],
) -> pd.DataFrame:
    """Generates a pivoted table in wide format.

    Args:
        long_table: Dataframe in long format that is used to generate a table in wide
            format.
        index: One or multiple column names that are used to group the table for
            pivoting.
        group_by: Column that is used to split the table on its unique entries.
        annotation_columns: Each column generates a new column in the pivoted table.
            Entries from each annotation column are aggregated for each group created by
            the column(s) specified by 'index' and unique values are joined together
            with ";" as separator.
        pivoting_columns: Columns that are combined with unique entries from 'group_by'
            to generate new columns in the pivoted table.

    Returns:
        A reshaped, pivot table with length equal to unique values from the 'index'
        column.

    Example:
        >>> table = pd.DataFrame(
        ...     {
        ...         "ID": ["A", "B", "C", "B", "C", "D"],
        ...         "Sample": ["S1", "S1", "S1", "S2", "S2", "S2"],
        ...         "Annotation": ["A", "B", "C", "B", "C", "D"],
        ...         "Quant": [1.0, 1.0, 1.0, 2.0, 2.0, 2.0],
        ...     }
        ... )
        >>> pivot_table(table, "ID", "Sample", ["Annotation"], ["Quant"])
          ID  Annotation  Quant S1  Quant S2
        0  A           A       1.0       NaN
        1  B           B       1.0       2.0
        2  C           C       1.0       2.0
        3  D           D       NaN       2.0
    """
    sub_tables = []
    for column in annotation_columns:
        sub_tables.append(join_unique(long_table, index, column))
    for column in pivoting_columns:
        sub_tables.append(pivot_column(long_table, index, group_by, column))

    wide_table = msreport.helper.join_tables(sub_tables, reset_index=True)
    return wide_table

pivot_column

pivot_column(
    table: DataFrame,
    index: str | Iterable[str],
    group_by: str,
    values: str,
) -> DataFrame

Returns a reshaped dataframe, generated by pivoting the table on one column.

Uses unique values from the specified 'index' to form the index axis of the new dataframe. Unique values from the 'group_by' column are used to split the data and generate new columns that are filled with values from the 'values' column. The column names are composed of the 'values' column and the unique entries from 'group_by'.

Parameters:

Name Type Description Default
table DataFrame

Dataframe that is used to generate the pivoted table.

required
index str | Iterable[str]

One or multiple column names that are used as the new index.

required
group_by str

Column that is used to split the table, each unique entry from this column generates a new column in the pivoted table.

required
values str

Column which values are used to populate the pivoted table.

required

Returns:

Type Description
DataFrame

The pivoted dataframe.

Example

table = pd.DataFrame( ... { ... "ID": ["A", "A", "B", "B"], ... "Sample": ["S1", "S2", "S1", "S2"], ... "Entries": [1.0, 2.0, 1.0, 2.0], ... } ... ) pivot_column(table, "ID", "Sample", "Entries") Entries S1 Entries S2 ID A 1.0 2.0 B 1.0 2.0

Source code in msreport\aggregate\pivot.py
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
def pivot_column(
    table: pd.DataFrame, index: str | Iterable[str], group_by: str, values: str
) -> pd.DataFrame:
    """Returns a reshaped dataframe, generated by pivoting the table on one column.

    Uses unique values from the specified 'index' to form the index axis of the new
    dataframe. Unique values from the 'group_by' column are used to split the data and
    generate new columns that are filled with values from the 'values' column. The
    column names are composed of the 'values' column and the unique entries from
    'group_by'.

    Args:
        table: Dataframe that is used to generate the pivoted table.
        index: One or multiple column names that are used as the new index.
        group_by: Column that is used to split the table, each unique entry from this
            column generates a new column in the pivoted table.
        values: Column which values are used to populate the pivoted table.

    Returns:
        The pivoted dataframe.

    Example:
        >>> table = pd.DataFrame(
        ...     {
        ...         "ID": ["A", "A", "B", "B"],
        ...         "Sample": ["S1", "S2", "S1", "S2"],
        ...         "Entries": [1.0, 2.0, 1.0, 2.0],
        ...     }
        ... )
        >>> pivot_column(table, "ID", "Sample", "Entries")
            Entries S1  Entries S2
        ID
        A          1.0         2.0
        B          1.0         2.0
    """
    pivot = table.pivot(index=index, columns=group_by, values=values)
    pivot.columns = [f"{values} {sample_column}" for sample_column in pivot.columns]
    return pivot

join_unique

join_unique(
    table: DataFrame,
    index: str | Iterable[str],
    values: str,
) -> DataFrame

Returns a new dataframe with unique values from a column and grouped by 'index'.

Parameters:

Name Type Description Default
table DataFrame

Input dataframe from which to generate the new dataframe.

required
index str | Iterable[str]

One or multiple column names group the table by.

required
values str

Column which is used to extract unique values.

required

Returns:

Type Description
DataFrame

A dataframe with a single column named 'values', where the unique values of the column specified by 'values' are joined together with ";" for each group created by the column(s) specified by 'index'.

Example

table = pd.DataFrame( ... { ... "ID": ["A", "A", "B", "B"], ... "Annotation": ["A1", "A1", "B1", "B1"], ... } ... ) join_unique(table, "ID", "Annotation") Annotation ID A A1 B B1

Source code in msreport\aggregate\pivot.py
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
def join_unique(
    table: pd.DataFrame, index: str | Iterable[str], values: str
) -> pd.DataFrame:
    """Returns a new dataframe with unique values from a column and grouped by 'index'.

    Args:
        table: Input dataframe from which to generate the new dataframe.
        index: One or multiple column names group the table by.
        values: Column which is used to extract unique values.

    Returns:
        A dataframe with a single column named 'values', where the unique values of the
            column specified by 'values' are joined together with ";" for each group
            created by the column(s) specified by 'index'.

    Example:
        >>> table = pd.DataFrame(
        ...     {
        ...         "ID": ["A", "A", "B", "B"],
        ...         "Annotation": ["A1", "A1", "B1", "B1"],
        ...     }
        ... )
        >>> join_unique(table, "ID", "Annotation")
            Annotation
        ID
        A           A1
        B           B1
    """
    series = table.groupby(index)[values].agg(
        lambda x: CONDENSE.join_unique_str(x.to_numpy())
    )
    new_df = pd.DataFrame(series)
    new_df.columns = [values]
    return new_df

summarize

High-level functions for aggregating quantitative proteomics data.

This module offers functions to summarize data from a lower level of abstraction (e.g. ions, peptides) to a higher level (e.g., peptides, proteins, PTMs). It operates directly on pandas DataFrames, allowing users to specify a grouping column and the columns to be summarized. These functions often leverage low-level condenser operations defined in msreport.aggregate.condense. It includes specific functions for MaxLFQ summation, as well as general counting, joining, and summing of columns.

Functions:

Name Description
count_unique

Aggregates column(s) by counting unique values for each unique group.

join_unique

Aggregates column(s) by concatenating unique values for each unique group.

sum_columns

Aggregates column(s) by summing up values for each unique group.

sum_columns_maxlfq

Aggregates column(s) by applying the MaxLFQ summation approach to unique group.

aggregate_unique_groups

Aggregates column(s) by applying a condenser function to unique groups.

count_unique

count_unique(
    table: DataFrame,
    group_by: str,
    input_column: str | Iterable[str],
    output_column: str = "Unique counts",
    is_sorted: bool = False,
) -> DataFrame

Aggregates column(s) by counting unique values for each unique group.

Note that empty strings and np.nan do not contribute to the unique value count.

Parameters:

Name Type Description Default
table DataFrame

The input DataFrame used for aggregating on unique groups.

required
group_by str

The name of the column used to determine unique groups for aggregation.

required
input_column str | Iterable[str]

A column or a list of columns, whose unique values will be counted for each unique group during aggregation.

required
output_column str

The name of the column containing the aggregation results. By default "Unique values" is used as the name of the output column.

'Unique counts'
is_sorted bool

Indicates whether the input dataframe is already sorted with respect to the 'group_by' column.

False

Returns:

Type Description
DataFrame

A dataframe with unique 'group_by' values as index and a unique counts column

DataFrame

containing the number of unique counts per group.

Example

table = pd.DataFrame( ... { ... "ID": ["A", "A", "B", "C", "C", "C"], ... "Peptide sequence": ["a", "a", "b", "c1", "c2", "c2"], ... } ... ) count_unique(table, group_by="ID", input_column="Peptide sequence") Unique counts A 1 B 1 C 2

Source code in msreport\aggregate\summarize.py
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
def count_unique(
    table: pd.DataFrame,
    group_by: str,
    input_column: str | Iterable[str],
    output_column: str = "Unique counts",
    is_sorted: bool = False,
) -> pd.DataFrame:
    """Aggregates column(s) by counting unique values for each unique group.

    Note that empty strings and np.nan do not contribute to the unique value count.

    Args:
        table: The input DataFrame used for aggregating on unique groups.
        group_by: The name of the column used to determine unique groups for
            aggregation.
        input_column: A column or a list of columns, whose unique values will be counted
            for each unique group during aggregation.
        output_column: The name of the column containing the aggregation results. By
            default "Unique values" is used as the name of the output column.
        is_sorted: Indicates whether the input dataframe is already sorted with respect
            to the 'group_by' column.

    Returns:
        A dataframe with unique 'group_by' values as index and a unique counts column
        containing the number of unique counts per group.

    Example:
        >>> table = pd.DataFrame(
        ...     {
        ...         "ID": ["A", "A", "B", "C", "C", "C"],
        ...         "Peptide sequence": ["a", "a", "b", "c1", "c2", "c2"],
        ...     }
        ... )
        >>> count_unique(table, group_by="ID", input_column="Peptide sequence")
           Unique counts
        A              1
        B              1
        C              2
    """
    aggregation, groups = aggregate_unique_groups(
        table, group_by, input_column, CONDENSE.count_unique, is_sorted
    )
    return pd.DataFrame(columns=[output_column], data=aggregation, index=groups)

join_unique

join_unique(
    table: DataFrame,
    group_by: str,
    input_column: str | Iterable[str],
    output_column: str = "Unique values",
    sep: str = ";",
    is_sorted: bool = False,
) -> DataFrame

Aggregates column(s) by concatenating unique values for each unique group.

Note that empty strings and np.nan do not contribute to the unique value count.

Parameters:

Name Type Description Default
table DataFrame

The input DataFrame used for aggregating on unique groups.

required
group_by str

The name of the column used to determine unique groups for aggregation.

required
input_column str | Iterable[str]

A column or a list of columns, whose unique values will be joined into a single string for each unique group

required
output_column str

The name of the column containing the aggregation results. By default "Unique values" is used as the name of the output column.

'Unique values'
sep str

The separator string used to join multiple unique values together. Default is ";".

';'
is_sorted bool

Indicates whether the input dataframe is already sorted with respect to the 'group_by' column.

False

Returns:

Type Description
DataFrame

A dataframe with unique 'group_by' values as index and a unique values column

DataFrame

containing the joined unique values per group. Unique values are sorted and

DataFrame

joined with the specified separator.

Example

table = pd.DataFrame( ... { ... "ID": ["A", "A", "B", "C", "C", "C"], ... "Peptide sequence": ["a", "", "b", "c1", "c2", "c2"], ... } ... ) join_unique(table, group_by="ID", input_column="Peptide sequence") Unique values A a B b C c1;c2

Source code in msreport\aggregate\summarize.py
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
def join_unique(
    table: pd.DataFrame,
    group_by: str,
    input_column: str | Iterable[str],
    output_column: str = "Unique values",
    sep: str = ";",
    is_sorted: bool = False,
) -> pd.DataFrame:
    """Aggregates column(s) by concatenating unique values for each unique group.

    Note that empty strings and np.nan do not contribute to the unique value count.

    Args:
        table: The input DataFrame used for aggregating on unique groups.
        group_by: The name of the column used to determine unique groups for
            aggregation.
        input_column: A column or a list of columns, whose unique values will be joined
            into a single string for each unique group
        output_column: The name of the column containing the aggregation results. By
            default "Unique values" is used as the name of the output column.
        sep: The separator string used to join multiple unique values together. Default
            is ";".
        is_sorted: Indicates whether the input dataframe is already sorted with respect
            to the 'group_by' column.

    Returns:
        A dataframe with unique 'group_by' values as index and a unique values column
        containing the joined unique values per group. Unique values are sorted and
        joined with the specified separator.

    Example:
        >>> table = pd.DataFrame(
        ...     {
        ...         "ID": ["A", "A", "B", "C", "C", "C"],
        ...         "Peptide sequence": ["a", "", "b", "c1", "c2", "c2"],
        ...     }
        ... )
        >>> join_unique(table, group_by="ID", input_column="Peptide sequence")
          Unique values
        A             a
        B             b
        C         c1;c2
    """
    aggregation, groups = aggregate_unique_groups(
        table,
        group_by,
        input_column,
        lambda x: CONDENSE.join_unique_str(x, sep=sep),
        is_sorted,
    )
    return pd.DataFrame(columns=[output_column], data=aggregation, index=groups)

sum_columns

sum_columns(
    table: DataFrame,
    group_by: str,
    samples: Iterable[str],
    input_tag: str,
    output_tag: Optional[str] = None,
    is_sorted: bool = False,
) -> DataFrame

Aggregates column(s) by summing up values for each unique group.

Parameters:

Name Type Description Default
table DataFrame

The input DataFrame used for aggregating on unique groups.

required
group_by str

The name of the column used to determine unique groups for aggregation.

required
samples Iterable[str]

List of sample names that appear in columns of the table as substrings.

required
input_tag str

Substring of column names, which is used together with the sample names to determine the columns whose values will be summarized for each unique group.

required
output_tag Optional[str]

Optional, allows changing the ouptut column names by replacing the 'input_tag' with the 'output_tag'. If not specified the names of the columns that were used for aggregation will be used in the returned dataframe.

None
is_sorted bool

Indicates whether the input dataframe is already sorted with respect to the 'group_by' column.

False

Returns:

Type Description
DataFrame

A dataframe with unique 'group_by' values as index and one column per sample.

DataFrame

The columns contain the summed group values per sample.

Example

table = pd.DataFrame( ... { ... "ID": ["A", "A", "B", "C", "C", "C"], ... "Col S1": [1, 1, 1, 1, 1, 1], ... "Col S2": [2, 2, 2, 2, 2, 2], ... } ... ) sum_columns(table, "ID", samples=["S1", "S2"], input_tag="Col") Col S1 Col S2 A 2 4 B 1 2 C 3 6

Source code in msreport\aggregate\summarize.py
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
def sum_columns(
    table: pd.DataFrame,
    group_by: str,
    samples: Iterable[str],
    input_tag: str,
    output_tag: Optional[str] = None,
    is_sorted: bool = False,
) -> pd.DataFrame:
    """Aggregates column(s) by summing up values for each unique group.

    Args:
        table: The input DataFrame used for aggregating on unique groups.
        group_by: The name of the column used to determine unique groups for
            aggregation.
        samples: List of sample names that appear in columns of the table as substrings.
        input_tag: Substring of column names, which is used together with the sample
            names to determine the columns whose values will be summarized for each
            unique group.
        output_tag: Optional, allows changing the ouptut column names by replacing the
            'input_tag' with the 'output_tag'. If not specified the names of the columns
            that were used for aggregation will be used in the returned dataframe.
        is_sorted: Indicates whether the input dataframe is already sorted with respect
            to the 'group_by' column.

    Returns:
        A dataframe with unique 'group_by' values as index and one column per sample.
        The columns contain the summed group values per sample.

    Example:
        >>> table = pd.DataFrame(
        ...     {
        ...         "ID": ["A", "A", "B", "C", "C", "C"],
        ...         "Col S1": [1, 1, 1, 1, 1, 1],
        ...         "Col S2": [2, 2, 2, 2, 2, 2],
        ...     }
        ... )
        >>> sum_columns(table, "ID", samples=["S1", "S2"], input_tag="Col")
           Col S1  Col S2
        A       2       4
        B       1       2
        C       3       6
    """
    output_tag = input_tag if output_tag is None else output_tag
    columns = find_sample_columns(table, input_tag, samples)
    aggregation, groups = aggregate_unique_groups(
        table, group_by, columns, CONDENSE.sum_per_column, is_sorted
    )
    output_columns = [column.replace(input_tag, output_tag) for column in columns]
    return pd.DataFrame(columns=output_columns, data=aggregation, index=groups)

sum_columns_maxlfq

sum_columns_maxlfq(
    table: DataFrame,
    group_by: str,
    samples: Iterable[str],
    input_tag: str,
    output_tag: Optional[str] = None,
    is_sorted: bool = False,
) -> DataFrame

Aggregates column(s) by applying the MaxLFQ summation approach to unique group.

This function estimates abundance profiles from sample columns using pairwise median ratios and least square regression. It then selects abundance profiles with finite values and the corresponding input columns and scales the abundance profiles so that their total sum is equal to the total sum of the corresponding input columns.

Parameters:

Name Type Description Default
table DataFrame

The input DataFrame used for aggregating on unique groups.

required
group_by str

The name of the column used to determine unique groups for aggregation.

required
samples Iterable[str]

List of sample names that appear in columns of the table as substrings.

required
input_tag str

Substring of column names, which is used together with the sample names to determine the columns whose values will be summarized for each unique group.

required
output_tag Optional[str]

Optional, allows changing the ouptut column names by replacing the 'input_tag' with the 'output_tag'. If not specified the names of the columns that were used for aggregation will be used in the returned dataframe.

None
is_sorted bool

Indicates whether the input dataframe is already sorted with respect to the 'group_by' column.

False

Returns:

Type Description
DataFrame

A dataframe with unique 'group_by' values as index and one column per sample.

DataFrame

The columns contain the summed group values per sample.

Example

table = pd.DataFrame( ... { ... "ID": ["A", "A", "B", "C", "C", "C"], ... "Col S1": [1, 1, 1, 1, 1, 1], ... "Col S2": [2, 2, 2, 2, 2, 2], ... } ... ) sum_columns_maxlfq(table, "ID", samples=["S1", "S2"], input_tag="Col") Col S1 Col S2 A 2.0 4.0 B 1.0 2.0 C 3.0 6.0

Source code in msreport\aggregate\summarize.py
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
def sum_columns_maxlfq(
    table: pd.DataFrame,
    group_by: str,
    samples: Iterable[str],
    input_tag: str,
    output_tag: Optional[str] = None,
    is_sorted: bool = False,
) -> pd.DataFrame:
    """Aggregates column(s) by applying the MaxLFQ summation approach to unique group.

    This function estimates abundance profiles from sample columns using pairwise median
    ratios and least square regression. It then selects abundance profiles with finite
    values and the corresponding input columns and scales the abundance profiles so that
    their total sum is equal to the total sum of the corresponding input columns.

    Args:
        table: The input DataFrame used for aggregating on unique groups.
        group_by: The name of the column used to determine unique groups for
            aggregation.
        samples: List of sample names that appear in columns of the table as substrings.
        input_tag: Substring of column names, which is used together with the sample
            names to determine the columns whose values will be summarized for each
            unique group.
        output_tag: Optional, allows changing the ouptut column names by replacing the
            'input_tag' with the 'output_tag'. If not specified the names of the columns
            that were used for aggregation will be used in the returned dataframe.
        is_sorted: Indicates whether the input dataframe is already sorted with respect
            to the 'group_by' column.

    Returns:
        A dataframe with unique 'group_by' values as index and one column per sample.
        The columns contain the summed group values per sample.

    Example:
        >>> table = pd.DataFrame(
        ...     {
        ...         "ID": ["A", "A", "B", "C", "C", "C"],
        ...         "Col S1": [1, 1, 1, 1, 1, 1],
        ...         "Col S2": [2, 2, 2, 2, 2, 2],
        ...     }
        ... )
        >>> sum_columns_maxlfq(table, "ID", samples=["S1", "S2"], input_tag="Col")
           Col S1  Col S2
        A     2.0     4.0
        B     1.0     2.0
        C     3.0     6.0
    """
    output_tag = input_tag if output_tag is None else output_tag
    columns = find_sample_columns(table, input_tag, samples)
    aggregation, groups = aggregate_unique_groups(
        table, group_by, columns, CONDENSE.sum_by_median_ratio_regression, is_sorted
    )
    output_columns = [column.replace(input_tag, output_tag) for column in columns]
    return pd.DataFrame(columns=output_columns, data=aggregation, index=groups)

aggregate_unique_groups

aggregate_unique_groups(
    table: DataFrame,
    group_by: str,
    columns_to_aggregate: str | Iterable[str],
    condenser: Callable,
    is_sorted: bool,
) -> tuple[ndarray, ndarray]

Aggregates column(s) by applying a condenser function to unique groups.

The function returns two arrays containing the aggregated values and the corresponding group names. This function can be used for example to summarize data from an ion table to a peptide, protein or modification table. Suitable condenser functions can be found in the module msreport.aggregate.condense

Parameters:

Name Type Description Default
table DataFrame

The input dataframe used for aggregating on unique groups.

required
group_by str

The name of the column used to determine unique groups for aggregation.

required
columns_to_aggregate str | Iterable[str]

A column or a list of columns, which will be passed to the condenser function for applying an aggregation to each unique group.

required
condenser Callable

Function that is applied to each group for generating the aggregation result. If multiple columns are specified for aggregation, the input array for the condenser function will be two dimensional, with the first dimension corresponding to rows and the second to the column. E.g. an array with 3 rows and 2 columns: np.array([[1, 'a'], [2, 'b'], [3, 'c']])

required
is_sorted bool

Indicates whether the input dataframe is already sorted with respect to the 'group_by' column.

required

Returns:

Type Description
ndarray

Two numpy arrays, the first array contains the aggregation results of each each

ndarray

unique group and the second array contains the correpsonding group names.

Source code in msreport\aggregate\summarize.py
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
def aggregate_unique_groups(
    table: pd.DataFrame,
    group_by: str,
    columns_to_aggregate: str | Iterable[str],
    condenser: Callable,
    is_sorted: bool,
) -> tuple[np.ndarray, np.ndarray]:
    """Aggregates column(s) by applying a condenser function to unique groups.

    The function returns two arrays containing the aggregated values and the
    corresponding group names. This function can be used for example to summarize data
    from an ion table to a peptide, protein or modification table. Suitable condenser
    functions can be found in the module msreport.aggregate.condense

    Args:
        table: The input dataframe used for aggregating on unique groups.
        group_by: The name of the column used to determine unique groups for
            aggregation.
        columns_to_aggregate: A column or a list of columns, which will be passed to the
            condenser function for applying an aggregation to each unique group.
        condenser: Function that is applied to each group for generating the
            aggregation result. If multiple columns are specified for aggregation,
            the input array for the condenser function will be two dimensional, with the
            first dimension corresponding to rows and the second to the column. E.g. an
            array with 3 rows and 2 columns: np.array([[1, 'a'], [2, 'b'], [3, 'c']])
        is_sorted: Indicates whether the input dataframe is already sorted with respect
            to the 'group_by' column.

    Returns:
        Two numpy arrays, the first array contains the aggregation results of each each
        unique group and the second array contains the correpsonding group names.
    """
    group_start_indices, group_names, table = _prepare_grouping_indices(
        table, group_by, is_sorted
    )
    array = table[columns_to_aggregate].to_numpy()
    aggregation_result = np.array(
        [condenser(i) for i in np.split(array, group_start_indices[1:])]
    )
    return aggregation_result, group_names