Aggregate
A comprehensive set of tools for aggregating and reshaping tabular proteomics data.
The aggregation
module contains submodules that offer functionalities to transform
data from lower levels of abstraction (e.g. ions, peptides) to higher levels (e.g.
peptides, proteins, PTMs) through various summarization and condensation techniques.
It also includes methods for reshaping tables from "long" to "wide" format, a common
prerequisite for aggregation. The MaxLFQ algorithm is integrated for specific
quantitative summarizations, enabling users to build customized, higher-level data
tables.
Modules:
Name | Description |
---|---|
condense |
Low-level functions for aggregating numerical and string data. |
pivot |
Functionalities for reshaping tabular quantitative proteomics data. |
summarize |
High-level functions for aggregating quantitative proteomics data. |
condense
Low-level functions for aggregating numerical and string data.
This module defines fundamental "condenser" functions that operate directly on NumPy arrays. These functions are designed to be applied to groups of data, performing operations such as summing values, finding maximum/minimum, counting or joining unique elements, and calculating abundance profiles. It includes the core implementations for MaxLFQ summation.
Functions:
Name | Description |
---|---|
join_str |
Returns a joined string of sorted values from the array. |
join_str_per_column |
Returns for each column a joined string of sorted values. |
join_unique_str |
Returns a joined string of unique sorted values from the array. |
join_unique_str_per_column |
Returns for each column a joined strings of unique sorted values. |
sum |
Returns sum of values from one or multiple columns. |
sum_per_column |
Returns for each column the sum of values. |
maximum |
Returns the highest finitevalue from one or multiple columns. |
maximum_per_column |
Returns for each column the highest finite value. |
minimum |
Returns the lowest finite value from one or multiple columns. |
minimum_per_column |
Returns for each column the lowest finite value. |
count_unique |
Returns the number of unique values from one or multiple columns. |
count_unique_per_column |
Returns for each column the number of unique values. |
profile_by_median_ratio_regression |
Calculates abundance profiles by lstsq regression of pair-wise median ratios. |
sum_by_median_ratio_regression |
Calculates summed abundance by lstsq regression of pair-wise median ratios. |
join_str
Returns a joined string of sorted values from the array.
Note that empty strings or np.nan are not included in the joined string.
Source code in msreport\aggregate\condense.py
15 16 17 18 19 20 21 22 23 24 |
|
join_str_per_column
join_str_per_column(
array: ndarray, sep: str = ";"
) -> ndarray
Returns for each column a joined string of sorted values.
Note that empty strings or np.nan are not included in the joined string.
Source code in msreport\aggregate\condense.py
27 28 29 30 31 32 |
|
join_unique_str
Returns a joined string of unique sorted values from the array.
Source code in msreport\aggregate\condense.py
35 36 37 38 39 40 41 |
|
join_unique_str_per_column
join_unique_str_per_column(
array: ndarray, sep: str = ";"
) -> ndarray
Returns for each column a joined strings of unique sorted values.
Source code in msreport\aggregate\condense.py
44 45 46 |
|
sum
sum(array: ndarray) -> float
Returns sum of values from one or multiple columns.
Note that if no finite values are present in the array np.nan is returned.
Source code in msreport\aggregate\condense.py
49 50 51 52 53 54 55 56 57 58 |
|
sum_per_column
sum_per_column(array: ndarray) -> ndarray
Returns for each column the sum of values.
Note that if no finite values are present in a column np.nan is returned.
Source code in msreport\aggregate\condense.py
61 62 63 64 65 66 |
|
maximum
maximum(array: ndarray) -> float
Returns the highest finitevalue from one or multiple columns.
Source code in msreport\aggregate\condense.py
69 70 71 72 73 74 75 |
|
maximum_per_column
maximum_per_column(array: ndarray) -> ndarray
Returns for each column the highest finite value.
Source code in msreport\aggregate\condense.py
78 79 80 |
|
minimum
minimum(array: ndarray) -> float
Returns the lowest finite value from one or multiple columns.
Source code in msreport\aggregate\condense.py
83 84 85 86 87 88 89 |
|
minimum_per_column
minimum_per_column(array: ndarray) -> ndarray
Returns for each column the lowest finite value.
Source code in msreport\aggregate\condense.py
92 93 94 |
|
count_unique
count_unique(array: ndarray) -> int
Returns the number of unique values from one or multiple columns.
Note that empty strings or np.nan are not counted as unique values.
Source code in msreport\aggregate\condense.py
97 98 99 100 101 102 103 104 105 106 107 |
|
count_unique_per_column
count_unique_per_column(array: ndarray) -> ndarray
Returns for each column the number of unique values.
Note that empty strings or np.nan are not counted as unique values.
Source code in msreport\aggregate\condense.py
110 111 112 113 114 115 116 117 118 |
|
profile_by_median_ratio_regression
profile_by_median_ratio_regression(
array: ndarray,
) -> ndarray
Calculates abundance profiles by lstsq regression of pair-wise median ratios.
The function performs a least squares regression of pair-wise median ratios to calculate estimated abundance profiles.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
array
|
ndarray
|
A two-dimensional array containing abundance values, with the first dimension corresponding to rows and the second dimension to columns. Abundance values must not be log transformed. |
required |
Returns:
Type | Description |
---|---|
ndarray
|
An array containing estimated abundance profiles, with length equal to the |
ndarray
|
number of columns in the input array. |
Source code in msreport\aggregate\condense.py
121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 |
|
sum_by_median_ratio_regression
sum_by_median_ratio_regression(array: ndarray) -> ndarray
Calculates summed abundance by lstsq regression of pair-wise median ratios.
The function performs a least squares regression of pair-wise median ratios to calculate estimated abundance profiles. These profiles are then scaled based on the input array such that the columns with finite profile values are used and the sum of the scaled profiles matches the sum of the input array.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
array
|
ndarray
|
A two-dimensional array containing abundance values, with the first dimension corresponding to rows and the second dimension to columns. Abundance values must not be log transformed. |
required |
Returns:
Type | Description |
---|---|
ndarray
|
An array containing summed abundance estimates, with length equal to the number |
ndarray
|
of columns in the input array. |
Source code in msreport\aggregate\condense.py
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
|
pivot
Functionalities for reshaping tabular quantitative proteomics data.
This module offers methods to transform data from a "long" format into a "wide" format, which is a common and often necessary step before aggregation or analysis. It supports pivoting data based on specified index and grouping columns, and can handle both quantitative values and annotation columns.
Functions:
Name | Description |
---|---|
pivot_table |
Generates a pivoted table in wide format. |
pivot_column |
Returns a reshaped dataframe, generated by pivoting the table on one column. |
join_unique |
Returns a new dataframe with unique values from a column and grouped by 'index'. |
pivot_table
pivot_table(
long_table: DataFrame,
index: str,
group_by: str,
annotation_columns: Iterable[str],
pivoting_columns: Iterable[str],
) -> DataFrame
Generates a pivoted table in wide format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
long_table
|
DataFrame
|
Dataframe in long format that is used to generate a table in wide format. |
required |
index
|
str
|
One or multiple column names that are used to group the table for pivoting. |
required |
group_by
|
str
|
Column that is used to split the table on its unique entries. |
required |
annotation_columns
|
Iterable[str]
|
Each column generates a new column in the pivoted table. Entries from each annotation column are aggregated for each group created by the column(s) specified by 'index' and unique values are joined together with ";" as separator. |
required |
pivoting_columns
|
Iterable[str]
|
Columns that are combined with unique entries from 'group_by' to generate new columns in the pivoted table. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A reshaped, pivot table with length equal to unique values from the 'index' |
DataFrame
|
column. |
Example
table = pd.DataFrame( ... { ... "ID": ["A", "B", "C", "B", "C", "D"], ... "Sample": ["S1", "S1", "S1", "S2", "S2", "S2"], ... "Annotation": ["A", "B", "C", "B", "C", "D"], ... "Quant": [1.0, 1.0, 1.0, 2.0, 2.0, 2.0], ... } ... ) pivot_table(table, "ID", "Sample", ["Annotation"], ["Quant"]) ID Annotation Quant S1 Quant S2 0 A A 1.0 NaN 1 B B 1.0 2.0 2 C C 1.0 2.0 3 D D NaN 2.0
Source code in msreport\aggregate\pivot.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
|
pivot_column
pivot_column(
table: DataFrame,
index: str | Iterable[str],
group_by: str,
values: str,
) -> DataFrame
Returns a reshaped dataframe, generated by pivoting the table on one column.
Uses unique values from the specified 'index' to form the index axis of the new dataframe. Unique values from the 'group_by' column are used to split the data and generate new columns that are filled with values from the 'values' column. The column names are composed of the 'values' column and the unique entries from 'group_by'.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table
|
DataFrame
|
Dataframe that is used to generate the pivoted table. |
required |
index
|
str | Iterable[str]
|
One or multiple column names that are used as the new index. |
required |
group_by
|
str
|
Column that is used to split the table, each unique entry from this column generates a new column in the pivoted table. |
required |
values
|
str
|
Column which values are used to populate the pivoted table. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
The pivoted dataframe. |
Example
table = pd.DataFrame( ... { ... "ID": ["A", "A", "B", "B"], ... "Sample": ["S1", "S2", "S1", "S2"], ... "Entries": [1.0, 2.0, 1.0, 2.0], ... } ... ) pivot_column(table, "ID", "Sample", "Entries") Entries S1 Entries S2 ID A 1.0 2.0 B 1.0 2.0
Source code in msreport\aggregate\pivot.py
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
|
join_unique
Returns a new dataframe with unique values from a column and grouped by 'index'.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table
|
DataFrame
|
Input dataframe from which to generate the new dataframe. |
required |
index
|
str | Iterable[str]
|
One or multiple column names group the table by. |
required |
values
|
str
|
Column which is used to extract unique values. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
A dataframe with a single column named 'values', where the unique values of the column specified by 'values' are joined together with ";" for each group created by the column(s) specified by 'index'. |
Example
table = pd.DataFrame( ... { ... "ID": ["A", "A", "B", "B"], ... "Annotation": ["A1", "A1", "B1", "B1"], ... } ... ) join_unique(table, "ID", "Annotation") Annotation ID A A1 B B1
Source code in msreport\aggregate\pivot.py
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
|
summarize
High-level functions for aggregating quantitative proteomics data.
This module offers functions to summarize data from a lower level of abstraction (e.g.
ions, peptides) to a higher level (e.g., peptides, proteins, PTMs). It operates directly
on pandas DataFrames, allowing users to specify a grouping column and the columns to be
summarized. These functions often leverage low-level condenser operations defined in
msreport.aggregate.condense
. It includes specific functions for MaxLFQ summation, as
well as general counting, joining, and summing of columns.
Functions:
Name | Description |
---|---|
count_unique |
Aggregates column(s) by counting unique values for each unique group. |
join_unique |
Aggregates column(s) by concatenating unique values for each unique group. |
sum_columns |
Aggregates column(s) by summing up values for each unique group. |
sum_columns_maxlfq |
Aggregates column(s) by applying the MaxLFQ summation approach to unique group. |
aggregate_unique_groups |
Aggregates column(s) by applying a condenser function to unique groups. |
count_unique
count_unique(
table: DataFrame,
group_by: str,
input_column: str | Iterable[str],
output_column: str = "Unique counts",
is_sorted: bool = False,
) -> DataFrame
Aggregates column(s) by counting unique values for each unique group.
Note that empty strings and np.nan do not contribute to the unique value count.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table
|
DataFrame
|
The input DataFrame used for aggregating on unique groups. |
required |
group_by
|
str
|
The name of the column used to determine unique groups for aggregation. |
required |
input_column
|
str | Iterable[str]
|
A column or a list of columns, whose unique values will be counted for each unique group during aggregation. |
required |
output_column
|
str
|
The name of the column containing the aggregation results. By default "Unique values" is used as the name of the output column. |
'Unique counts'
|
is_sorted
|
bool
|
Indicates whether the input dataframe is already sorted with respect to the 'group_by' column. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
A dataframe with unique 'group_by' values as index and a unique counts column |
DataFrame
|
containing the number of unique counts per group. |
Example
table = pd.DataFrame( ... { ... "ID": ["A", "A", "B", "C", "C", "C"], ... "Peptide sequence": ["a", "a", "b", "c1", "c2", "c2"], ... } ... ) count_unique(table, group_by="ID", input_column="Peptide sequence") Unique counts A 1 B 1 C 2
Source code in msreport\aggregate\summarize.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
|
join_unique
join_unique(
table: DataFrame,
group_by: str,
input_column: str | Iterable[str],
output_column: str = "Unique values",
sep: str = ";",
is_sorted: bool = False,
) -> DataFrame
Aggregates column(s) by concatenating unique values for each unique group.
Note that empty strings and np.nan do not contribute to the unique value count.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table
|
DataFrame
|
The input DataFrame used for aggregating on unique groups. |
required |
group_by
|
str
|
The name of the column used to determine unique groups for aggregation. |
required |
input_column
|
str | Iterable[str]
|
A column or a list of columns, whose unique values will be joined into a single string for each unique group |
required |
output_column
|
str
|
The name of the column containing the aggregation results. By default "Unique values" is used as the name of the output column. |
'Unique values'
|
sep
|
str
|
The separator string used to join multiple unique values together. Default is ";". |
';'
|
is_sorted
|
bool
|
Indicates whether the input dataframe is already sorted with respect to the 'group_by' column. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
A dataframe with unique 'group_by' values as index and a unique values column |
DataFrame
|
containing the joined unique values per group. Unique values are sorted and |
DataFrame
|
joined with the specified separator. |
Example
table = pd.DataFrame( ... { ... "ID": ["A", "A", "B", "C", "C", "C"], ... "Peptide sequence": ["a", "", "b", "c1", "c2", "c2"], ... } ... ) join_unique(table, group_by="ID", input_column="Peptide sequence") Unique values A a B b C c1;c2
Source code in msreport\aggregate\summarize.py
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
|
sum_columns
sum_columns(
table: DataFrame,
group_by: str,
samples: Iterable[str],
input_tag: str,
output_tag: Optional[str] = None,
is_sorted: bool = False,
) -> DataFrame
Aggregates column(s) by summing up values for each unique group.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table
|
DataFrame
|
The input DataFrame used for aggregating on unique groups. |
required |
group_by
|
str
|
The name of the column used to determine unique groups for aggregation. |
required |
samples
|
Iterable[str]
|
List of sample names that appear in columns of the table as substrings. |
required |
input_tag
|
str
|
Substring of column names, which is used together with the sample names to determine the columns whose values will be summarized for each unique group. |
required |
output_tag
|
Optional[str]
|
Optional, allows changing the ouptut column names by replacing the 'input_tag' with the 'output_tag'. If not specified the names of the columns that were used for aggregation will be used in the returned dataframe. |
None
|
is_sorted
|
bool
|
Indicates whether the input dataframe is already sorted with respect to the 'group_by' column. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
A dataframe with unique 'group_by' values as index and one column per sample. |
DataFrame
|
The columns contain the summed group values per sample. |
Example
table = pd.DataFrame( ... { ... "ID": ["A", "A", "B", "C", "C", "C"], ... "Col S1": [1, 1, 1, 1, 1, 1], ... "Col S2": [2, 2, 2, 2, 2, 2], ... } ... ) sum_columns(table, "ID", samples=["S1", "S2"], input_tag="Col") Col S1 Col S2 A 2 4 B 1 2 C 3 6
Source code in msreport\aggregate\summarize.py
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
|
sum_columns_maxlfq
sum_columns_maxlfq(
table: DataFrame,
group_by: str,
samples: Iterable[str],
input_tag: str,
output_tag: Optional[str] = None,
is_sorted: bool = False,
) -> DataFrame
Aggregates column(s) by applying the MaxLFQ summation approach to unique group.
This function estimates abundance profiles from sample columns using pairwise median ratios and least square regression. It then selects abundance profiles with finite values and the corresponding input columns and scales the abundance profiles so that their total sum is equal to the total sum of the corresponding input columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table
|
DataFrame
|
The input DataFrame used for aggregating on unique groups. |
required |
group_by
|
str
|
The name of the column used to determine unique groups for aggregation. |
required |
samples
|
Iterable[str]
|
List of sample names that appear in columns of the table as substrings. |
required |
input_tag
|
str
|
Substring of column names, which is used together with the sample names to determine the columns whose values will be summarized for each unique group. |
required |
output_tag
|
Optional[str]
|
Optional, allows changing the ouptut column names by replacing the 'input_tag' with the 'output_tag'. If not specified the names of the columns that were used for aggregation will be used in the returned dataframe. |
None
|
is_sorted
|
bool
|
Indicates whether the input dataframe is already sorted with respect to the 'group_by' column. |
False
|
Returns:
Type | Description |
---|---|
DataFrame
|
A dataframe with unique 'group_by' values as index and one column per sample. |
DataFrame
|
The columns contain the summed group values per sample. |
Example
table = pd.DataFrame( ... { ... "ID": ["A", "A", "B", "C", "C", "C"], ... "Col S1": [1, 1, 1, 1, 1, 1], ... "Col S2": [2, 2, 2, 2, 2, 2], ... } ... ) sum_columns_maxlfq(table, "ID", samples=["S1", "S2"], input_tag="Col") Col S1 Col S2 A 2.0 4.0 B 1.0 2.0 C 3.0 6.0
Source code in msreport\aggregate\summarize.py
169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 |
|
aggregate_unique_groups
aggregate_unique_groups(
table: DataFrame,
group_by: str,
columns_to_aggregate: str | Iterable[str],
condenser: Callable,
is_sorted: bool,
) -> tuple[ndarray, ndarray]
Aggregates column(s) by applying a condenser function to unique groups.
The function returns two arrays containing the aggregated values and the corresponding group names. This function can be used for example to summarize data from an ion table to a peptide, protein or modification table. Suitable condenser functions can be found in the module msreport.aggregate.condense
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table
|
DataFrame
|
The input dataframe used for aggregating on unique groups. |
required |
group_by
|
str
|
The name of the column used to determine unique groups for aggregation. |
required |
columns_to_aggregate
|
str | Iterable[str]
|
A column or a list of columns, which will be passed to the condenser function for applying an aggregation to each unique group. |
required |
condenser
|
Callable
|
Function that is applied to each group for generating the aggregation result. If multiple columns are specified for aggregation, the input array for the condenser function will be two dimensional, with the first dimension corresponding to rows and the second to the column. E.g. an array with 3 rows and 2 columns: np.array([[1, 'a'], [2, 'b'], [3, 'c']]) |
required |
is_sorted
|
bool
|
Indicates whether the input dataframe is already sorted with respect to the 'group_by' column. |
required |
Returns:
Type | Description |
---|---|
ndarray
|
Two numpy arrays, the first array contains the aggregation results of each each |
ndarray
|
unique group and the second array contains the correpsonding group names. |
Source code in msreport\aggregate\summarize.py
225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 |
|