Skip to content

Helper

A collection of widely used helper and utility functions.

This module re-exports commonly used functions from various msreport.helper submodules for convenience.

Functions:

Name Description
apply_intensity_cutoff

Sets values below the threshold to NA.

find_columns

Returns a list column names containing the substring.

find_sample_columns

Returns column names that contain the substring and any entry of 'samples'.

guess_design

Extracts sample name, experiment, and replicate from specified sample columns.

intensities_in_logspace

Evaluates whether intensities are likely to be log transformed.

keep_rows_by_partial_match

Filter a table to keep only rows partially matching any of the specified values.

remove_rows_by_partial_match

Filter a table to remove rows partially matching any of the specified values.

rename_mq_reporter_channels

Renames reporter channel numbers with sample names.

rename_sample_columns

Renames sample names according to the mapping in a cautious manner.

apply_intensity_cutoff

apply_intensity_cutoff(
    table: DataFrame, column_tag: str, threshold: float
) -> None

Sets values below the threshold to NA.

Parameters:

Name Type Description Default
table DataFrame

Dataframe to which the protein annotations are added.

required
column_tag str

Substring used to identify intensity columns from the 'table' to which the intensity cutoff is applied.

required
threshold float

Values below the treshold will be set to NA.

required
Source code in msreport\helper\table.py
131
132
133
134
135
136
137
138
139
140
141
142
143
def apply_intensity_cutoff(
    table: pd.DataFrame, column_tag: str, threshold: float
) -> None:
    """Sets values below the threshold to NA.

    Args:
        table: Dataframe to which the protein annotations are added.
        column_tag: Substring used to identify intensity columns from the 'table' to
            which the intensity cutoff is applied.
        threshold: Values below the treshold will be set to NA.
    """
    for column in find_columns(table, column_tag):
        table.loc[table[column] < threshold, column] = np.nan

find_columns

find_columns(
    table: DataFrame,
    substring: str,
    must_be_substring: bool = False,
) -> list[str]

Returns a list column names containing the substring.

Parameters:

Name Type Description Default
table DataFrame

Columns of this datafram are queried.

required
substring str

String that must be part of column names.

required
must_be_substring bool

If true than column names are not reported if they are exactly equal to the substring.

False

Returns:

Type Description
list[str]

A list of column names.

Source code in msreport\helper\table.py
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
def find_columns(
    table: pd.DataFrame, substring: str, must_be_substring: bool = False
) -> list[str]:
    """Returns a list column names containing the substring.

    Args:
        table: Columns of this datafram are queried.
        substring: String that must be part of column names.
        must_be_substring: If true than column names are not reported if they
            are exactly equal to the substring.

    Returns:
        A list of column names.
    """
    matched_columns = [col for col in table.columns if substring in col]
    if must_be_substring:
        matched_columns = [col for col in matched_columns if col != substring]
    return matched_columns

find_sample_columns

find_sample_columns(
    table: DataFrame, substring: str, samples: Iterable[str]
) -> list[str]

Returns column names that contain the substring and any entry of 'samples'.

Parameters:

Name Type Description Default
table DataFrame

Columns of this dataframe are queried.

required
substring str

String that must be part of column names.

required
samples Iterable[str]

List of strings from which at least one must be present in matched columns.

required

Returns:

Type Description
list[str]

A list of column names containing the substring and any entry of 'samples'.

list[str]

Columns are returned in the order of entries in 'samples'.

Source code in msreport\helper\table.py
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
def find_sample_columns(
    table: pd.DataFrame, substring: str, samples: Iterable[str]
) -> list[str]:
    """Returns column names that contain the substring and any entry of 'samples'.

    Args:
        table: Columns of this dataframe are queried.
        substring: String that must be part of column names.
        samples: List of strings from which at least one must be present in matched
            columns.

    Returns:
        A list of column names containing the substring and any entry of 'samples'.
        Columns are returned in the order of entries in 'samples'.
    """
    WHITESPACE_CHARS = " ."

    matched_columns = []
    substring_columns = find_columns(table, substring)
    for sample in samples:
        sample_columns = [c for c in substring_columns if sample in c]
        for col in sample_columns:
            column_remainder = (
                col.replace(substring, "").replace(sample, "").strip(WHITESPACE_CHARS)
            )
            if column_remainder == "":
                matched_columns.append(col)
                break
    return matched_columns

guess_design

guess_design(table: DataFrame, tag: str) -> DataFrame

Extracts sample name, experiment, and replicate from specified sample columns.

"Total" and "Combined", and their lower case variants, are not allowed as sample names and will be ignored.

First a subset of columns containing a column tag are identified. Then sample names are extracted by removing the column tag from each column name. And finally, sample names are split into experiment and replicate at the last underscore.

This requires that the naming of samples follows a specific convention. Sample names must begin with the experiment name, followed by an underscore and a unique identifier of the sample, for example the replicate number. The experiment name can also contain underscores, as it is split only by the last underscore.

For example "ExpA_r1" would be split into experiment "ExpA" and replicate "r1", "Exp_A_1" would be experiment "Exp_A" and replicate "1".

Parameters:

Name Type Description Default
table DataFrame

Dataframe which columns are used for extracting sample names.

required
tag str

Column names containing the 'tag' are selected for sample extraction.

required

Returns:

Type Description
DataFrame

A dataframe containing the columns "Sample", "Experiment", and "Replicate"

Source code in msreport\helper\table.py
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def guess_design(table: pd.DataFrame, tag: str) -> pd.DataFrame:
    """Extracts sample name, experiment, and replicate from specified sample columns.

    "Total" and "Combined", and their lower case variants, are not allowed as sample
    names and will be ignored.

    First a subset of columns containing a column tag are identified. Then sample names
    are extracted by removing the column tag from each column name. And finally, sample
    names are split into experiment and replicate at the last underscore.

    This requires that the naming of samples follows a specific convention. Sample names
    must begin with the experiment name, followed by an underscore and a unique
    identifier of the sample, for example the replicate number. The experiment name can
    also contain underscores, as it is split only by the last underscore.

    For example "ExpA_r1" would be split into experiment "ExpA" and replicate "r1",
    "Exp_A_1" would be experiment "Exp_A" and replicate "1".

    Args:
        table: Dataframe which columns are used for extracting sample names.
        tag: Column names containing the 'tag' are selected for sample extraction.

    Returns:
        A dataframe containing the columns "Sample", "Experiment", and "Replicate"
    """
    sample_entries = []
    for column in find_columns(table, tag, must_be_substring=True):
        sample = column.replace(tag, "").strip()
        if sample.lower() in ["total", "combined"]:
            continue
        experiment = "_".join(sample.split("_")[:-1])
        experiment = experiment if experiment else sample
        replicate = sample.split("_")[-1]
        replicate = replicate if replicate is not sample else "-1"
        sample_entries.append([sample, experiment, replicate])
    design = pd.DataFrame(sample_entries, columns=["Sample", "Experiment", "Replicate"])
    return design

intensities_in_logspace

intensities_in_logspace(
    data: Union[DataFrame, ndarray, Iterable],
) -> bool

Evaluates whether intensities are likely to be log transformed.

Assumes that intensities are log transformed if all values are smaller or equal to 64. Intensities values (and intensity peak areas) reported by tandem mass spectrometry typically range from 10^3 to 10^12. To reach log2 transformed values greater than 64, intensities would need to be higher than 10^19, which seems to be very unlikely to be ever encountered.

Parameters:

Name Type Description Default
data Union[DataFrame, ndarray, Iterable]

Dataset that contains only intensity values, can be any iterable, a numpy.array or a pandas.DataFrame, multiple dimensions or columns are allowed.

required

Returns:

Type Description
bool

True if intensity values in 'data' appear to be log transformed.

Source code in msreport\helper\table.py
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def intensities_in_logspace(data: Union[pd.DataFrame, np.ndarray, Iterable]) -> bool:
    """Evaluates whether intensities are likely to be log transformed.

    Assumes that intensities are log transformed if all values are smaller or equal to
    64. Intensities values (and intensity peak areas) reported by tandem mass
    spectrometry typically range from 10^3 to 10^12. To reach log2 transformed values
    greater than 64, intensities would need to be higher than 10^19, which seems to be
    very unlikely to be ever encountered.

    Args:
        data: Dataset that contains only intensity values, can be any iterable,
            a numpy.array or a pandas.DataFrame, multiple dimensions or columns
            are allowed.

    Returns:
        True if intensity values in 'data' appear to be log transformed.
    """
    data = np.array(data, dtype=float)
    mask = np.isfinite(data)
    return bool(np.all(data[mask].flatten() <= 64))

keep_rows_by_partial_match

keep_rows_by_partial_match(
    table: DataFrame, column: str, values: Iterable[str]
) -> DataFrame

Filter a table to keep only rows partially matching any of the specified values.

Parameters:

Name Type Description Default
table DataFrame

The input table that will be filtered.

required
column str

The name of the column in the 'table' which entries are checked for partial matches to the values. This column must have the datatype 'str'.

required
modifications

An iterable of strings that are used to filter the table. Any of the specified values must have at least a partial match to an entry from the specified 'column' for a row to be kept in the filtered table.

required

Returns:

Type Description
DataFrame

A new DataFrame containing only the rows that have a partial or complete match

DataFrame

with any of the specified 'values'.

Example

df = pd.DataFrame({"Modifications": ["phos", "acetyl;phos", "acetyl"]}) keep_rows_by_partial_match(df, "Modifications", ["phos"]) Modifications 0 phos 1 acetyl;phos

Source code in msreport\helper\table.py
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
def keep_rows_by_partial_match(
    table: pd.DataFrame, column: str, values: Iterable[str]
) -> pd.DataFrame:
    """Filter a table to keep only rows partially matching any of the specified values.

    Args:
        table: The input table that will be filtered.
        column: The name of the column in the 'table' which entries are checked for
            partial matches to the values. This column must have the datatype 'str'.
        modifications: An iterable of strings that are used to filter the table. Any of
            the specified values must have at least a partial match to an entry from the
            specified 'column' for a row to be kept in the filtered table.

    Returns:
        A new DataFrame containing only the rows that have a partial or complete match
        with any of the specified 'values'.

    Example:
        >>> df = pd.DataFrame({"Modifications": ["phos", "acetyl;phos", "acetyl"]})
        >>> keep_rows_by_partial_match(df, "Modifications", ["phos"])
          Modifications
        0          phos
        1   acetyl;phos
    """
    value_masks = [table[column].str.contains(value, regex=False) for value in values]
    target_mask = np.any(value_masks, axis=0)
    filtered_table = table[target_mask].copy()
    return filtered_table

remove_rows_by_partial_match

remove_rows_by_partial_match(
    table: DataFrame, column: str, values: Iterable[str]
) -> DataFrame

Filter a table to remove rows partially matching any of the specified values.

Parameters:

Name Type Description Default
table DataFrame

The input table that will be filtered.

required
column str

The name of the column in the 'table' which entries are checked for partial matches to the values. This column must have the datatype 'str'.

required
modifications

An iterable of strings that are used to filter the table. Any of the specified values must have at least a partial match to an entry from the specified 'column' for a row to be removed in the filtered table.

required

Returns:

Type Description
DataFrame

A new DataFrame containing no rows that have a partial or complete match with

DataFrame

any of the specified 'values'.

Example

df = pd.DataFrame({"Modifications": ["phos", "acetyl;phos", "acetyl"]}) remove_rows_by_partial_match(df, "Modifications", ["phos"]) Modifications 2 acetyl

Source code in msreport\helper\table.py
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
def remove_rows_by_partial_match(
    table: pd.DataFrame, column: str, values: Iterable[str]
) -> pd.DataFrame:
    """Filter a table to remove rows partially matching any of the specified values.

    Args:
        table: The input table that will be filtered.
        column: The name of the column in the 'table' which entries are checked for
            partial matches to the values. This column must have the datatype 'str'.
        modifications: An iterable of strings that are used to filter the table. Any of
            the specified values must have at least a partial match to an entry from the
            specified 'column' for a row to be removed in the filtered table.

    Returns:
        A new DataFrame containing no rows that have a partial or complete match with
        any of the specified 'values'.

    Example:
        >>> df = pd.DataFrame({"Modifications": ["phos", "acetyl;phos", "acetyl"]})
        >>> remove_rows_by_partial_match(df, "Modifications", ["phos"])
          Modifications
        2        acetyl
    """
    value_masks = [table[column].str.contains(value, regex=False) for value in values]
    target_mask = ~np.any(value_masks, axis=0)
    filtered_table = table[target_mask].copy()
    return filtered_table

rename_mq_reporter_channels

rename_mq_reporter_channels(
    table: DataFrame, channel_names: Sequence[str]
) -> None

Renames reporter channel numbers with sample names.

MaxQuant writes reporter channel names either in the format "Reporter intensity 1" or "Reporter intensity 1 Experiment Name", dependent on whether an experiment name was specified. Renames "Reporter intensity", "Reporter intensity count", and "Reporter intensity corrected" columns.

NOTE: This might not work for the peptides.txt table, as there are columns present with the experiment name and also without it.

Source code in msreport\helper\table.py
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
def rename_mq_reporter_channels(
    table: pd.DataFrame, channel_names: Sequence[str]
) -> None:
    """Renames reporter channel numbers with sample names.

    MaxQuant writes reporter channel names either in the format "Reporter intensity 1"
    or "Reporter intensity 1 Experiment Name", dependent on whether an experiment name
    was specified. Renames "Reporter intensity", "Reporter intensity count", and
    "Reporter intensity corrected" columns.

    NOTE: This might not work for the peptides.txt table, as there are columns present
    with the experiment name and also without it.
    """
    pattern = re.compile("Reporter intensity [0-9]+")
    reporter_columns = list(filter(pattern.match, table.columns.tolist()))
    assert len(reporter_columns) == len(channel_names)

    column_mapping = {}
    base_name = "Reporter intensity "
    for column, channel_name in zip(reporter_columns, channel_names):
        for tag in ["", "count ", "corrected "]:
            old_column = column.replace(f"{base_name}", f"{base_name}{tag}")
            new_column = f"{base_name}{tag}{channel_name}"
            column_mapping[old_column] = new_column
    table.rename(columns=column_mapping, inplace=True)

rename_sample_columns

rename_sample_columns(
    table: DataFrame, mapping: dict[str, str]
) -> DataFrame

Renames sample names according to the mapping in a cautious manner.

In general, this function allows the use of 'mapping' with keys that are substrings of any other keys, as well as values that are substrings of any of the keys.

Importantly, if the mapping keys (sample names) are substrings of other column names within the table, unintended renaming of those columns will occur. For instance, when renaming columns ["Abundance", "Intensity A"] with the mapping {"A": "Sample Alpha"}, the columns will be renamed to ["Sample Alphabundance", "Intensity Sample Alpha"].

Parameters:

Name Type Description Default
table DataFrame

Dataframe which columns will be renamed.

required
mapping dict[str, str]

A mapping of old to new sample names that will be used to replace matching substrings in the columns from table.

required

Returns:

Type Description
DataFrame

A copy of the table with renamed columns.

Source code in msreport\helper\table.py
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
def rename_sample_columns(table: pd.DataFrame, mapping: dict[str, str]) -> pd.DataFrame:
    """Renames sample names according to the mapping in a cautious manner.

    In general, this function allows the use of 'mapping' with keys that are substrings
    of any other keys, as well as values that are substrings of any of the keys.

    Importantly, if the mapping keys (sample names) are substrings of other column names
    within the table, unintended renaming of those columns will occur. For instance,
    when renaming columns ["Abundance", "Intensity A"] with the mapping
    {"A": "Sample Alpha"}, the columns will be renamed to ["Sample Alphabundance",
    "Intensity Sample Alpha"].

    Args:
        table: Dataframe which columns will be renamed.
        mapping: A mapping of old to new sample names that will be used to replace
            matching substrings in the columns from table.

    Returns:
        A copy of the table with renamed columns.
    """
    sorted_mapping_keys = sorted(mapping, key=len, reverse=True)

    renamed_columns = []
    for column in table.columns:
        for sample_name in sorted_mapping_keys:
            if sample_name in column:
                column = column.replace(sample_name, mapping[sample_name])
                break
        renamed_columns.append(column)

    renamed_table = table.copy()
    renamed_table.columns = renamed_columns
    return renamed_table