Skip to content

Reader

Provides tools for importing and standardizing quantitative proteomics data.

This module offers software-specific reader classes to import raw result tables (e.g., proteins, peptides, ions) from various proteomics software (MaxQuant, FragPipe, Spectronaut) and convert them into a standardized msreport format. Additionally, it provides functions for annotating imported data with biological metadata, such as protein information (e.g., sequence length, molecular weight) and peptide positions, extracted from a ProteinDatabase (FASTA file).

New columns added to imported protein tables: - Representative protein - Leading proteins - Protein reported by software

Standardized column names for quantitative values (if available in the software output): - Spectral count "sample name" - Unique spectral count "sample name" - Total spectral count "sample name" - Intensity "sample name" - LFQ intensity "sample name" - iBAQ intensity "sample name"

Classes:

Name Description
Protein

Abstract protein entry

ProteinDatabase

Abstract protein database

ResultReader

Base Reader class, is by itself not functional.

MaxQuantReader

MaxQuant result reader.

FragPipeReader

FragPipe result reader.

SpectronautReader

Spectronaut result reader.

Functions:

Name Description
sort_leading_proteins

Returns a copy of 'table' with sorted leading proteins.

add_protein_annotation

Uses a FASTA protein database to add protein annotation columns.

add_protein_site_annotation

Uses a FASTA protein database to add protein site annotation columns.

add_leading_proteins_annotation

Uses a FASTA protein database to add leading protein annotation columns.

add_protein_site_identifiers

Adds a "Protein site identifier" column to the 'table'.

add_sequence_coverage

Calculates "Sequence coverage" and adds a new column to the 'protein_table'.

add_ibaq_intensities

Adds iBAQ intensity columns to the 'table'.

add_peptide_positions

Adds peptide "Start position" and "End position" positions to the table.

add_protein_modifications

Adds a "Protein sites" column.

propagate_representative_protein

Propagates "Representative protein" column from the source to the target table.

extract_sample_names

Extracts sample names from columns containing the 'tag' substring.

extract_maxquant_localization_probabilities

Extract localization probabilites from a MaxQuant "Probabilities" entry.

extract_fragpipe_localization_probabilities

Extract localization probabilites from a FragPipe "Localization" entry.

extract_spectronaut_localization_probabilities

Extract localization probabilites from a Spectronaut localization entry.

Protein

Bases: Protocol

Abstract protein entry

ProteinDatabase

Bases: Protocol

Abstract protein database

ResultReader

ResultReader()

Base Reader class, is by itself not functional.

Source code in msreport\reader.py
67
68
69
def __init__(self):
    self.data_directory = ""
    self.filenames = {}

MaxQuantReader

MaxQuantReader(
    directory: str,
    isobar: bool = False,
    contaminant_tag: str = "CON__",
)

Bases: ResultReader

MaxQuant result reader.

Methods:

Name Description
import_proteins

Reads a "proteinGroups.txt" file and returns a processed dataframe, conforming to the MsReport naming convention.

import_peptides

Reads a "peptides.txt" file and returns a processed dataframe, conforming to the MsReport naming convention.

import_ion_evidence

Reads an "evidence.xt" file and returns a processed dataframe, conforming to the MsReport naming convention.

Attributes:

Name Type Description
default_filenames dict[str, str]

(class attribute) Look up of filenames for the result files generated by MaxQuant.

sample_column_tags list[str]

(class attribute) Column tags for which an additional column is present per sample.

column_mapping dict[str, str]

(class attribute) Used to rename original column names from MaxQuant according to the MsReport naming convention.

column_tag_mapping OrderedDict[str, str]

(class attribute) Mapping of original sample column tags from MaxQuant to column tags according to the MsReport naming convention, used to replace column names containing the original column tag.

protein_info_columns list[str]

(class attribute) List of columns that contain protein specific information. Used to allow removing all protein specific information prior to changing the representative protein.

protein_info_tags list[str]

(class attribute) List of tags present in columns that contain protein specific information per sample.

data_directory str

Location of the MaxQuant "txt" folder

filenames list[str]

Look up of filenames generated by MaxQuant

contamination_tag str

Substring present in protein IDs to identify them as potential contaminants.

Parameters:

Name Type Description Default
directory str

Location of the MaxQuant "txt" folder.

required
isobar bool

Set to True if quantification strategy was TMT, iTRAQ or similar.

False
contaminant_tag str

Prefix of Protein ID entries to identify contaminants.

'CON__'
Source code in msreport\reader.py
217
218
219
220
221
222
223
224
225
226
227
228
229
230
def __init__(
    self, directory: str, isobar: bool = False, contaminant_tag: str = "CON__"
) -> None:
    """Initializes the MaxQuantReader.

    Args:
        directory: Location of the MaxQuant "txt" folder.
        isobar: Set to True if quantification strategy was TMT, iTRAQ or similar.
        contaminant_tag: Prefix of Protein ID entries to identify contaminants.
    """
    self._add_data_directory(directory)
    self.filenames: dict[str, str] = self.default_filenames
    self._isobar: bool = isobar
    self._contaminant_tag: str = contaminant_tag

import_proteins

import_proteins(
    filename: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
    drop_decoy: bool = True,
    drop_idbysite: bool = True,
    drop_protein_info: bool = False,
) -> DataFrame

Reads a "proteinGroups.txt" file and returns a processed dataframe.

Adds three new protein entry columns to comply with the MsReport convention: "Protein reported by software", "Leading proteins", "Representative protein".

"Protein reported by software" contains the first protein ID from the "Majority protein IDs" column. "Leading proteins" contain all entries from the "Majority protein IDs" column that have the same and highest number of mapped peptides in the "Peptide counts (all)" column, multiple protein entries are separated by ";". "Representative protein" contains the first entry form "Leading proteins".

Several columns in the "combined_protein.tsv" file contain information specific for the protein entry of the "Protein" column. If leading proteins will be re-sorted later, it is recommended to remove columns containing protein specific information by setting 'drop_protein_info=True'.

Parameters:

Name Type Description Default
filename Optional[str]

allows specifying an alternative filename, otherwise the default filename is used.

None
rename_columns bool

If True, columns are renamed according to the MsReport convention; default True.

True
prefix_column_tags bool

If True, column tags such as "Intensity" are added in front of the sample names, e.g. "Intensity sample_name". If False, column tags are added afterwards, e.g. "Sample_name Intensity"; default True.

True
drop_decoy bool

If True, decoy entries are removed and the "Reverse" column is dropped; default True.

True
drop_idbysite bool

If True, protein groups that were only identified by site are removed and the "Only identified by site" columns is dropped; default True.

True
drop_protein_info bool

If True, columns containing protein specific information, such as "Gene names", "Sequence coverage [%]" or "iBAQ peptides". See MaxQuantReader.protein_info_columns and MaxQuantReader.protein_info_tags for a full list of columns that will be removed. Default False.

False

Returns:

Type Description
DataFrame

A dataframe containing the processed protein table.

Source code in msreport\reader.py
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
def import_proteins(
    self,
    filename: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
    drop_decoy: bool = True,
    drop_idbysite: bool = True,
    drop_protein_info: bool = False,
) -> pd.DataFrame:
    """Reads a "proteinGroups.txt" file and returns a processed dataframe.

    Adds three new protein entry columns to comply with the MsReport convention:
    "Protein reported by software", "Leading proteins", "Representative protein".

    "Protein reported by software" contains the first protein ID from the "Majority
    protein IDs" column. "Leading proteins" contain all entries from the "Majority
    protein IDs" column that have the same and highest number of mapped peptides in
    the "Peptide counts (all)" column, multiple protein entries are separated by
    ";". "Representative protein" contains the first entry form "Leading proteins".

    Several columns in the "combined_protein.tsv" file contain information specific
    for the protein entry of the "Protein" column. If leading proteins will be
    re-sorted later, it is recommended to remove columns containing protein specific
    information by setting 'drop_protein_info=True'.

    Args:
        filename: allows specifying an alternative filename, otherwise the default
            filename is used.
        rename_columns: If True, columns are renamed according to the MsReport
            convention; default True.
        prefix_column_tags: If True, column tags such as "Intensity" are added
            in front of the sample names, e.g. "Intensity sample_name". If False,
            column tags are added afterwards, e.g. "Sample_name Intensity"; default
            True.
        drop_decoy: If True, decoy entries are removed and the "Reverse" column is
            dropped; default True.
        drop_idbysite: If True, protein groups that were only identified by site are
            removed and the "Only identified by site" columns is dropped; default
            True.
        drop_protein_info: If True, columns containing protein specific information,
            such as "Gene names", "Sequence coverage [%]" or "iBAQ peptides". See
            MaxQuantReader.protein_info_columns and MaxQuantReader.protein_info_tags
            for a full list of columns that will be removed. Default False.

    Returns:
        A dataframe containing the processed protein table.
    """
    df = self._read_file("proteins" if filename is None else filename)
    df = self._add_protein_entries(df)

    if drop_decoy:
        df = self._drop_decoy(df)
    if drop_idbysite:
        df = self._drop_idbysite(df)
    if drop_protein_info:
        df = self._drop_columns(df, self.protein_info_columns)
        for tag in self.protein_info_tags:
            df = self._drop_columns_by_tag(df, tag)
    if rename_columns:
        df = self._rename_columns(df, prefix_column_tags)
    return df

import_peptides

import_peptides(
    filename: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
    drop_decoy: bool = True,
) -> DataFrame

Reads a "peptides.txt" file and returns a processed dataframe.

Adds new columns to comply with the MsReport convention: "Protein reported by software" and "Representative protein", both contain the first entry from "Leading razor protein".

Parameters:

Name Type Description Default
filename Optional[str]

allows specifying an alternative filename, otherwise the default filename is used.

None
rename_columns bool

If True, columns are renamed according to the MsReport convention; default True.

True
prefix_column_tags bool

If True, column tags such as "Intensity" are added in front of the sample names, e.g. "Intensity sample_name". If False, column tags are added afterwards, e.g. "Sample_name Intensity"; default True.

True
drop_decoy bool

If True, decoy entries are removed and the "Reverse" column is dropped; default True.

True

Returns:

Type Description
DataFrame

A dataframe containing the processed peptide table.

Source code in msreport\reader.py
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
def import_peptides(
    self,
    filename: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
    drop_decoy: bool = True,
) -> pd.DataFrame:
    """Reads a "peptides.txt" file and returns a processed dataframe.

    Adds new columns to comply with the MsReport convention:
    "Protein reported by software" and "Representative protein", both contain the
    first entry from "Leading razor protein".

    Args:
        filename: allows specifying an alternative filename, otherwise the default
            filename is used.
        rename_columns: If True, columns are renamed according to the MsReport
            convention; default True.
        prefix_column_tags: If True, column tags such as "Intensity" are added
            in front of the sample names, e.g. "Intensity sample_name". If False,
            column tags are added afterwards, e.g. "Sample_name Intensity"; default
            True.
        drop_decoy: If True, decoy entries are removed and the "Reverse" column is
            dropped; default True.

    Returns:
        A dataframe containing the processed peptide table.
    """
    # TODO: not tested
    df = self._read_file("peptides" if filename is None else filename)
    df["Protein reported by software"] = _extract_protein_ids(
        df["Leading razor protein"]
    )
    df["Representative protein"] = df["Protein reported by software"]
    # Note that _add_protein_entries would need to be adapted for the peptide table.
    # df = self._add_protein_entries(df)
    if drop_decoy:
        df = self._drop_decoy(df)
    if rename_columns:
        df = self._rename_columns(df, prefix_column_tags)
    return df

import_ion_evidence

import_ion_evidence(
    filename: Optional[str] = None,
    rename_columns: bool = True,
    rewrite_modifications: bool = True,
    drop_decoy: bool = True,
) -> DataFrame

Reads an "evidence.txt" file and returns a processed dataframe.

Adds new columns to comply with the MsReport convention. "Modified sequence", "Modifications columns", "Modification localization string". "Protein reported by software" and "Representative protein", both contain the first entry from "Leading razor protein". "Ion ID" contains unique entries for each ion, which are generated by concatenating the "Modified sequence" and "Charge" columns, and if present, the "Compensation voltage" column.

"Modified sequence" entries contain modifications within square brackets. "Modification" entries are strings in the form of "position:modification_tag", multiple modifications are joined by ";". An example for a modified sequence and a modification entry: "PEPT[Phospho]IDO[Oxidation]", "4:Phospho;7:Oxidation".

"Modification localization string" contains localization probabilities in the format "Mod1@Site1:Probability1,Site2:Probability2;Mod2@Site3:Probability3", e.g. "15.9949@11:1.000;79.9663@3:0.200,4:0.800". Refer to msreport.peptidoform.make_localization_string for details.

Parameters:

Name Type Description Default
filename Optional[str]

Allows specifying an alternative filename, otherwise the default filename is used.

None
rename_columns bool

If True, columns are renamed according to the MsReport convention; default True.

True
rewrite_modifications bool

If True, the peptide format in "Modified sequence" is changed according to the MsReport convention, and a "Modifications" is added to contains the amino acid position for all modifications. Requires 'rename_columns' to be true. Default True.

True
drop_decoy bool

If True, decoy entries are removed and the "Reverse" column is dropped; default True.

True

Returns:

Type Description
DataFrame

A dataframe containing the processed ion table.

Source code in msreport\reader.py
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
def import_ion_evidence(
    self,
    filename: Optional[str] = None,
    rename_columns: bool = True,
    rewrite_modifications: bool = True,
    drop_decoy: bool = True,
) -> pd.DataFrame:
    """Reads an "evidence.txt" file and returns a processed dataframe.

    Adds new columns to comply with the MsReport convention. "Modified sequence",
    "Modifications columns", "Modification localization string". "Protein reported
    by software" and "Representative protein", both contain the first entry from
    "Leading razor protein". "Ion ID" contains unique entries for each ion, which
    are generated by concatenating the "Modified sequence" and "Charge" columns, and
    if present, the "Compensation voltage" column.

    "Modified sequence" entries contain modifications within square brackets.
    "Modification" entries are strings in the form of "position:modification_tag",
    multiple modifications are joined by ";". An example for a modified sequence and
    a modification entry: "PEPT[Phospho]IDO[Oxidation]", "4:Phospho;7:Oxidation".

    "Modification localization string" contains localization probabilities in the
    format "Mod1@Site1:Probability1,Site2:Probability2;Mod2@Site3:Probability3",
    e.g. "15.9949@11:1.000;79.9663@3:0.200,4:0.800". Refer to
    `msreport.peptidoform.make_localization_string` for details.

    Args:
        filename: Allows specifying an alternative filename, otherwise the default
            filename is used.
        rename_columns: If True, columns are renamed according to the MsReport
            convention; default True.
        rewrite_modifications: If True, the peptide format in "Modified sequence" is
            changed according to the MsReport convention, and a "Modifications" is
            added to contains the amino acid position for all modifications.
            Requires 'rename_columns' to be true. Default True.
        drop_decoy: If True, decoy entries are removed and the "Reverse" column is
            dropped; default True.

    Returns:
        A dataframe containing the processed ion table.
    """
    # TODO: not tested
    df = self._read_file("ion_evidence" if filename is None else filename)
    df["Protein reported by software"] = _extract_protein_ids(
        df["Leading razor protein"]
    )
    df["Representative protein"] = df["Protein reported by software"]

    if drop_decoy:
        df = self._drop_decoy(df)
    if rename_columns:
        # Actually there are no column tags as the table is in long format
        df = self._rename_columns(df, prefix_tag=True)
    if rewrite_modifications and rename_columns:
        df = self._add_peptide_modification_entries(df)
        df = self._add_modification_localization_string(df)
        df["Ion ID"] = df["Modified sequence"] + "_c" + df["Charge"].astype(str)
        if "Compensation voltage" in df.columns:
            _cv = df["Compensation voltage"].astype(str)
            df["Ion ID"] = df["Ion ID"] + "_cv" + _cv
    return df

FragPipeReader

FragPipeReader(
    directory: str,
    isobar: bool = False,
    sil: bool = False,
    contaminant_tag: str = "contam_",
)

Bases: ResultReader

FragPipe result reader.

Methods:

Name Description
import_design

Depending on the quantification strategy, imports either the manifest file or the experiment annotation file and returns a processed design dataframe.

import_manifest

Reads a "fragpipe-files.fp-manifest" file and returns a processed design dataframe.

import_experiment_annotation

Reads a "experiment_annotation" file and returns a processed design dataframe.

import_proteins

Reads a "combined_protein.tsv" or "protein.tsv" file and returns a processed dataframe, conforming to the MsReport naming convention.

import_peptides

Reads a "combined_peptide.tsv" or "peptide.tsv" file and returns a processed dataframe, conforming to the MsReport naming convention.

import_ions

Reads a "combined_ion.tsv" or "ion.tsv" file and returns a processed dataframe, conforming to the MsReport naming convention.

import_ion_evidence

Reads and concatenates all "ion.tsv" files and returns a processed dataframe, conforming to the MsReport naming convention.

Attributes:

Name Type Description
default_filenames dict[str, str]

(class attribute) Look up of default filenames of the result files generated by FragPipe.

isobar_filenames dict[str, str]

(class attribute) Look up of default filenames of the result files generated by FragPipe, which are relevant when using isobaric quantification.

sample_column_tags list[str]

(class attribute) Tags (column name substrings) that idenfity sample columns. Sample columns are those, for which one unique column is present per sample, for example intensity columns.

column_mapping dict[str, str]

(class attribute) Used to rename original column names from FragPipe according to the MsReport naming convention.

column_tag_mapping OrderedDict[str, str]

(class attribute) Mapping of original sample column tags from FragPipe to column tags according to the MsReport naming convention, used to replace column names containing the original column tag.

protein_info_columns list[str]

(class attribute) List of columns that contain information specific to the leading protein.

protein_info_tags list[str]

(class attribute) List of substrings present in columns that contain information specific to the leading protein.

data_directory str

Location of the folder containing FragPipe result files.

filenames dict[str, str]

Look up of FragPipe result filenames used for importing protein or other tables.

contamination_tag str

Substring present in protein IDs to identify them as potential contaminants.

Parameters:

Name Type Description Default
directory str

Location of the FragPipe result folder

required
isobar bool

Set to True if quantification strategy was TMT, iTRAQ or similar; default False.

False
sil bool

Set to True if the FragPipe result files are from a stable isotope labeling experiment, such as SILAC; default False.

False
contaminant_tag str

Prefix of Protein ID entries to identify contaminants; default "contam_".

'contam_'
Source code in msreport\reader.py
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
def __init__(
    self,
    directory: str,
    isobar: bool = False,
    sil: bool = False,
    contaminant_tag: str = "contam_",
) -> None:
    """Initializes the FragPipeReader.

    Args:
        directory: Location of the FragPipe result folder
        isobar: Set to True if quantification strategy was TMT, iTRAQ or similar;
            default False.
        sil: Set to True if the FragPipe result files are from a stable isotope
            labeling experiment, such as SILAC; default False.
        contaminant_tag: Prefix of Protein ID entries to identify contaminants;
            default "contam_".
    """
    if sil and isobar:
        raise ValueError("Cannot set both 'isobar' and 'sil' to True.")
    self._add_data_directory(directory)
    self._isobar: bool = isobar
    self._sil: bool = sil
    self._contaminant_tag: str = contaminant_tag

    self.filenames = self.default_filenames.copy()
    if sil:
        self.filenames.update(self.sil_filenames)

import_design

import_design(sort: bool = False) -> DataFrame

Reads the experimental design file and returns a processed design dataframe.

Depending on the quantification strategy (isobaric or label-free/sil), either the experiment annotation file or the manifest file is imported.

Parameters:

Name Type Description Default
sort bool

If True, the design dataframe is sorted by "Experiment" and "Replicate"; default False.

False
Source code in msreport\reader.py
684
685
686
687
688
689
690
691
692
693
694
695
696
697
def import_design(self, sort: bool = False) -> pd.DataFrame:
    """Reads the experimental design file and returns a processed design dataframe.

    Depending on the quantification strategy (isobaric or label-free/sil), either
    the experiment annotation file or the manifest file is imported.

    Args:
        sort: If True, the design dataframe is sorted by "Experiment" and
            "Replicate"; default False.
    """
    if self._isobar:
        return self.import_experiment_annotation(sort=sort)
    else:
        return self.import_manifest(sort=sort)

import_manifest

import_manifest(
    filename: Optional[str] = None, sort: bool = False
) -> DataFrame

Read a 'fp-manifest' file and returns a processed design dataframe.

The manifest columns "Path", "Experiment", and "Bioreplicate" are mapped to the design table columns "Rawfile", "Experiment", and "Replicate". The "Rawfile" column is extracted as the filename from the full path. The "Sample" column is generated by combining "Experiment" and "Replicate" with an underscore (e.g., "Experiment_Replicate"), except when "Replicate" is empty, in which case "Sample" is set to "Experiment". If "Experiment" is missing, it is set to "exp" by default.

Parameters:

Name Type Description Default
filename Optional[str]

Allows specifying an alternative filename, otherwise the default filename is used.

None
sort bool

If True, the design dataframe is sorted by "Experiment" and "Replicate"; default False.

False

Returns:

Type Description
DataFrame

A dataframe containing the processed design table with columns:

DataFrame

"Sample", "Experiment", "Replicate", "Rawfile".

Raises:

Type Description
FileNotFoundError

If the specified manifest file does not exist.

Source code in msreport\reader.py
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
def import_manifest(
    self, filename: Optional[str] = None, sort: bool = False
) -> pd.DataFrame:
    """Read a 'fp-manifest' file and returns a processed design dataframe.

    The manifest columns "Path", "Experiment", and "Bioreplicate" are mapped to the
    design table columns "Rawfile", "Experiment", and "Replicate". The "Rawfile"
    column is extracted as the filename from the full path. The "Sample" column is
    generated by combining "Experiment" and "Replicate" with an underscore
    (e.g., "Experiment_Replicate"), except when "Replicate" is empty, in which case
    "Sample" is set to "Experiment". If "Experiment" is missing, it is set to "exp"
    by default.

    Args:
        filename: Allows specifying an alternative filename, otherwise the default
            filename is used.
        sort: If True, the design dataframe is sorted by "Experiment" and
            "Replicate"; default False.

    Returns:
        A dataframe containing the processed design table with columns:
        "Sample", "Experiment", "Replicate", "Rawfile".

    Raises:
        FileNotFoundError: If the specified manifest file does not exist.
    """
    if filename is None:
        filepath = os.path.join(self.data_directory, self.filenames["manifest"])
    else:
        filepath = os.path.join(self.data_directory, filename)
    if not os.path.exists(filepath):
        raise FileNotFoundError(
            f"File '{filepath}' does not exist. Please check the file path."
        )
    fp_manifest = (
        pd.read_csv(
            filepath, sep="\t", header=None, na_values=[""], keep_default_na=False
        )
        .fillna("")
        .astype(str)
    )
    fp_manifest.columns = ["Path", "Experiment", "Bioreplicate", "Data type"]

    design = pd.DataFrame(
        {
            "Sample": "",
            "Experiment": fp_manifest["Experiment"],
            "Replicate": fp_manifest["Bioreplicate"],
            "Rawfile": fp_manifest["Path"].apply(
                # Required to handle Windows and Unix style paths on either system
                lambda x: x.replace("\\", "/").split("/")[-1]
            ),
        }
    )
    # FragPipe uses "exp" for missing 'Experiment' values
    design.loc[design["Experiment"] == "", "Experiment"] = "exp"
    # FragPipe combines 'Experiment' + "_" + 'Replicate' into 'Sample', except when
    # 'Replicate' is empty, in which case 'Sample' is set to 'Experiment'.
    design["Sample"] = design["Experiment"] + "_" + design["Replicate"]
    design.loc[design["Replicate"] == "", "Sample"] = design["Experiment"]

    if sort:
        design.sort_values(by=["Experiment", "Replicate"], inplace=True)
        design.reset_index(drop=True, inplace=True)
    return design

import_experiment_annotation

import_experiment_annotation(
    filename: Optional[str] = None, sort: bool = False
) -> DataFrame

Read a 'experiment_annotation' file and returns a processed design dataframe.

The annotation columns "sample", "channel", and "plex" are mapped to the design table columns "Sample", "Channel", and "Plex". The "Experiment" and "Replicate" columns are extracted from the "Sample" column by splitting at the last underscore, if there is no underscore, "Replicate" is set to an empty string.

Note that this convention of splitting the "Sample" column does confirm to the FragPipe convention, but FragPipe does not enforce it for the experiment annotation file.

Parameters:

Name Type Description Default
filename Optional[str]

Allows specifying an alternative filename, otherwise the default filename is used.

None
sort bool

If True, the design dataframe is sorted by "Experiment" and "Replicate"; default False.

False

Returns:

Type Description
DataFrame

A dataframe containing the processed design table with columns:

DataFrame

"Sample", "Experiment", "Replicate", "Channel", and "Plex".

Raises:

Type Description
FileNotFoundError

If the specified manifest file does not exist.

Source code in msreport\reader.py
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
def import_experiment_annotation(
    self, filename: Optional[str] = None, sort: bool = False
) -> pd.DataFrame:
    """Read a 'experiment_annotation' file and returns a processed design dataframe.

    The annotation columns "sample", "channel", and "plex" are mapped to the design
    table columns "Sample", "Channel", and "Plex". The "Experiment" and "Replicate"
    columns are extracted from the "Sample" column by splitting at the last
    underscore, if there is no underscore, "Replicate" is set to an empty string.

    Note that this convention of splitting the "Sample" column does confirm to the
    FragPipe convention, but FragPipe does not enforce it for the experiment
    annotation file.

    Args:
        filename: Allows specifying an alternative filename, otherwise the default
            filename is used.
        sort: If True, the design dataframe is sorted by "Experiment" and
            "Replicate"; default False.

    Returns:
        A dataframe containing the processed design table with columns:
        "Sample", "Experiment", "Replicate", "Channel", and "Plex".

    Raises:
        FileNotFoundError: If the specified manifest file does not exist.
    """
    if filename is None:
        filepath = os.path.join(
            self.data_directory, self.filenames["experiment_annotation"]
        )
    else:
        filepath = os.path.join(self.data_directory, filename)
    if not os.path.exists(filepath):
        raise FileNotFoundError(
            f"File '{filepath}' does not exist. Please check the file path."
        )

    annotation = pd.read_csv(filepath, sep="\t")

    design = pd.DataFrame(
        {
            "Sample": annotation["sample"],
            "Experiment": annotation["sample"].str.rsplit("_", n=1).str[0],
            "Replicate": annotation["sample"].str.rsplit("_", n=1).str[1],
            "Channel": annotation["channel"],
            "Plex": annotation["plex"],
        }
    )
    design["Replicate"] = design["Replicate"].fillna("")

    if sort:
        design.sort_values(by=["Experiment", "Replicate"], inplace=True)
        design.reset_index(drop=True, inplace=True)

    return design

import_proteins

import_proteins(
    filename: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
    drop_protein_info: bool = False,
) -> DataFrame

Reads a "combined_protein.tsv" or "protein.tsv" file and returns a processed dataframe.

Adds four protein entry columns to comply with the MsReport convention: "Protein reported by software", "Leading proteins", "Representative protein", "Potential contaminant".

"Protein reported by software" contains the protein ID extracted from the "Protein" column. "Leading proteins" contains the combined protein IDs extracted from the "Protein" and "Indistinguishable Proteins" columns, multiple entries are separated by ";". "Representative protein" contains the first entry form "Leading proteins".

Several columns in the "combined_protein.tsv" file contain information specific for the protein entry of the "Protein" column. If leading proteins will be re-sorted later, it is recommended to remove columns containing protein specific information by setting 'drop_protein_info=True'..

Parameters:

Name Type Description Default
filename Optional[str]

Allows specifying an alternative filename, otherwise the default filename is used.

None
rename_columns bool

If True, columns are renamed according to the MsReport convention; default True.

True
prefix_column_tags bool

If True, column tags such as "Intensity" are added in front of the sample names, e.g. "Intensity sample_name". If False, column tags are added afterwards, e.g. "Sample_name Intensity"; default True.

True
drop_protein_info bool

If True, columns containing protein specific information, such as "Gene" or "Protein Length". See FragPipeReader.protein_info_columns and FragPipeReader.protein_info_tags for a full list of columns that will be removed. Default False.

False

Returns:

Type Description
DataFrame

A dataframe containing the processed protein table.

Source code in msreport\reader.py
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
def import_proteins(
    self,
    filename: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
    drop_protein_info: bool = False,
) -> pd.DataFrame:
    """Reads a "combined_protein.tsv" or "protein.tsv" file and returns a processed
    dataframe.

    Adds four protein entry columns to comply with the MsReport convention:
    "Protein reported by software", "Leading proteins", "Representative protein",
    "Potential contaminant".

    "Protein reported by software" contains the protein ID extracted from the
    "Protein" column. "Leading proteins" contains the combined protein IDs extracted
    from the "Protein" and "Indistinguishable Proteins" columns, multiple entries
    are separated by ";". "Representative protein" contains the first entry form
    "Leading proteins".

    Several columns in the "combined_protein.tsv" file contain information specific
    for the protein entry of the "Protein" column. If leading proteins will be
    re-sorted later, it is recommended to remove columns containing protein specific
    information by setting 'drop_protein_info=True'..

    Args:
        filename: Allows specifying an alternative filename, otherwise the default
            filename is used.
        rename_columns: If True, columns are renamed according to the MsReport
            convention; default True.
        prefix_column_tags: If True, column tags such as "Intensity" are added
            in front of the sample names, e.g. "Intensity sample_name". If False,
            column tags are added afterwards, e.g. "Sample_name Intensity"; default
            True.
        drop_protein_info: If True, columns containing protein specific information,
            such as "Gene" or "Protein Length". See
            FragPipeReader.protein_info_columns and FragPipeReader.protein_info_tags
            for a full list of columns that will be removed. Default False.

    Returns:
        A dataframe containing the processed protein table.
    """
    df = self._read_file("proteins" if filename is None else filename)
    df = self._add_protein_entries(df)
    if drop_protein_info:
        df = self._drop_columns(df, self.protein_info_columns)
        for tag in self.protein_info_tags:
            df = self._drop_columns_by_tag(df, tag)
    if rename_columns:
        df = self._rename_columns(df, prefix_column_tags)
    return df

import_peptides

import_peptides(
    filename: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
) -> DataFrame

Reads a "combined_peptides.txt" file and returns a processed dataframe.

Adds a new column to comply with the MsReport convention: "Protein reported by software"

Parameters:

Name Type Description Default
filename Optional[str]

allows specifying an alternative filename, otherwise the default filename is used.

None
rename_columns bool

If True, columns are renamed according to the MsReport convention; default True.

True
prefix_column_tags bool

If True, column tags such as "Intensity" are added in front of the sample names, e.g. "Intensity sample_name". If False, column tags are added afterwards, e.g. "Sample_name Intensity"; default True.

True

Returns:

Type Description
DataFrame

A dataframe containing the processed peptide table.

Source code in msreport\reader.py
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
def import_peptides(
    self,
    filename: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
) -> pd.DataFrame:
    """Reads a "combined_peptides.txt" file and returns a processed dataframe.

    Adds a new column to comply with the MsReport convention:
    "Protein reported by software"

    Args:
        filename: allows specifying an alternative filename, otherwise the default
            filename is used.
        rename_columns: If True, columns are renamed according to the MsReport
            convention; default True.
        prefix_column_tags: If True, column tags such as "Intensity" are added
            in front of the sample names, e.g. "Intensity sample_name". If False,
            column tags are added afterwards, e.g. "Sample_name Intensity"; default
            True.

    Returns:
        A dataframe containing the processed peptide table.
    """
    # TODO: not tested
    df = self._read_file("peptides" if filename is None else filename)
    df["Protein reported by software"] = _extract_protein_ids(df["Protein"])
    df["Representative protein"] = df["Protein reported by software"]
    df["Mapped Proteins"] = self._collect_mapped_proteins(df)
    # Note that _add_protein_entries would need to be adapted for the peptide table.
    # df = self._add_protein_entries(df)
    if rename_columns:
        df = self._rename_columns(df, prefix_column_tags)
    return df

import_ions

import_ions(
    filename: Optional[str] = None,
    rename_columns: bool = True,
    rewrite_modifications: bool = True,
    prefix_column_tags: bool = True,
) -> DataFrame

Reads a "combined_ion.tsv" or "ion.tsv" file and returns a processed dataframe.

Adds new columns to comply with the MsReport convention. "Modified sequence" and "Modifications columns". "Protein reported by software" and "Representative protein", both contain the first entry from "Leading razor protein". "Ion ID" contains unique entries for each ion, which are generated by concatenating the "Modified sequence" and "Charge" columns, and if present, the "Compensation voltage" column.

"Modified sequence" entries contain modifications within square brackets. "Modification" entries are strings in the form of "position:modification_text", multiple modifications are joined by ";". An example for a modified sequence and a modification entry: "PEPT[Phospho]IDO[Oxidation]", "4:Phospho;7:Oxidation".

Note that currently the format of the modification itself, as well as the site localization probability are not modified; and no protein site entries are added.

Parameters:

Name Type Description Default
filename Optional[str]

Allows specifying an alternative filename, otherwise the default filename is used.

None
rename_columns bool

If True, columns are renamed according to the MsReport convention; default True.

True
rewrite_modifications bool

If True, the peptide format in "Modified sequence" is changed according to the MsReport convention, and a "Modifications" is added to contains the amino acid position for all modifications. Requires 'rename_columns' to be true. Default True.

True
prefix_column_tags bool

If True, column tags such as "Intensity" are added in front of the sample names, e.g. "Intensity sample_name". If False, column tags are added afterwards, e.g. "Sample_name Intensity"; default True.

True

Returns:

Type Description
DataFrame

A DataFrame containing the processed ion table.

Source code in msreport\reader.py
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
def import_ions(
    self,
    filename: Optional[str] = None,
    rename_columns: bool = True,
    rewrite_modifications: bool = True,
    prefix_column_tags: bool = True,
) -> pd.DataFrame:
    """Reads a "combined_ion.tsv" or "ion.tsv" file and returns a processed
    dataframe.

    Adds new columns to comply with the MsReport convention. "Modified sequence"
    and "Modifications columns". "Protein reported by software" and "Representative
    protein", both contain the first entry from "Leading razor protein". "Ion ID"
    contains unique entries for each ion, which are generated by concatenating the
    "Modified sequence" and "Charge" columns, and if present, the
    "Compensation voltage" column.

    "Modified sequence" entries contain modifications within square brackets.
    "Modification" entries are strings in the form of "position:modification_text",
    multiple modifications are joined by ";". An example for a modified sequence and
    a modification entry: "PEPT[Phospho]IDO[Oxidation]", "4:Phospho;7:Oxidation".

    Note that currently the format of the modification itself, as well as the
    site localization probability are not modified; and no protein site entries are
    added.

    Args:
        filename: Allows specifying an alternative filename, otherwise the default
            filename is used.
        rename_columns: If True, columns are renamed according to the MsReport
            convention; default True.
        rewrite_modifications: If True, the peptide format in "Modified sequence" is
            changed according to the MsReport convention, and a "Modifications" is
            added to contains the amino acid position for all modifications.
            Requires 'rename_columns' to be true. Default True.
        prefix_column_tags: If True, column tags such as "Intensity" are added
            in front of the sample names, e.g. "Intensity sample_name". If False,
            column tags are added afterwards, e.g. "Sample_name Intensity"; default
            True.

    Returns:
        A DataFrame containing the processed ion table.
    """
    # TODO: not tested #
    df = self._read_file("ions" if filename is None else filename)

    # FUTURE: replace this by _add_protein_entries(df, False) if FragPipe adds
    #         'Indistinguishable Proteins' to the ion table.
    df["Protein reported by software"] = _extract_protein_ids(df["Protein"])
    df["Representative protein"] = df["Protein reported by software"]
    df["Mapped Proteins"] = self._collect_mapped_proteins(df)

    if rename_columns:
        df = self._rename_columns(df, prefix_column_tags)
    if rewrite_modifications and rename_columns:
        df = self._add_peptide_modification_entries(df)
        df = self._add_modification_localization_string(df, prefix_column_tags)
        df["Ion ID"] = df["Modified sequence"] + "_c" + df["Charge"].astype(str)
        if "Compensation voltage" in df.columns:
            _cv = df["Compensation voltage"].astype(str)
            df["Ion ID"] = df["Ion ID"] + "_cv" + _cv

    return df

import_ion_evidence

import_ion_evidence(
    filename: Optional[str] = None,
    rename_columns: bool = True,
    rewrite_modifications: bool = True,
    prefix_column_tags: bool = True,
) -> DataFrame

Reads and concatenates all "ion.tsv" files and returns a processed dataframe.

Adds new columns to comply with the MsReport convention. "Modified sequence", "Modifications", and "Modification localization string" columns. "Protein reported by software" and "Representative protein", both contain the first entry from "Leading razor protein". "Ion ID" contains unique entries for each ion, which are generated by concatenating the "Modified sequence" and "Charge" columns, and if present, the "Compensation voltage" column.

"Modified sequence" entries contain modifications within square brackets. "Modification" entries are strings in the form of "position:modification_text", multiple modifications are joined by ";". An example for a modified sequence and a modification entry: "PEPT[Phospho]IDO[Oxidation]", "4:Phospho;7:Oxidation".

"Modification localization string" contains localization probabilities in the format "Mod1@Site1:Probability1,Site2:Probability2;Mod2@Site3:Probability3", e.g. "15.9949@11:1.000;79.9663@3:0.200,4:0.800". Refer to msreport.peptidoform.make_localization_string for details.

Parameters:

Name Type Description Default
filename Optional[str]

Allows specifying an alternative filename, otherwise the default filename is used.

None
rename_columns bool

If True, columns are renamed according to the MsReport convention; default True.

True
rewrite_modifications bool

If True, the peptide format in "Modified sequence" is changed according to the MsReport convention, and a "Modifications" is added to contains the amino acid position for all modifications. Requires 'rename_columns' to be true. Default True.

True
prefix_column_tags bool

If True, column tags such as "Intensity" are added in front of the sample names, e.g. "Intensity sample_name". If False, column tags are added afterwards, e.g. "Sample_name Intensity"; default True.

True

Returns:

Type Description
DataFrame

A DataFrame containing the processed ion table.

Source code in msreport\reader.py
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
def import_ion_evidence(
    self,
    filename: Optional[str] = None,
    rename_columns: bool = True,
    rewrite_modifications: bool = True,
    prefix_column_tags: bool = True,
) -> pd.DataFrame:
    """Reads and concatenates all "ion.tsv" files and returns a processed dataframe.

    Adds new columns to comply with the MsReport convention. "Modified sequence",
    "Modifications", and "Modification localization string" columns. "Protein
    reported by software" and "Representative protein", both contain the first entry
    from "Leading razor protein". "Ion ID" contains unique entries for each ion,
    which are generated by concatenating the "Modified sequence" and "Charge"
    columns, and if present, the "Compensation voltage" column.

    "Modified sequence" entries contain modifications within square brackets.
    "Modification" entries are strings in the form of "position:modification_text",
    multiple modifications are joined by ";". An example for a modified sequence and
    a modification entry: "PEPT[Phospho]IDO[Oxidation]", "4:Phospho;7:Oxidation".

    "Modification localization string" contains localization probabilities in the
    format "Mod1@Site1:Probability1,Site2:Probability2;Mod2@Site3:Probability3",
    e.g. "15.9949@11:1.000;79.9663@3:0.200,4:0.800". Refer to
    `msreport.peptidoform.make_localization_string` for details.

    Args:
        filename: Allows specifying an alternative filename, otherwise the default
            filename is used.
        rename_columns: If True, columns are renamed according to the MsReport
            convention; default True.
        rewrite_modifications: If True, the peptide format in "Modified sequence" is
            changed according to the MsReport convention, and a "Modifications" is
            added to contains the amino acid position for all modifications.
            Requires 'rename_columns' to be true. Default True.
        prefix_column_tags: If True, column tags such as "Intensity" are added
            in front of the sample names, e.g. "Intensity sample_name". If False,
            column tags are added afterwards, e.g. "Sample_name Intensity"; default
            True.

    Returns:
        A DataFrame containing the processed ion table.
    """
    # TODO: not tested #

    # --- Get paths of all ion.tsv files --- #
    if filename is None:
        filename = self.default_filenames["ion_evidence"]

    ion_table_paths = []
    for path in pathlib.Path(self.data_directory).iterdir():
        ion_table_path = path / filename
        if path.is_dir() and ion_table_path.exists():
            ion_table_paths.append(ion_table_path)

    # --- like self._read_file --- #
    ion_tables = []
    for filepath in ion_table_paths:
        table = pd.read_csv(filepath, sep="\t", low_memory=False)
        str_cols = table.select_dtypes(include=["object"]).columns
        table.loc[:, str_cols] = table.loc[:, str_cols].fillna("")

        table["Sample"] = filepath.parent.name
        ion_tables.append(table)
    df = pd.concat(ion_tables, ignore_index=True)

    # --- Process dataframe --- #
    df["Ion ID"] = df["Modified Sequence"] + "_c" + df["Charge"].astype(str)
    if "Compensation Voltage" in df.columns:
        df["Ion ID"] = df["Ion ID"] + "_cv" + df["Compensation Voltage"].astype(str)
    # FUTURE: replace this by _add_protein_entries(df, False) if FragPipe adds
    #         'Indistinguishable Proteins' to the ion table.
    df["Protein reported by software"] = _extract_protein_ids(df["Protein"])
    df["Representative protein"] = df["Protein reported by software"]
    df["Mapped Proteins"] = self._collect_mapped_proteins(df)

    if rename_columns:
        df = self._rename_columns(df, prefix_column_tags)
    if rewrite_modifications and rename_columns:
        df = self._add_peptide_modification_entries(df)
        df = self._add_modification_localization_string(df, prefix_column_tags)
    return df

import_psm_evidence

import_psm_evidence(
    filename: Optional[str] = None,
    rename_columns: bool = True,
    rewrite_modifications: bool = True,
) -> DataFrame

Concatenate all "psm.tsv" files and return a processed dataframe.

Parameters:

Name Type Description Default
filename Optional[str]

Allows specifying an alternative filename, otherwise the default filename is used.

None
rename_columns bool

If True, columns are renamed according to the MsReport convention; default True.

True
rewrite_modifications bool

If True, the peptide format in "Modified sequence" is changed according to the MsReport convention, and a "Modifications" is added to contains the amino acid position for all modifications. Requires 'rename_columns' to be true. Default True.

True

Returns:

Type Description
DataFrame

A DataFrame containing the processed psm evidence tables.

Source code in msreport\reader.py
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
def import_psm_evidence(
    self,
    filename: Optional[str] = None,
    rename_columns: bool = True,
    rewrite_modifications: bool = True,
) -> pd.DataFrame:
    """Concatenate all "psm.tsv" files and return a processed dataframe.

    Args:
        filename: Allows specifying an alternative filename, otherwise the default
            filename is used.
        rename_columns: If True, columns are renamed according to the MsReport
            convention; default True.
        rewrite_modifications: If True, the peptide format in "Modified sequence" is
            changed according to the MsReport convention, and a "Modifications" is
            added to contains the amino acid position for all modifications.
            Requires 'rename_columns' to be true. Default True.

    Returns:
        A DataFrame containing the processed psm evidence tables.
    """
    if filename is None:
        filename = self.default_filenames["psm_evidence"]

    psm_table_paths = []
    for path in pathlib.Path(self.data_directory).iterdir():
        psm_table_path = path / filename
        if path.is_dir() and psm_table_path.exists():
            psm_table_paths.append(psm_table_path)

    psm_tables = []
    for filepath in psm_table_paths:
        table = pd.read_csv(filepath, sep="\t", low_memory=False)
        str_cols = table.select_dtypes(include=["object"]).columns
        table.loc[:, str_cols] = table.loc[:, str_cols].fillna("")

        table["Sample"] = filepath.parent.name
        psm_tables.append(table)
    df = pd.concat(psm_tables, ignore_index=True)

    df["Protein reported by software"] = _extract_protein_ids(df["Protein"])
    df["Representative protein"] = df["Protein reported by software"]
    df["Mapped Proteins"] = self._collect_mapped_proteins(df)

    if rename_columns:
        df = self._rename_columns(df, prefix_tag=True)
    if rewrite_modifications and rename_columns:
        mod_entries = _generate_modification_entries_from_assigned_modifications(
            df["Peptide sequence"], df["Assigned Modifications"]
        )
        df["Modified sequence"] = mod_entries["Modified sequence"]
        df["Modifications"] = mod_entries["Modifications"]
        df = self._add_modification_localization_string_to_psm_evidence(df)
    return df

SpectronautReader

SpectronautReader(
    directory: str, contaminant_tag: str = "contam_"
)

Bases: ResultReader

Spectronaut result reader.

Methods:

Name Description
import_proteins

Reads a LFQ protein report file and returns a processed dataframe, conforming to the MsReport naming convention.

import_design

Reads a ConditionSetup file and returns a processed dataframe, containing the default columns of an MsReport experimental design table.

Attributes:

Name Type Description
default_filetags dict[str, str]

(class attribute) Look up of default file tags for the outputs generated by Spectronaut.

sample_column_tags list[str]

(class attribute) Tags (column name substrings) that idenfity sample columns. Sample columns are those, for which one unique column is present per sample, for example intensity columns.

column_mapping dict[str, str]

(class attribute) Used to rename original column names from Spectronaut according to the MsReport naming convention.

column_tag_mapping OrderedDict[str, str]

(class attribute) Mapping of original sample column tags from Spectronaut to column tags according to the MsReport naming convention, used to replace column names containing the original column tag.

protein_info_columns list[str]

(class attribute) List of columns that contain information specific to the leading protein.

protein_info_tags list[str]

(class attribute) List of substrings present in columns that contain information specific to the leading protein.

data_directory str

Location of the folder containing Spectronaut result files.

filetags dict[str, str]

Look up of file tags used for matching files during the import of protein or other tables.

contamination_tag str

Substring present in protein IDs to identify them as potential contaminants.

Parameters:

Name Type Description Default
directory str

Location of the Spectronaut result folder.

required
contaminant_tag str

Prefix of Protein ID entries to identify contaminants; default "contam_".

'contam_'
Source code in msreport\reader.py
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
def __init__(self, directory: str, contaminant_tag: str = "contam_") -> None:
    """Initializes the SpectronautReader.

    Args:
        directory: Location of the Spectronaut result folder.
        contaminant_tag: Prefix of Protein ID entries to identify contaminants;
            default "contam_".
    """
    self.data_directory = directory
    self.filetags: dict[str, str] = self.default_filetags
    self.filenames = {}
    self._contaminant_tag: str = contaminant_tag

import_design

import_design(
    filename: Optional[str] = None,
    filetag: Optional[str] = None,
) -> DataFrame

Reads a ConditionSetup file and returns an experimental design table.

The following columns from the Spectronaut ConditionSetup file will be imported to the design table and renamed: Replicate -> Replicate Condition -> Experiment File Name -> Filename Run Label -> Run label

In addition, a "Sample" is added containing values from the Experiment and Replicate columns, separated by an underscore.

If neither filename nor filetag is specified, the default file tag "conditionsetup" is used to select a file from the data directory. If no file or multiple files match, an exception is thrown. The check for the presence of the file tag is not case sensitive.

Parameters:

Name Type Description Default
filename Optional[str]

Optional, allows specifying a specific file that will be imported.

None
filetag Optional[str]

Optional, can be used to select a file that contains the filetag as a substring, instead of specifying a filename.

None

Returns:

Type Description
DataFrame

A dataframe containing the processed design table.

Source code in msreport\reader.py
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
def import_design(
    self, filename: Optional[str] = None, filetag: Optional[str] = None
) -> pd.DataFrame:
    """Reads a ConditionSetup file and returns an experimental design table.

    The following columns from the Spectronaut ConditionSetup file will be imported
    to the design table and renamed:
        Replicate -> Replicate
        Condition -> Experiment
        File Name -> Filename
        Run Label -> Run label

    In addition, a "Sample" is added containing values from the Experiment and
    Replicate columns, separated by an underscore.

    If neither filename nor filetag is specified, the default file tag
    "conditionsetup" is used to select a file from the data directory. If no file
    or multiple files match, an exception is thrown. The check for the presence of
    the file tag is not case sensitive.

    Args:
        filename: Optional, allows specifying a specific file that will be imported.
        filetag: Optional, can be used to select a file that contains the filetag as
            a substring, instead of specifying a filename.

    Returns:
        A dataframe containing the processed design table.
    """
    filetag = self.filetags["design"] if filetag is None else filetag
    filenames = _find_matching_files(
        self.data_directory,
        filename=filename,
        filetag=filetag,
        extensions=["xls", "tsv", "csv"],
    )
    if len(filenames) == 0:
        raise FileNotFoundError("No matching file found.")
    elif len(filenames) > 1:
        exception_message_lines = [
            f"Multiple matching files found in: {self.data_directory}",
            "One of the report filenames must be specified manually:",
        ]
        exception_message_lines.extend(filenames)
        exception_message = "\n".join(exception_message_lines)
        raise ValueError(exception_message)
    else:
        filename = filenames[0]

    df = self._read_file(filename)
    df["Sample"] = df["Condition"].astype(str) + "_" + df["Replicate"].astype(str)
    df = pd.DataFrame(
        {
            "Sample": df["Sample"].astype(str),
            "Replicate": df["Replicate"].astype(str),
            "Experiment": df["Condition"].astype(str),
            "Filename": df["File Name"].astype(str),
            "Run label": df["Run Label"].astype(str),
        }
    )
    return df

import_proteins

import_proteins(
    filename: Optional[str] = None,
    filetag: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
    drop_protein_info: bool = True,
) -> DataFrame

Reads a Spectronaut protein report file and returns a processed DataFrame.

Adds four protein entry columns to comply with the MsReport convention: "Protein reported by software", "Leading proteins", "Representative protein", "Potential contaminant".

"Protein reported by software" and "Representative protein" contain the first entry from the "PG.ProteinAccessions" column, and "Leading proteins" contains all entries from this column. Multiple leading protein entries are separated by ";".

Several columns in the Spectronaut report file can contain information specific for the leading protein entry. If leading proteins will be re-sorted later, it is recommended to remove columns containing protein specific information by setting 'drop_protein_info=True'.

Parameters:

Name Type Description Default
filename Optional[str]

Optional, allows specifying a specific file that will be imported.

None
filetag Optional[str]

Optional, can be used to select a file that contains the filetag as a substring, instead of specifying a filename.

None
rename_columns bool

If True, columns are renamed according to the MsReport convention; default True.

True
prefix_column_tags bool

If True, column tags such as "Intensity" are added in front of the sample names, e.g. "Intensity sample_name". If False, column tags are added afterwards, e.g. "Sample_name Intensity"; default True.

True
drop_protein_info bool

If True, columns containing protein specific information, such as "Gene" or "Protein Length". See SpectronautReader.protein_info_columns and SpectronautReader.protein_info_tags for a full list of columns that will be removed. Default False.

True

Returns:

Type Description
DataFrame

A dataframe containing the processed protein table.

Source code in msreport\reader.py
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
def import_proteins(
    self,
    filename: Optional[str] = None,
    filetag: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
    drop_protein_info: bool = True,
) -> pd.DataFrame:
    """Reads a Spectronaut protein report file and returns a processed DataFrame.

    Adds four protein entry columns to comply with the MsReport convention:
    "Protein reported by software", "Leading proteins", "Representative protein",
    "Potential contaminant".

    "Protein reported by software" and "Representative protein" contain the first
    entry from the "PG.ProteinAccessions" column, and "Leading proteins" contains
    all entries from this column. Multiple leading protein entries are separated by
    ";".

    Several columns in the Spectronaut report file can contain information specific
    for the leading protein entry. If leading proteins will be re-sorted later, it
    is recommended to remove columns containing protein specific information by
    setting 'drop_protein_info=True'.

    Args:
        filename: Optional, allows specifying a specific file that will be imported.
        filetag: Optional, can be used to select a file that contains the filetag as
            a substring, instead of specifying a filename.
        rename_columns: If True, columns are renamed according to the MsReport
            convention; default True.
        prefix_column_tags: If True, column tags such as "Intensity" are added
            in front of the sample names, e.g. "Intensity sample_name". If False,
            column tags are added afterwards, e.g. "Sample_name Intensity"; default
            True.
        drop_protein_info: If True, columns containing protein specific information,
            such as "Gene" or "Protein Length". See
            SpectronautReader.protein_info_columns and
            SpectronautReader.protein_info_tags for a full list of columns that will
            be removed. Default False.

    Returns:
        A dataframe containing the processed protein table.
    """
    filetag = self.filetags["proteins"] if filetag is None else filetag
    filenames = _find_matching_files(
        self.data_directory,
        filename=filename,
        filetag=filetag,
        extensions=["xls", "tsv", "csv"],
    )
    if len(filenames) == 0:
        raise FileNotFoundError("No matching file found.")
    elif len(filenames) > 1:
        exception_message_lines = [
            f"Multiple matching files found in: {self.data_directory}",
            "One of the report filenames must be specified manually:",
        ]
        exception_message_lines.extend(filenames)
        exception_message = "\n".join(exception_message_lines)
        raise ValueError(exception_message)
    else:
        filename = filenames[0]

    df = self._read_file(filename)
    df = self._tidy_up_sample_columns(df)
    df = self._add_protein_entries(df)
    if drop_protein_info:
        df = self._drop_columns(df, self.protein_info_columns)
        for tag in self.protein_info_tags:
            df = self._drop_columns_by_tag(df, tag)
    if rename_columns:
        df = self._rename_columns(df, prefix_column_tags)
    return df

import_peptides

import_peptides(
    filename: Optional[str] = None,
    filetag: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
) -> DataFrame

Reads a Spectronaut peptide report file and returns a processed DataFrame.

Uses and renames the following Spectronaut report columns: PG.ProteinAccessions, PEP.Quantity, PEP.StrippedSequence, and PEP.AllOccurringProteinAccessions

Adds four protein entry columns to comply with the MsReport convention: "Protein reported by software", "Leading proteins", "Representative protein", "Potential contaminant".

"Protein reported by software" and "Representative protein" contain the first entry from the "PG.ProteinAccessions" column, and "Leading proteins" contains all entries from this column. Multiple leading protein entries are separated by ";".

Parameters:

Name Type Description Default
filename Optional[str]

Optional, allows specifying a specific file that will be imported.

None
filetag Optional[str]

Optional, can be used to select a file that contains the filetag as a substring, instead of specifying a filename.

None
rename_columns bool

If True, columns are renamed according to the MsReport convention; default True.

True
prefix_column_tags bool

If True, column tags such as "Intensity" are added in front of the sample names, e.g. "Intensity sample_name". If False, column tags are added afterwards, e.g. "Sample_name Intensity"; default True.

True

Returns:

Type Description
DataFrame

A dataframe containing the processed protein table.

Source code in msreport\reader.py
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
def import_peptides(
    self,
    filename: Optional[str] = None,
    filetag: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
) -> pd.DataFrame:
    """Reads a Spectronaut peptide report file and returns a processed DataFrame.

    Uses and renames the following Spectronaut report columns:
    PG.ProteinAccessions, PEP.Quantity, PEP.StrippedSequence, and
    PEP.AllOccurringProteinAccessions

    Adds four protein entry columns to comply with the MsReport convention:
    "Protein reported by software", "Leading proteins", "Representative protein",
    "Potential contaminant".

    "Protein reported by software" and "Representative protein" contain the first
    entry from the "PG.ProteinAccessions" column, and "Leading proteins" contains
    all entries from this column. Multiple leading protein entries are separated by
    ";".

    Args:
        filename: Optional, allows specifying a specific file that will be imported.
        filetag: Optional, can be used to select a file that contains the filetag as
            a substring, instead of specifying a filename.
        rename_columns: If True, columns are renamed according to the MsReport
            convention; default True.
        prefix_column_tags: If True, column tags such as "Intensity" are added
            in front of the sample names, e.g. "Intensity sample_name". If False,
            column tags are added afterwards, e.g. "Sample_name Intensity"; default
            True.

    Returns:
        A dataframe containing the processed protein table.
    """
    filenames = _find_matching_files(
        self.data_directory,
        filename=filename,
        filetag=filetag,
        extensions=["xls", "tsv", "csv"],
    )
    if len(filenames) == 0:
        raise FileNotFoundError("No matching file found.")
    elif len(filenames) > 1:
        exception_message_lines = [
            f"Multiple matching files found in: {self.data_directory}",
            "One of the report filenames must be specified manually:",
        ]
        exception_message_lines.extend(filenames)
        exception_message = "\n".join(exception_message_lines)
        raise ValueError(exception_message)
    else:
        filename = filenames[0]

    df = self._read_file(filename)
    df = self._tidy_up_sample_columns(df)
    df = self._add_protein_entries(df)
    if rename_columns:
        df = self._rename_columns(df, prefix_column_tags)
    return df

import_ion_evidence

import_ion_evidence(
    filename: Optional[str] = None,
    filetag: Optional[str] = None,
    rename_columns: bool = True,
    rewrite_modifications: bool = True,
) -> DataFrame

Reads an ion evidence file (long format) and returns a processed dataframe.

Adds new columns to comply with the MsReport convention. "Protein reported by software" and "Representative protein", both contain the first entry from "PG.ProteinAccessions". "Ion ID" contains unique entries for each ion, which are generated by concatenating the "Modified sequence" and "Charge" columns, and if present, the "Compensation voltage" column.

"Modified sequence" entries contain modifications within square brackets. "Modification" entries are strings in the form of "position:modification_tag", multiple modifications are joined by ";". An example for a modified sequence and a modification entry: "PEPT[Phospho]IDO[Oxidation]", "4:Phospho;7:Oxidation".

"Modification localization string" contains localization probabilities in the format "Mod1@Site1:Probability1,Site2:Probability2;Mod2@Site3:Probability3", e.g. "15.9949@11:1.000;79.9663@3:0.200,4:0.800". Refer to msreport.peptidoform.make_localization_string for details.

Parameters:

Name Type Description Default
filename Optional[str]

Optional, allows specifying a specific file that will be imported.

None
filetag Optional[str]

Optional, can be used to select a file that contains the filetag as a substring, instead of specifying a filename.

None
rename_columns bool

If True, columns are renamed according to the MsReport convention; default True.

True
rewrite_modifications bool

If True, the peptide format in "Modified sequence" is changed according to the MsReport convention, and a "Modifications" is added to contains the amino acid position for all modifications. Requires 'rename_columns' to be true. Default True.

True

Returns:

Type Description
DataFrame

A dataframe containing the processed ion table.

Source code in msreport\reader.py
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
def import_ion_evidence(
    self,
    filename: Optional[str] = None,
    filetag: Optional[str] = None,
    rename_columns: bool = True,
    rewrite_modifications: bool = True,
) -> pd.DataFrame:
    """Reads an ion evidence file (long format) and returns a processed dataframe.

    Adds new columns to comply with the MsReport convention. "Protein reported
    by software" and "Representative protein", both contain the first entry from
    "PG.ProteinAccessions". "Ion ID" contains unique entries for each ion, which are
    generated by concatenating the "Modified sequence" and "Charge" columns, and if
    present, the "Compensation voltage" column.

    "Modified sequence" entries contain modifications within square brackets.
    "Modification" entries are strings in the form of "position:modification_tag",
    multiple modifications are joined by ";". An example for a modified sequence and
    a modification entry: "PEPT[Phospho]IDO[Oxidation]", "4:Phospho;7:Oxidation".

    "Modification localization string" contains localization probabilities in the
    format "Mod1@Site1:Probability1,Site2:Probability2;Mod2@Site3:Probability3",
    e.g. "15.9949@11:1.000;79.9663@3:0.200,4:0.800". Refer to
    `msreport.peptidoform.make_localization_string` for details.

    Args:
        filename: Optional, allows specifying a specific file that will be imported.
        filetag: Optional, can be used to select a file that contains the filetag as
            a substring, instead of specifying a filename.
        rename_columns: If True, columns are renamed according to the MsReport
            convention; default True.
        rewrite_modifications: If True, the peptide format in "Modified sequence" is
            changed according to the MsReport convention, and a "Modifications" is
            added to contains the amino acid position for all modifications.
            Requires 'rename_columns' to be true. Default True.

    Returns:
        A dataframe containing the processed ion table.
    """
    filenames = _find_matching_files(
        self.data_directory,
        filename=filename,
        filetag=filetag,
        extensions=["xls", "tsv", "csv"],
    )
    if len(filenames) == 0:
        raise FileNotFoundError("No matching file found.")
    elif len(filenames) > 1:
        exception_message_lines = [
            f"Multiple matching files found in: {self.data_directory}",
            "One of the report filenames must be specified manually:",
        ]
        exception_message_lines.extend(filenames)
        exception_message = "\n".join(exception_message_lines)
        raise ValueError(exception_message)
    else:
        filename = filenames[0]
    df = self._read_file(filename)
    df = self._tidy_up_sample_columns(df)
    df = self._add_protein_entries(df)
    if rename_columns:
        df = self._rename_columns(df, True)
    if rewrite_modifications and rename_columns:
        df = self._add_peptide_modification_entries(df)
        df = self._add_modification_localization_string(df)
        df["Ion ID"] = df["Modified sequence"] + "_c" + df["Charge"].astype(str)
        if "Compensation voltage" in df.columns:
            _cv = df["Compensation voltage"].astype(str)
            df["Ion ID"] = df["Ion ID"] + "_cv" + _cv

    return df

sort_leading_proteins

sort_leading_proteins(
    table: DataFrame,
    alphanumeric: bool = True,
    penalize_contaminants: bool = True,
    special_proteins: Optional[list[str]] = None,
    database_order: Optional[list[str]] = None,
) -> DataFrame

Returns a copy of 'table' with sorted leading proteins.

"Leading proteins" are sorted according to the selected options. The first entry of the sorted leading proteins is selected as the new "Representative protein". If the columns are present, also the entries of "Leading proteins database origin" and "Leading potential contaminants" are reordered, and "Potential contaminant" is reassigned according to the representative protein.

Additional protein annotation columns, refering to a representative protein that has been changed, will no longer be valid. It is therefore recommended to remove all columns containing protein specific information by enabling 'drop_protein_info' during the import of protein tables or to update protein annotation columns if possible.

Parameters:

Name Type Description Default
table DataFrame

Dataframe in which "Leading proteins" will be sorted.

required
alphanumeric bool

If True, protein entries are sorted alpha numerical.

True
penalize_contaminants bool

If True, protein contaminants are sorted to the back.

True
special_proteins Optional[list[str]]

Optional, allows specifying a list of protein IDs that will always be sorted to the beginning.

None
database_order Optional[list[str]]

Optional, allows specifying an order of protein databases that will be considered for sorting. Database names that are not present in 'database_order' are sorted to the end. The protein database of a fasta entry is written in the very beginning of the fasta header, e.g. "sp" from the fasta header ">sp|P60709|ACTB_HUMAN Actin".

None

Returns:

Type Description
DataFrame

A copy of the 'table', containing sorted leading protein entries.

Source code in msreport\reader.py
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
def sort_leading_proteins(
    table: pd.DataFrame,
    alphanumeric: bool = True,
    penalize_contaminants: bool = True,
    special_proteins: Optional[list[str]] = None,
    database_order: Optional[list[str]] = None,
) -> pd.DataFrame:
    """Returns a copy of 'table' with sorted leading proteins.

    "Leading proteins" are sorted according to the selected options. The first entry
    of the sorted leading proteins is selected as the new "Representative protein". If
    the columns are present, also the entries of "Leading proteins database origin" and
    "Leading potential contaminants" are reordered, and "Potential contaminant" is
    reassigned according to the representative protein.

    Additional protein annotation columns, refering to a representative protein that has
    been changed, will no longer be valid. It is therefore recommended to remove all
    columns containing protein specific information by enabling 'drop_protein_info'
    during the import of protein tables or to update protein annotation columns if
    possible.

    Args:
        table: Dataframe in which "Leading proteins" will be sorted.
        alphanumeric: If True, protein entries are sorted alpha numerical.
        penalize_contaminants: If True, protein contaminants are sorted to the back.
        special_proteins: Optional, allows specifying a list of protein IDs that
            will always be sorted to the beginning.
        database_order: Optional, allows specifying an order of protein databases that
            will be considered for sorting. Database names that are not present in
            'database_order' are sorted to the end. The protein database of a fasta
            entry is written in the very beginning of the fasta header, e.g. "sp" from
            the fasta header ">sp|P60709|ACTB_HUMAN Actin".

    Returns:
        A copy of the 'table', containing sorted leading protein entries.
    """
    sorted_entries = defaultdict(list)
    contaminants_present = "Leading potential contaminants" in table
    db_origins_present = "Leading proteins database origin" in table

    if database_order is not None:
        database_encoding: dict[str, int] = defaultdict(lambda: 999)
        database_encoding.update({db: i for i, db in enumerate(database_order)})
    if penalize_contaminants is not None:
        contaminant_encoding = {"False": 0, "True": 1, False: 0, True: 1}

    for _, row in table.iterrows():
        protein_ids = row["Leading proteins"].split(";")

        sorting_info: list[list] = [[] for _ in protein_ids]
        if special_proteins is not None:
            for i, _id in enumerate(protein_ids):
                sorting_info[i].append(_id not in special_proteins)
        if penalize_contaminants:
            for i, is_contaminant in enumerate(
                row["Leading potential contaminants"].split(";")
            ):
                sorting_info[i].append(contaminant_encoding[is_contaminant])
        if database_order is not None:
            for i, db_origin in enumerate(
                row["Leading proteins database origin"].split(";")
            ):
                sorting_info[i].append(database_encoding[db_origin])
        if alphanumeric:
            for i, _id in enumerate(protein_ids):
                sorting_info[i].append(_id)
        sorting_order = [
            i[0] for i in sorted(enumerate(sorting_info), key=lambda x: x[1])
        ]

        protein_ids = [protein_ids[i] for i in sorting_order]
        sorted_entries["Representative protein"].append(protein_ids[0])
        sorted_entries["Leading proteins"].append(";".join(protein_ids))

        if contaminants_present:
            contaminants = row["Leading potential contaminants"].split(";")
            contaminants = [contaminants[i] for i in sorting_order]
            potential_contaminant = contaminants[0] == "True"
            contaminants = ";".join(contaminants)
            sorted_entries["Potential contaminant"].append(potential_contaminant)
            sorted_entries["Leading potential contaminants"].append(contaminants)

        if db_origins_present:
            db_origins = row["Leading proteins database origin"].split(";")
            db_origins = ";".join([db_origins[i] for i in sorting_order])
            sorted_entries["Leading proteins database origin"].append(db_origins)

    sorted_table = table.copy()
    for key in sorted_entries:
        sorted_table[key] = sorted_entries[key]
    return sorted_table

add_protein_annotation

add_protein_annotation(
    table: DataFrame,
    protein_db: ProteinDatabase,
    id_column: str = "Representative protein",
    gene_name: bool = False,
    protein_name: bool = False,
    protein_entry: bool = False,
    protein_length: bool = False,
    molecular_weight: bool = False,
    fasta_header: bool = False,
    ibaq_peptides: bool = False,
    database_origin: bool = False,
) -> DataFrame

Uses a FASTA protein database to add protein annotation columns.

Parameters:

Name Type Description Default
table DataFrame

Dataframe to which the protein annotations are added.

required
protein_db ProteinDatabase

A protein database containing entries from one or multiple FASTA files.

required
id_column str

Column in 'table' that contains protein uniprot IDs, which will be used to look up entries in the 'protein_db'.

'Representative protein'
gene_name bool

If True, adds a "Gene name" column.

False
protein_name bool

If True, adds "Protein name" column.

False
protein_entry bool

If True, adds "Protein entry name" column.

False
protein_length bool

If True, adds a "Protein length" column.

False
molecular_weight bool

If True, adds a "Molecular weight [kDa]" column. The molecular weight is calculated as the monoisotopic mass in kilo Dalton, rounded to two decimal places. Note that there is an opinionated behaviour for non-standard amino acids code. "O" is Pyrrolysine, "U" is Selenocysteine, "B" is treated as "N", "Z" is treated as "Q", and "X" is ignored.

False
fasta_header bool

If True, adds a "Fasta header" column.

False
ibaq_peptides bool

If True, adds a "iBAQ peptides" columns. The number of iBAQ peptides is calculated as the theoretical number of tryptic peptides with a length between 7 and 30.

False
database_origin bool

If True, adds a "Database origin" column.

False

Returns:

Type Description
DataFrame

The updated 'table' dataframe.

Source code in msreport\reader.py
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
def add_protein_annotation(
    table: pd.DataFrame,
    protein_db: ProteinDatabase,
    id_column: str = "Representative protein",
    gene_name: bool = False,
    protein_name: bool = False,
    protein_entry: bool = False,
    protein_length: bool = False,
    molecular_weight: bool = False,
    fasta_header: bool = False,
    ibaq_peptides: bool = False,
    database_origin: bool = False,
) -> pd.DataFrame:
    """Uses a FASTA protein database to add protein annotation columns.

    Args:
        table: Dataframe to which the protein annotations are added.
        protein_db: A protein database containing entries from one or multiple FASTA
            files.
        id_column: Column in 'table' that contains protein uniprot IDs, which will be
            used to look up entries in the 'protein_db'.
        gene_name: If True, adds a "Gene name" column.
        protein_name: If True, adds "Protein name" column.
        protein_entry: If True, adds "Protein entry name" column.
        protein_length: If True, adds a "Protein length" column.
        molecular_weight: If True, adds a "Molecular weight [kDa]" column. The molecular
            weight is calculated as the monoisotopic mass in kilo Dalton, rounded to two
            decimal places. Note that there is an opinionated behaviour for non-standard
            amino acids code. "O" is Pyrrolysine, "U" is Selenocysteine, "B" is treated
            as "N", "Z" is treated as "Q", and "X" is ignored.
        fasta_header: If True, adds a "Fasta header" column.
        ibaq_peptides: If True, adds a "iBAQ peptides" columns. The number of iBAQ
            peptides is calculated as the theoretical number of tryptic peptides with
            a length between 7 and 30.
        database_origin: If True, adds a "Database origin" column.

    Returns:
        The updated 'table' dataframe.
    """
    # not tested #
    proteins = table[id_column].to_list()

    proteins_not_in_db = []
    for protein_id in proteins:
        if protein_id not in protein_db:
            proteins_not_in_db.append(protein_id)
    if proteins_not_in_db:
        warnings.warn(
            f"Some proteins could not be annotated: {repr(proteins_not_in_db)}",
            ProteinsNotInFastaWarning,
            stacklevel=2,
        )

    annotations = {}
    if gene_name:
        annotations["Gene name"] = _create_protein_annotations_from_db(
            proteins, protein_db, _get_annotation_gene_name, ""
        )
    if protein_name:
        annotations["Protein name"] = _create_protein_annotations_from_db(
            proteins, protein_db, _get_annotation_protein_name, ""
        )
    if protein_entry:
        annotations["Protein entry name"] = _create_protein_annotations_from_db(
            proteins, protein_db, _get_annotation_protein_entry_name, ""
        )
    if protein_length:
        annotations["Protein length"] = _create_protein_annotations_from_db(
            proteins, protein_db, _get_annotation_sequence_length, -1
        )
    if molecular_weight:
        annotations["Molecular weight [kDa]"] = _create_protein_annotations_from_db(
            proteins, protein_db, _get_annotation_molecular_weight, np.nan
        )
    if fasta_header:
        annotations["Fasta header"] = _create_protein_annotations_from_db(
            proteins, protein_db, _get_annotation_fasta_header, ""
        )
    if database_origin:
        annotations["Database origin"] = _create_protein_annotations_from_db(
            proteins, protein_db, _get_annotation_db_origin, ""
        )
    if ibaq_peptides:
        annotations["iBAQ peptides"] = _create_protein_annotations_from_db(
            proteins, protein_db, _get_annotation_ibaq_peptides, -1
        )
    for column in annotations.keys():
        table[column] = annotations[column]
    return table

add_protein_site_annotation

add_protein_site_annotation(
    table: DataFrame,
    protein_db: ProteinDatabase,
    protein_column: str = "Representative protein",
    site_column: str = "Protein site",
) -> DataFrame

Uses a FASTA protein database to add protein site annotation columns.

Adds the columns "Modified residue", which corresponds to the amino acid at the protein site position, and "Sequence window", which contains sequence windows of eleven amino acids surrounding the protein site. Sequence windows are centered on the respective protein site; missing amino acids due to the position being close to the beginning or end of the protein sequence are substituted with "-".

Parameters:

Name Type Description Default
table DataFrame

Dataframe to which the protein site annotations are added.

required
protein_db ProteinDatabase

A protein database containing entries from one or multiple FASTA files.

required
protein_column str

Column in 'table' that contains protein identifiers, which will be used to look up entries in the 'protein_db'.

'Representative protein'
site_column str

Column in 'table' that contains protein sites, which will be used to extract information from the protein sequence. Protein sites are one-indexed, meaining the first amino acid of the protein is position 1.

'Protein site'

Returns:

Type Description
DataFrame

The updated 'table' dataframe.

Source code in msreport\reader.py
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
def add_protein_site_annotation(
    table: pd.DataFrame,
    protein_db: ProteinDatabase,
    protein_column: str = "Representative protein",
    site_column: str = "Protein site",
) -> pd.DataFrame:
    """Uses a FASTA protein database to add protein site annotation columns.

    Adds the columns "Modified residue", which corresponds to the amino acid at the
    protein site position, and "Sequence window", which contains sequence windows of
    eleven amino acids surrounding the protein site. Sequence windows are centered on
    the respective protein site; missing amino acids due to the position being close to
    the beginning or end of the protein sequence are substituted with "-".

    Args:
        table: Dataframe to which the protein site annotations are added.
        protein_db: A protein database containing entries from one or multiple FASTA
            files.
        protein_column: Column in 'table' that contains protein identifiers, which will
            be used to look up entries in the 'protein_db'.
        site_column: Column in 'table' that contains protein sites, which will be used
            to extract information from the protein sequence. Protein sites are
            one-indexed, meaining the first amino acid of the protein is position 1.

    Returns:
        The updated 'table' dataframe.
    """
    # TODO not tested
    proteins = table[protein_column].to_list()
    proteins_not_in_db = []
    for protein_id in proteins:
        if protein_id not in protein_db:
            proteins_not_in_db.append(protein_id)
    if proteins_not_in_db:
        warnings.warn(
            f"Some proteins could not be annotated: {repr(proteins_not_in_db)}",
            ProteinsNotInFastaWarning,
            stacklevel=2,
        )

    annotations: dict[str, list[str]] = {
        "Modified residue": [],
        "Sequence window": [],
    }
    for protein, site in zip(table[protein_column], table[site_column]):
        protein_sequence = protein_db[protein].sequence

        modified_residue = protein_sequence[site - 1]
        annotations["Modified residue"].append(modified_residue)

        sequence_window = extract_window_around_position(protein_sequence, site)
        annotations["Sequence window"].append(sequence_window)

    for column, annotation_values in annotations.items():
        table[column] = annotation_values
    return table

add_leading_proteins_annotation

add_leading_proteins_annotation(
    table: DataFrame,
    protein_db: ProteinDatabase,
    id_column: str = "Leading proteins",
    gene_name: bool = False,
    protein_entry: bool = False,
    protein_length: bool = False,
    fasta_header: bool = False,
    ibaq_peptides: bool = False,
    database_origin: bool = False,
) -> DataFrame

Uses a FASTA protein database to add leading protein annotation columns.

Generates protein annotations for multi protein entries, where each entry can contain one or multiple protein ids, multiple protein ids are separated by ";".

Parameters:

Name Type Description Default
table DataFrame

Dataframe to which the protein annotations are added.

required
protein_db ProteinDatabase

A protein database containing entries from one or multiple FASTA files.

required
id_column str

Column in 'table' that contains leading protein uniprot IDs, which will be used to look up entries in the 'protein_db'.

'Leading proteins'
gene_name bool

If True, adds a "Leading proteins gene name" column.

False
protein_entry bool

If True, adds "Leading proteins entry name" column.

False
protein_length bool

If True, adds a "Leading proteins length" column.

False
fasta_header bool

If True, adds a "Leading proteins fasta header" column.

False
ibaq_peptides bool

If True, adds a "Leading proteins iBAQ peptides" columns. The number of iBAQ peptides is calculated as the theoretical number of tryptic peptides with a length between 7 and 30.

False
database_origin bool

If True, adds a "Leading proteins database origin" column.

False

Returns:

Type Description
DataFrame

The updated 'table' dataframe.

Source code in msreport\reader.py
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
def add_leading_proteins_annotation(
    table: pd.DataFrame,
    protein_db: ProteinDatabase,
    id_column: str = "Leading proteins",
    gene_name: bool = False,
    protein_entry: bool = False,
    protein_length: bool = False,
    fasta_header: bool = False,
    ibaq_peptides: bool = False,
    database_origin: bool = False,
) -> pd.DataFrame:
    """Uses a FASTA protein database to add leading protein annotation columns.

    Generates protein annotations for multi protein entries, where each entry can
    contain one or multiple protein ids, multiple protein ids are separated by ";".

    Args:
        table: Dataframe to which the protein annotations are added.
        protein_db: A protein database containing entries from one or multiple FASTA
            files.
        id_column: Column in 'table' that contains leading protein uniprot IDs, which
            will be used to look up entries in the 'protein_db'.
        gene_name: If True, adds a "Leading proteins gene name" column.
        protein_entry: If True, adds "Leading proteins entry name" column.
        protein_length: If True, adds a "Leading proteins length" column.
        fasta_header: If True, adds a "Leading proteins fasta header" column.
        ibaq_peptides: If True, adds a "Leading proteins iBAQ peptides" columns. The
            number of iBAQ peptides is calculated as the theoretical number of tryptic
            peptides with a length between 7 and 30.
        database_origin: If True, adds a "Leading proteins database origin" column.

    Returns:
        The updated 'table' dataframe.
    """
    # not tested #
    leading_protein_entries = table[id_column].to_list()

    proteins_not_in_db = []
    for leading_entry in leading_protein_entries:
        for protein_id in leading_entry.split(";"):
            if protein_id not in protein_db:
                proteins_not_in_db.append(protein_id)
    if proteins_not_in_db:
        warnings.warn(
            f"Some proteins could not be annotated: {repr(proteins_not_in_db)}",
            ProteinsNotInFastaWarning,
            stacklevel=2,
        )

    annotations = {}
    if gene_name:
        annotation = _create_multi_protein_annotations_from_db(
            leading_protein_entries, protein_db, _get_annotation_gene_name
        )
        annotations["Leading proteins gene name"] = annotation
    if protein_entry:
        annotation = _create_multi_protein_annotations_from_db(
            leading_protein_entries, protein_db, _get_annotation_protein_entry_name
        )
        annotations["Leading proteins entry name"] = annotation
    if protein_length:
        annotation = _create_multi_protein_annotations_from_db(
            leading_protein_entries, protein_db, _get_annotation_sequence_length
        )
        annotations["Leading proteins length"] = annotation
    if fasta_header:
        annotation = _create_multi_protein_annotations_from_db(
            leading_protein_entries, protein_db, _get_annotation_fasta_header
        )
        annotations["Leading proteins fasta header"] = annotation
    if ibaq_peptides:
        annotation = _create_multi_protein_annotations_from_db(
            leading_protein_entries, protein_db, _get_annotation_ibaq_peptides
        )
        annotations["Leading proteins iBAQ peptides"] = annotation
    if database_origin:
        annotation = _create_multi_protein_annotations_from_db(
            leading_protein_entries, protein_db, _get_annotation_db_origin
        )
        annotations["Leading proteins database origin"] = annotation
    for column in annotations.keys():
        table[column] = annotations[column]
    return table

add_protein_site_identifiers

add_protein_site_identifiers(
    table: DataFrame,
    protein_db: ProteinDatabase,
    site_column: str,
    protein_name_column: str,
)

Adds a "Protein site identifier" column to the 'table'.

The "Protein site identifier" is generated by concatenating the protein name with the amino acid and position of the protein site or sites, e.g. "P12345 - S123" or "P12345 - S123 / T125". The amino acid is extracted from the protein sequence at the position of the site. If the protein name is not available, the "Representative protein" entry is used instead.

Parameters:

Name Type Description Default
table DataFrame

Dataframe to which the protein site identifiers are added.

required
protein_db ProteinDatabase

A protein database containing entries from one or multiple FASTA files. Protein identifiers in the 'table' column "Representative protein" are used to look up entries in the 'protein_db'.

required
site_column str

Column in 'table' that contains protein site positions. Positions are one-indexed, meaning the first amino acid of the protein is position 1. Multiple sites in a single entry should be separated by ";".

required
protein_name_column str

Column in 'table' that contains protein names, which will be used to generate the identifier. If no name is available, the accession is used instead.

required

Raises:

Type Description
ValueError

If the "Representative protein", 'protein_name_column' or 'site_column' is not found in the 'table'.

Source code in msreport\reader.py
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
def add_protein_site_identifiers(
    table: pd.DataFrame,
    protein_db: ProteinDatabase,
    site_column: str,
    protein_name_column: str,
):
    """Adds a "Protein site identifier" column to the 'table'.

    The "Protein site identifier" is generated by concatenating the protein name
    with the amino acid and position of the protein site or sites, e.g. "P12345 - S123"
    or "P12345 - S123 / T125". The amino acid is extracted from the protein sequence at
    the position of the site. If the protein name is not available, the
    "Representative protein" entry is used instead.

    Args:
        table: Dataframe to which the protein site identifiers are added.
        protein_db: A protein database containing entries from one or multiple FASTA
            files. Protein identifiers in the 'table' column "Representative protein"
            are used to look up entries in the 'protein_db'.
        site_column: Column in 'table' that contains protein site positions. Positions
            are one-indexed, meaning the first amino acid of the protein is position 1.
            Multiple sites in a single entry should be separated by ";".
        protein_name_column: Column in 'table' that contains protein names, which will
            be used to generate the identifier. If no name is available, the accession
            is used instead.

    Raises:
        ValueError: If the "Representative protein", 'protein_name_column' or
            'site_column' is not found in the 'table'.
    """
    if site_column not in table.columns:
        raise ValueError(f"Column '{site_column}' not found in the table.")
    if protein_name_column not in table.columns:
        raise ValueError(f"Column '{protein_name_column}' not found in the table.")
    if "Representative protein" not in table.columns:
        raise ValueError("Column 'Representative protein' not found in the table.")

    site_identifiers = []
    for accession, sites, name in zip(
        table["Representative protein"],
        table[site_column].astype(str),
        table[protein_name_column],
    ):
        protein_sequence = protein_db[accession].sequence
        protein_identifier = name if name else accession
        aa_sites = []
        for site in sites.split(";"):
            aa = protein_sequence[int(site) - 1]
            aa_sites.append(f"{aa}{site}")
        aa_site_tag = " / ".join(aa_sites)
        site_identifier = f"{protein_identifier} - {aa_site_tag}"
        site_identifiers.append(site_identifier)
    table["Protein site identifier"] = site_identifiers

add_sequence_coverage

add_sequence_coverage(
    protein_table: DataFrame,
    peptide_table: DataFrame,
    id_column: str = "Protein reported by software",
) -> None

Calculates "Sequence coverage" and adds a new column to the 'protein_table'.

Sequence coverage is represented as a percentage, with values ranging from 0 to 100. Requires the columns "Start position" and "End position" in the 'peptide_table', and "Protein length" in the 'protein_table'. For protein entries where the sequence coverage cannot be calculated, a value of -1 is added.

Parameters:

Name Type Description Default
protein_table DataFrame

Dataframe to which the "Sequence coverage" column is added.

required
peptide_table DataFrame

Dataframe which contains peptide information required for calculation of the protein sequence coverage.

required
id_column str

Column used to match entries between the 'protein_table' and the 'peptide_table', must be present in both tables. Default "Protein reported by software".

'Protein reported by software'
Source code in msreport\reader.py
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
def add_sequence_coverage(
    protein_table: pd.DataFrame,
    peptide_table: pd.DataFrame,
    id_column: str = "Protein reported by software",
) -> None:
    """Calculates "Sequence coverage" and adds a new column to the 'protein_table'.

    Sequence coverage is represented as a percentage, with values ranging from 0 to 100.
    Requires the columns "Start position" and "End position" in the 'peptide_table', and
    "Protein length" in the 'protein_table'. For protein entries where the sequence
    coverage cannot be calculated, a value of -1 is added.

    Args:
        protein_table: Dataframe to which the "Sequence coverage" column is added.
        peptide_table: Dataframe which contains peptide information required for
            calculation of the protein sequence coverage.
        id_column: Column used to match entries between the 'protein_table' and the
            'peptide_table', must be present in both tables. Default
            "Protein reported by software".
    """
    peptide_positions = {}
    for protein_id, peptide_group in peptide_table.groupby(by=id_column):
        positions = list(
            zip(peptide_group["Start position"], peptide_group["End position"])
        )
        peptide_positions[protein_id] = sorted(positions)

    sequence_coverages = []
    for protein_id, protein_length in zip(
        protein_table[id_column], protein_table["Protein length"]
    ):
        can_calculate_coverage = True
        if protein_id not in peptide_positions:
            can_calculate_coverage = False
        if protein_length < 1:
            can_calculate_coverage = False
        try:
            protein_length = int(protein_length)
        except ValueError:
            can_calculate_coverage = False

        if can_calculate_coverage:
            sequence_coverage = helper.calculate_sequence_coverage(
                protein_length, peptide_positions[protein_id], ndigits=1
            )
        else:
            sequence_coverage = np.nan
        sequence_coverages.append(sequence_coverage)
    protein_table["Sequence coverage"] = sequence_coverages

add_ibaq_intensities

add_ibaq_intensities(
    table: DataFrame,
    normalize: bool = True,
    ibaq_peptide_column: str = "iBAQ peptides",
    intensity_tag: str = "Intensity",
    ibaq_tag: str = "iBAQ intensity",
) -> None

Adds iBAQ intensity columns to the 'table'.

Requires a column containing the theoretical number of iBAQ peptides.

Parameters:

Name Type Description Default
table DataFrame

Dataframe to which the iBAQ intensity columns are added.

required
normalize bool

Scales iBAQ intensities per sample so that the sum of all iBAQ intensities is equal to the sum of all Intensities.

True
ibaq_peptide_column str

Column in 'table' containing the number of iBAQ peptides. No iBAQ intensity is calculated for rows with negative values or zero in the ibaq_peptide_column.

'iBAQ peptides'
intensity_tag str

Substring used to identify intensity columns from the 'table' that are used to calculate iBAQ intensities.

'Intensity'
ibaq_tag str

Substring used for naming the new 'table' columns containing the calculated iBAQ intensities. The column names are generated by replacing the 'intensity_tag' with the 'ibaq_tag'.

'iBAQ intensity'
Source code in msreport\reader.py
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
def add_ibaq_intensities(
    table: pd.DataFrame,
    normalize: bool = True,
    ibaq_peptide_column: str = "iBAQ peptides",
    intensity_tag: str = "Intensity",
    ibaq_tag: str = "iBAQ intensity",
) -> None:
    """Adds iBAQ intensity columns to the 'table'.

    Requires a column containing the theoretical number of iBAQ peptides.

    Args:
        table: Dataframe to which the iBAQ intensity columns are added.
        normalize: Scales iBAQ intensities per sample so that the sum of all iBAQ
            intensities is equal to the sum of all Intensities.
        ibaq_peptide_column: Column in 'table' containing the number of iBAQ peptides.
            No iBAQ intensity is calculated for rows with negative values or zero in the
            ibaq_peptide_column.
        intensity_tag: Substring used to identify intensity columns from the 'table'
            that are used to calculate iBAQ intensities.
        ibaq_tag: Substring used for naming the new 'table' columns containing the
            calculated iBAQ intensities. The column names are generated by replacing
            the 'intensity_tag' with the 'ibaq_tag'.
    """
    for intensity_column in helper.find_columns(table, intensity_tag):
        ibaq_column = intensity_column.replace(intensity_tag, ibaq_tag)
        valid = table[ibaq_peptide_column] > 0

        table[ibaq_column] = np.nan
        table.loc[valid, ibaq_column] = (
            table.loc[valid, intensity_column] / table.loc[valid, ibaq_peptide_column]
        )

        if normalize:
            total_intensity = table.loc[valid, intensity_column].sum()
            total_ibaq = table.loc[valid, ibaq_column].sum()
            factor = total_intensity / total_ibaq
            table.loc[valid, ibaq_column] = table.loc[valid, ibaq_column] * factor

add_peptide_positions

add_peptide_positions(
    table: DataFrame,
    protein_db: ProteinDatabase,
    peptide_column: str = "Peptide sequence",
    protein_column: str = "Representative protein",
) -> None

Adds peptide "Start position" and "End position" positions to the table.

For entries where the protein is absent from the FASTA or the peptide sequence could not be matched to the protein sequence, start and end positions of -1 are added.

Parameters:

Name Type Description Default
table DataFrame

Dataframe to which the protein annotations are added.

required
protein_db ProteinDatabase

A protein database containing entries from one or multiple FASTA files.

required
peptide_column str

Column in 'table' that contains the peptide sequence. Peptide sequences must only contain amino acids and no other symbols.

'Peptide sequence'
protein_column str

Column in 'table' that contains protein IDs that are used to find matching entries in the FASTA files.

'Representative protein'
Source code in msreport\reader.py
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
def add_peptide_positions(
    table: pd.DataFrame,
    protein_db: ProteinDatabase,
    peptide_column: str = "Peptide sequence",
    protein_column: str = "Representative protein",
) -> None:
    """Adds peptide "Start position" and "End position" positions to the table.

    For entries where the protein is absent from the FASTA or the peptide sequence
    could not be matched to the protein sequence, start and end positions of -1 are
    added.

    Args:
        table: Dataframe to which the protein annotations are added.
        protein_db: A protein database containing entries from one or multiple FASTA
            files.
        peptide_column: Column in 'table' that contains the peptide sequence. Peptide
            sequences must only contain amino acids and no other symbols.
        protein_column: Column in 'table' that contains protein IDs that are used to
            find matching entries in the FASTA files.
    """
    # not tested #
    peptide_positions: dict[str, list[int]] = {"Start position": [], "End position": []}
    proteins_not_in_db = []
    for peptide, protein_id in zip(table[peptide_column], table[protein_column]):
        if protein_id in protein_db:
            sequence = protein_db[protein_id].sequence
            start = sequence.find(peptide) + 1
            end = start + len(peptide) - 1
            if start == 0:
                start, end = -1, -1
        else:
            proteins_not_in_db.append(protein_id)
            start, end = -1, -1
        peptide_positions["Start position"].append(start)
        peptide_positions["End position"].append(end)

    for key in peptide_positions:
        table[key] = peptide_positions[key]

    if proteins_not_in_db:
        warnings.warn(
            f"Some peptides could not be annotated: {repr(proteins_not_in_db)}",
            ProteinsNotInFastaWarning,
            stacklevel=2,
        )

add_protein_modifications

add_protein_modifications(table: DataFrame)

Adds a "Protein sites" column.

To generate the "Protein modifications" the positions from the "Modifications" column are increase according to the peptide positions ("Start position"] column).

Parameters:

Name Type Description Default
table DataFrame

Dataframe to which the "Protein modifications" column is added.

required
Source code in msreport\reader.py
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
def add_protein_modifications(table: pd.DataFrame):
    """Adds a "Protein sites" column.

    To generate the "Protein modifications" the positions from the "Modifications"
    column are increase according to the peptide positions ("Start position"] column).

    Args:
        table: Dataframe to which the "Protein modifications" column is added.
    """
    protein_modification_entries = []
    for mod_entry, start_pos in zip(table["Modifications"], table["Start position"]):
        if mod_entry:
            protein_mods = []
            for peptide_site, mod in [m.split(":") for m in mod_entry.split(";")]:
                protein_site = int(peptide_site) + start_pos - 1
                protein_mods.append([str(protein_site), mod])
            protein_mod_string = ";".join([f"{pos}:{mod}" for pos, mod in protein_mods])
        else:
            protein_mod_string = ""
        protein_modification_entries.append(protein_mod_string)
    table["Protein modifications"] = protein_modification_entries

propagate_representative_protein

propagate_representative_protein(
    target_table: DataFrame, source_table: DataFrame
) -> None

Propagates "Representative protein" column from the source to the target table.

The column "Protein reported by software" is used to match entries between the two tables. Then entries from "Representative protein" are propagated from the 'source_table' to matching rows in the 'target_table'.

Parameters:

Name Type Description Default
target_table DataFrame

Dataframe to which "Representative protein" entries will be added.

required
source_table DataFrame

Dataframe from which "Representative protein" entries are propagated.

required
Source code in msreport\reader.py
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
def propagate_representative_protein(
    target_table: pd.DataFrame, source_table: pd.DataFrame
) -> None:
    """Propagates "Representative protein" column from the source to the target table.

    The column "Protein reported by software" is used to match entries between the two
    tables. Then entries from "Representative protein" are propagated from the
    'source_table' to matching rows in the 'target_table'.

    Args:
        target_table: Dataframe to which "Representative protein" entries will be added.
        source_table: Dataframe from which "Representative protein" entries are
            propagated.
    """
    # not tested #
    protein_lookup = {}
    for old, new in zip(
        source_table["Protein reported by software"],
        source_table["Representative protein"],
    ):
        protein_lookup[old] = new

    new_protein_ids = []
    for old in target_table["Protein reported by software"]:
        new_protein_ids.append(protein_lookup[old] if old in protein_lookup else old)
    target_table["Representative protein"] = new_protein_ids

extract_sample_names

extract_sample_names(df: DataFrame, tag: str) -> list[str]

Extracts sample names from columns containing the 'tag' substring.

Sample names are extracted from column names containing the 'tag' string, by splitting the column name with the 'tag', and removing all trailing and leading white spaces from the resulting strings.

Parameters:

Name Type Description Default
df DataFrame

Column names from this dataframe are used for extracting sample names.

required
tag str

Column names containing the 'tag' are selected for extracting sample names.

required

Returns:

Type Description
list[str]

A list of sample names.

Source code in msreport\reader.py
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
def extract_sample_names(df: pd.DataFrame, tag: str) -> list[str]:
    """Extracts sample names from columns containing the 'tag' substring.

    Sample names are extracted from column names containing the 'tag' string, by
    splitting the column name with the 'tag', and removing all trailing and leading
    white spaces from the resulting strings.

    Args:
        df: Column names from this dataframe are used for extracting sample names.
        tag: Column names containing the 'tag' are selected for extracting sample names.

    Returns:
        A list of sample names.
    """
    columns = helper.find_columns(df, tag)
    sample_names = _find_remaining_substrings(columns, tag)
    return sample_names

extract_maxquant_localization_probabilities

extract_maxquant_localization_probabilities(
    localization_entry: str,
) -> dict[int, float]

Extract localization probabilites from a MaxQuant "Probabilities" entry.

Parameters:

Name Type Description Default
localization_entry str

Entry from the "Probabilities" columns of a MaxQuant msms.txt, evidence.txt or Sites.txt table.

required

Returns:

Type Description
dict[int, float]

A dictionary of {position: probability} mappings. Positions are one-indexed,

dict[int, float]

which means that the first amino acid position is 1.

Example:

extract_maxquant_localization_probabilities("IRT(0.989)AMNS(0.011)IER") {3: 0.989, 7: 0.011}

Source code in msreport\reader.py
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
def extract_maxquant_localization_probabilities(
    localization_entry: str,
) -> dict[int, float]:
    """Extract localization probabilites from a MaxQuant "Probabilities" entry.

    Args:
        localization_entry: Entry from the "Probabilities" columns of a MaxQuant
            msms.txt, evidence.txt or Sites.txt table.

    Returns:
        A dictionary of {position: probability} mappings. Positions are one-indexed,
        which means that the first amino acid position is 1.

    Example:
    >>> extract_maxquant_localization_probabilities("IRT(0.989)AMNS(0.011)IER")
    {3: 0.989, 7: 0.011}
    """
    _, probabilities = msreport.peptidoform.parse_modified_sequence(
        localization_entry, "(", ")"
    )
    site_probabilities = {
        site: float(probability) for site, probability in probabilities
    }
    return site_probabilities

extract_fragpipe_localization_probabilities

extract_fragpipe_localization_probabilities(
    localization_entry: str,
) -> dict

Extract localization probabilites from a FragPipe "Localization" entry.

Parameters:

Name Type Description Default
localization_entry str

Entry from the "Localization" column of a FragPipe ions.tsv or combined_ions.tsv table.

required

Returns:

Type Description
dict

A dictionary of modifications containing a dictionary of {position: probability}

dict

mappings. Positions are one-indexed, which means that the first amino acid

dict

position is 1.

Example:

extract_fragpipe_localization_probabilities( ... "M:15.9949@FIM(1.000)TPTLK;STY:79.9663@FIMT(0.334)PT(0.666)LK;" ... ) {'15.9949': {3: 1.0}, '79.9663': {4: 0.334, 6: 0.666}}

Source code in msreport\reader.py
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
def extract_fragpipe_localization_probabilities(localization_entry: str) -> dict:
    """Extract localization probabilites from a FragPipe "Localization" entry.

    Args:
        localization_entry: Entry from the "Localization" column of a FragPipe
            ions.tsv or combined_ions.tsv table.

    Returns:
        A dictionary of modifications containing a dictionary of {position: probability}
        mappings. Positions are one-indexed, which means that the first amino acid
        position is 1.

    Example:
    >>> extract_fragpipe_localization_probabilities(
    ...     "M:15.9949@FIM(1.000)TPTLK;STY:79.9663@FIMT(0.334)PT(0.666)LK;"
    ... )
    {'15.9949': {3: 1.0}, '79.9663': {4: 0.334, 6: 0.666}}
    """
    modification_probabilities: dict[str, dict[int, float]] = {}
    for modification_entry in filter(None, localization_entry.split(";")):
        specified_modification, probability_sequence = modification_entry.split("@")
        _, modification = specified_modification.split(":")
        _, probabilities = msreport.peptidoform.parse_modified_sequence(
            probability_sequence, "(", ")"
        )
        if modification not in modification_probabilities:
            modification_probabilities[modification] = {}
        modification_probabilities[modification].update(
            {site: float(probability) for site, probability in probabilities}
        )
    return modification_probabilities

extract_spectronaut_localization_probabilities

extract_spectronaut_localization_probabilities(
    localization_entry: str,
) -> dict

Extract localization probabilites from a Spectronaut localization entry.

Parameters:

Name Type Description Default
localization_entry str

Entry from the "EG.PTMLocalizationProbabilities" column of a spectronaut elution group (EG) output table.

required

Returns:

Type Description
dict

A dictionary of modifications containing a dictionary of {position: probability}

dict

mappings. Positions are one-indexed, which means that the first amino acid

dict

position is 1.

Example:

extract_spectronaut_localization_probabilities( ... "HM[Oxidation (M): 100%]S[Phospho (STY): 45.5%]GS[Phospho (STY): 54.5%]PG" ... ) {'Oxidation (M)': {2: 1.0}, 'Phospho (STY)': {3: 0.455, 5: 0.545}}

Source code in msreport\reader.py
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
def extract_spectronaut_localization_probabilities(localization_entry: str) -> dict:
    """Extract localization probabilites from a Spectronaut localization entry.

    Args:
        localization_entry: Entry from the "EG.PTMLocalizationProbabilities" column of a
            spectronaut elution group (EG) output table.

    Returns:
        A dictionary of modifications containing a dictionary of {position: probability}
        mappings. Positions are one-indexed, which means that the first amino acid
        position is 1.

    Example:
    >>> extract_spectronaut_localization_probabilities(
    ...     "_HM[Oxidation (M): 100%]S[Phospho (STY): 45.5%]GS[Phospho (STY): 54.5%]PG_"
    ... )
    {'Oxidation (M)': {2: 1.0}, 'Phospho (STY)': {3: 0.455, 5: 0.545}}
    """
    modification_probabilities: dict[str, dict[int, float]] = {}
    localization_entry = localization_entry.strip("_")
    _, raw_probability_entries = msreport.peptidoform.parse_modified_sequence(
        localization_entry, "[", "]"
    )

    for site, mod_probability_entry in raw_probability_entries:
        modification, probability_entry = mod_probability_entry.split(": ")
        if modification not in modification_probabilities:
            modification_probabilities[modification] = {}
        probability = float(probability_entry.replace("%", "")) / 100.0
        modification_probabilities[modification][site] = probability
    return modification_probabilities