Skip to content

Reader

Provides tools for importing and standardizing quantitative proteomics data.

This module offers software-specific reader classes to import raw result tables (e.g., proteins, peptides, ions) from various proteomics software (MaxQuant, FragPipe, Spectronaut) and convert them into a standardized msreport format. Additionally, it provides functions for annotating imported data with biological metadata, such as protein information (e.g., sequence length, molecular weight) and peptide positions, extracted from a ProteinDatabase (FASTA file).

New columns added to imported protein tables: - Representative protein - Leading proteins - Protein reported by software

Standardized column names for quantitative values (if available in the software output): - Spectral count "sample name" - Unique spectral count "sample name" - Total spectral count "sample name" - Intensity "sample name" - LFQ intensity "sample name" - iBAQ intensity "sample name"

Classes:

Name Description
Protein

Abstract protein entry

ProteinDatabase

Abstract protein database

ResultReader

Base Reader class, is by itself not functional.

MaxQuantReader

MaxQuant result reader.

FragPipeReader

FragPipe result reader.

SpectronautReader

Spectronaut result reader.

Functions:

Name Description
sort_leading_proteins

Returns a copy of 'table' with sorted leading proteins.

add_protein_annotation

Uses a FASTA protein database to add protein annotation columns.

add_protein_site_annotation

Uses a FASTA protein database to add protein site annotation columns.

add_leading_proteins_annotation

Uses a FASTA protein database to add leading protein annotation columns.

add_protein_site_identifiers

Adds a "Protein site identifier" column to the 'table'.

add_sequence_coverage

Calculates "Sequence coverage" and adds a new column to the 'protein_table'.

add_ibaq_intensities

Adds iBAQ intensity columns to the 'table'.

add_peptide_positions

Adds peptide "Start position" and "End position" positions to the table.

add_protein_modifications

Adds a "Protein sites" column.

propagate_representative_protein

Propagates "Representative protein" column from the source to the target table.

extract_sample_names

Extracts sample names from columns containing the 'tag' substring.

extract_maxquant_localization_probabilities

Extract localization probabilites from a MaxQuant "Probabilities" entry.

extract_fragpipe_localization_probabilities

Extract localization probabilites from a FragPipe "Localization" entry.

extract_spectronaut_localization_probabilities

Extract localization probabilites from a Spectronaut localization entry.

Protein

Bases: Protocol

Abstract protein entry

ProteinDatabase

Bases: Protocol

Abstract protein database

ResultReader

ResultReader()

Base Reader class, is by itself not functional.

Source code in msreport\reader.py
67
68
69
def __init__(self):
    self.data_directory = ""
    self.filenames = {}

MaxQuantReader

MaxQuantReader(
    directory: str,
    isobar: bool = False,
    contaminant_tag: str = "CON__",
)

Bases: ResultReader

MaxQuant result reader.

Methods:

Name Description
import_proteins

Reads a "proteinGroups.txt" file and returns a processed dataframe, conforming to the MsReport naming convention.

import_peptides

Reads a "peptides.txt" file and returns a processed dataframe, conforming to the MsReport naming convention.

import_ion_evidence

Reads an "evidence.xt" file and returns a processed dataframe, conforming to the MsReport naming convention.

Attributes:

Name Type Description
default_filenames dict[str, str]

(class attribute) Look up of filenames for the result files generated by MaxQuant.

sample_column_tags list[str]

(class attribute) Column tags for which an additional column is present per sample.

column_mapping dict[str, str]

(class attribute) Used to rename original column names from MaxQuant according to the MsReport naming convention.

column_tag_mapping OrderedDict[str, str]

(class attribute) Mapping of original sample column tags from MaxQuant to column tags according to the MsReport naming convention, used to replace column names containing the original column tag.

protein_info_columns list[str]

(class attribute) List of columns that contain protein specific information. Used to allow removing all protein specific information prior to changing the representative protein.

protein_info_tags list[str]

(class attribute) List of tags present in columns that contain protein specific information per sample.

data_directory str

Location of the MaxQuant "txt" folder

filenames list[str]

Look up of filenames generated by MaxQuant

contamination_tag str

Substring present in protein IDs to identify them as potential contaminants.

Parameters:

Name Type Description Default
directory str

Location of the MaxQuant "txt" folder.

required
isobar bool

Set to True if quantification strategy was TMT, iTRAQ or similar.

False
contaminant_tag str

Prefix of Protein ID entries to identify contaminants.

'CON__'
Source code in msreport\reader.py
217
218
219
220
221
222
223
224
225
226
227
228
229
230
def __init__(
    self, directory: str, isobar: bool = False, contaminant_tag: str = "CON__"
) -> None:
    """Initializes the MaxQuantReader.

    Args:
        directory: Location of the MaxQuant "txt" folder.
        isobar: Set to True if quantification strategy was TMT, iTRAQ or similar.
        contaminant_tag: Prefix of Protein ID entries to identify contaminants.
    """
    self._add_data_directory(directory)
    self.filenames: dict[str, str] = self.default_filenames
    self._isobar: bool = isobar
    self._contaminant_tag: str = contaminant_tag

import_proteins

import_proteins(
    filename: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
    drop_decoy: bool = True,
    drop_idbysite: bool = True,
    drop_protein_info: bool = False,
) -> DataFrame

Reads a "proteinGroups.txt" file and returns a processed dataframe.

Adds three new protein entry columns to comply with the MsReport convention: "Protein reported by software", "Leading proteins", "Representative protein".

"Protein reported by software" contains the first protein ID from the "Majority protein IDs" column. "Leading proteins" contain all entries from the "Majority protein IDs" column that have the same and highest number of mapped peptides in the "Peptide counts (all)" column, multiple protein entries are separated by ";". "Representative protein" contains the first entry form "Leading proteins".

Several columns in the "combined_protein.tsv" file contain information specific for the protein entry of the "Protein" column. If leading proteins will be re-sorted later, it is recommended to remove columns containing protein specific information by setting 'drop_protein_info=True'.

Parameters:

Name Type Description Default
filename Optional[str]

allows specifying an alternative filename, otherwise the default filename is used.

None
rename_columns bool

If True, columns are renamed according to the MsReport convention; default True.

True
prefix_column_tags bool

If True, column tags such as "Intensity" are added in front of the sample names, e.g. "Intensity sample_name". If False, column tags are added afterwards, e.g. "Sample_name Intensity"; default True.

True
drop_decoy bool

If True, decoy entries are removed and the "Reverse" column is dropped; default True.

True
drop_idbysite bool

If True, protein groups that were only identified by site are removed and the "Only identified by site" columns is dropped; default True.

True
drop_protein_info bool

If True, columns containing protein specific information, such as "Gene names", "Sequence coverage [%]" or "iBAQ peptides". See MaxQuantReader.protein_info_columns and MaxQuantReader.protein_info_tags for a full list of columns that will be removed. Default False.

False

Returns:

Type Description
DataFrame

A dataframe containing the processed protein table.

Source code in msreport\reader.py
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
def import_proteins(
    self,
    filename: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
    drop_decoy: bool = True,
    drop_idbysite: bool = True,
    drop_protein_info: bool = False,
) -> pd.DataFrame:
    """Reads a "proteinGroups.txt" file and returns a processed dataframe.

    Adds three new protein entry columns to comply with the MsReport convention:
    "Protein reported by software", "Leading proteins", "Representative protein".

    "Protein reported by software" contains the first protein ID from the "Majority
    protein IDs" column. "Leading proteins" contain all entries from the "Majority
    protein IDs" column that have the same and highest number of mapped peptides in
    the "Peptide counts (all)" column, multiple protein entries are separated by
    ";". "Representative protein" contains the first entry form "Leading proteins".

    Several columns in the "combined_protein.tsv" file contain information specific
    for the protein entry of the "Protein" column. If leading proteins will be
    re-sorted later, it is recommended to remove columns containing protein specific
    information by setting 'drop_protein_info=True'.

    Args:
        filename: allows specifying an alternative filename, otherwise the default
            filename is used.
        rename_columns: If True, columns are renamed according to the MsReport
            convention; default True.
        prefix_column_tags: If True, column tags such as "Intensity" are added
            in front of the sample names, e.g. "Intensity sample_name". If False,
            column tags are added afterwards, e.g. "Sample_name Intensity"; default
            True.
        drop_decoy: If True, decoy entries are removed and the "Reverse" column is
            dropped; default True.
        drop_idbysite: If True, protein groups that were only identified by site are
            removed and the "Only identified by site" columns is dropped; default
            True.
        drop_protein_info: If True, columns containing protein specific information,
            such as "Gene names", "Sequence coverage [%]" or "iBAQ peptides". See
            MaxQuantReader.protein_info_columns and MaxQuantReader.protein_info_tags
            for a full list of columns that will be removed. Default False.

    Returns:
        A dataframe containing the processed protein table.
    """
    df = self._read_file("proteins" if filename is None else filename)
    df = self._add_protein_entries(df)

    if drop_decoy:
        df = self._drop_decoy(df)
    if drop_idbysite:
        df = self._drop_idbysite(df)
    if drop_protein_info:
        df = self._drop_columns(df, self.protein_info_columns)
        for tag in self.protein_info_tags:
            df = self._drop_columns_by_tag(df, tag)
    if rename_columns:
        df = self._rename_columns(df, prefix_column_tags)
    return df

import_peptides

import_peptides(
    filename: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
    drop_decoy: bool = True,
) -> DataFrame

Reads a "peptides.txt" file and returns a processed dataframe.

Adds new columns to comply with the MsReport convention: "Protein reported by software" and "Representative protein", both contain the first entry from "Leading razor protein".

Parameters:

Name Type Description Default
filename Optional[str]

allows specifying an alternative filename, otherwise the default filename is used.

None
rename_columns bool

If True, columns are renamed according to the MsReport convention; default True.

True
prefix_column_tags bool

If True, column tags such as "Intensity" are added in front of the sample names, e.g. "Intensity sample_name". If False, column tags are added afterwards, e.g. "Sample_name Intensity"; default True.

True
drop_decoy bool

If True, decoy entries are removed and the "Reverse" column is dropped; default True.

True

Returns:

Type Description
DataFrame

A dataframe containing the processed peptide table.

Source code in msreport\reader.py
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
def import_peptides(
    self,
    filename: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
    drop_decoy: bool = True,
) -> pd.DataFrame:
    """Reads a "peptides.txt" file and returns a processed dataframe.

    Adds new columns to comply with the MsReport convention:
    "Protein reported by software" and "Representative protein", both contain the
    first entry from "Leading razor protein".

    Args:
        filename: allows specifying an alternative filename, otherwise the default
            filename is used.
        rename_columns: If True, columns are renamed according to the MsReport
            convention; default True.
        prefix_column_tags: If True, column tags such as "Intensity" are added
            in front of the sample names, e.g. "Intensity sample_name". If False,
            column tags are added afterwards, e.g. "Sample_name Intensity"; default
            True.
        drop_decoy: If True, decoy entries are removed and the "Reverse" column is
            dropped; default True.

    Returns:
        A dataframe containing the processed peptide table.
    """
    # TODO: not tested
    df = self._read_file("peptides" if filename is None else filename)
    df["Protein reported by software"] = _extract_protein_ids(
        df["Leading razor protein"]
    )
    df["Representative protein"] = df["Protein reported by software"]
    # Note that _add_protein_entries would need to be adapted for the peptide table.
    # df = self._add_protein_entries(df)
    if drop_decoy:
        df = self._drop_decoy(df)
    if rename_columns:
        df = self._rename_columns(df, prefix_column_tags)
    return df

import_ion_evidence

import_ion_evidence(
    filename: Optional[str] = None,
    rename_columns: bool = True,
    rewrite_modifications: bool = True,
    drop_decoy: bool = True,
) -> DataFrame

Reads an "evidence.txt" file and returns a processed dataframe.

Adds new columns to comply with the MsReport convention. "Modified sequence", "Modifications columns", "Modification localization string". "Protein reported by software" and "Representative protein", both contain the first entry from "Leading razor protein". "Ion ID" contains unique entries for each ion, which are generated by concatenating the "Modified sequence" and "Charge" columns, and if present, the "Compensation voltage" column.

"Modified sequence" entries contain modifications within square brackets. "Modification" entries are strings in the form of "position:modification_tag", multiple modifications are joined by ";". An example for a modified sequence and a modification entry: "PEPT[Phospho]IDO[Oxidation]", "4:Phospho;7:Oxidation".

"Modification localization string" contains localization probabilities in the format "Mod1@Site1:Probability1,Site2:Probability2;Mod2@Site3:Probability3", e.g. "15.9949@11:1.000;79.9663@3:0.200,4:0.800". Refer to msreport.peptidoform.make_localization_string for details.

Parameters:

Name Type Description Default
filename Optional[str]

Allows specifying an alternative filename, otherwise the default filename is used.

None
rename_columns bool

If True, columns are renamed according to the MsReport convention; default True.

True
rewrite_modifications bool

If True, the peptide format in "Modified sequence" is changed according to the MsReport convention, and a "Modifications" is added to contains the amino acid position for all modifications. Requires 'rename_columns' to be true. Default True.

True
drop_decoy bool

If True, decoy entries are removed and the "Reverse" column is dropped; default True.

True

Returns:

Type Description
DataFrame

A dataframe containing the processed ion table.

Source code in msreport\reader.py
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
def import_ion_evidence(
    self,
    filename: Optional[str] = None,
    rename_columns: bool = True,
    rewrite_modifications: bool = True,
    drop_decoy: bool = True,
) -> pd.DataFrame:
    """Reads an "evidence.txt" file and returns a processed dataframe.

    Adds new columns to comply with the MsReport convention. "Modified sequence",
    "Modifications columns", "Modification localization string". "Protein reported
    by software" and "Representative protein", both contain the first entry from
    "Leading razor protein". "Ion ID" contains unique entries for each ion, which
    are generated by concatenating the "Modified sequence" and "Charge" columns, and
    if present, the "Compensation voltage" column.

    "Modified sequence" entries contain modifications within square brackets.
    "Modification" entries are strings in the form of "position:modification_tag",
    multiple modifications are joined by ";". An example for a modified sequence and
    a modification entry: "PEPT[Phospho]IDO[Oxidation]", "4:Phospho;7:Oxidation".

    "Modification localization string" contains localization probabilities in the
    format "Mod1@Site1:Probability1,Site2:Probability2;Mod2@Site3:Probability3",
    e.g. "15.9949@11:1.000;79.9663@3:0.200,4:0.800". Refer to
    `msreport.peptidoform.make_localization_string` for details.

    Args:
        filename: Allows specifying an alternative filename, otherwise the default
            filename is used.
        rename_columns: If True, columns are renamed according to the MsReport
            convention; default True.
        rewrite_modifications: If True, the peptide format in "Modified sequence" is
            changed according to the MsReport convention, and a "Modifications" is
            added to contains the amino acid position for all modifications.
            Requires 'rename_columns' to be true. Default True.
        drop_decoy: If True, decoy entries are removed and the "Reverse" column is
            dropped; default True.

    Returns:
        A dataframe containing the processed ion table.
    """
    # TODO: not tested
    df = self._read_file("ion_evidence" if filename is None else filename)
    df["Protein reported by software"] = _extract_protein_ids(
        df["Leading razor protein"]
    )
    df["Representative protein"] = df["Protein reported by software"]

    if drop_decoy:
        df = self._drop_decoy(df)
    if rename_columns:
        # Actually there are no column tags as the table is in long format
        df = self._rename_columns(df, prefix_tag=True)
    if rewrite_modifications and rename_columns:
        df = self._add_peptide_modification_entries(df)
        df = self._add_modification_localization_string(df)
        df["Ion ID"] = df["Modified sequence"] + "_c" + df["Charge"].astype(str)
        if "Compensation voltage" in df.columns:
            _cv = df["Compensation voltage"].astype(str)
            df["Ion ID"] = df["Ion ID"] + "_cv" + _cv
    return df

FragPipeReader

FragPipeReader(
    directory: str,
    isobar: bool = False,
    sil: bool = False,
    contaminant_tag: str = "contam_",
)

Bases: ResultReader

FragPipe result reader.

Methods:

Name Description
import_design

Reads a "fragpipe-files.fp-manifest" file and returns a processed design dataframe.

import_proteins

Reads a "combined_protein.tsv" or "protein.tsv" file and returns a processed dataframe, conforming to the MsReport naming convention.

import_peptides

Reads a "combined_peptide.tsv" or "peptide.tsv" file and returns a processed dataframe, conforming to the MsReport naming convention.

import_ions

Reads a "combined_ion.tsv" or "ion.tsv" file and returns a processed dataframe, conforming to the MsReport naming convention.

import_ion_evidence

Reads and concatenates all "ion.tsv" files and returns a processed dataframe, conforming to the MsReport naming convention.

Attributes:

Name Type Description
default_filenames dict[str, str]

(class attribute) Look up of default filenames of the result files generated by FragPipe.

isobar_filenames dict[str, str]

(class attribute) Look up of default filenames of the result files generated by FragPipe, which are relevant when using isobaric quantification.

sample_column_tags list[str]

(class attribute) Tags (column name substrings) that idenfity sample columns. Sample columns are those, for which one unique column is present per sample, for example intensity columns.

column_mapping dict[str, str]

(class attribute) Used to rename original column names from FragPipe according to the MsReport naming convention.

column_tag_mapping OrderedDict[str, str]

(class attribute) Mapping of original sample column tags from FragPipe to column tags according to the MsReport naming convention, used to replace column names containing the original column tag.

protein_info_columns list[str]

(class attribute) List of columns that contain information specific to the leading protein.

protein_info_tags list[str]

(class attribute) List of substrings present in columns that contain information specific to the leading protein.

data_directory str

Location of the folder containing FragPipe result files.

filenames dict[str, str]

Look up of FragPipe result filenames used for importing protein or other tables.

contamination_tag str

Substring present in protein IDs to identify them as potential contaminants.

Parameters:

Name Type Description Default
directory str

Location of the FragPipe result folder

required
isobar bool

Set to True if quantification strategy was TMT, iTRAQ or similar; default False.

False
sil bool

Set to True if the FragPipe result files are from a stable isotope labeling experiment, such as SILAC; default False.

False
contaminant_tag str

Prefix of Protein ID entries to identify contaminants; default "contam_".

'contam_'
Source code in msreport\reader.py
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
def __init__(
    self,
    directory: str,
    isobar: bool = False,
    sil: bool = False,
    contaminant_tag: str = "contam_",
) -> None:
    """Initializes the FragPipeReader.

    Args:
        directory: Location of the FragPipe result folder
        isobar: Set to True if quantification strategy was TMT, iTRAQ or similar;
            default False.
        sil: Set to True if the FragPipe result files are from a stable isotope
            labeling experiment, such as SILAC; default False.
        contaminant_tag: Prefix of Protein ID entries to identify contaminants;
            default "contam_".
    """
    if sil and isobar:
        raise ValueError("Cannot set both 'isobar' and 'sil' to True.")
    self._add_data_directory(directory)
    self._isobar: bool = isobar
    self._sil: bool = sil
    self._contaminant_tag: str = contaminant_tag
    if isobar:
        self.filenames = self.isobar_filenames
    elif sil:
        self.filenames = self.sil_filenames
    else:
        self.filenames = self.default_filenames

import_design

import_design(
    filename: Optional[str] = None, sort: bool = False
) -> DataFrame

Read a 'fp-manifest' file and returns a processed design dataframe.

The manifest columns "Path", "Experiment", and "Bioreplicate" are mapped to the design table columns "Rawfile", "Experiment", and "Replicate". The "Rawfile" column is extracted as the filename from the full path. The "Sample" column is generated by combining "Experiment" and "Replicate" with an underscore (e.g., "Experiment_Replicate"), except when "Replicate" is empty, in which case "Sample" is set to "Experiment". If "Experiment" is missing, it is set to "exp" by default.

Parameters:

Name Type Description Default
filename Optional[str]

Allows specifying an alternative filename, otherwise the default filename is used.

None
sort bool

If True, the design dataframe is sorted by "Experiment" and "Replicate"; default False.

False

Returns:

Type Description
DataFrame

A dataframe containing the processed design table with columns:

DataFrame

"Sample", "Experiment", "Replicate", "Rawfile".

Raises:

Type Description
FileNotFoundError

If the specified manifest file does not exist.

Source code in msreport\reader.py
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
def import_design(
    self, filename: Optional[str] = None, sort: bool = False
) -> pd.DataFrame:
    """Read a 'fp-manifest' file and returns a processed design dataframe.

    The manifest columns "Path", "Experiment", and "Bioreplicate" are mapped to the
    design table columns "Rawfile", "Experiment", and "Replicate". The "Rawfile"
    column is extracted as the filename from the full path. The "Sample" column is
    generated by combining "Experiment" and "Replicate" with an underscore
    (e.g., "Experiment_Replicate"), except when "Replicate" is empty, in which case
    "Sample" is set to "Experiment". If "Experiment" is missing, it is set to "exp"
    by default.

    Args:
        filename: Allows specifying an alternative filename, otherwise the default
            filename is used.
        sort: If True, the design dataframe is sorted by "Experiment" and
            "Replicate"; default False.

    Returns:
        A dataframe containing the processed design table with columns:
        "Sample", "Experiment", "Replicate", "Rawfile".

    Raises:
        FileNotFoundError: If the specified manifest file does not exist.
    """
    if filename is None:
        filepath = os.path.join(self.data_directory, self.filenames["design"])
    else:
        filepath = os.path.join(self.data_directory, filename)
    if not os.path.exists(filepath):
        raise FileNotFoundError(
            f"File '{filepath}' does not exist. Please check the file path."
        )
    fp_manifest = (
        pd.read_csv(
            filepath, sep="\t", header=None, na_values=[""], keep_default_na=False
        )
        .fillna("")
        .astype(str)
    )
    fp_manifest.columns = ["Path", "Experiment", "Bioreplicate", "Data type"]

    design = pd.DataFrame(
        {
            "Sample": "",
            "Experiment": fp_manifest["Experiment"],
            "Replicate": fp_manifest["Bioreplicate"],
            "Rawfile": fp_manifest["Path"].apply(
                # Required to handle Windows and Unix style paths on either system
                lambda x: x.replace("\\", "/").split("/")[-1]
            ),
        }
    )
    # FragPipe uses "exp" for missing 'Experiment' values
    design.loc[design["Experiment"] == "", "Experiment"] = "exp"
    # FragPipe combines 'Experiment' + "_" + 'Replicate' into 'Sample', except when
    # 'Replicate' is empty, in which case 'Sample' is set to 'Experiment'.
    design["Sample"] = design["Experiment"] + "_" + design["Replicate"]
    design.loc[design["Replicate"] == "", "Sample"] = design["Experiment"]

    if sort:
        design.sort_values(by=["Experiment", "Replicate"], inplace=True)
        design.reset_index(drop=True, inplace=True)
    return design

import_proteins

import_proteins(
    filename: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
    drop_protein_info: bool = False,
) -> DataFrame

Reads a "combined_protein.tsv" or "protein.tsv" file and returns a processed dataframe.

Adds four protein entry columns to comply with the MsReport convention: "Protein reported by software", "Leading proteins", "Representative protein", "Potential contaminant".

"Protein reported by software" contains the protein ID extracted from the "Protein" column. "Leading proteins" contains the combined protein IDs extracted from the "Protein" and "Indistinguishable Proteins" columns, multiple entries are separated by ";". "Representative protein" contains the first entry form "Leading proteins".

Several columns in the "combined_protein.tsv" file contain information specific for the protein entry of the "Protein" column. If leading proteins will be re-sorted later, it is recommended to remove columns containing protein specific information by setting 'drop_protein_info=True'..

Parameters:

Name Type Description Default
filename Optional[str]

Allows specifying an alternative filename, otherwise the default filename is used.

None
rename_columns bool

If True, columns are renamed according to the MsReport convention; default True.

True
prefix_column_tags bool

If True, column tags such as "Intensity" are added in front of the sample names, e.g. "Intensity sample_name". If False, column tags are added afterwards, e.g. "Sample_name Intensity"; default True.

True
drop_protein_info bool

If True, columns containing protein specific information, such as "Gene" or "Protein Length". See FragPipeReader.protein_info_columns and FragPipeReader.protein_info_tags for a full list of columns that will be removed. Default False.

False

Returns:

Type Description
DataFrame

A dataframe containing the processed protein table.

Source code in msreport\reader.py
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
def import_proteins(
    self,
    filename: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
    drop_protein_info: bool = False,
) -> pd.DataFrame:
    """Reads a "combined_protein.tsv" or "protein.tsv" file and returns a processed
    dataframe.

    Adds four protein entry columns to comply with the MsReport convention:
    "Protein reported by software", "Leading proteins", "Representative protein",
    "Potential contaminant".

    "Protein reported by software" contains the protein ID extracted from the
    "Protein" column. "Leading proteins" contains the combined protein IDs extracted
    from the "Protein" and "Indistinguishable Proteins" columns, multiple entries
    are separated by ";". "Representative protein" contains the first entry form
    "Leading proteins".

    Several columns in the "combined_protein.tsv" file contain information specific
    for the protein entry of the "Protein" column. If leading proteins will be
    re-sorted later, it is recommended to remove columns containing protein specific
    information by setting 'drop_protein_info=True'..

    Args:
        filename: Allows specifying an alternative filename, otherwise the default
            filename is used.
        rename_columns: If True, columns are renamed according to the MsReport
            convention; default True.
        prefix_column_tags: If True, column tags such as "Intensity" are added
            in front of the sample names, e.g. "Intensity sample_name". If False,
            column tags are added afterwards, e.g. "Sample_name Intensity"; default
            True.
        drop_protein_info: If True, columns containing protein specific information,
            such as "Gene" or "Protein Length". See
            FragPipeReader.protein_info_columns and FragPipeReader.protein_info_tags
            for a full list of columns that will be removed. Default False.

    Returns:
        A dataframe containing the processed protein table.
    """
    df = self._read_file("proteins" if filename is None else filename)
    df = self._add_protein_entries(df)
    if drop_protein_info:
        df = self._drop_columns(df, self.protein_info_columns)
        for tag in self.protein_info_tags:
            df = self._drop_columns_by_tag(df, tag)
    if rename_columns:
        df = self._rename_columns(df, prefix_column_tags)
    return df

import_peptides

import_peptides(
    filename: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
) -> DataFrame

Reads a "combined_peptides.txt" file and returns a processed dataframe.

Adds a new column to comply with the MsReport convention: "Protein reported by software"

Parameters:

Name Type Description Default
filename Optional[str]

allows specifying an alternative filename, otherwise the default filename is used.

None
rename_columns bool

If True, columns are renamed according to the MsReport convention; default True.

True
prefix_column_tags bool

If True, column tags such as "Intensity" are added in front of the sample names, e.g. "Intensity sample_name". If False, column tags are added afterwards, e.g. "Sample_name Intensity"; default True.

True

Returns:

Type Description
DataFrame

A dataframe containing the processed peptide table.

Source code in msreport\reader.py
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
def import_peptides(
    self,
    filename: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
) -> pd.DataFrame:
    """Reads a "combined_peptides.txt" file and returns a processed dataframe.

    Adds a new column to comply with the MsReport convention:
    "Protein reported by software"

    Args:
        filename: allows specifying an alternative filename, otherwise the default
            filename is used.
        rename_columns: If True, columns are renamed according to the MsReport
            convention; default True.
        prefix_column_tags: If True, column tags such as "Intensity" are added
            in front of the sample names, e.g. "Intensity sample_name". If False,
            column tags are added afterwards, e.g. "Sample_name Intensity"; default
            True.

    Returns:
        A dataframe containing the processed peptide table.
    """
    # TODO: not tested
    df = self._read_file("peptides" if filename is None else filename)
    df["Protein reported by software"] = _extract_protein_ids(df["Protein"])
    df["Representative protein"] = df["Protein reported by software"]
    df["Mapped Proteins"] = self._collect_mapped_proteins(df)
    # Note that _add_protein_entries would need to be adapted for the peptide table.
    # df = self._add_protein_entries(df)
    if rename_columns:
        df = self._rename_columns(df, prefix_column_tags)
    return df

import_ions

import_ions(
    filename: Optional[str] = None,
    rename_columns: bool = True,
    rewrite_modifications: bool = True,
    prefix_column_tags: bool = True,
) -> DataFrame

Reads a "combined_ion.tsv" or "ion.tsv" file and returns a processed dataframe.

Adds new columns to comply with the MsReport convention. "Modified sequence" and "Modifications columns". "Protein reported by software" and "Representative protein", both contain the first entry from "Leading razor protein". "Ion ID" contains unique entries for each ion, which are generated by concatenating the "Modified sequence" and "Charge" columns, and if present, the "Compensation voltage" column.

"Modified sequence" entries contain modifications within square brackets. "Modification" entries are strings in the form of "position:modification_text", multiple modifications are joined by ";". An example for a modified sequence and a modification entry: "PEPT[Phospho]IDO[Oxidation]", "4:Phospho;7:Oxidation".

Note that currently the format of the modification itself, as well as the site localization probability are not modified; and no protein site entries are added.

Parameters:

Name Type Description Default
filename Optional[str]

Allows specifying an alternative filename, otherwise the default filename is used.

None
rename_columns bool

If True, columns are renamed according to the MsReport convention; default True.

True
rewrite_modifications bool

If True, the peptide format in "Modified sequence" is changed according to the MsReport convention, and a "Modifications" is added to contains the amino acid position for all modifications. Requires 'rename_columns' to be true. Default True.

True
prefix_column_tags bool

If True, column tags such as "Intensity" are added in front of the sample names, e.g. "Intensity sample_name". If False, column tags are added afterwards, e.g. "Sample_name Intensity"; default True.

True

Returns:

Type Description
DataFrame

A DataFrame containing the processed ion table.

Source code in msreport\reader.py
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
def import_ions(
    self,
    filename: Optional[str] = None,
    rename_columns: bool = True,
    rewrite_modifications: bool = True,
    prefix_column_tags: bool = True,
) -> pd.DataFrame:
    """Reads a "combined_ion.tsv" or "ion.tsv" file and returns a processed
    dataframe.

    Adds new columns to comply with the MsReport convention. "Modified sequence"
    and "Modifications columns". "Protein reported by software" and "Representative
    protein", both contain the first entry from "Leading razor protein". "Ion ID"
    contains unique entries for each ion, which are generated by concatenating the
    "Modified sequence" and "Charge" columns, and if present, the
    "Compensation voltage" column.

    "Modified sequence" entries contain modifications within square brackets.
    "Modification" entries are strings in the form of "position:modification_text",
    multiple modifications are joined by ";". An example for a modified sequence and
    a modification entry: "PEPT[Phospho]IDO[Oxidation]", "4:Phospho;7:Oxidation".

    Note that currently the format of the modification itself, as well as the
    site localization probability are not modified; and no protein site entries are
    added.

    Args:
        filename: Allows specifying an alternative filename, otherwise the default
            filename is used.
        rename_columns: If True, columns are renamed according to the MsReport
            convention; default True.
        rewrite_modifications: If True, the peptide format in "Modified sequence" is
            changed according to the MsReport convention, and a "Modifications" is
            added to contains the amino acid position for all modifications.
            Requires 'rename_columns' to be true. Default True.
        prefix_column_tags: If True, column tags such as "Intensity" are added
            in front of the sample names, e.g. "Intensity sample_name". If False,
            column tags are added afterwards, e.g. "Sample_name Intensity"; default
            True.

    Returns:
        A DataFrame containing the processed ion table.
    """
    # TODO: not tested #
    df = self._read_file("ions" if filename is None else filename)

    # FUTURE: replace this by _add_protein_entries(df, False) if FragPipe adds
    #         'Indistinguishable Proteins' to the ion table.
    df["Protein reported by software"] = _extract_protein_ids(df["Protein"])
    df["Representative protein"] = df["Protein reported by software"]
    df["Mapped Proteins"] = self._collect_mapped_proteins(df)

    if rename_columns:
        df = self._rename_columns(df, prefix_column_tags)
    if rewrite_modifications and rename_columns:
        df = self._add_peptide_modification_entries(df)
        df = self._add_modification_localization_string(df, prefix_column_tags)
        df["Ion ID"] = df["Modified sequence"] + "_c" + df["Charge"].astype(str)
        if "Compensation voltage" in df.columns:
            _cv = df["Compensation voltage"].astype(str)
            df["Ion ID"] = df["Ion ID"] + "_cv" + _cv

    return df

import_ion_evidence

import_ion_evidence(
    filename: Optional[str] = None,
    rename_columns: bool = True,
    rewrite_modifications: bool = True,
    prefix_column_tags: bool = True,
) -> DataFrame

Reads and concatenates all "ion.tsv" files and returns a processed dataframe.

Adds new columns to comply with the MsReport convention. "Modified sequence", "Modifications", and "Modification localization string" columns. "Protein reported by software" and "Representative protein", both contain the first entry from "Leading razor protein". "Ion ID" contains unique entries for each ion, which are generated by concatenating the "Modified sequence" and "Charge" columns, and if present, the "Compensation voltage" column.

"Modified sequence" entries contain modifications within square brackets. "Modification" entries are strings in the form of "position:modification_text", multiple modifications are joined by ";". An example for a modified sequence and a modification entry: "PEPT[Phospho]IDO[Oxidation]", "4:Phospho;7:Oxidation".

"Modification localization string" contains localization probabilities in the format "Mod1@Site1:Probability1,Site2:Probability2;Mod2@Site3:Probability3", e.g. "15.9949@11:1.000;79.9663@3:0.200,4:0.800". Refer to msreport.peptidoform.make_localization_string for details.

Parameters:

Name Type Description Default
filename Optional[str]

Allows specifying an alternative filename, otherwise the default filename is used.

None
rename_columns bool

If True, columns are renamed according to the MsReport convention; default True.

True
rewrite_modifications bool

If True, the peptide format in "Modified sequence" is changed according to the MsReport convention, and a "Modifications" is added to contains the amino acid position for all modifications. Requires 'rename_columns' to be true. Default True.

True
prefix_column_tags bool

If True, column tags such as "Intensity" are added in front of the sample names, e.g. "Intensity sample_name". If False, column tags are added afterwards, e.g. "Sample_name Intensity"; default True.

True

Returns:

Type Description
DataFrame

A DataFrame containing the processed ion table.

Source code in msreport\reader.py
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
def import_ion_evidence(
    self,
    filename: Optional[str] = None,
    rename_columns: bool = True,
    rewrite_modifications: bool = True,
    prefix_column_tags: bool = True,
) -> pd.DataFrame:
    """Reads and concatenates all "ion.tsv" files and returns a processed dataframe.

    Adds new columns to comply with the MsReport convention. "Modified sequence",
    "Modifications", and "Modification localization string" columns. "Protein
    reported by software" and "Representative protein", both contain the first entry
    from "Leading razor protein". "Ion ID" contains unique entries for each ion,
    which are generated by concatenating the "Modified sequence" and "Charge"
    columns, and if present, the "Compensation voltage" column.

    "Modified sequence" entries contain modifications within square brackets.
    "Modification" entries are strings in the form of "position:modification_text",
    multiple modifications are joined by ";". An example for a modified sequence and
    a modification entry: "PEPT[Phospho]IDO[Oxidation]", "4:Phospho;7:Oxidation".

    "Modification localization string" contains localization probabilities in the
    format "Mod1@Site1:Probability1,Site2:Probability2;Mod2@Site3:Probability3",
    e.g. "15.9949@11:1.000;79.9663@3:0.200,4:0.800". Refer to
    `msreport.peptidoform.make_localization_string` for details.

    Args:
        filename: Allows specifying an alternative filename, otherwise the default
            filename is used.
        rename_columns: If True, columns are renamed according to the MsReport
            convention; default True.
        rewrite_modifications: If True, the peptide format in "Modified sequence" is
            changed according to the MsReport convention, and a "Modifications" is
            added to contains the amino acid position for all modifications.
            Requires 'rename_columns' to be true. Default True.
        prefix_column_tags: If True, column tags such as "Intensity" are added
            in front of the sample names, e.g. "Intensity sample_name". If False,
            column tags are added afterwards, e.g. "Sample_name Intensity"; default
            True.

    Returns:
        A DataFrame containing the processed ion table.
    """
    # TODO: not tested #

    # --- Get paths of all ion.tsv files --- #
    if filename is None:
        filename = self.default_filenames["ion_evidence"]

    ion_table_paths = []
    for path in pathlib.Path(self.data_directory).iterdir():
        ion_table_path = path / filename
        if path.is_dir() and ion_table_path.exists():
            ion_table_paths.append(ion_table_path)

    # --- like self._read_file --- #
    ion_tables = []
    for filepath in ion_table_paths:
        table = pd.read_csv(filepath, sep="\t", low_memory=False)
        str_cols = table.select_dtypes(include=["object"]).columns
        table.loc[:, str_cols] = table.loc[:, str_cols].fillna("")

        table["Sample"] = filepath.parent.name
        ion_tables.append(table)
    df = pd.concat(ion_tables, ignore_index=True)

    # --- Process dataframe --- #
    df["Ion ID"] = df["Modified Sequence"] + "_c" + df["Charge"].astype(str)
    if "Compensation Voltage" in df.columns:
        df["Ion ID"] = df["Ion ID"] + "_cv" + df["Compensation Voltage"].astype(str)
    # FUTURE: replace this by _add_protein_entries(df, False) if FragPipe adds
    #         'Indistinguishable Proteins' to the ion table.
    df["Protein reported by software"] = _extract_protein_ids(df["Protein"])
    df["Representative protein"] = df["Protein reported by software"]
    df["Mapped Proteins"] = self._collect_mapped_proteins(df)

    if rename_columns:
        df = self._rename_columns(df, prefix_column_tags)
    if rewrite_modifications and rename_columns:
        df = self._add_peptide_modification_entries(df)
        df = self._add_modification_localization_string(df, prefix_column_tags)
    return df

import_psm_evidence

import_psm_evidence(
    filename: Optional[str] = None,
    rename_columns: bool = True,
    rewrite_modifications: bool = True,
) -> DataFrame

Concatenate all "psm.tsv" files and return a processed dataframe.

Parameters:

Name Type Description Default
filename Optional[str]

Allows specifying an alternative filename, otherwise the default filename is used.

None
rename_columns bool

If True, columns are renamed according to the MsReport convention; default True.

True
rewrite_modifications bool

If True, the peptide format in "Modified sequence" is changed according to the MsReport convention, and a "Modifications" is added to contains the amino acid position for all modifications. Requires 'rename_columns' to be true. Default True.

True

Returns:

Type Description
DataFrame

A DataFrame containing the processed psm evidence tables.

Source code in msreport\reader.py
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
def import_psm_evidence(
    self,
    filename: Optional[str] = None,
    rename_columns: bool = True,
    rewrite_modifications: bool = True,
) -> pd.DataFrame:
    """Concatenate all "psm.tsv" files and return a processed dataframe.

    Args:
        filename: Allows specifying an alternative filename, otherwise the default
            filename is used.
        rename_columns: If True, columns are renamed according to the MsReport
            convention; default True.
        rewrite_modifications: If True, the peptide format in "Modified sequence" is
            changed according to the MsReport convention, and a "Modifications" is
            added to contains the amino acid position for all modifications.
            Requires 'rename_columns' to be true. Default True.

    Returns:
        A DataFrame containing the processed psm evidence tables.
    """
    if filename is None:
        filename = self.default_filenames["psm_evidence"]

    psm_table_paths = []
    for path in pathlib.Path(self.data_directory).iterdir():
        psm_table_path = path / filename
        if path.is_dir() and psm_table_path.exists():
            psm_table_paths.append(psm_table_path)

    psm_tables = []
    for filepath in psm_table_paths:
        table = pd.read_csv(filepath, sep="\t", low_memory=False)
        str_cols = table.select_dtypes(include=["object"]).columns
        table.loc[:, str_cols] = table.loc[:, str_cols].fillna("")

        table["Sample"] = filepath.parent.name
        psm_tables.append(table)
    df = pd.concat(psm_tables, ignore_index=True)

    df["Protein reported by software"] = _extract_protein_ids(df["Protein"])
    df["Representative protein"] = df["Protein reported by software"]
    df["Mapped Proteins"] = self._collect_mapped_proteins(df)

    if rename_columns:
        df = self._rename_columns(df, prefix_tag=True)
    if rewrite_modifications and rename_columns:
        mod_entries = _generate_modification_entries_from_assigned_modifications(
            df["Peptide sequence"], df["Assigned Modifications"]
        )
        df["Modified sequence"] = mod_entries["Modified sequence"]
        df["Modifications"] = mod_entries["Modifications"]
    return df

SpectronautReader

SpectronautReader(
    directory: str, contaminant_tag: str = "contam_"
)

Bases: ResultReader

Spectronaut result reader.

Methods:

Name Description
import_proteins

Reads a LFQ protein report file and returns a processed dataframe, conforming to the MsReport naming convention.

import_design

Reads a ConditionSetup file and returns a processed dataframe, containing the default columns of an MsReport experimental design table.

Attributes:

Name Type Description
default_filetags dict[str, str]

(class attribute) Look up of default file tags for the outputs generated by Spectronaut.

sample_column_tags list[str]

(class attribute) Tags (column name substrings) that idenfity sample columns. Sample columns are those, for which one unique column is present per sample, for example intensity columns.

column_mapping dict[str, str]

(class attribute) Used to rename original column names from Spectronaut according to the MsReport naming convention.

column_tag_mapping OrderedDict[str, str]

(class attribute) Mapping of original sample column tags from Spectronaut to column tags according to the MsReport naming convention, used to replace column names containing the original column tag.

protein_info_columns list[str]

(class attribute) List of columns that contain information specific to the leading protein.

protein_info_tags list[str]

(class attribute) List of substrings present in columns that contain information specific to the leading protein.

data_directory str

Location of the folder containing Spectronaut result files.

filetags dict[str, str]

Look up of file tags used for matching files during the import of protein or other tables.

contamination_tag str

Substring present in protein IDs to identify them as potential contaminants.

Parameters:

Name Type Description Default
directory str

Location of the Spectronaut result folder.

required
contaminant_tag str

Prefix of Protein ID entries to identify contaminants; default "contam_".

'contam_'
Source code in msreport\reader.py
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
def __init__(self, directory: str, contaminant_tag: str = "contam_") -> None:
    """Initializes the SpectronautReader.

    Args:
        directory: Location of the Spectronaut result folder.
        contaminant_tag: Prefix of Protein ID entries to identify contaminants;
            default "contam_".
    """
    self.data_directory = directory
    self.filetags: dict[str, str] = self.default_filetags
    self.filenames = {}
    self._contaminant_tag: str = contaminant_tag

import_design

import_design(
    filename: Optional[str] = None,
    filetag: Optional[str] = None,
) -> DataFrame

Reads a ConditionSetup file and returns an experimental design table.

The following columns from the Spectronaut ConditionSetup file will be imported to the design table and renamed: Replicate -> Replicate Condition -> Experiment File Name -> Filename Run Label -> Run label

In addition, a "Sample" is added containing values from the Experiment and Replicate columns, separated by an underscore.

If neither filename nor filetag is specified, the default file tag "conditionsetup" is used to select a file from the data directory. If no file or multiple files match, an exception is thrown. The check for the presence of the file tag is not case sensitive.

Parameters:

Name Type Description Default
filename Optional[str]

Optional, allows specifying a specific file that will be imported.

None
filetag Optional[str]

Optional, can be used to select a file that contains the filetag as a substring, instead of specifying a filename.

None

Returns:

Type Description
DataFrame

A dataframe containing the processed design table.

Source code in msreport\reader.py
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
def import_design(
    self, filename: Optional[str] = None, filetag: Optional[str] = None
) -> pd.DataFrame:
    """Reads a ConditionSetup file and returns an experimental design table.

    The following columns from the Spectronaut ConditionSetup file will be imported
    to the design table and renamed:
        Replicate -> Replicate
        Condition -> Experiment
        File Name -> Filename
        Run Label -> Run label

    In addition, a "Sample" is added containing values from the Experiment and
    Replicate columns, separated by an underscore.

    If neither filename nor filetag is specified, the default file tag
    "conditionsetup" is used to select a file from the data directory. If no file
    or multiple files match, an exception is thrown. The check for the presence of
    the file tag is not case sensitive.

    Args:
        filename: Optional, allows specifying a specific file that will be imported.
        filetag: Optional, can be used to select a file that contains the filetag as
            a substring, instead of specifying a filename.

    Returns:
        A dataframe containing the processed design table.
    """
    filetag = self.filetags["design"] if filetag is None else filetag
    filenames = _find_matching_files(
        self.data_directory,
        filename=filename,
        filetag=filetag,
        extensions=["xls", "tsv", "csv"],
    )
    if len(filenames) == 0:
        raise FileNotFoundError("No matching file found.")
    elif len(filenames) > 1:
        exception_message_lines = [
            f"Multiple matching files found in: {self.data_directory}",
            "One of the report filenames must be specified manually:",
        ]
        exception_message_lines.extend(filenames)
        exception_message = "\n".join(exception_message_lines)
        raise ValueError(exception_message)
    else:
        filename = filenames[0]

    df = self._read_file(filename)
    df["Sample"] = df["Condition"].astype(str) + "_" + df["Replicate"].astype(str)
    df = pd.DataFrame(
        {
            "Sample": df["Sample"].astype(str),
            "Replicate": df["Replicate"].astype(str),
            "Experiment": df["Condition"].astype(str),
            "Filename": df["File Name"].astype(str),
            "Run label": df["Run Label"].astype(str),
        }
    )
    return df

import_proteins

import_proteins(
    filename: Optional[str] = None,
    filetag: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
    drop_protein_info: bool = True,
) -> DataFrame

Reads a Spectronaut protein report file and returns a processed DataFrame.

Adds four protein entry columns to comply with the MsReport convention: "Protein reported by software", "Leading proteins", "Representative protein", "Potential contaminant".

"Protein reported by software" and "Representative protein" contain the first entry from the "PG.ProteinAccessions" column, and "Leading proteins" contains all entries from this column. Multiple leading protein entries are separated by ";".

Several columns in the Spectronaut report file can contain information specific for the leading protein entry. If leading proteins will be re-sorted later, it is recommended to remove columns containing protein specific information by setting 'drop_protein_info=True'.

Parameters:

Name Type Description Default
filename Optional[str]

Optional, allows specifying a specific file that will be imported.

None
filetag Optional[str]

Optional, can be used to select a file that contains the filetag as a substring, instead of specifying a filename.

None
rename_columns bool

If True, columns are renamed according to the MsReport convention; default True.

True
prefix_column_tags bool

If True, column tags such as "Intensity" are added in front of the sample names, e.g. "Intensity sample_name". If False, column tags are added afterwards, e.g. "Sample_name Intensity"; default True.

True
drop_protein_info bool

If True, columns containing protein specific information, such as "Gene" or "Protein Length". See SpectronautReader.protein_info_columns and SpectronautReader.protein_info_tags for a full list of columns that will be removed. Default False.

True

Returns:

Type Description
DataFrame

A dataframe containing the processed protein table.

Source code in msreport\reader.py
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
def import_proteins(
    self,
    filename: Optional[str] = None,
    filetag: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
    drop_protein_info: bool = True,
) -> pd.DataFrame:
    """Reads a Spectronaut protein report file and returns a processed DataFrame.

    Adds four protein entry columns to comply with the MsReport convention:
    "Protein reported by software", "Leading proteins", "Representative protein",
    "Potential contaminant".

    "Protein reported by software" and "Representative protein" contain the first
    entry from the "PG.ProteinAccessions" column, and "Leading proteins" contains
    all entries from this column. Multiple leading protein entries are separated by
    ";".

    Several columns in the Spectronaut report file can contain information specific
    for the leading protein entry. If leading proteins will be re-sorted later, it
    is recommended to remove columns containing protein specific information by
    setting 'drop_protein_info=True'.

    Args:
        filename: Optional, allows specifying a specific file that will be imported.
        filetag: Optional, can be used to select a file that contains the filetag as
            a substring, instead of specifying a filename.
        rename_columns: If True, columns are renamed according to the MsReport
            convention; default True.
        prefix_column_tags: If True, column tags such as "Intensity" are added
            in front of the sample names, e.g. "Intensity sample_name". If False,
            column tags are added afterwards, e.g. "Sample_name Intensity"; default
            True.
        drop_protein_info: If True, columns containing protein specific information,
            such as "Gene" or "Protein Length". See
            SpectronautReader.protein_info_columns and
            SpectronautReader.protein_info_tags for a full list of columns that will
            be removed. Default False.

    Returns:
        A dataframe containing the processed protein table.
    """
    filetag = self.filetags["proteins"] if filetag is None else filetag
    filenames = _find_matching_files(
        self.data_directory,
        filename=filename,
        filetag=filetag,
        extensions=["xls", "tsv", "csv"],
    )
    if len(filenames) == 0:
        raise FileNotFoundError("No matching file found.")
    elif len(filenames) > 1:
        exception_message_lines = [
            f"Multiple matching files found in: {self.data_directory}",
            "One of the report filenames must be specified manually:",
        ]
        exception_message_lines.extend(filenames)
        exception_message = "\n".join(exception_message_lines)
        raise ValueError(exception_message)
    else:
        filename = filenames[0]

    df = self._read_file(filename)
    df = self._tidy_up_sample_columns(df)
    df = self._add_protein_entries(df)
    if drop_protein_info:
        df = self._drop_columns(df, self.protein_info_columns)
        for tag in self.protein_info_tags:
            df = self._drop_columns_by_tag(df, tag)
    if rename_columns:
        df = self._rename_columns(df, prefix_column_tags)
    return df

import_peptides

import_peptides(
    filename: Optional[str] = None,
    filetag: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
) -> DataFrame

Reads a Spectronaut peptide report file and returns a processed DataFrame.

Uses and renames the following Spectronaut report columns: PG.ProteinAccessions, PEP.Quantity, PEP.StrippedSequence, and PEP.AllOccurringProteinAccessions

Adds four protein entry columns to comply with the MsReport convention: "Protein reported by software", "Leading proteins", "Representative protein", "Potential contaminant".

"Protein reported by software" and "Representative protein" contain the first entry from the "PG.ProteinAccessions" column, and "Leading proteins" contains all entries from this column. Multiple leading protein entries are separated by ";".

Parameters:

Name Type Description Default
filename Optional[str]

Optional, allows specifying a specific file that will be imported.

None
filetag Optional[str]

Optional, can be used to select a file that contains the filetag as a substring, instead of specifying a filename.

None
rename_columns bool

If True, columns are renamed according to the MsReport convention; default True.

True
prefix_column_tags bool

If True, column tags such as "Intensity" are added in front of the sample names, e.g. "Intensity sample_name". If False, column tags are added afterwards, e.g. "Sample_name Intensity"; default True.

True

Returns:

Type Description
DataFrame

A dataframe containing the processed protein table.

Source code in msreport\reader.py
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
def import_peptides(
    self,
    filename: Optional[str] = None,
    filetag: Optional[str] = None,
    rename_columns: bool = True,
    prefix_column_tags: bool = True,
) -> pd.DataFrame:
    """Reads a Spectronaut peptide report file and returns a processed DataFrame.

    Uses and renames the following Spectronaut report columns:
    PG.ProteinAccessions, PEP.Quantity, PEP.StrippedSequence, and
    PEP.AllOccurringProteinAccessions

    Adds four protein entry columns to comply with the MsReport convention:
    "Protein reported by software", "Leading proteins", "Representative protein",
    "Potential contaminant".

    "Protein reported by software" and "Representative protein" contain the first
    entry from the "PG.ProteinAccessions" column, and "Leading proteins" contains
    all entries from this column. Multiple leading protein entries are separated by
    ";".

    Args:
        filename: Optional, allows specifying a specific file that will be imported.
        filetag: Optional, can be used to select a file that contains the filetag as
            a substring, instead of specifying a filename.
        rename_columns: If True, columns are renamed according to the MsReport
            convention; default True.
        prefix_column_tags: If True, column tags such as "Intensity" are added
            in front of the sample names, e.g. "Intensity sample_name". If False,
            column tags are added afterwards, e.g. "Sample_name Intensity"; default
            True.

    Returns:
        A dataframe containing the processed protein table.
    """
    filenames = _find_matching_files(
        self.data_directory,
        filename=filename,
        filetag=filetag,
        extensions=["xls", "tsv", "csv"],
    )
    if len(filenames) == 0:
        raise FileNotFoundError("No matching file found.")
    elif len(filenames) > 1:
        exception_message_lines = [
            f"Multiple matching files found in: {self.data_directory}",
            "One of the report filenames must be specified manually:",
        ]
        exception_message_lines.extend(filenames)
        exception_message = "\n".join(exception_message_lines)
        raise ValueError(exception_message)
    else:
        filename = filenames[0]

    df = self._read_file(filename)
    df = self._tidy_up_sample_columns(df)
    df = self._add_protein_entries(df)
    if rename_columns:
        df = self._rename_columns(df, prefix_column_tags)
    return df

import_ion_evidence

import_ion_evidence(
    filename: Optional[str] = None,
    filetag: Optional[str] = None,
    rename_columns: bool = True,
    rewrite_modifications: bool = True,
) -> DataFrame

Reads an ion evidence file (long format) and returns a processed dataframe.

Adds new columns to comply with the MsReport convention. "Protein reported by software" and "Representative protein", both contain the first entry from "PG.ProteinAccessions". "Ion ID" contains unique entries for each ion, which are generated by concatenating the "Modified sequence" and "Charge" columns, and if present, the "Compensation voltage" column.

"Modified sequence" entries contain modifications within square brackets. "Modification" entries are strings in the form of "position:modification_tag", multiple modifications are joined by ";". An example for a modified sequence and a modification entry: "PEPT[Phospho]IDO[Oxidation]", "4:Phospho;7:Oxidation".

"Modification localization string" contains localization probabilities in the format "Mod1@Site1:Probability1,Site2:Probability2;Mod2@Site3:Probability3", e.g. "15.9949@11:1.000;79.9663@3:0.200,4:0.800". Refer to msreport.peptidoform.make_localization_string for details.

Parameters:

Name Type Description Default
filename Optional[str]

Optional, allows specifying a specific file that will be imported.

None
filetag Optional[str]

Optional, can be used to select a file that contains the filetag as a substring, instead of specifying a filename.

None
rename_columns bool

If True, columns are renamed according to the MsReport convention; default True.

True
rewrite_modifications bool

If True, the peptide format in "Modified sequence" is changed according to the MsReport convention, and a "Modifications" is added to contains the amino acid position for all modifications. Requires 'rename_columns' to be true. Default True.

True

Returns:

Type Description
DataFrame

A dataframe containing the processed ion table.

Source code in msreport\reader.py
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
def import_ion_evidence(
    self,
    filename: Optional[str] = None,
    filetag: Optional[str] = None,
    rename_columns: bool = True,
    rewrite_modifications: bool = True,
) -> pd.DataFrame:
    """Reads an ion evidence file (long format) and returns a processed dataframe.

    Adds new columns to comply with the MsReport convention. "Protein reported
    by software" and "Representative protein", both contain the first entry from
    "PG.ProteinAccessions". "Ion ID" contains unique entries for each ion, which are
    generated by concatenating the "Modified sequence" and "Charge" columns, and if
    present, the "Compensation voltage" column.

    "Modified sequence" entries contain modifications within square brackets.
    "Modification" entries are strings in the form of "position:modification_tag",
    multiple modifications are joined by ";". An example for a modified sequence and
    a modification entry: "PEPT[Phospho]IDO[Oxidation]", "4:Phospho;7:Oxidation".

    "Modification localization string" contains localization probabilities in the
    format "Mod1@Site1:Probability1,Site2:Probability2;Mod2@Site3:Probability3",
    e.g. "15.9949@11:1.000;79.9663@3:0.200,4:0.800". Refer to
    `msreport.peptidoform.make_localization_string` for details.

    Args:
        filename: Optional, allows specifying a specific file that will be imported.
        filetag: Optional, can be used to select a file that contains the filetag as
            a substring, instead of specifying a filename.
        rename_columns: If True, columns are renamed according to the MsReport
            convention; default True.
        rewrite_modifications: If True, the peptide format in "Modified sequence" is
            changed according to the MsReport convention, and a "Modifications" is
            added to contains the amino acid position for all modifications.
            Requires 'rename_columns' to be true. Default True.

    Returns:
        A dataframe containing the processed ion table.
    """
    filenames = _find_matching_files(
        self.data_directory,
        filename=filename,
        filetag=filetag,
        extensions=["xls", "tsv", "csv"],
    )
    if len(filenames) == 0:
        raise FileNotFoundError("No matching file found.")
    elif len(filenames) > 1:
        exception_message_lines = [
            f"Multiple matching files found in: {self.data_directory}",
            "One of the report filenames must be specified manually:",
        ]
        exception_message_lines.extend(filenames)
        exception_message = "\n".join(exception_message_lines)
        raise ValueError(exception_message)
    else:
        filename = filenames[0]
    df = self._read_file(filename)
    df = self._tidy_up_sample_columns(df)
    df = self._add_protein_entries(df)
    if rename_columns:
        df = self._rename_columns(df, True)
    if rewrite_modifications and rename_columns:
        df = self._add_peptide_modification_entries(df)
        df = self._add_modification_localization_string(df)
        df["Ion ID"] = df["Modified sequence"] + "_c" + df["Charge"].astype(str)
        if "Compensation voltage" in df.columns:
            _cv = df["Compensation voltage"].astype(str)
            df["Ion ID"] = df["Ion ID"] + "_cv" + _cv

    return df

sort_leading_proteins

sort_leading_proteins(
    table: DataFrame,
    alphanumeric: bool = True,
    penalize_contaminants: bool = True,
    special_proteins: Optional[list[str]] = None,
    database_order: Optional[list[str]] = None,
) -> DataFrame

Returns a copy of 'table' with sorted leading proteins.

"Leading proteins" are sorted according to the selected options. The first entry of the sorted leading proteins is selected as the new "Representative protein". If the columns are present, also the entries of "Leading proteins database origin" and "Leading potential contaminants" are reordered, and "Potential contaminant" is reassigned according to the representative protein.

Additional protein annotation columns, refering to a representative protein that has been changed, will no longer be valid. It is therefore recommended to remove all columns containing protein specific information by enabling 'drop_protein_info' during the import of protein tables or to update protein annotation columns if possible.

Parameters:

Name Type Description Default
table DataFrame

Dataframe in which "Leading proteins" will be sorted.

required
alphanumeric bool

If True, protein entries are sorted alpha numerical.

True
penalize_contaminants bool

If True, protein contaminants are sorted to the back.

True
special_proteins Optional[list[str]]

Optional, allows specifying a list of protein IDs that will always be sorted to the beginning.

None
database_order Optional[list[str]]

Optional, allows specifying an order of protein databases that will be considered for sorting. Database names that are not present in 'database_order' are sorted to the end. The protein database of a fasta entry is written in the very beginning of the fasta header, e.g. "sp" from the fasta header ">sp|P60709|ACTB_HUMAN Actin".

None

Returns:

Type Description
DataFrame

A copy of the 'table', containing sorted leading protein entries.

Source code in msreport\reader.py
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
def sort_leading_proteins(
    table: pd.DataFrame,
    alphanumeric: bool = True,
    penalize_contaminants: bool = True,
    special_proteins: Optional[list[str]] = None,
    database_order: Optional[list[str]] = None,
) -> pd.DataFrame:
    """Returns a copy of 'table' with sorted leading proteins.

    "Leading proteins" are sorted according to the selected options. The first entry
    of the sorted leading proteins is selected as the new "Representative protein". If
    the columns are present, also the entries of "Leading proteins database origin" and
    "Leading potential contaminants" are reordered, and "Potential contaminant" is
    reassigned according to the representative protein.

    Additional protein annotation columns, refering to a representative protein that has
    been changed, will no longer be valid. It is therefore recommended to remove all
    columns containing protein specific information by enabling 'drop_protein_info'
    during the import of protein tables or to update protein annotation columns if
    possible.

    Args:
        table: Dataframe in which "Leading proteins" will be sorted.
        alphanumeric: If True, protein entries are sorted alpha numerical.
        penalize_contaminants: If True, protein contaminants are sorted to the back.
        special_proteins: Optional, allows specifying a list of protein IDs that
            will always be sorted to the beginning.
        database_order: Optional, allows specifying an order of protein databases that
            will be considered for sorting. Database names that are not present in
            'database_order' are sorted to the end. The protein database of a fasta
            entry is written in the very beginning of the fasta header, e.g. "sp" from
            the fasta header ">sp|P60709|ACTB_HUMAN Actin".

    Returns:
        A copy of the 'table', containing sorted leading protein entries.
    """
    sorted_entries = defaultdict(list)
    contaminants_present = "Leading potential contaminants" in table
    db_origins_present = "Leading proteins database origin" in table

    if database_order is not None:
        database_encoding: dict[str, int] = defaultdict(lambda: 999)
        database_encoding.update({db: i for i, db in enumerate(database_order)})
    if penalize_contaminants is not None:
        contaminant_encoding = {"False": 0, "True": 1, False: 0, True: 1}

    for _, row in table.iterrows():
        protein_ids = row["Leading proteins"].split(";")

        sorting_info: list[list] = [[] for _ in protein_ids]
        if special_proteins is not None:
            for i, _id in enumerate(protein_ids):
                sorting_info[i].append(_id not in special_proteins)
        if penalize_contaminants:
            for i, is_contaminant in enumerate(
                row["Leading potential contaminants"].split(";")
            ):
                sorting_info[i].append(contaminant_encoding[is_contaminant])
        if database_order is not None:
            for i, db_origin in enumerate(
                row["Leading proteins database origin"].split(";")
            ):
                sorting_info[i].append(database_encoding[db_origin])
        if alphanumeric:
            for i, _id in enumerate(protein_ids):
                sorting_info[i].append(_id)
        sorting_order = [
            i[0] for i in sorted(enumerate(sorting_info), key=lambda x: x[1])
        ]

        protein_ids = [protein_ids[i] for i in sorting_order]
        sorted_entries["Representative protein"].append(protein_ids[0])
        sorted_entries["Leading proteins"].append(";".join(protein_ids))

        if contaminants_present:
            contaminants = row["Leading potential contaminants"].split(";")
            contaminants = [contaminants[i] for i in sorting_order]
            potential_contaminant = contaminants[0] == "True"
            contaminants = ";".join(contaminants)
            sorted_entries["Potential contaminant"].append(potential_contaminant)
            sorted_entries["Leading potential contaminants"].append(contaminants)

        if db_origins_present:
            db_origins = row["Leading proteins database origin"].split(";")
            db_origins = ";".join([db_origins[i] for i in sorting_order])
            sorted_entries["Leading proteins database origin"].append(db_origins)

    sorted_table = table.copy()
    for key in sorted_entries:
        sorted_table[key] = sorted_entries[key]
    return sorted_table

add_protein_annotation

add_protein_annotation(
    table: DataFrame,
    protein_db: ProteinDatabase,
    id_column: str = "Representative protein",
    gene_name: bool = False,
    protein_name: bool = False,
    protein_entry: bool = False,
    protein_length: bool = False,
    molecular_weight: bool = False,
    fasta_header: bool = False,
    ibaq_peptides: bool = False,
    database_origin: bool = False,
) -> DataFrame

Uses a FASTA protein database to add protein annotation columns.

Parameters:

Name Type Description Default
table DataFrame

Dataframe to which the protein annotations are added.

required
protein_db ProteinDatabase

A protein database containing entries from one or multiple FASTA files.

required
id_column str

Column in 'table' that contains protein uniprot IDs, which will be used to look up entries in the 'protein_db'.

'Representative protein'
gene_name bool

If True, adds a "Gene name" column.

False
protein_name bool

If True, adds "Protein name" column.

False
protein_entry bool

If True, adds "Protein entry name" column.

False
protein_length bool

If True, adds a "Protein length" column.

False
molecular_weight bool

If True, adds a "Molecular weight [kDa]" column. The molecular weight is calculated as the monoisotopic mass in kilo Dalton, rounded to two decimal places. Note that there is an opinionated behaviour for non-standard amino acids code. "O" is Pyrrolysine, "U" is Selenocysteine, "B" is treated as "N", "Z" is treated as "Q", and "X" is ignored.

False
fasta_header bool

If True, adds a "Fasta header" column.

False
ibaq_peptides bool

If True, adds a "iBAQ peptides" columns. The number of iBAQ peptides is calculated as the theoretical number of tryptic peptides with a length between 7 and 30.

False
database_origin bool

If True, adds a "Database origin" column.

False

Returns:

Type Description
DataFrame

The updated 'table' dataframe.

Source code in msreport\reader.py
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
def add_protein_annotation(
    table: pd.DataFrame,
    protein_db: ProteinDatabase,
    id_column: str = "Representative protein",
    gene_name: bool = False,
    protein_name: bool = False,
    protein_entry: bool = False,
    protein_length: bool = False,
    molecular_weight: bool = False,
    fasta_header: bool = False,
    ibaq_peptides: bool = False,
    database_origin: bool = False,
) -> pd.DataFrame:
    """Uses a FASTA protein database to add protein annotation columns.

    Args:
        table: Dataframe to which the protein annotations are added.
        protein_db: A protein database containing entries from one or multiple FASTA
            files.
        id_column: Column in 'table' that contains protein uniprot IDs, which will be
            used to look up entries in the 'protein_db'.
        gene_name: If True, adds a "Gene name" column.
        protein_name: If True, adds "Protein name" column.
        protein_entry: If True, adds "Protein entry name" column.
        protein_length: If True, adds a "Protein length" column.
        molecular_weight: If True, adds a "Molecular weight [kDa]" column. The molecular
            weight is calculated as the monoisotopic mass in kilo Dalton, rounded to two
            decimal places. Note that there is an opinionated behaviour for non-standard
            amino acids code. "O" is Pyrrolysine, "U" is Selenocysteine, "B" is treated
            as "N", "Z" is treated as "Q", and "X" is ignored.
        fasta_header: If True, adds a "Fasta header" column.
        ibaq_peptides: If True, adds a "iBAQ peptides" columns. The number of iBAQ
            peptides is calculated as the theoretical number of tryptic peptides with
            a length between 7 and 30.
        database_origin: If True, adds a "Database origin" column.

    Returns:
        The updated 'table' dataframe.
    """
    # not tested #
    proteins = table[id_column].to_list()

    proteins_not_in_db = []
    for protein_id in proteins:
        if protein_id not in protein_db:
            proteins_not_in_db.append(protein_id)
    if proteins_not_in_db:
        warnings.warn(
            f"Some proteins could not be annotated: {repr(proteins_not_in_db)}",
            ProteinsNotInFastaWarning,
            stacklevel=2,
        )

    annotations = {}
    if gene_name:
        annotations["Gene name"] = _create_protein_annotations_from_db(
            proteins, protein_db, _get_annotation_gene_name, ""
        )
    if protein_name:
        annotations["Protein name"] = _create_protein_annotations_from_db(
            proteins, protein_db, _get_annotation_protein_name, ""
        )
    if protein_entry:
        annotations["Protein entry name"] = _create_protein_annotations_from_db(
            proteins, protein_db, _get_annotation_protein_entry_name, ""
        )
    if protein_length:
        annotations["Protein length"] = _create_protein_annotations_from_db(
            proteins, protein_db, _get_annotation_sequence_length, -1
        )
    if molecular_weight:
        annotations["Molecular weight [kDa]"] = _create_protein_annotations_from_db(
            proteins, protein_db, _get_annotation_molecular_weight, np.nan
        )
    if fasta_header:
        annotations["Fasta header"] = _create_protein_annotations_from_db(
            proteins, protein_db, _get_annotation_fasta_header, ""
        )
    if database_origin:
        annotations["Database origin"] = _create_protein_annotations_from_db(
            proteins, protein_db, _get_annotation_db_origin, ""
        )
    if ibaq_peptides:
        annotations["iBAQ peptides"] = _create_protein_annotations_from_db(
            proteins, protein_db, _get_annotation_ibaq_peptides, -1
        )
    for column in annotations.keys():
        table[column] = annotations[column]
    return table

add_protein_site_annotation

add_protein_site_annotation(
    table: DataFrame,
    protein_db: ProteinDatabase,
    protein_column: str = "Representative protein",
    site_column: str = "Protein site",
) -> DataFrame

Uses a FASTA protein database to add protein site annotation columns.

Adds the columns "Modified residue", which corresponds to the amino acid at the protein site position, and "Sequence window", which contains sequence windows of eleven amino acids surrounding the protein site. Sequence windows are centered on the respective protein site; missing amino acids due to the position being close to the beginning or end of the protein sequence are substituted with "-".

Parameters:

Name Type Description Default
table DataFrame

Dataframe to which the protein site annotations are added.

required
protein_db ProteinDatabase

A protein database containing entries from one or multiple FASTA files.

required
protein_column str

Column in 'table' that contains protein identifiers, which will be used to look up entries in the 'protein_db'.

'Representative protein'
site_column str

Column in 'table' that contains protein sites, which will be used to extract information from the protein sequence. Protein sites are one-indexed, meaining the first amino acid of the protein is position 1.

'Protein site'

Returns:

Type Description
DataFrame

The updated 'table' dataframe.

Source code in msreport\reader.py
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
def add_protein_site_annotation(
    table: pd.DataFrame,
    protein_db: ProteinDatabase,
    protein_column: str = "Representative protein",
    site_column: str = "Protein site",
) -> pd.DataFrame:
    """Uses a FASTA protein database to add protein site annotation columns.

    Adds the columns "Modified residue", which corresponds to the amino acid at the
    protein site position, and "Sequence window", which contains sequence windows of
    eleven amino acids surrounding the protein site. Sequence windows are centered on
    the respective protein site; missing amino acids due to the position being close to
    the beginning or end of the protein sequence are substituted with "-".

    Args:
        table: Dataframe to which the protein site annotations are added.
        protein_db: A protein database containing entries from one or multiple FASTA
            files.
        protein_column: Column in 'table' that contains protein identifiers, which will
            be used to look up entries in the 'protein_db'.
        site_column: Column in 'table' that contains protein sites, which will be used
            to extract information from the protein sequence. Protein sites are
            one-indexed, meaining the first amino acid of the protein is position 1.

    Returns:
        The updated 'table' dataframe.
    """
    # TODO not tested
    proteins = table[protein_column].to_list()
    proteins_not_in_db = []
    for protein_id in proteins:
        if protein_id not in protein_db:
            proteins_not_in_db.append(protein_id)
    if proteins_not_in_db:
        warnings.warn(
            f"Some proteins could not be annotated: {repr(proteins_not_in_db)}",
            ProteinsNotInFastaWarning,
            stacklevel=2,
        )

    annotations: dict[str, list[str]] = {
        "Modified residue": [],
        "Sequence window": [],
    }
    for protein, site in zip(table[protein_column], table[site_column]):
        protein_sequence = protein_db[protein].sequence

        modified_residue = protein_sequence[site - 1]
        annotations["Modified residue"].append(modified_residue)

        sequence_window = extract_window_around_position(protein_sequence, site)
        annotations["Sequence window"].append(sequence_window)

    for column, annotation_values in annotations.items():
        table[column] = annotation_values
    return table

add_leading_proteins_annotation

add_leading_proteins_annotation(
    table: DataFrame,
    protein_db: ProteinDatabase,
    id_column: str = "Leading proteins",
    gene_name: bool = False,
    protein_entry: bool = False,
    protein_length: bool = False,
    fasta_header: bool = False,
    ibaq_peptides: bool = False,
    database_origin: bool = False,
) -> DataFrame

Uses a FASTA protein database to add leading protein annotation columns.

Generates protein annotations for multi protein entries, where each entry can contain one or multiple protein ids, multiple protein ids are separated by ";".

Parameters:

Name Type Description Default
table DataFrame

Dataframe to which the protein annotations are added.

required
protein_db ProteinDatabase

A protein database containing entries from one or multiple FASTA files.

required
id_column str

Column in 'table' that contains leading protein uniprot IDs, which will be used to look up entries in the 'protein_db'.

'Leading proteins'
gene_name bool

If True, adds a "Leading proteins gene name" column.

False
protein_entry bool

If True, adds "Leading proteins entry name" column.

False
protein_length bool

If True, adds a "Leading proteins length" column.

False
fasta_header bool

If True, adds a "Leading proteins fasta header" column.

False
ibaq_peptides bool

If True, adds a "Leading proteins iBAQ peptides" columns. The number of iBAQ peptides is calculated as the theoretical number of tryptic peptides with a length between 7 and 30.

False
database_origin bool

If True, adds a "Leading proteins database origin" column.

False

Returns:

Type Description
DataFrame

The updated 'table' dataframe.

Source code in msreport\reader.py
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
def add_leading_proteins_annotation(
    table: pd.DataFrame,
    protein_db: ProteinDatabase,
    id_column: str = "Leading proteins",
    gene_name: bool = False,
    protein_entry: bool = False,
    protein_length: bool = False,
    fasta_header: bool = False,
    ibaq_peptides: bool = False,
    database_origin: bool = False,
) -> pd.DataFrame:
    """Uses a FASTA protein database to add leading protein annotation columns.

    Generates protein annotations for multi protein entries, where each entry can
    contain one or multiple protein ids, multiple protein ids are separated by ";".

    Args:
        table: Dataframe to which the protein annotations are added.
        protein_db: A protein database containing entries from one or multiple FASTA
            files.
        id_column: Column in 'table' that contains leading protein uniprot IDs, which
            will be used to look up entries in the 'protein_db'.
        gene_name: If True, adds a "Leading proteins gene name" column.
        protein_entry: If True, adds "Leading proteins entry name" column.
        protein_length: If True, adds a "Leading proteins length" column.
        fasta_header: If True, adds a "Leading proteins fasta header" column.
        ibaq_peptides: If True, adds a "Leading proteins iBAQ peptides" columns. The
            number of iBAQ peptides is calculated as the theoretical number of tryptic
            peptides with a length between 7 and 30.
        database_origin: If True, adds a "Leading proteins database origin" column.

    Returns:
        The updated 'table' dataframe.
    """
    # not tested #
    leading_protein_entries = table[id_column].to_list()

    proteins_not_in_db = []
    for leading_entry in leading_protein_entries:
        for protein_id in leading_entry.split(";"):
            if protein_id not in protein_db:
                proteins_not_in_db.append(protein_id)
    if proteins_not_in_db:
        warnings.warn(
            f"Some proteins could not be annotated: {repr(proteins_not_in_db)}",
            ProteinsNotInFastaWarning,
            stacklevel=2,
        )

    annotations = {}
    if gene_name:
        annotation = _create_multi_protein_annotations_from_db(
            leading_protein_entries, protein_db, _get_annotation_gene_name
        )
        annotations["Leading proteins gene name"] = annotation
    if protein_entry:
        annotation = _create_multi_protein_annotations_from_db(
            leading_protein_entries, protein_db, _get_annotation_protein_entry_name
        )
        annotations["Leading proteins entry name"] = annotation
    if protein_length:
        annotation = _create_multi_protein_annotations_from_db(
            leading_protein_entries, protein_db, _get_annotation_sequence_length
        )
        annotations["Leading proteins length"] = annotation
    if fasta_header:
        annotation = _create_multi_protein_annotations_from_db(
            leading_protein_entries, protein_db, _get_annotation_fasta_header
        )
        annotations["Leading proteins fasta header"] = annotation
    if ibaq_peptides:
        annotation = _create_multi_protein_annotations_from_db(
            leading_protein_entries, protein_db, _get_annotation_ibaq_peptides
        )
        annotations["Leading proteins iBAQ peptides"] = annotation
    if database_origin:
        annotation = _create_multi_protein_annotations_from_db(
            leading_protein_entries, protein_db, _get_annotation_db_origin
        )
        annotations["Leading proteins database origin"] = annotation
    for column in annotations.keys():
        table[column] = annotations[column]
    return table

add_protein_site_identifiers

add_protein_site_identifiers(
    table: DataFrame,
    protein_db: ProteinDatabase,
    site_column: str,
    protein_name_column: str,
)

Adds a "Protein site identifier" column to the 'table'.

The "Protein site identifier" is generated by concatenating the protein name with the amino acid and position of the protein site or sites, e.g. "P12345 - S123" or "P12345 - S123 / T125". The amino acid is extracted from the protein sequence at the position of the site. If the protein name is not available, the "Representative protein" entry is used instead.

Parameters:

Name Type Description Default
table DataFrame

Dataframe to which the protein site identifiers are added.

required
protein_db ProteinDatabase

A protein database containing entries from one or multiple FASTA files. Protein identifiers in the 'table' column "Representative protein" are used to look up entries in the 'protein_db'.

required
site_column str

Column in 'table' that contains protein site positions. Positions are one-indexed, meaning the first amino acid of the protein is position 1. Multiple sites in a single entry should be separated by ";".

required
protein_name_column str

Column in 'table' that contains protein names, which will be used to generate the identifier. If no name is available, the accession is used instead.

required

Raises:

Type Description
ValueError

If the "Representative protein", 'protein_name_column' or 'site_column' is not found in the 'table'.

Source code in msreport\reader.py
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
def add_protein_site_identifiers(
    table: pd.DataFrame,
    protein_db: ProteinDatabase,
    site_column: str,
    protein_name_column: str,
):
    """Adds a "Protein site identifier" column to the 'table'.

    The "Protein site identifier" is generated by concatenating the protein name
    with the amino acid and position of the protein site or sites, e.g. "P12345 - S123"
    or "P12345 - S123 / T125". The amino acid is extracted from the protein sequence at
    the position of the site. If the protein name is not available, the
    "Representative protein" entry is used instead.

    Args:
        table: Dataframe to which the protein site identifiers are added.
        protein_db: A protein database containing entries from one or multiple FASTA
            files. Protein identifiers in the 'table' column "Representative protein"
            are used to look up entries in the 'protein_db'.
        site_column: Column in 'table' that contains protein site positions. Positions
            are one-indexed, meaning the first amino acid of the protein is position 1.
            Multiple sites in a single entry should be separated by ";".
        protein_name_column: Column in 'table' that contains protein names, which will
            be used to generate the identifier. If no name is available, the accession
            is used instead.

    Raises:
        ValueError: If the "Representative protein", 'protein_name_column' or
            'site_column' is not found in the 'table'.
    """
    if site_column not in table.columns:
        raise ValueError(f"Column '{site_column}' not found in the table.")
    if protein_name_column not in table.columns:
        raise ValueError(f"Column '{protein_name_column}' not found in the table.")
    if "Representative protein" not in table.columns:
        raise ValueError("Column 'Representative protein' not found in the table.")

    site_identifiers = []
    for accession, sites, name in zip(
        table["Representative protein"],
        table[site_column].astype(str),
        table[protein_name_column],
    ):
        protein_sequence = protein_db[accession].sequence
        protein_identifier = name if name else accession
        aa_sites = []
        for site in sites.split(";"):
            aa = protein_sequence[int(site) - 1]
            aa_sites.append(f"{aa}{site}")
        aa_site_tag = " / ".join(aa_sites)
        site_identifier = f"{protein_identifier} - {aa_site_tag}"
        site_identifiers.append(site_identifier)
    table["Protein site identifier"] = site_identifiers

add_sequence_coverage

add_sequence_coverage(
    protein_table: DataFrame,
    peptide_table: DataFrame,
    id_column: str = "Protein reported by software",
) -> None

Calculates "Sequence coverage" and adds a new column to the 'protein_table'.

Sequence coverage is represented as a percentage, with values ranging from 0 to 100. Requires the columns "Start position" and "End position" in the 'peptide_table', and "Protein length" in the 'protein_table'. For protein entries where the sequence coverage cannot be calculated, a value of -1 is added.

Parameters:

Name Type Description Default
protein_table DataFrame

Dataframe to which the "Sequence coverage" column is added.

required
peptide_table DataFrame

Dataframe which contains peptide information required for calculation of the protein sequence coverage.

required
id_column str

Column used to match entries between the 'protein_table' and the 'peptide_table', must be present in both tables. Default "Protein reported by software".

'Protein reported by software'
Source code in msreport\reader.py
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
def add_sequence_coverage(
    protein_table: pd.DataFrame,
    peptide_table: pd.DataFrame,
    id_column: str = "Protein reported by software",
) -> None:
    """Calculates "Sequence coverage" and adds a new column to the 'protein_table'.

    Sequence coverage is represented as a percentage, with values ranging from 0 to 100.
    Requires the columns "Start position" and "End position" in the 'peptide_table', and
    "Protein length" in the 'protein_table'. For protein entries where the sequence
    coverage cannot be calculated, a value of -1 is added.

    Args:
        protein_table: Dataframe to which the "Sequence coverage" column is added.
        peptide_table: Dataframe which contains peptide information required for
            calculation of the protein sequence coverage.
        id_column: Column used to match entries between the 'protein_table' and the
            'peptide_table', must be present in both tables. Default
            "Protein reported by software".
    """
    peptide_positions = {}
    for protein_id, peptide_group in peptide_table.groupby(by=id_column):
        positions = list(
            zip(peptide_group["Start position"], peptide_group["End position"])
        )
        peptide_positions[protein_id] = sorted(positions)

    sequence_coverages = []
    for protein_id, protein_length in zip(
        protein_table[id_column], protein_table["Protein length"]
    ):
        can_calculate_coverage = True
        if protein_id not in peptide_positions:
            can_calculate_coverage = False
        if protein_length < 1:
            can_calculate_coverage = False
        try:
            protein_length = int(protein_length)
        except ValueError:
            can_calculate_coverage = False

        if can_calculate_coverage:
            sequence_coverage = helper.calculate_sequence_coverage(
                protein_length, peptide_positions[protein_id], ndigits=1
            )
        else:
            sequence_coverage = np.nan
        sequence_coverages.append(sequence_coverage)
    protein_table["Sequence coverage"] = sequence_coverages

add_ibaq_intensities

add_ibaq_intensities(
    table: DataFrame,
    normalize: bool = True,
    ibaq_peptide_column: str = "iBAQ peptides",
    intensity_tag: str = "Intensity",
    ibaq_tag: str = "iBAQ intensity",
) -> None

Adds iBAQ intensity columns to the 'table'.

Requires a column containing the theoretical number of iBAQ peptides.

Parameters:

Name Type Description Default
table DataFrame

Dataframe to which the iBAQ intensity columns are added.

required
normalize bool

Scales iBAQ intensities per sample so that the sum of all iBAQ intensities is equal to the sum of all Intensities.

True
ibaq_peptide_column str

Column in 'table' containing the number of iBAQ peptides. No iBAQ intensity is calculated for rows with negative values or zero in the ibaq_peptide_column.

'iBAQ peptides'
intensity_tag str

Substring used to identify intensity columns from the 'table' that are used to calculate iBAQ intensities.

'Intensity'
ibaq_tag str

Substring used for naming the new 'table' columns containing the calculated iBAQ intensities. The column names are generated by replacing the 'intensity_tag' with the 'ibaq_tag'.

'iBAQ intensity'
Source code in msreport\reader.py
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
def add_ibaq_intensities(
    table: pd.DataFrame,
    normalize: bool = True,
    ibaq_peptide_column: str = "iBAQ peptides",
    intensity_tag: str = "Intensity",
    ibaq_tag: str = "iBAQ intensity",
) -> None:
    """Adds iBAQ intensity columns to the 'table'.

    Requires a column containing the theoretical number of iBAQ peptides.

    Args:
        table: Dataframe to which the iBAQ intensity columns are added.
        normalize: Scales iBAQ intensities per sample so that the sum of all iBAQ
            intensities is equal to the sum of all Intensities.
        ibaq_peptide_column: Column in 'table' containing the number of iBAQ peptides.
            No iBAQ intensity is calculated for rows with negative values or zero in the
            ibaq_peptide_column.
        intensity_tag: Substring used to identify intensity columns from the 'table'
            that are used to calculate iBAQ intensities.
        ibaq_tag: Substring used for naming the new 'table' columns containing the
            calculated iBAQ intensities. The column names are generated by replacing
            the 'intensity_tag' with the 'ibaq_tag'.
    """
    for intensity_column in helper.find_columns(table, intensity_tag):
        ibaq_column = intensity_column.replace(intensity_tag, ibaq_tag)
        valid = table[ibaq_peptide_column] > 0

        table[ibaq_column] = np.nan
        table.loc[valid, ibaq_column] = (
            table.loc[valid, intensity_column] / table.loc[valid, ibaq_peptide_column]
        )

        if normalize:
            total_intensity = table.loc[valid, intensity_column].sum()
            total_ibaq = table.loc[valid, ibaq_column].sum()
            factor = total_intensity / total_ibaq
            table.loc[valid, ibaq_column] = table.loc[valid, ibaq_column] * factor

add_peptide_positions

add_peptide_positions(
    table: DataFrame,
    protein_db: ProteinDatabase,
    peptide_column: str = "Peptide sequence",
    protein_column: str = "Representative protein",
) -> None

Adds peptide "Start position" and "End position" positions to the table.

For entries where the protein is absent from the FASTA or the peptide sequence could not be matched to the protein sequence, start and end positions of -1 are added.

Parameters:

Name Type Description Default
table DataFrame

Dataframe to which the protein annotations are added.

required
protein_db ProteinDatabase

A protein database containing entries from one or multiple FASTA files.

required
peptide_column str

Column in 'table' that contains the peptide sequence. Peptide sequences must only contain amino acids and no other symbols.

'Peptide sequence'
protein_column str

Column in 'table' that contains protein IDs that are used to find matching entries in the FASTA files.

'Representative protein'
Source code in msreport\reader.py
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
def add_peptide_positions(
    table: pd.DataFrame,
    protein_db: ProteinDatabase,
    peptide_column: str = "Peptide sequence",
    protein_column: str = "Representative protein",
) -> None:
    """Adds peptide "Start position" and "End position" positions to the table.

    For entries where the protein is absent from the FASTA or the peptide sequence
    could not be matched to the protein sequence, start and end positions of -1 are
    added.

    Args:
        table: Dataframe to which the protein annotations are added.
        protein_db: A protein database containing entries from one or multiple FASTA
            files.
        peptide_column: Column in 'table' that contains the peptide sequence. Peptide
            sequences must only contain amino acids and no other symbols.
        protein_column: Column in 'table' that contains protein IDs that are used to
            find matching entries in the FASTA files.
    """
    # not tested #
    peptide_positions: dict[str, list[int]] = {"Start position": [], "End position": []}
    proteins_not_in_db = []
    for peptide, protein_id in zip(table[peptide_column], table[protein_column]):
        if protein_id in protein_db:
            sequence = protein_db[protein_id].sequence
            start = sequence.find(peptide) + 1
            end = start + len(peptide) - 1
            if start == 0:
                start, end = -1, -1
        else:
            proteins_not_in_db.append(protein_id)
            start, end = -1, -1
        peptide_positions["Start position"].append(start)
        peptide_positions["End position"].append(end)

    for key in peptide_positions:
        table[key] = peptide_positions[key]

    if proteins_not_in_db:
        warnings.warn(
            f"Some peptides could not be annotated: {repr(proteins_not_in_db)}",
            ProteinsNotInFastaWarning,
            stacklevel=2,
        )

add_protein_modifications

add_protein_modifications(table: DataFrame)

Adds a "Protein sites" column.

To generate the "Protein modifications" the positions from the "Modifications" column are increase according to the peptide positions ("Start position"] column).

Parameters:

Name Type Description Default
table DataFrame

Dataframe to which the "Protein modifications" column is added.

required
Source code in msreport\reader.py
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
def add_protein_modifications(table: pd.DataFrame):
    """Adds a "Protein sites" column.

    To generate the "Protein modifications" the positions from the "Modifications"
    column are increase according to the peptide positions ("Start position"] column).

    Args:
        table: Dataframe to which the "Protein modifications" column is added.
    """
    protein_modification_entries = []
    for mod_entry, start_pos in zip(table["Modifications"], table["Start position"]):
        if mod_entry:
            protein_mods = []
            for peptide_site, mod in [m.split(":") for m in mod_entry.split(";")]:
                protein_site = int(peptide_site) + start_pos - 1
                protein_mods.append([str(protein_site), mod])
            protein_mod_string = ";".join([f"{pos}:{mod}" for pos, mod in protein_mods])
        else:
            protein_mod_string = ""
        protein_modification_entries.append(protein_mod_string)
    table["Protein modifications"] = protein_modification_entries

propagate_representative_protein

propagate_representative_protein(
    target_table: DataFrame, source_table: DataFrame
) -> None

Propagates "Representative protein" column from the source to the target table.

The column "Protein reported by software" is used to match entries between the two tables. Then entries from "Representative protein" are propagated from the 'source_table' to matching rows in the 'target_table'.

Parameters:

Name Type Description Default
target_table DataFrame

Dataframe to which "Representative protein" entries will be added.

required
source_table DataFrame

Dataframe from which "Representative protein" entries are propagated.

required
Source code in msreport\reader.py
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
def propagate_representative_protein(
    target_table: pd.DataFrame, source_table: pd.DataFrame
) -> None:
    """Propagates "Representative protein" column from the source to the target table.

    The column "Protein reported by software" is used to match entries between the two
    tables. Then entries from "Representative protein" are propagated from the
    'source_table' to matching rows in the 'target_table'.

    Args:
        target_table: Dataframe to which "Representative protein" entries will be added.
        source_table: Dataframe from which "Representative protein" entries are
            propagated.
    """
    # not tested #
    protein_lookup = {}
    for old, new in zip(
        source_table["Protein reported by software"],
        source_table["Representative protein"],
    ):
        protein_lookup[old] = new

    new_protein_ids = []
    for old in target_table["Protein reported by software"]:
        new_protein_ids.append(protein_lookup[old] if old in protein_lookup else old)
    target_table["Representative protein"] = new_protein_ids

extract_sample_names

extract_sample_names(df: DataFrame, tag: str) -> list[str]

Extracts sample names from columns containing the 'tag' substring.

Sample names are extracted from column names containing the 'tag' string, by splitting the column name with the 'tag', and removing all trailing and leading white spaces from the resulting strings.

Parameters:

Name Type Description Default
df DataFrame

Column names from this dataframe are used for extracting sample names.

required
tag str

Column names containing the 'tag' are selected for extracting sample names.

required

Returns:

Type Description
list[str]

A list of sample names.

Source code in msreport\reader.py
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
def extract_sample_names(df: pd.DataFrame, tag: str) -> list[str]:
    """Extracts sample names from columns containing the 'tag' substring.

    Sample names are extracted from column names containing the 'tag' string, by
    splitting the column name with the 'tag', and removing all trailing and leading
    white spaces from the resulting strings.

    Args:
        df: Column names from this dataframe are used for extracting sample names.
        tag: Column names containing the 'tag' are selected for extracting sample names.

    Returns:
        A list of sample names.
    """
    columns = helper.find_columns(df, tag)
    sample_names = _find_remaining_substrings(columns, tag)
    return sample_names

extract_maxquant_localization_probabilities

extract_maxquant_localization_probabilities(
    localization_entry: str,
) -> dict[int, float]

Extract localization probabilites from a MaxQuant "Probabilities" entry.

Parameters:

Name Type Description Default
localization_entry str

Entry from the "Probabilities" columns of a MaxQuant msms.txt, evidence.txt or Sites.txt table.

required

Returns:

Type Description
dict[int, float]

A dictionary of {position: probability} mappings. Positions are one-indexed,

dict[int, float]

which means that the first amino acid position is 1.

Example:

extract_maxquant_localization_probabilities("IRT(0.989)AMNS(0.011)IER") {3: 0.989, 7: 0.011}

Source code in msreport\reader.py
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
def extract_maxquant_localization_probabilities(
    localization_entry: str,
) -> dict[int, float]:
    """Extract localization probabilites from a MaxQuant "Probabilities" entry.

    Args:
        localization_entry: Entry from the "Probabilities" columns of a MaxQuant
            msms.txt, evidence.txt or Sites.txt table.

    Returns:
        A dictionary of {position: probability} mappings. Positions are one-indexed,
        which means that the first amino acid position is 1.

    Example:
    >>> extract_maxquant_localization_probabilities("IRT(0.989)AMNS(0.011)IER")
    {3: 0.989, 7: 0.011}
    """
    _, probabilities = msreport.peptidoform.parse_modified_sequence(
        localization_entry, "(", ")"
    )
    site_probabilities = {
        site: float(probability) for site, probability in probabilities
    }
    return site_probabilities

extract_fragpipe_localization_probabilities

extract_fragpipe_localization_probabilities(
    localization_entry: str,
) -> dict

Extract localization probabilites from a FragPipe "Localization" entry.

Parameters:

Name Type Description Default
localization_entry str

Entry from the "Localization" column of a FragPipe ions.tsv or combined_ions.tsv table.

required

Returns:

Type Description
dict

A dictionary of modifications containing a dictionary of {position: probability}

dict

mappings. Positions are one-indexed, which means that the first amino acid

dict

position is 1.

Example:

extract_fragpipe_localization_probabilities( ... "M:15.9949@FIM(1.000)TPTLK;STY:79.9663@FIMT(0.334)PT(0.666)LK;" ... ) {'15.9949': {3: 1.0}, '79.9663': {4: 0.334, 6: 0.666}}

Source code in msreport\reader.py
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
def extract_fragpipe_localization_probabilities(localization_entry: str) -> dict:
    """Extract localization probabilites from a FragPipe "Localization" entry.

    Args:
        localization_entry: Entry from the "Localization" column of a FragPipe
            ions.tsv or combined_ions.tsv table.

    Returns:
        A dictionary of modifications containing a dictionary of {position: probability}
        mappings. Positions are one-indexed, which means that the first amino acid
        position is 1.

    Example:
    >>> extract_fragpipe_localization_probabilities(
    ...     "M:15.9949@FIM(1.000)TPTLK;STY:79.9663@FIMT(0.334)PT(0.666)LK;"
    ... )
    {'15.9949': {3: 1.0}, '79.9663': {4: 0.334, 6: 0.666}}
    """
    modification_probabilities: dict[str, dict[int, float]] = {}
    for modification_entry in filter(None, localization_entry.split(";")):
        specified_modification, probability_sequence = modification_entry.split("@")
        _, modification = specified_modification.split(":")
        _, probabilities = msreport.peptidoform.parse_modified_sequence(
            probability_sequence, "(", ")"
        )
        if modification not in modification_probabilities:
            modification_probabilities[modification] = {}
        modification_probabilities[modification].update(
            {site: float(probability) for site, probability in probabilities}
        )
    return modification_probabilities

extract_spectronaut_localization_probabilities

extract_spectronaut_localization_probabilities(
    localization_entry: str,
) -> dict

Extract localization probabilites from a Spectronaut localization entry.

Parameters:

Name Type Description Default
localization_entry str

Entry from the "EG.PTMLocalizationProbabilities" column of a spectronaut elution group (EG) output table.

required

Returns:

Type Description
dict

A dictionary of modifications containing a dictionary of {position: probability}

dict

mappings. Positions are one-indexed, which means that the first amino acid

dict

position is 1.

Example:

extract_spectronaut_localization_probabilities( ... "HM[Oxidation (M): 100%]S[Phospho (STY): 45.5%]GS[Phospho (STY): 54.5%]PG" ... ) {'Oxidation (M)': {2: 1.0}, 'Phospho (STY)': {3: 0.455, 5: 0.545}}

Source code in msreport\reader.py
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
def extract_spectronaut_localization_probabilities(localization_entry: str) -> dict:
    """Extract localization probabilites from a Spectronaut localization entry.

    Args:
        localization_entry: Entry from the "EG.PTMLocalizationProbabilities" column of a
            spectronaut elution group (EG) output table.

    Returns:
        A dictionary of modifications containing a dictionary of {position: probability}
        mappings. Positions are one-indexed, which means that the first amino acid
        position is 1.

    Example:
    >>> extract_spectronaut_localization_probabilities(
    ...     "_HM[Oxidation (M): 100%]S[Phospho (STY): 45.5%]GS[Phospho (STY): 54.5%]PG_"
    ... )
    {'Oxidation (M)': {2: 1.0}, 'Phospho (STY)': {3: 0.455, 5: 0.545}}
    """
    modification_probabilities: dict[str, dict[int, float]] = {}
    localization_entry = localization_entry.strip("_")
    _, raw_probability_entries = msreport.peptidoform.parse_modified_sequence(
        localization_entry, "[", "]"
    )

    for site, mod_probability_entry in raw_probability_entries:
        modification, probability_entry = mod_probability_entry.split(": ")
        if modification not in modification_probabilities:
            modification_probabilities[modification] = {}
        probability = float(probability_entry.replace("%", "")) / 100.0
        modification_probabilities[modification][site] = probability
    return modification_probabilities