Normalize

Transformer classes for normalizing and transforming quantitative proteomics data.

This module defines various transformer classes for normalizing and scaling quantitative values in tabular data. Examples include normalizers like median, mode, and LOWESS, as well as scalers such as PercentageScaler and ZScoreScaler. A specialized CategoricalNormalizer is also provided, which, when appropriately fitted and applied, can be used for complex transformations such as iBAQ or site-to-protein normalization.

These transformers can be fitted to a table containing quantitative values to learn parameters. Once fitted, they can then be applied to another table to adjust its values. The transformation returns a new copy of the table with the normalized/scaled values, leaving the original table unchanged.

Classes:

Name	Description
`FixedValueNormalizer`	Normalization by a constant normalization factor for each sample.
`ValueDependentNormalizer`	Normalization with a value dependent fit for each sample.
`SumNormalizer`	Normalizer that uses the sum of all values in each sample for normalization.
`MedianNormalizer`	A FixedValueNormalizer that uses the median as the fitting function.
`ModeNormalizer`	A FixedValueNormalizer that uses the mode as the fitting function.
`LowessNormalizer`	A ValueDependentNormalizer that uses lowess as the fitting function.
`CategoricalNormalizer`	Normalize samples based on category-dependent reference values.
`PercentageScaler`	Transform column values to percentages by dividing them with the column sum.
`ZscoreScaler`	Normalize samples by z-score scaling.
`Log2Transformer`	Apply log2 transformation to column values.

FixedValueNormalizer

FixedValueNormalizer(
    center_function: Callable, comparison: str
)

Normalization by a constant normalization factor for each sample.

Expects log transformed intensity values.

Parameters:

Name	Type	Description	Default
`center_function`	`Callable`	A function that accepts a sequence of values and returns a center value such as the median.	required
`comparison`	`str`	Must be "paired" or "reference". When "paired" is specified the normalization values are first calculated for each column pair. Then an optimal normalization value for each column is calculated by solving a matrix of linear equations of the column pair values with least squares. When "reference" is selected, a pseudo-reference sample is generated by calculating the mean value for each row. Only rows with valid values in all columns are used. Normalization values are then calculated by comparing each column to the pseudo-reference sample.	required

Methods:

Name	Description
`fit`	Fits the FixedValueNormalizer.
`is_fitted`	Returns True if the FixedValueNormalizer has been fitted.
`get_fits`	Returns a dictionary containing the fitted center values per sample.
`transform`	Applies a fixed value normalization to each column of the table.

Source code in msreport\normalize.py

def __init__(self, center_function: Callable, comparison: str):
    """Initializes the FixedValueNormalizer.

    Args:
        center_function: A function that accepts a sequence of values and
            returns a center value such as the median.
        comparison: Must be "paired" or "reference". When "paired" is specified
            the normalization values are first calculated for each column pair. Then
            an optimal normalization value for each column is calculated by solving
            a matrix of linear equations of the column pair values with least
            squares. When "reference" is selected, a pseudo-reference sample is
            generated by calculating the mean value for each row. Only rows with
            valid values in all columns are used. Normalization values are then
            calculated by comparing each column to the pseudo-reference sample.
    """
    if comparison not in ["paired", "reference"]:
        raise ValueError(
            f'"comparison" = {comparison} not allowed. '
            'Must be either "paired" or "reference".'
        )
    self._comparison_mode: str = comparison
    self._fit_function: Callable = center_function
    self._sample_fits: dict[str, float] = {}

fit

fit(table: DataFrame) -> Self

Fits the FixedValueNormalizer.

Parameters:

Name	Type	Description	Default
`table`	`DataFrame`	Dataframe used to calculate normalization values for each column. The normalization values are stored with the column names.	required

Returns:

Type	Description
`Self`	Returns the instance itself.

Source code in msreport\normalize.py

def fit(self, table: pd.DataFrame) -> Self:
    """Fits the FixedValueNormalizer.

    Args:
        table: Dataframe used to calculate normalization values for each column.
            The normalization values are stored with the column names.

    Returns:
        Returns the instance itself.
    """
    if self._comparison_mode == "paired":
        self._fit_with_paired_samples(table)
    elif self._comparison_mode == "reference":
        self._fit_with_pseudo_reference(table)
    return self

is_fitted

is_fitted() -> bool

Returns True if the FixedValueNormalizer has been fitted.

Source code in msreport\normalize.py

def is_fitted(self) -> bool:
    """Returns True if the FixedValueNormalizer has been fitted."""
    return True if self._sample_fits else False

get_fits

get_fits() -> dict[str, float]

Returns a dictionary containing the fitted center values per sample.

Raises:

Type	Description
`NotFittedError`	If the FixedValueNormalizer has not been fitted yet.

Source code in msreport\normalize.py

def get_fits(self) -> dict[str, float]:
    """Returns a dictionary containing the fitted center values per sample.

    Raises:
        NotFittedError: If the FixedValueNormalizer has not been fitted yet.
    """
    _confirm_is_fitted(self)
    return self._sample_fits.copy()

transform

transform(table: DataFrame) -> DataFrame

Applies a fixed value normalization to each column of the table.

Parameters:

Name	Type	Description	Default
`table`	`DataFrame`	The data to normalize. Each column name must correspond to a column name from the table that was used for the fitting.	required

Returns:

Type	Description
`DataFrame`	Transformed dataframe.

Raises:

Type	Description
`NotFittedError`	If the FixedValueNormalizer has not been fitted yet.

Source code in msreport\normalize.py

def transform(self, table: pd.DataFrame) -> pd.DataFrame:
    """Applies a fixed value normalization to each column of the table.

    Args:
        table: The data to normalize. Each column name must correspond to a column
            name from the table that was used for the fitting.

    Returns:
        Transformed dataframe.

    Raises:
        NotFittedError: If the FixedValueNormalizer has not been fitted yet.
    """
    _confirm_is_fitted(self)

    _table = table.copy()
    for column in _table.columns:
        column_data = np.array(_table[column], dtype=float)
        mask = np.isfinite(column_data)
        column_data[mask] = column_data[mask] - self._sample_fits[column]

        _table[column] = column_data
    return _table

ValueDependentNormalizer

ValueDependentNormalizer(
    fit_function: Callable[[Iterable, Iterable], ndarray],
)

Normalization with a value dependent fit for each sample.

Expects log transformed intensity values.

Parameters:

Name	Type	Description	Default
`fit_function`	`Callable[[Iterable, Iterable], ndarray]`	A function that accepts two sequences of values with equal length, with the first sequence being the observed samples values and the second the reference values. The function must return a numpy array with two columns. The first column contains the values and the second column the fitted deviations.	required

Methods:

Name	Description
`fit`	Fits the ValueDependentNormalizer.
`is_fitted`	Returns True if the ValueDependentNormalizer has been fitted.
`get_fits`	Returns a dictionary containing lists of fitting data per sample.
`transform`	Applies a value dependent normalization to each column of the table.

Source code in msreport\normalize.py

def __init__(self, fit_function: Callable[[Iterable, Iterable], np.ndarray]):
    """Initializes the ValueDependentNormalizer.

    Args:
        fit_function: A function that accepts two sequences of values with equal
            length, with the first sequence being the observed samples values and
            the second the reference values. The function must return a numpy array
            with two columns. The first column contains the values and the second
            column the fitted deviations.
    """
    self._sample_fits: dict[str, np.ndarray] = {}
    self._fit_function = fit_function

fit

fit(table: DataFrame) -> Self

Fits the ValueDependentNormalizer.

Parameters:

Name	Type	Description	Default
`table`	`DataFrame`	Dataframe used to calculate normalization arrays for each column.	required

Returns:

Type	Description
`Self`	Returns the instance itself.

Source code in msreport\normalize.py

def fit(self, table: pd.DataFrame) -> Self:
    """Fits the ValueDependentNormalizer.

    Args:
        table: Dataframe used to calculate normalization arrays for each column.

    Returns:
        Returns the instance itself.
    """
    self._fit_with_pseudo_reference(table)
    return self

is_fitted

is_fitted() -> bool

Returns True if the ValueDependentNormalizer has been fitted.

Source code in msreport\normalize.py

def is_fitted(self) -> bool:
    """Returns True if the ValueDependentNormalizer has been fitted."""
    return True if self._sample_fits else False

get_fits

get_fits() -> dict[str, ndarray]

Returns a dictionary containing lists of fitting data per sample.

Returns:

Type	Description
`dict[str, ndarray]`	A dictionary mapping sample names to fitting data. Fitting data is sequence
`dict[str, ndarray]`	of [itensity, deviation at this intensity] pairs.

Raises:

Type	Description
`NotFittedError`	If the ValueDependentNormalizer has not been fitted yet.

Source code in msreport\normalize.py

def get_fits(self) -> dict[str, np.ndarray]:
    """Returns a dictionary containing lists of fitting data per sample.

    Returns:
        A dictionary mapping sample names to fitting data. Fitting data is sequence
        of [itensity, deviation at this intensity] pairs.

    Raises:
        NotFittedError: If the ValueDependentNormalizer has not been fitted yet.
    """
    _confirm_is_fitted(self)
    return self._sample_fits.copy()

transform

transform(table: DataFrame) -> DataFrame

Applies a value dependent normalization to each column of the table.

Parameters:

Name	Type	Description	Default
`table`	`DataFrame`	The data to normalize. Each column name must correspond to a column name from the table that was used for the fitting.	required

Returns:

Type	Description
`DataFrame`	Transformed dataframe.

Raises:

Type	Description
`NotFittedError`	If the ValueDependentNormalizer has not been fitted yet.

Source code in msreport\normalize.py

def transform(self, table: pd.DataFrame) -> pd.DataFrame:
    """Applies a value dependent normalization to each column of the table.

    Args:
        table: The data to normalize. Each column name must correspond to a column
            name from the table that was used for the fitting.

    Returns:
        Transformed dataframe.

    Raises:
        NotFittedError: If the ValueDependentNormalizer has not been fitted yet.
    """
    _confirm_is_fitted(self)

    _table = table.copy()
    for column in _table.columns:
        column_data = np.array(_table[column], dtype=float)
        mask = np.isfinite(column_data)

        sample_fit = self._sample_fits[column]
        fit_values, fit_deviations = [np.array(i) for i in zip(*sample_fit)]
        column_data[mask] = column_data[mask] - np.interp(
            column_data[mask], fit_values, fit_deviations
        )

        _table[column] = column_data
    return _table

SumNormalizer

SumNormalizer()

Normalizer that uses the sum of all values in each sample for normalization.

Expects log2-transformed intensity values. To obtain normalization factors, the sum of non-log2-transformed values is calculated for each sample, then divided by the average of all sample sums and log2-transformed.

Methods:

Name	Description
`fit`	Fits the SumNormalizer and returns a fitted instance.
`is_fitted`	Returns True if the Transformer has been fitted.
`get_fits`	Returns a dictionary containing the fitted center values per sample.
`transform`	Transform values in table.

Source code in msreport\normalize.py

def __init__(self):
    """Initializes the SumNormalizer."""
    self._sample_fits: dict[str, float] = {}

fit

fit(table: DataFrame) -> Self

Fits the SumNormalizer and returns a fitted instance.

Parameters:

Name	Type	Description	Default
`table`	`DataFrame`	Dataframe used to calculate normalization values for each column.	required

Returns:

Type	Description
`Self`	Returns the instance itself.

Source code in msreport\normalize.py

def fit(self, table: pd.DataFrame) -> Self:
    """Fits the SumNormalizer and returns a fitted instance.

    Args:
        table: Dataframe used to calculate normalization values for each column.

    Returns:
        Returns the instance itself.
    """
    _sums = np.power(2, table).sum()
    _log2_fits = np.log2(_sums.divide(_sums.mean()))
    self._sample_fits = _log2_fits.to_dict()
    return self

is_fitted

is_fitted() -> bool

Returns True if the Transformer has been fitted.

Source code in msreport\normalize.py

def is_fitted(self) -> bool:
    """Returns True if the Transformer has been fitted."""
    return True if self._sample_fits else False

get_fits

get_fits() -> dict[str, float]

Returns a dictionary containing the fitted center values per sample.

Raises:

Type	Description
`NotFittedError`	If the FixedValueNormalizer has not been fitted yet.

Source code in msreport\normalize.py

def get_fits(self) -> dict[str, float]:
    """Returns a dictionary containing the fitted center values per sample.

    Raises:
        NotFittedError: If the FixedValueNormalizer has not been fitted yet.
    """
    _confirm_is_fitted(self)
    return self._sample_fits.copy()

transform

transform(table: DataFrame) -> DataFrame

Transform values in table.

Source code in msreport\normalize.py

def transform(self, table: pd.DataFrame) -> pd.DataFrame:
    """Transform values in table."""
    _confirm_is_fitted(self)

    _table = table.copy()
    for column in _table.columns:
        column_data = np.array(_table[column], dtype=float)
        mask = np.isfinite(column_data)
        column_data[mask] = column_data[mask] - self._sample_fits[column]

        _table[column] = column_data
    return _table

MedianNormalizer

MedianNormalizer()

Bases: FixedValueNormalizer

A FixedValueNormalizer that uses the median as the fitting function.

Use MedianNormalizer.fit(table: pd.DataFrame) to fit the normalizer, and then MedianNormalizer.transform(table: pd.DataFrame) with the fitted normalizer to apply the normalization.

Source code in msreport\normalize.py

def __init__(self):
    """Initializes the MedianNormalizer."""
    super(MedianNormalizer, self).__init__(
        center_function=np.median, comparison="paired"
    )

ModeNormalizer

ModeNormalizer()

Bases: FixedValueNormalizer

A FixedValueNormalizer that uses the mode as the fitting function.

Use ModeNormalizer.fit(table: pd.DataFrame) to fit the normalizer, and then ModeNormalizer.transform(table: pd.DataFrame) with the fitted normalizer to apply the normalization.

Source code in msreport\normalize.py

def __init__(self):
    """Initializes the ModeNormalizer."""
    super(ModeNormalizer, self).__init__(
        center_function=msreport.helper.mode, comparison="paired"
    )

LowessNormalizer

LowessNormalizer()

Bases: ValueDependentNormalizer

A ValueDependentNormalizer that uses lowess as the fitting function.

Use LowessNormalizer.fit(table: pd.DataFrame) to fit the normalizer, and then LowessNormalizer.transform(table: pd.DataFrame) with the fitted normalizer to apply the normalization.

Source code in msreport\normalize.py

def __init__(self):
    """Initializes the LowessNormalizer."""
    super(LowessNormalizer, self).__init__(fit_function=_value_dependent_fit_lowess)

CategoricalNormalizer

CategoricalNormalizer(category_column: str)

Normalize samples based on category-dependent reference values.

Values from the reference table are used for normalization of the corresponding categories in the table that will be transformed. The normalization is applied to each column of the input table based on the category of each row.

The reference table must not contain NaN values and values in the sample columns must be log-transformed. The table to be transformed must contain the same category_column as the reference table and only include sample columns that were used for fitting. Values from categories not present in the reference table will be set to NaN. The table sample columns must also be log-transformed.

Parameters:

Name	Type	Description	Default
`category_column`	`str`	The name of the column containing the categories. This column must be present in the reference table and the table to be transformed.	required

Methods:

Name	Description
`is_fitted`	Returns True if the CategoricalNormalizer has been fitted.
`fit`	Fits the CategoricalNormalizer to a reference table.
`get_fits`	Returns a copy of the reference table used for fitting.
`get_category_column`	Returns the name of the category column.
`transform`	Applies a category dependent normalization to the table.

Source code in msreport\normalize.py

def __init__(self, category_column: str):
    """Initializes a new instance of the CategoricalNormalizer class.

    Args:
        category_column: The name of the column containing the categories. This
            column must be present in the reference table and the table to be
            transformed.
    """
    self._fitted_table: pd.DataFrame = pd.DataFrame()
    self._category_column: str = category_column

is_fitted

is_fitted() -> bool

Returns True if the CategoricalNormalizer has been fitted.

Source code in msreport\normalize.py

def is_fitted(self) -> bool:
    """Returns True if the CategoricalNormalizer has been fitted."""
    return not self._fitted_table.empty

fit

fit(reference_table: DataFrame) -> Self

Fits the CategoricalNormalizer to a reference table.

Parameters:

Name	Type	Description	Default
`reference_table`	`DataFrame`	The reference table used for fitting.	required

Returns:

Type	Description
`Self`	Returns the instance itself.

Raises:

Type	Description
`ValueError`	If the reference table contains NaN values.

Source code in msreport\normalize.py

def fit(self, reference_table: pd.DataFrame) -> Self:
    """Fits the CategoricalNormalizer to a reference table.

    Args:
        reference_table: The reference table used for fitting.

    Returns:
        Returns the instance itself.

    Raises:
        ValueError: If the reference table contains NaN values.
    """
    if reference_table.isna().values.any():
        raise ValueError("Input table contains NaN values")
    reference_table = reference_table.set_index(self.get_category_column())
    self._fitted_table = reference_table
    return self

get_fits

get_fits() -> DataFrame

Returns a copy of the reference table used for fitting.

Raises:

Type	Description
`NotFittedError`	If the CategoricalNormalizer has not been fitted yet.

Source code in msreport\normalize.py

def get_fits(self) -> pd.DataFrame:
    """Returns a copy of the reference table used for fitting.

    Raises:
        NotFittedError: If the CategoricalNormalizer has not been fitted yet.
    """
    _confirm_is_fitted(self)
    return self._fitted_table.copy()

get_category_column

get_category_column() -> str

Returns the name of the category column.

Source code in msreport\normalize.py

def get_category_column(self) -> str:
    """Returns the name of the category column."""
    return self._category_column

transform

transform(table: DataFrame) -> DataFrame

Applies a category dependent normalization to the table.

Parameters:

Name	Type	Description	Default
`table`	`DataFrame`	The table to normalize.	required

Returns:

Type	Description
`DataFrame`	The normalized table.

Raises:

Type	Description
`KeyError`	If the input table contains columns not present in the reference table.
`NotFittedError`	If the CategoricalNormalizer has not been fitted yet.

Source code in msreport\normalize.py

def transform(self, table: pd.DataFrame) -> pd.DataFrame:
    """Applies a category dependent normalization to the table.

    Args:
        table: The table to normalize.

    Returns:
        The normalized table.

    Raises:
        KeyError: If the input table contains columns not present in the reference
            table.
        NotFittedError: If the CategoricalNormalizer has not been fitted yet.
    """
    _confirm_is_fitted(self)

    original_index = table.index
    table = table.set_index(self.get_category_column(), drop=True, inplace=False)

    if not table.columns.isin(self._fitted_table).all():
        raise KeyError("The `table` contains columns not present in the fits")

    valid_categories = table.index.isin(self._fitted_table.index)
    sub_table = table[valid_categories]
    values_for_fitting = self._fitted_table.loc[sub_table.index, sub_table.columns]

    transformed_table = table.copy()
    transformed_table[~valid_categories] = np.nan
    transformed_table[valid_categories] = sub_table.sub(values_for_fitting, axis=1)

    transformed_table.reset_index(inplace=True)
    transformed_table.index = original_index
    return transformed_table

PercentageScaler

Transform column values to percentages by dividing them with the column sum.

Methods:

Name	Description
`fit`	Returns the instance itself.
`is_fitted`	Always returns True because the Scaler does not need to be fitted.
`get_fits`	Returns an empty dictionary.
`transform`	Transforms column values into percentages by devision with the column sum.

fit

fit(table: DataFrame) -> Self

Returns the instance itself.

Source code in msreport\normalize.py

def fit(self, table: pd.DataFrame) -> Self:
    """Returns the instance itself."""
    return self

is_fitted

is_fitted() -> bool

Always returns True because the Scaler does not need to be fitted.

Source code in msreport\normalize.py

def is_fitted(self) -> bool:
    """Always returns True because the Scaler does not need to be fitted."""
    return True

get_fits

get_fits() -> dict

Returns an empty dictionary.

Source code in msreport\normalize.py

def get_fits(self) -> dict:
    """Returns an empty dictionary."""
    return {}

transform

transform(table: DataFrame) -> DataFrame

Transforms column values into percentages by devision with the column sum.

Parameters:

Name	Type	Description	Default
`table`	`DataFrame`	The table used to scale row values.	required

Returns:

Type	Description
`DataFrame`	A copy of the table containing the scaled values.

Source code in msreport\normalize.py

def transform(self, table: pd.DataFrame) -> pd.DataFrame:
    """Transforms column values into percentages by devision with the column sum.

    Args:
        table: The table used to scale row values.

    Returns:
        A copy of the table containing the scaled values.
    """
    return table.divide(table.sum(axis=0), axis=1)

ZscoreScaler

ZscoreScaler(with_mean: bool = True, with_std: bool = True)

Normalize samples by z-score scaling.

Parameters:

Name	Type	Description	Default
`with_mean`	`bool`	If True, center row values by subtracting the row mean.	`True`
`with_std`	`bool`	If True, scale row values by dividing by the row std.	`True`

Methods:

Name	Description
`fit`	Returns the instance itself.
`is_fitted`	Always returns True because the ZscoreScaler does not need to be fitted.
`get_fits`	Returns a dictionary containing the parameters 'with_mean' and 'with_std'.
`transform`	Applies a z-score normalization to each column of the table.

Source code in msreport\normalize.py

def __init__(self, with_mean: bool = True, with_std: bool = True):
    """Initializes a new instance of the ZscoreScaler class.

    Args:
        with_mean: If True, center row values by subtracting the row mean.
        with_std: If True, scale row values by dividing by the row std.
    """
    self._with_mean = with_mean
    self._with_std = with_std

fit

fit(table: DataFrame) -> Self

Returns the instance itself.

Source code in msreport\normalize.py

def fit(self, table: pd.DataFrame) -> Self:
    """Returns the instance itself."""
    return self

is_fitted

is_fitted() -> bool

Always returns True because the ZscoreScaler does not need to be fitted.

Source code in msreport\normalize.py

def is_fitted(self) -> bool:
    """Always returns True because the ZscoreScaler does not need to be fitted."""
    return True

get_fits

get_fits() -> dict

Returns a dictionary containing the parameters 'with_mean' and 'with_std'.

Source code in msreport\normalize.py

def get_fits(self) -> dict:
    """Returns a dictionary containing the parameters 'with_mean' and 'with_std'."""
    return {"with_mean": self._with_mean, "with_std": self._with_std}

transform

transform(table: DataFrame) -> DataFrame

Applies a z-score normalization to each column of the table.

Parameters:

Name	Type	Description	Default
`table`	`DataFrame`	The table used to scale row values.	required

Returns:

Type	Description
`DataFrame`	A copy of the table containing the scaled values.

Source code in msreport\normalize.py

def transform(self, table: pd.DataFrame) -> pd.DataFrame:
    """Applies a z-score normalization to each column of the table.

    Args:
        table: The table used to scale row values.

    Returns:
        A copy of the table containing the scaled values.
    """
    scaled_table = table.copy()
    if self._with_mean:
        scaled_table = scaled_table.subtract(scaled_table.mean(axis=1), axis=0)
    if self._with_std:
        scaled_table = scaled_table.divide(scaled_table.std(axis=1, ddof=0), axis=0)
    return scaled_table

Log2Transformer

Apply log2 transformation to column values.

Methods:

Name	Description
`fit`	Returns the instance itself.
`is_fitted`	Returns True if the transformer is fitted.
`transform`	Applies a log2 transformation to each column of the table.

fit

fit(table: DataFrame) -> Self

Returns the instance itself.

Source code in msreport\normalize.py

def fit(self, table: pd.DataFrame) -> Self:
    """Returns the instance itself."""
    return self

is_fitted

is_fitted() -> bool

Returns True if the transformer is fitted.

Source code in msreport\normalize.py

def is_fitted(self) -> bool:
    """Returns True if the transformer is fitted."""
    return True

transform

transform(table: DataFrame) -> DataFrame

Applies a log2 transformation to each column of the table.

Zero values are replaced with NaN before the transformation to avoid an error during the log2 calculation.

Source code in msreport\normalize.py

def transform(self, table: pd.DataFrame) -> pd.DataFrame:
    """Applies a log2 transformation to each column of the table.

    Zero values are replaced with NaN before the transformation to avoid an error
    during the log2 calculation.
    """
    return pd.DataFrame(np.log2(table.replace({0: np.nan})))