Skip to content

Impute

Transformer classes for imputing missing values in quantitative proteomics data.

This module defines transformer classes that can be fitted to a table containing quantitative values to learn imputation parameters. Once fitted, these transformers can then be applied to another table to transform it by filling in missing values. The transformation returns a new copy of the table with the imputed values, leaving the original table unchanged.

Classes:

Name Description
FixedValueImputer

Imputer for completing missing values with a fixed value.

GaussianImputer

Imputer for completing missing values by drawing from a gaussian distribution.

PerseusImputer

Imputer for completing missing values as implemented in Perseus.

FixedValueImputer

FixedValueImputer(
    strategy: str,
    fill_value: float = 0.0,
    column_wise: bool = True,
)

Imputer for completing missing values with a fixed value.

Replace missing values using a constant value or with an integer that is smaller than the minimum value of each column or smaller than the minimum value of the whole array.

Parameters:

Name Type Description Default
strategy str

The imputation strategy. - If "constant", replace missing values with 'fill_value'. - If "below", replace missing values with an integer that is smaller than the minimal value of the fitted dataframe. Minimal values are calculated per column if 'column_wise' is True, otherwise the minimal value is calculated for all columns.

required
fill_value float

When strategy is "constant", 'fill_value' is used to replace all occurrences of missing_values.

0.0
column_wise bool

If True, imputation is performed independently for each column, otherwise the whole dataframe is imputed togeter. Default True.

True

Methods:

Name Description
fit

Fits the FixedValueImputer.

is_fitted

Returns True if the FixedValueImputer has been fitted.

transform

Impute all missing values in 'table'.

Source code in msreport\impute.py
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
def __init__(
    self,
    strategy: str,
    fill_value: float = 0.0,
    column_wise: bool = True,
):
    """Initializes the FixedValueImputer.

    Args:
        strategy: The imputation strategy.
            - If "constant", replace missing values with 'fill_value'.
            - If "below", replace missing values with an integer that is smaller
              than the minimal value of the fitted dataframe. Minimal values are
              calculated per column if 'column_wise' is True, otherwise the minimal
              value is calculated for all columns.
        fill_value: When strategy is "constant", 'fill_value' is used to replace all
            occurrences of missing_values.
        column_wise: If True, imputation is performed independently for each column,
            otherwise the whole dataframe is imputed togeter. Default True.

    """
    self.strategy = strategy
    self.fill_value = fill_value
    self.column_wise = column_wise
    self._sample_fill_values: dict[str, float] = {}

fit

fit(table: DataFrame) -> Self

Fits the FixedValueImputer.

Parameters:

Name Type Description Default
table DataFrame

Input Dataframe for generating fill values for each column.

required

Returns:

Type Description
Self

Returns the fitted FixedValueImputer instance.

Source code in msreport\impute.py
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
def fit(self, table: pd.DataFrame) -> Self:
    """Fits the FixedValueImputer.

    Args:
        table: Input Dataframe for generating fill values for each column.

    Returns:
        Returns the fitted FixedValueImputer instance.
    """
    if self.strategy == "constant":
        fill_values = dict.fromkeys(table.columns, self.fill_value)
    elif self.strategy == "below":
        if self.column_wise:
            fill_values = {}
            for column in table.columns:
                fill_values[column] = _calculate_integer_below_min(table[column])
        else:
            int_below_min = _calculate_integer_below_min(table)
            fill_values = dict.fromkeys(table.columns, int_below_min)
    self._sample_fill_values = fill_values
    return self

is_fitted

is_fitted() -> bool

Returns True if the FixedValueImputer has been fitted.

Source code in msreport\impute.py
75
76
77
def is_fitted(self) -> bool:
    """Returns True if the FixedValueImputer has been fitted."""
    return len(self._sample_fill_values) != 0

transform

transform(table: DataFrame) -> DataFrame

Impute all missing values in 'table'.

Parameters:

Name Type Description Default
table DataFrame

A dataframe of numeric values that will be completed. Each column name must correspond to a column name from the table that was used for the fitting.

required

Returns:

Type Description
DataFrame

'table' with imputed missing values.

Source code in msreport\impute.py
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
def transform(self, table: pd.DataFrame) -> pd.DataFrame:
    """Impute all missing values in 'table'.

    Args:
        table: A dataframe of numeric values that will be completed. Each column
            name must correspond to a column name from the table that was used for
            the fitting.

    Returns:
        'table' with imputed missing values.
    """
    _confirm_is_fitted(self)

    _table = table.copy()
    for column in _table.columns:
        column_data = np.array(_table[column], dtype=float)
        mask = ~np.isfinite(column_data)
        column_data[mask] = self._sample_fill_values[column]
        _table[column] = column_data
    return _table

GaussianImputer

GaussianImputer(
    mu: float, sigma: float, seed: Optional[int] = None
)

Imputer for completing missing values by drawing from a gaussian distribution.

Parameters:

Name Type Description Default
mu float

Mean of the gaussian distribution.

required
sigma float

Standard deviation of the gaussian distribution, must be positive.

required
seed Optional[int]

Optional, allows specifying a number for initializing the random number generator. Using the same seed for the same input table will generate the same set of imputed values each time. Default is None, which results in different imputed values being generated each time.

None

Methods:

Name Description
fit

Fits the GaussianImputer, altough this is not necessary.

is_fitted

Returns always True, as the GaussianImputer does not need to be fitted.

transform

Impute all missing values in 'table'.

Source code in msreport\impute.py
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def __init__(self, mu: float, sigma: float, seed: Optional[int] = None):
    """Initializes the GaussianImputer.

    Args:
        mu: Mean of the gaussian distribution.
        sigma: Standard deviation of the gaussian distribution, must be positive.
        seed: Optional, allows specifying a number for initializing the random
            number generator. Using the same seed for the same input table will
            generate the same set of imputed values each time. Default is None,
            which results in different imputed values being generated each time.
    """
    self.mu = mu
    self.sigma = sigma
    self.seed = seed

fit

fit(table: DataFrame) -> Self

Fits the GaussianImputer, altough this is not necessary.

Parameters:

Name Type Description Default
table DataFrame

Input Dataframe for fitting.

required

Returns:

Type Description
Self

Returns the fitted GaussianImputer instance.

Source code in msreport\impute.py
119
120
121
122
123
124
125
126
127
128
def fit(self, table: pd.DataFrame) -> Self:
    """Fits the GaussianImputer, altough this is not necessary.

    Args:
        table: Input Dataframe for fitting.

    Returns:
        Returns the fitted GaussianImputer instance.
    """
    return self

is_fitted

is_fitted() -> bool

Returns always True, as the GaussianImputer does not need to be fitted.

Source code in msreport\impute.py
130
131
132
def is_fitted(self) -> bool:
    """Returns always True, as the GaussianImputer does not need to be fitted."""
    return True

transform

transform(table: DataFrame) -> DataFrame

Impute all missing values in 'table'.

Parameters:

Name Type Description Default
table DataFrame

A dataframe of numeric values that will be completed. Each column name must correspond to a column name from the table that was used for the fitting.

required

Returns:

Type Description
DataFrame

'table' with imputed missing values.

Source code in msreport\impute.py
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
def transform(self, table: pd.DataFrame) -> pd.DataFrame:
    """Impute all missing values in 'table'.

    Args:
        table: A dataframe of numeric values that will be completed. Each column
            name must correspond to a column name from the table that was used for
            the fitting.

    Returns:
        'table' with imputed missing values.
    """
    _confirm_is_fitted(self)
    np.random.seed(self.seed)

    _table = table.copy()
    for column in _table.columns:
        column_data = np.array(_table[column], dtype=float)
        mask = ~np.isfinite(column_data)
        column_data[mask] = np.random.normal(
            loc=self.mu, scale=self.sigma, size=mask.sum()
        )
        _table[column] = column_data
    return _table

PerseusImputer

PerseusImputer(
    median_downshift: float = 1.8,
    std_width: float = 0.3,
    column_wise: bool = True,
    seed: Optional[int] = None,
)

Imputer for completing missing values as implemented in Perseus.

Perseus-style imputation replaces missing values by random numbers drawn from a normal distribution. Sigma and mu of this distribution are calculated from the standard deviation and median of the observed values.

Parameters:

Name Type Description Default
median_downshift float

Times of standard deviations the observed median is downshifted for calulating mu of the normal distribution. Default is 1.8

1.8
std_width float

Factor for adjusting the standard deviation of the observed values to obtain sigma of the normal distribution. Default is 0.3

0.3
column_wise bool

If True, imputation is performed independently for each column, otherwise the whole dataframe is imputed togeter. Default True.

True
seed Optional[int]

Optional, allows specifying a number for initializing the random number generator. Using the same seed for the same input table will generate the same set of imputed values each time. Default is None, which results in different imputed values being generated each time.

None

Methods:

Name Description
fit

Fits the PerseusImputer.

is_fitted

Returns True if the PerseusImputer has been fitted.

transform

Impute all missing values in 'table'.

Source code in msreport\impute.py
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
def __init__(
    self,
    median_downshift: float = 1.8,
    std_width: float = 0.3,
    column_wise: bool = True,
    seed: Optional[int] = None,
):
    """Initializes the GaussianImputer.

    Args:
        median_downshift: Times of standard deviations the observed median is
            downshifted for calulating mu of the normal distribution. Default is 1.8
        std_width: Factor for adjusting the standard deviation of the observed
            values to obtain sigma of the normal distribution. Default is 0.3
        column_wise: If True, imputation is performed independently for each column,
            otherwise the whole dataframe is imputed togeter. Default True.
        seed: Optional, allows specifying a number for initializing the random
            number generator. Using the same seed for the same input table will
            generate the same set of imputed values each time. Default is None,
            which results in different imputed values being generated each time.

    """
    self.median_downshift = median_downshift
    self.std_width = std_width
    self.column_wise = column_wise
    self.seed = seed
    self._column_params: dict[str, dict[str, float]] = {}

fit

fit(table: DataFrame) -> Self

Fits the PerseusImputer.

Parameters:

Name Type Description Default
table DataFrame

Input Dataframe for calculating mu and sigma of the gaussian distribution.

required

Returns:

Type Description
Self

Returns the fitted PerseusImputer instance.

Source code in msreport\impute.py
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
def fit(self, table: pd.DataFrame) -> Self:
    """Fits the PerseusImputer.

    Args:
        table: Input Dataframe for calculating mu and sigma of the gaussian
            distribution.

    Returns:
        Returns the fitted PerseusImputer instance.
    """
    for column in table.columns:
        if self.column_wise:
            median = np.nanmedian(table[column])
            std = np.nanstd(table[column])
        else:
            median = np.nanmedian(table)
            std = np.nanstd(table)

        mu = median - (std * self.median_downshift)
        sigma = std * self.std_width

        self._column_params[column] = {"mu": mu, "sigma": sigma}
    return self

is_fitted

is_fitted() -> bool

Returns True if the PerseusImputer has been fitted.

Source code in msreport\impute.py
219
220
221
def is_fitted(self) -> bool:
    """Returns True if the PerseusImputer has been fitted."""
    return len(self._column_params) != 0

transform

transform(table: DataFrame) -> DataFrame

Impute all missing values in 'table'.

Parameters:

Name Type Description Default
table DataFrame

A dataframe of numeric values that will be completed. Each column name must correspond to a column name from the table that was used for the fitting.

required

Returns:

Type Description
DataFrame

'table' with imputed missing values.

Source code in msreport\impute.py
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
def transform(self, table: pd.DataFrame) -> pd.DataFrame:
    """Impute all missing values in 'table'.

    Args:
        table: A dataframe of numeric values that will be completed. Each column
            name must correspond to a column name from the table that was used for
            the fitting.

    Returns:
        'table' with imputed missing values.
    """
    _confirm_is_fitted(self)
    np.random.seed(self.seed)

    _table = table.copy()
    for column in _table.columns:
        column_data = np.array(_table[column], dtype=float)
        mask = ~np.isfinite(column_data)
        column_data[mask] = np.random.normal(
            loc=self._column_params[column]["mu"],
            scale=self._column_params[column]["sigma"],
            size=mask.sum(),
        )
        _table[column] = column_data
    return _table