DataCleaner

class qf_lib.common.utils.data_cleaner.DataCleaner(dataframe: SimpleReturnsDataFrame, threshold: float = 0.05)[source]

Bases: object

Cleans data which is partially incomplete, e.g. has gaps

Parameters:
  • dataframe (SimpleReturnsDataFrame) – DataFrame of simple returns. If one column has more missing values than the threshold, it is removed from the result.

  • threshold (float) – top limit of missing data. If the amount of missing data in a series exceeds this limit, the series will be removed. It is a relative value (e.g. 0.02, which corresponds to 2% of the data from the series).

Methods:

proxy_using_regression(benchmark_tms, ...)

Removes columns from the DataFrame which have too many missing values.

proxy_using_value(proxy_value)

Removes columns from the DataFrame which have too many missing values.

proxy_using_regression(benchmark_tms: QFSeries, columns_type: type) SimpleReturnsDataFrame[source]

Removes columns from the DataFrame which have too many missing values. Then, the missing data in the remaining columns is completed using regression with the benchmark.

Parameters:
  • benchmark_tms (QFSeries) – benchmark used indirectly to proxy the missing data in the Dataframe.

  • columns_type (type) – type of each column (e.g. PricesSeries, LogReturnsSeries)

Returns:

completed dataframe. However it can still contain missing data, because sometimes it is not possible to complete all data using regression (e.g. for data that is missing in the original series there is no corresponding benchmark value).

Return type:

SimpleReturnsDataFrame

proxy_using_value(proxy_value: float) SimpleReturnsDataFrame[source]

Removes columns from the DataFrame which have too many missing values. Then, the missing data in the remaining columns is completed using a given proxy_value.

Parameters:

proxy_value (float) – value with which all the missing data should be filled

Returns:

completed dataframe without missing data

Return type:

SimpleReturnsDataFrame