src.fairreckitlib.data.set.dataset

This module contains a dataset definition for accessing a dataset and related data tables.

Classes:

Dataset: class wrapper of the user events, user-item matrices and related tables.

Functions:

add_dataset_columns: add columns from the dataset matrix/user/item tables to a dataframe.

This program has been developed by students from the bachelor Computer Science at Utrecht University within the Software Project course. © Copyright Utrecht University (Department of Information and Computing Sciences)

  1"""This module contains a dataset definition for accessing a dataset and related data tables.
  2
  3Classes:
  4
  5    Dataset: class wrapper of the user events, user-item matrices and related tables.
  6
  7Functions:
  8
  9    add_dataset_columns: add columns from the dataset matrix/user/item tables to a dataframe.
 10
 11This program has been developed by students from the bachelor Computer Science at
 12Utrecht University within the Software Project course.
 13© Copyright Utrecht University (Department of Information and Computing Sciences)
 14"""
 15
 16import os
 17from typing import Any, Dict, Optional, List, Union
 18
 19import pandas as pd
 20
 21from .dataset_config import DatasetConfig, DatasetMatrixConfig, DatasetTableConfig
 22
 23
 24class Dataset:
 25    """Wrapper class for a FairRecKit dataset.
 26
 27    A dataset is used for carrying out recommender system experiments.
 28    Each dataset has a strong affinity with a database structure consisting of
 29    multiple tables.
 30    The standardized matrix is a pandas.DataFrame stored in a '.tsv' file.
 31    The (derived sparse) matrix is used in experiments and needs to be
 32    in a CSR compatible format, meaning three fields:
 33
 34    1) 'user': IDs range from 0 to the amount of unique users.
 35    2) 'item': IDs range from 0 to the amount of unique items. An item can be
 36        various of things (e.g. an artist, an album, a track, a movie, etc.)
 37    3) 'rating': floating-point data describing the rating a user has given an item.
 38        There are two types of ratings, namely explicit or implicit, and both
 39        are expected to be greater than zero.
 40
 41    The matrix has one optional field which is:
 42
 43    4) 'timestamp': when present can be used to split the matrix on temporal basis.
 44
 45    A dataset has two main tables that are connected to the 'user' and 'item' fields.
 46    Indirection arrays are available when user and/or item IDs do not match up in
 47    their corresponding tables. These two tables can be used in an experiment to
 48    filter any rows based on various table header criteria.
 49    Any additional tables can be added for accessibility/compatibility with the FRK
 50    recommender system.
 51
 52    Public methods:
 53
 54    get_available_columns
 55    get_available_event_tables
 56    get_available_matrices
 57    get_available_tables
 58    get_matrices_info
 59    get_matrix_config
 60    get_matrix_file_path
 61    get_name
 62    get_table_config
 63    get_table_info
 64    load_matrix
 65    read_matrix
 66    read_table
 67    resolve_item_ids
 68    resolve_user_ids
 69    """
 70
 71    def __init__(self, data_dir: str, config: DatasetConfig):
 72        """Construct the dataset.
 73
 74        Args:
 75            data_dir: directory where the dataset is stored.
 76            config: the dataset configuration.
 77        """
 78        if not os.path.isdir(data_dir):
 79            raise IOError('Unknown dataset directory: ' + data_dir)
 80
 81        self.data_dir = data_dir
 82        self.config = config
 83
 84    def get_available_columns(self, matrix_name: str) -> Dict[str, List[str]]:
 85        """Get the available table column names of this dataset.
 86
 87        Args:
 88            matrix_name: the name of the matrix to get the available columns of.
 89
 90        Returns:
 91            a dictionary with table name as keys and column names as values.
 92
 93        """
 94        return self.config.get_available_columns(matrix_name)
 95
 96    def get_available_event_tables(self) -> List[str]:
 97        """Get the available event table names in the dataset.
 98
 99        Returns:
100            a list of event table names.
101        """
102        event_table_names = []
103
104        for table_name, _ in self.config.events.items():
105            event_table_names.append(table_name)
106
107        return event_table_names
108
109    def get_available_matrices(self) -> List[str]:
110        """Get the available matrix names in the dataset.
111
112        Returns:
113            a list of matrix names.
114        """
115        matrix_names = []
116
117        for matrix_name, _ in self.config.matrices.items():
118            matrix_names.append(matrix_name)
119
120        return matrix_names
121
122    def get_available_tables(self) -> List[str]:
123        """Get the available table names in the dataset.
124
125        Returns:
126            a list of table names.
127        """
128        table_names = []
129
130        for table_name, _ in self.config.tables.items():
131            table_names.append(table_name)
132
133        return table_names
134
135    def get_matrices_info(self) -> Dict[str, Any]:
136        """Get the information on the dataset's available matrices.
137
138        Returns:
139            a dictionary containing the matrices' information keyed by matrix name.
140        """
141        info = {}
142
143        for matrix_name, matrix_config in self.config.matrices.items():
144            info[matrix_name] = matrix_config.to_yml_format()
145
146        return info
147
148    def get_matrix_config(self, matrix_name: str) -> Optional[DatasetMatrixConfig]:
149        """Get the configuration of a dataset's matrix.
150
151        Args:
152            matrix_name: the name of the matrix to get the configuration of.
153
154        Returns:
155            the configuration of the matrix or None when not available.
156        """
157        return self.config.matrices.get(matrix_name)
158
159    def get_matrix_file_path(self, matrix_name: str) -> Optional[str]:
160        """Get the file path where the matrix with the specified name is stored.
161
162        Args:
163            matrix_name: the name of the matrix to get the file path of.
164
165        Returns:
166            the path of the dataset's matrix file or None when not available.
167        """
168        if matrix_name not in self.config.matrices:
169            return None
170
171        return os.path.join(
172            self.data_dir,
173            self.config.matrices[matrix_name].table.file.name
174        )
175
176    def get_name(self) -> str:
177        """Get the name of the dataset.
178
179        Returns:
180            the dataset name.
181        """
182        return self.config.dataset_name
183
184    def get_table_config(self, table_name: str) -> Optional[DatasetTableConfig]:
185        """Get the configuration of the dataset table with the specified name.
186
187        Args:
188            table_name: name of the table to retrieve the configuration of.
189
190        Returns:
191            the table configuration or None when not available.
192        """
193        return self.config.tables.get(table_name)
194
195    def get_table_info(self) -> Dict[str, Any]:
196        """Get the information on the dataset's available tables.
197
198        Returns:
199            a dictionary containing the table information keyed by table name.
200        """
201        info = {}
202
203        for table_name, table_config in self.config.tables.items():
204            info[table_name] = table_config.to_yml_format()
205
206        return info
207
208    def load_matrix(self, matrix_name: str) -> Optional[pd.DataFrame]:
209        """Load the standardized user-item matrix of the dataset.
210
211        Args:
212            matrix_name: the name of the matrix to load.
213
214        Returns:
215            the loaded user-item matrix or None when not available.
216        """
217        matrix_config = self.get_matrix_config(matrix_name)
218        if matrix_config is None:
219            return None
220
221        return matrix_config.load_matrix(self.data_dir)
222
223    def load_item_indices(self, matrix_name: str) -> Optional[List[int]]:
224        """Load the item indices.
225
226        Optional indirection array of the item IDs that do not match up in
227        the corresponding data table.
228
229        Args:
230            matrix_name: the name of the matrix to load the item indices of.
231
232        Raises:
233            KeyError: when the matrix with the specified name does not exist.
234
235        Returns:
236            the indirection array or None when not needed.
237        """
238        matrix_config = self.get_matrix_config(matrix_name)
239        if not matrix_config:
240            raise KeyError('Unknown matrix configuration to load item indices from')
241
242        return matrix_config.item.load_indices(self.data_dir)
243
244    def load_user_indices(self, matrix_name: str) -> Optional[List[int]]:
245        """Load the user indices.
246
247        Optional indirection array of the user IDs that do not match up in
248        the corresponding data table.
249
250        Args:
251            matrix_name: the name of the matrix to load the user indices of.
252
253        Raises:
254            KeyError: when the matrix with the specified name does not exist.
255
256        Returns:
257            the indirection array or None when not needed.
258        """
259        matrix_config = self.get_matrix_config(matrix_name)
260        if not matrix_config:
261            raise KeyError('Unknown matrix configuration to load item indices from')
262
263        return matrix_config.user.load_indices(self.data_dir)
264
265    def read_matrix(
266            self,
267            matrix_name: str,
268            columns: List[Union[int,str]]=None,
269            chunk_size: int=None) -> Optional[pd.DataFrame]:
270        """Read the matrix with the specified name from the dataset.
271
272        Args:
273            matrix_name: the name of the matrix to load.
274            columns: subset list of columns to load or None to load all.
275                All elements must either be integer indices or
276                strings that correspond to the one of the available table columns.
277            chunk_size: reads the matrix in chunks as an iterator or
278                the entire table when None.
279
280        Returns:
281            the resulting matrix dataframe (iterator) or None when not available.
282        """
283        matrix_config = self.get_matrix_config(matrix_name)
284        if matrix_config is None:
285            return None
286
287        table_config = matrix_config.table
288        return table_config.read_table(self.data_dir, columns=columns, chunk_size=chunk_size)
289
290    def read_table(
291            self,
292            table_name: str,
293            columns: List[Union[int,str]]=None,
294            chunk_size: int=None) -> Optional[pd.DataFrame]:
295        """Read the table with the specified name from the dataset.
296
297        Args:
298            table_name: name of the table to read.
299            columns: subset list of columns to load or None to load all.
300                All elements must either be integer indices or
301                strings that correspond to the one of the available table columns.
302            chunk_size: reads the table in chunks as an iterator or
303                the entire table when None.
304
305        Returns:
306            the resulting table dataframe (iterator) or None when not available.
307        """
308        table_config = self.get_table_config(table_name)
309        if table_config is None:
310            return None
311
312        return table_config.read_table(self.data_dir, columns=columns, chunk_size=chunk_size)
313
314    def resolve_item_ids(
315            self,
316            matrix_name: str,
317            items: Union[int,List[int]]) -> Union[int,List[int]]:
318        """Resolve the specified item ID(s).
319
320        The item ID(s) of a dataset need to be resolved when it contains
321        an indirection array, otherwise ID(s) are returned unchanged.
322
323        Args:
324            matrix_name: the name of the matrix to resolve the item indices of.
325            items: source ID(s) to convert.
326
327        Raises:
328            KeyError: when the matrix with the specified name does not exist.
329
330        Returns:
331            the resolved item ID(s).
332        """
333        item_indices = self.load_item_indices(matrix_name)
334        if item_indices is None:
335            return items
336
337        return item_indices[items]
338
339    def resolve_user_ids(
340            self,
341            matrix_name: str,
342            users: Union[int,List[int]]) -> Union[int,List[int]]:
343        """Resolve the specified user ID(s).
344
345        The user ID(s) of a dataset need to be resolved when it contains
346        an indirection array, otherwise ID(s) are returned unchanged.
347
348        Args:
349            matrix_name: the name of the matrix to resolve the user indices of.
350            users: source ID(s) to convert.
351
352        Raises:
353            KeyError: when the matrix with the specified name does not exist.
354
355        Returns:
356            the resolved user ID(s).
357        """
358        user_indices = self.load_user_indices(matrix_name)
359        if user_indices is None:
360            return users
361
362        return user_indices[users]
363
364
365def add_dataset_columns(
366        dataset: Dataset,
367        matrix_name: str,
368        dataframe: pd.DataFrame,
369        column_names: List[str]) -> pd.DataFrame:
370    """Add the specified columns from the dataset to the dataframe.
371
372    Args:
373        dataset: the set related to the dataframe.
374        matrix_name: the name of the dataset matrix.
375        dataframe: with at least the 'user' and/or 'item' columns.
376        column_names: a list of strings to indicate which
377            user and/or item columns need to be added. Any values that are not
378            present in the dataset tables are ignored.
379
380    Returns:
381        the resulting dataframe with the added columns that exist in the dataset.
382    """
383    for table_name, table_columns in dataset.get_available_columns(matrix_name).items():
384        columns = [c for c in column_names if c in table_columns]
385        # skip table that does not contain any needed columns
386        if len(columns) == 0:
387            continue
388
389        matrix_config = dataset.get_matrix_config(matrix_name)
390        table_config = dataset.get_table_config(table_name)
391
392        user_key = matrix_config.user.key
393        item_key = matrix_config.item.key
394        user_item_key = [user_key, item_key]
395
396        # add matrix columns
397        if table_name == 'matrix':
398            dataframe = pd.merge(
399                dataframe,
400                dataset.read_matrix(matrix_name, columns=matrix_config.table.primary_key + columns),
401                how='left',
402                left_on=['user', 'item'],
403                right_on=matrix_config.table.primary_key
404            )
405            dataframe.drop(matrix_config.table.primary_key, inplace=True, axis=1)
406        # add user columns
407        elif table_config.primary_key == [user_key]:
408            dataframe[user_key] = dataset.resolve_user_ids(matrix_name, dataframe['user'])
409            dataframe = pd.merge(
410                dataframe,
411                dataset.read_table(table_name, columns=table_config.primary_key + columns),
412                how='left',
413                on=user_key
414            )
415            dataframe.drop(user_key, inplace=True, axis=1)
416        # add item columns
417        elif table_config.primary_key == [item_key]:
418            dataframe[item_key] = dataset.resolve_item_ids(matrix_name, dataframe['item'])
419            dataframe = pd.merge(
420                dataframe,
421                dataset.read_table(table_name, columns=table_config.primary_key + columns),
422                how='left',
423                on=item_key
424            )
425            dataframe.drop(item_key, inplace=True, axis=1)
426        # add user-item columns
427        elif table_config.primary_key == user_item_key:
428            dataframe[user_key] = dataset.resolve_user_ids(matrix_name, dataframe['user'])
429            dataframe[item_key] = dataset.resolve_item_ids(matrix_name, dataframe['item'])
430            dataframe = pd.merge(
431                dataframe,
432                dataset.read_table(table_name, columns=table_config.primary_key + columns),
433                how='left',
434                on=user_item_key
435            )
436            dataframe.drop(user_item_key, inplace=True, axis=1)
437
438    return dataframe
class Dataset:
 25class Dataset:
 26    """Wrapper class for a FairRecKit dataset.
 27
 28    A dataset is used for carrying out recommender system experiments.
 29    Each dataset has a strong affinity with a database structure consisting of
 30    multiple tables.
 31    The standardized matrix is a pandas.DataFrame stored in a '.tsv' file.
 32    The (derived sparse) matrix is used in experiments and needs to be
 33    in a CSR compatible format, meaning three fields:
 34
 35    1) 'user': IDs range from 0 to the amount of unique users.
 36    2) 'item': IDs range from 0 to the amount of unique items. An item can be
 37        various of things (e.g. an artist, an album, a track, a movie, etc.)
 38    3) 'rating': floating-point data describing the rating a user has given an item.
 39        There are two types of ratings, namely explicit or implicit, and both
 40        are expected to be greater than zero.
 41
 42    The matrix has one optional field which is:
 43
 44    4) 'timestamp': when present can be used to split the matrix on temporal basis.
 45
 46    A dataset has two main tables that are connected to the 'user' and 'item' fields.
 47    Indirection arrays are available when user and/or item IDs do not match up in
 48    their corresponding tables. These two tables can be used in an experiment to
 49    filter any rows based on various table header criteria.
 50    Any additional tables can be added for accessibility/compatibility with the FRK
 51    recommender system.
 52
 53    Public methods:
 54
 55    get_available_columns
 56    get_available_event_tables
 57    get_available_matrices
 58    get_available_tables
 59    get_matrices_info
 60    get_matrix_config
 61    get_matrix_file_path
 62    get_name
 63    get_table_config
 64    get_table_info
 65    load_matrix
 66    read_matrix
 67    read_table
 68    resolve_item_ids
 69    resolve_user_ids
 70    """
 71
 72    def __init__(self, data_dir: str, config: DatasetConfig):
 73        """Construct the dataset.
 74
 75        Args:
 76            data_dir: directory where the dataset is stored.
 77            config: the dataset configuration.
 78        """
 79        if not os.path.isdir(data_dir):
 80            raise IOError('Unknown dataset directory: ' + data_dir)
 81
 82        self.data_dir = data_dir
 83        self.config = config
 84
 85    def get_available_columns(self, matrix_name: str) -> Dict[str, List[str]]:
 86        """Get the available table column names of this dataset.
 87
 88        Args:
 89            matrix_name: the name of the matrix to get the available columns of.
 90
 91        Returns:
 92            a dictionary with table name as keys and column names as values.
 93
 94        """
 95        return self.config.get_available_columns(matrix_name)
 96
 97    def get_available_event_tables(self) -> List[str]:
 98        """Get the available event table names in the dataset.
 99
100        Returns:
101            a list of event table names.
102        """
103        event_table_names = []
104
105        for table_name, _ in self.config.events.items():
106            event_table_names.append(table_name)
107
108        return event_table_names
109
110    def get_available_matrices(self) -> List[str]:
111        """Get the available matrix names in the dataset.
112
113        Returns:
114            a list of matrix names.
115        """
116        matrix_names = []
117
118        for matrix_name, _ in self.config.matrices.items():
119            matrix_names.append(matrix_name)
120
121        return matrix_names
122
123    def get_available_tables(self) -> List[str]:
124        """Get the available table names in the dataset.
125
126        Returns:
127            a list of table names.
128        """
129        table_names = []
130
131        for table_name, _ in self.config.tables.items():
132            table_names.append(table_name)
133
134        return table_names
135
136    def get_matrices_info(self) -> Dict[str, Any]:
137        """Get the information on the dataset's available matrices.
138
139        Returns:
140            a dictionary containing the matrices' information keyed by matrix name.
141        """
142        info = {}
143
144        for matrix_name, matrix_config in self.config.matrices.items():
145            info[matrix_name] = matrix_config.to_yml_format()
146
147        return info
148
149    def get_matrix_config(self, matrix_name: str) -> Optional[DatasetMatrixConfig]:
150        """Get the configuration of a dataset's matrix.
151
152        Args:
153            matrix_name: the name of the matrix to get the configuration of.
154
155        Returns:
156            the configuration of the matrix or None when not available.
157        """
158        return self.config.matrices.get(matrix_name)
159
160    def get_matrix_file_path(self, matrix_name: str) -> Optional[str]:
161        """Get the file path where the matrix with the specified name is stored.
162
163        Args:
164            matrix_name: the name of the matrix to get the file path of.
165
166        Returns:
167            the path of the dataset's matrix file or None when not available.
168        """
169        if matrix_name not in self.config.matrices:
170            return None
171
172        return os.path.join(
173            self.data_dir,
174            self.config.matrices[matrix_name].table.file.name
175        )
176
177    def get_name(self) -> str:
178        """Get the name of the dataset.
179
180        Returns:
181            the dataset name.
182        """
183        return self.config.dataset_name
184
185    def get_table_config(self, table_name: str) -> Optional[DatasetTableConfig]:
186        """Get the configuration of the dataset table with the specified name.
187
188        Args:
189            table_name: name of the table to retrieve the configuration of.
190
191        Returns:
192            the table configuration or None when not available.
193        """
194        return self.config.tables.get(table_name)
195
196    def get_table_info(self) -> Dict[str, Any]:
197        """Get the information on the dataset's available tables.
198
199        Returns:
200            a dictionary containing the table information keyed by table name.
201        """
202        info = {}
203
204        for table_name, table_config in self.config.tables.items():
205            info[table_name] = table_config.to_yml_format()
206
207        return info
208
209    def load_matrix(self, matrix_name: str) -> Optional[pd.DataFrame]:
210        """Load the standardized user-item matrix of the dataset.
211
212        Args:
213            matrix_name: the name of the matrix to load.
214
215        Returns:
216            the loaded user-item matrix or None when not available.
217        """
218        matrix_config = self.get_matrix_config(matrix_name)
219        if matrix_config is None:
220            return None
221
222        return matrix_config.load_matrix(self.data_dir)
223
224    def load_item_indices(self, matrix_name: str) -> Optional[List[int]]:
225        """Load the item indices.
226
227        Optional indirection array of the item IDs that do not match up in
228        the corresponding data table.
229
230        Args:
231            matrix_name: the name of the matrix to load the item indices of.
232
233        Raises:
234            KeyError: when the matrix with the specified name does not exist.
235
236        Returns:
237            the indirection array or None when not needed.
238        """
239        matrix_config = self.get_matrix_config(matrix_name)
240        if not matrix_config:
241            raise KeyError('Unknown matrix configuration to load item indices from')
242
243        return matrix_config.item.load_indices(self.data_dir)
244
245    def load_user_indices(self, matrix_name: str) -> Optional[List[int]]:
246        """Load the user indices.
247
248        Optional indirection array of the user IDs that do not match up in
249        the corresponding data table.
250
251        Args:
252            matrix_name: the name of the matrix to load the user indices of.
253
254        Raises:
255            KeyError: when the matrix with the specified name does not exist.
256
257        Returns:
258            the indirection array or None when not needed.
259        """
260        matrix_config = self.get_matrix_config(matrix_name)
261        if not matrix_config:
262            raise KeyError('Unknown matrix configuration to load item indices from')
263
264        return matrix_config.user.load_indices(self.data_dir)
265
266    def read_matrix(
267            self,
268            matrix_name: str,
269            columns: List[Union[int,str]]=None,
270            chunk_size: int=None) -> Optional[pd.DataFrame]:
271        """Read the matrix with the specified name from the dataset.
272
273        Args:
274            matrix_name: the name of the matrix to load.
275            columns: subset list of columns to load or None to load all.
276                All elements must either be integer indices or
277                strings that correspond to the one of the available table columns.
278            chunk_size: reads the matrix in chunks as an iterator or
279                the entire table when None.
280
281        Returns:
282            the resulting matrix dataframe (iterator) or None when not available.
283        """
284        matrix_config = self.get_matrix_config(matrix_name)
285        if matrix_config is None:
286            return None
287
288        table_config = matrix_config.table
289        return table_config.read_table(self.data_dir, columns=columns, chunk_size=chunk_size)
290
291    def read_table(
292            self,
293            table_name: str,
294            columns: List[Union[int,str]]=None,
295            chunk_size: int=None) -> Optional[pd.DataFrame]:
296        """Read the table with the specified name from the dataset.
297
298        Args:
299            table_name: name of the table to read.
300            columns: subset list of columns to load or None to load all.
301                All elements must either be integer indices or
302                strings that correspond to the one of the available table columns.
303            chunk_size: reads the table in chunks as an iterator or
304                the entire table when None.
305
306        Returns:
307            the resulting table dataframe (iterator) or None when not available.
308        """
309        table_config = self.get_table_config(table_name)
310        if table_config is None:
311            return None
312
313        return table_config.read_table(self.data_dir, columns=columns, chunk_size=chunk_size)
314
315    def resolve_item_ids(
316            self,
317            matrix_name: str,
318            items: Union[int,List[int]]) -> Union[int,List[int]]:
319        """Resolve the specified item ID(s).
320
321        The item ID(s) of a dataset need to be resolved when it contains
322        an indirection array, otherwise ID(s) are returned unchanged.
323
324        Args:
325            matrix_name: the name of the matrix to resolve the item indices of.
326            items: source ID(s) to convert.
327
328        Raises:
329            KeyError: when the matrix with the specified name does not exist.
330
331        Returns:
332            the resolved item ID(s).
333        """
334        item_indices = self.load_item_indices(matrix_name)
335        if item_indices is None:
336            return items
337
338        return item_indices[items]
339
340    def resolve_user_ids(
341            self,
342            matrix_name: str,
343            users: Union[int,List[int]]) -> Union[int,List[int]]:
344        """Resolve the specified user ID(s).
345
346        The user ID(s) of a dataset need to be resolved when it contains
347        an indirection array, otherwise ID(s) are returned unchanged.
348
349        Args:
350            matrix_name: the name of the matrix to resolve the user indices of.
351            users: source ID(s) to convert.
352
353        Raises:
354            KeyError: when the matrix with the specified name does not exist.
355
356        Returns:
357            the resolved user ID(s).
358        """
359        user_indices = self.load_user_indices(matrix_name)
360        if user_indices is None:
361            return users
362
363        return user_indices[users]

Wrapper class for a FairRecKit dataset.

A dataset is used for carrying out recommender system experiments. Each dataset has a strong affinity with a database structure consisting of multiple tables. The standardized matrix is a pandas.DataFrame stored in a '.tsv' file. The (derived sparse) matrix is used in experiments and needs to be in a CSR compatible format, meaning three fields:

1) 'user': IDs range from 0 to the amount of unique users. 2) 'item': IDs range from 0 to the amount of unique items. An item can be various of things (e.g. an artist, an album, a track, a movie, etc.) 3) 'rating': floating-point data describing the rating a user has given an item. There are two types of ratings, namely explicit or implicit, and both are expected to be greater than zero.

The matrix has one optional field which is:

4) 'timestamp': when present can be used to split the matrix on temporal basis.

A dataset has two main tables that are connected to the 'user' and 'item' fields. Indirection arrays are available when user and/or item IDs do not match up in their corresponding tables. These two tables can be used in an experiment to filter any rows based on various table header criteria. Any additional tables can be added for accessibility/compatibility with the FRK recommender system.

Public methods:

get_available_columns get_available_event_tables get_available_matrices get_available_tables get_matrices_info get_matrix_config get_matrix_file_path get_name get_table_config get_table_info load_matrix read_matrix read_table resolve_item_ids resolve_user_ids

Dataset( data_dir: str, config: src.fairreckitlib.data.set.dataset_config.DatasetConfig)
72    def __init__(self, data_dir: str, config: DatasetConfig):
73        """Construct the dataset.
74
75        Args:
76            data_dir: directory where the dataset is stored.
77            config: the dataset configuration.
78        """
79        if not os.path.isdir(data_dir):
80            raise IOError('Unknown dataset directory: ' + data_dir)
81
82        self.data_dir = data_dir
83        self.config = config

Construct the dataset.

Args: data_dir: directory where the dataset is stored. config: the dataset configuration.

def get_available_columns(self, matrix_name: str) -> Dict[str, List[str]]:
85    def get_available_columns(self, matrix_name: str) -> Dict[str, List[str]]:
86        """Get the available table column names of this dataset.
87
88        Args:
89            matrix_name: the name of the matrix to get the available columns of.
90
91        Returns:
92            a dictionary with table name as keys and column names as values.
93
94        """
95        return self.config.get_available_columns(matrix_name)

Get the available table column names of this dataset.

Args: matrix_name: the name of the matrix to get the available columns of.

Returns: a dictionary with table name as keys and column names as values.

def get_available_event_tables(self) -> List[str]:
 97    def get_available_event_tables(self) -> List[str]:
 98        """Get the available event table names in the dataset.
 99
100        Returns:
101            a list of event table names.
102        """
103        event_table_names = []
104
105        for table_name, _ in self.config.events.items():
106            event_table_names.append(table_name)
107
108        return event_table_names

Get the available event table names in the dataset.

Returns: a list of event table names.

def get_available_matrices(self) -> List[str]:
110    def get_available_matrices(self) -> List[str]:
111        """Get the available matrix names in the dataset.
112
113        Returns:
114            a list of matrix names.
115        """
116        matrix_names = []
117
118        for matrix_name, _ in self.config.matrices.items():
119            matrix_names.append(matrix_name)
120
121        return matrix_names

Get the available matrix names in the dataset.

Returns: a list of matrix names.

def get_available_tables(self) -> List[str]:
123    def get_available_tables(self) -> List[str]:
124        """Get the available table names in the dataset.
125
126        Returns:
127            a list of table names.
128        """
129        table_names = []
130
131        for table_name, _ in self.config.tables.items():
132            table_names.append(table_name)
133
134        return table_names

Get the available table names in the dataset.

Returns: a list of table names.

def get_matrices_info(self) -> Dict[str, Any]:
136    def get_matrices_info(self) -> Dict[str, Any]:
137        """Get the information on the dataset's available matrices.
138
139        Returns:
140            a dictionary containing the matrices' information keyed by matrix name.
141        """
142        info = {}
143
144        for matrix_name, matrix_config in self.config.matrices.items():
145            info[matrix_name] = matrix_config.to_yml_format()
146
147        return info

Get the information on the dataset's available matrices.

Returns: a dictionary containing the matrices' information keyed by matrix name.

def get_matrix_config( self, matrix_name: str) -> Optional[src.fairreckitlib.data.set.dataset_config.DatasetMatrixConfig]:
149    def get_matrix_config(self, matrix_name: str) -> Optional[DatasetMatrixConfig]:
150        """Get the configuration of a dataset's matrix.
151
152        Args:
153            matrix_name: the name of the matrix to get the configuration of.
154
155        Returns:
156            the configuration of the matrix or None when not available.
157        """
158        return self.config.matrices.get(matrix_name)

Get the configuration of a dataset's matrix.

Args: matrix_name: the name of the matrix to get the configuration of.

Returns: the configuration of the matrix or None when not available.

def get_matrix_file_path(self, matrix_name: str) -> Optional[str]:
160    def get_matrix_file_path(self, matrix_name: str) -> Optional[str]:
161        """Get the file path where the matrix with the specified name is stored.
162
163        Args:
164            matrix_name: the name of the matrix to get the file path of.
165
166        Returns:
167            the path of the dataset's matrix file or None when not available.
168        """
169        if matrix_name not in self.config.matrices:
170            return None
171
172        return os.path.join(
173            self.data_dir,
174            self.config.matrices[matrix_name].table.file.name
175        )

Get the file path where the matrix with the specified name is stored.

Args: matrix_name: the name of the matrix to get the file path of.

Returns: the path of the dataset's matrix file or None when not available.

def get_name(self) -> str:
177    def get_name(self) -> str:
178        """Get the name of the dataset.
179
180        Returns:
181            the dataset name.
182        """
183        return self.config.dataset_name

Get the name of the dataset.

Returns: the dataset name.

def get_table_config( self, table_name: str) -> Optional[src.fairreckitlib.data.set.dataset_config.DatasetTableConfig]:
185    def get_table_config(self, table_name: str) -> Optional[DatasetTableConfig]:
186        """Get the configuration of the dataset table with the specified name.
187
188        Args:
189            table_name: name of the table to retrieve the configuration of.
190
191        Returns:
192            the table configuration or None when not available.
193        """
194        return self.config.tables.get(table_name)

Get the configuration of the dataset table with the specified name.

Args: table_name: name of the table to retrieve the configuration of.

Returns: the table configuration or None when not available.

def get_table_info(self) -> Dict[str, Any]:
196    def get_table_info(self) -> Dict[str, Any]:
197        """Get the information on the dataset's available tables.
198
199        Returns:
200            a dictionary containing the table information keyed by table name.
201        """
202        info = {}
203
204        for table_name, table_config in self.config.tables.items():
205            info[table_name] = table_config.to_yml_format()
206
207        return info

Get the information on the dataset's available tables.

Returns: a dictionary containing the table information keyed by table name.

def load_matrix(self, matrix_name: str) -> Optional[pandas.core.frame.DataFrame]:
209    def load_matrix(self, matrix_name: str) -> Optional[pd.DataFrame]:
210        """Load the standardized user-item matrix of the dataset.
211
212        Args:
213            matrix_name: the name of the matrix to load.
214
215        Returns:
216            the loaded user-item matrix or None when not available.
217        """
218        matrix_config = self.get_matrix_config(matrix_name)
219        if matrix_config is None:
220            return None
221
222        return matrix_config.load_matrix(self.data_dir)

Load the standardized user-item matrix of the dataset.

Args: matrix_name: the name of the matrix to load.

Returns: the loaded user-item matrix or None when not available.

def load_item_indices(self, matrix_name: str) -> Optional[List[int]]:
224    def load_item_indices(self, matrix_name: str) -> Optional[List[int]]:
225        """Load the item indices.
226
227        Optional indirection array of the item IDs that do not match up in
228        the corresponding data table.
229
230        Args:
231            matrix_name: the name of the matrix to load the item indices of.
232
233        Raises:
234            KeyError: when the matrix with the specified name does not exist.
235
236        Returns:
237            the indirection array or None when not needed.
238        """
239        matrix_config = self.get_matrix_config(matrix_name)
240        if not matrix_config:
241            raise KeyError('Unknown matrix configuration to load item indices from')
242
243        return matrix_config.item.load_indices(self.data_dir)

Load the item indices.

Optional indirection array of the item IDs that do not match up in the corresponding data table.

Args: matrix_name: the name of the matrix to load the item indices of.

Raises: KeyError: when the matrix with the specified name does not exist.

Returns: the indirection array or None when not needed.

def load_user_indices(self, matrix_name: str) -> Optional[List[int]]:
245    def load_user_indices(self, matrix_name: str) -> Optional[List[int]]:
246        """Load the user indices.
247
248        Optional indirection array of the user IDs that do not match up in
249        the corresponding data table.
250
251        Args:
252            matrix_name: the name of the matrix to load the user indices of.
253
254        Raises:
255            KeyError: when the matrix with the specified name does not exist.
256
257        Returns:
258            the indirection array or None when not needed.
259        """
260        matrix_config = self.get_matrix_config(matrix_name)
261        if not matrix_config:
262            raise KeyError('Unknown matrix configuration to load item indices from')
263
264        return matrix_config.user.load_indices(self.data_dir)

Load the user indices.

Optional indirection array of the user IDs that do not match up in the corresponding data table.

Args: matrix_name: the name of the matrix to load the user indices of.

Raises: KeyError: when the matrix with the specified name does not exist.

Returns: the indirection array or None when not needed.

def read_matrix( self, matrix_name: str, columns: List[Union[str, int]] = None, chunk_size: int = None) -> Optional[pandas.core.frame.DataFrame]:
266    def read_matrix(
267            self,
268            matrix_name: str,
269            columns: List[Union[int,str]]=None,
270            chunk_size: int=None) -> Optional[pd.DataFrame]:
271        """Read the matrix with the specified name from the dataset.
272
273        Args:
274            matrix_name: the name of the matrix to load.
275            columns: subset list of columns to load or None to load all.
276                All elements must either be integer indices or
277                strings that correspond to the one of the available table columns.
278            chunk_size: reads the matrix in chunks as an iterator or
279                the entire table when None.
280
281        Returns:
282            the resulting matrix dataframe (iterator) or None when not available.
283        """
284        matrix_config = self.get_matrix_config(matrix_name)
285        if matrix_config is None:
286            return None
287
288        table_config = matrix_config.table
289        return table_config.read_table(self.data_dir, columns=columns, chunk_size=chunk_size)

Read the matrix with the specified name from the dataset.

Args: matrix_name: the name of the matrix to load. columns: subset list of columns to load or None to load all. All elements must either be integer indices or strings that correspond to the one of the available table columns. chunk_size: reads the matrix in chunks as an iterator or the entire table when None.

Returns: the resulting matrix dataframe (iterator) or None when not available.

def read_table( self, table_name: str, columns: List[Union[str, int]] = None, chunk_size: int = None) -> Optional[pandas.core.frame.DataFrame]:
291    def read_table(
292            self,
293            table_name: str,
294            columns: List[Union[int,str]]=None,
295            chunk_size: int=None) -> Optional[pd.DataFrame]:
296        """Read the table with the specified name from the dataset.
297
298        Args:
299            table_name: name of the table to read.
300            columns: subset list of columns to load or None to load all.
301                All elements must either be integer indices or
302                strings that correspond to the one of the available table columns.
303            chunk_size: reads the table in chunks as an iterator or
304                the entire table when None.
305
306        Returns:
307            the resulting table dataframe (iterator) or None when not available.
308        """
309        table_config = self.get_table_config(table_name)
310        if table_config is None:
311            return None
312
313        return table_config.read_table(self.data_dir, columns=columns, chunk_size=chunk_size)

Read the table with the specified name from the dataset.

Args: table_name: name of the table to read. columns: subset list of columns to load or None to load all. All elements must either be integer indices or strings that correspond to the one of the available table columns. chunk_size: reads the table in chunks as an iterator or the entire table when None.

Returns: the resulting table dataframe (iterator) or None when not available.

def resolve_item_ids( self, matrix_name: str, items: Union[int, List[int]]) -> Union[int, List[int]]:
315    def resolve_item_ids(
316            self,
317            matrix_name: str,
318            items: Union[int,List[int]]) -> Union[int,List[int]]:
319        """Resolve the specified item ID(s).
320
321        The item ID(s) of a dataset need to be resolved when it contains
322        an indirection array, otherwise ID(s) are returned unchanged.
323
324        Args:
325            matrix_name: the name of the matrix to resolve the item indices of.
326            items: source ID(s) to convert.
327
328        Raises:
329            KeyError: when the matrix with the specified name does not exist.
330
331        Returns:
332            the resolved item ID(s).
333        """
334        item_indices = self.load_item_indices(matrix_name)
335        if item_indices is None:
336            return items
337
338        return item_indices[items]

Resolve the specified item ID(s).

The item ID(s) of a dataset need to be resolved when it contains an indirection array, otherwise ID(s) are returned unchanged.

Args: matrix_name: the name of the matrix to resolve the item indices of. items: source ID(s) to convert.

Raises: KeyError: when the matrix with the specified name does not exist.

Returns: the resolved item ID(s).

def resolve_user_ids( self, matrix_name: str, users: Union[int, List[int]]) -> Union[int, List[int]]:
340    def resolve_user_ids(
341            self,
342            matrix_name: str,
343            users: Union[int,List[int]]) -> Union[int,List[int]]:
344        """Resolve the specified user ID(s).
345
346        The user ID(s) of a dataset need to be resolved when it contains
347        an indirection array, otherwise ID(s) are returned unchanged.
348
349        Args:
350            matrix_name: the name of the matrix to resolve the user indices of.
351            users: source ID(s) to convert.
352
353        Raises:
354            KeyError: when the matrix with the specified name does not exist.
355
356        Returns:
357            the resolved user ID(s).
358        """
359        user_indices = self.load_user_indices(matrix_name)
360        if user_indices is None:
361            return users
362
363        return user_indices[users]

Resolve the specified user ID(s).

The user ID(s) of a dataset need to be resolved when it contains an indirection array, otherwise ID(s) are returned unchanged.

Args: matrix_name: the name of the matrix to resolve the user indices of. users: source ID(s) to convert.

Raises: KeyError: when the matrix with the specified name does not exist.

Returns: the resolved user ID(s).

def add_dataset_columns( dataset: src.fairreckitlib.data.set.dataset.Dataset, matrix_name: str, dataframe: pandas.core.frame.DataFrame, column_names: List[str]) -> pandas.core.frame.DataFrame:
366def add_dataset_columns(
367        dataset: Dataset,
368        matrix_name: str,
369        dataframe: pd.DataFrame,
370        column_names: List[str]) -> pd.DataFrame:
371    """Add the specified columns from the dataset to the dataframe.
372
373    Args:
374        dataset: the set related to the dataframe.
375        matrix_name: the name of the dataset matrix.
376        dataframe: with at least the 'user' and/or 'item' columns.
377        column_names: a list of strings to indicate which
378            user and/or item columns need to be added. Any values that are not
379            present in the dataset tables are ignored.
380
381    Returns:
382        the resulting dataframe with the added columns that exist in the dataset.
383    """
384    for table_name, table_columns in dataset.get_available_columns(matrix_name).items():
385        columns = [c for c in column_names if c in table_columns]
386        # skip table that does not contain any needed columns
387        if len(columns) == 0:
388            continue
389
390        matrix_config = dataset.get_matrix_config(matrix_name)
391        table_config = dataset.get_table_config(table_name)
392
393        user_key = matrix_config.user.key
394        item_key = matrix_config.item.key
395        user_item_key = [user_key, item_key]
396
397        # add matrix columns
398        if table_name == 'matrix':
399            dataframe = pd.merge(
400                dataframe,
401                dataset.read_matrix(matrix_name, columns=matrix_config.table.primary_key + columns),
402                how='left',
403                left_on=['user', 'item'],
404                right_on=matrix_config.table.primary_key
405            )
406            dataframe.drop(matrix_config.table.primary_key, inplace=True, axis=1)
407        # add user columns
408        elif table_config.primary_key == [user_key]:
409            dataframe[user_key] = dataset.resolve_user_ids(matrix_name, dataframe['user'])
410            dataframe = pd.merge(
411                dataframe,
412                dataset.read_table(table_name, columns=table_config.primary_key + columns),
413                how='left',
414                on=user_key
415            )
416            dataframe.drop(user_key, inplace=True, axis=1)
417        # add item columns
418        elif table_config.primary_key == [item_key]:
419            dataframe[item_key] = dataset.resolve_item_ids(matrix_name, dataframe['item'])
420            dataframe = pd.merge(
421                dataframe,
422                dataset.read_table(table_name, columns=table_config.primary_key + columns),
423                how='left',
424                on=item_key
425            )
426            dataframe.drop(item_key, inplace=True, axis=1)
427        # add user-item columns
428        elif table_config.primary_key == user_item_key:
429            dataframe[user_key] = dataset.resolve_user_ids(matrix_name, dataframe['user'])
430            dataframe[item_key] = dataset.resolve_item_ids(matrix_name, dataframe['item'])
431            dataframe = pd.merge(
432                dataframe,
433                dataset.read_table(table_name, columns=table_config.primary_key + columns),
434                how='left',
435                on=user_item_key
436            )
437            dataframe.drop(user_item_key, inplace=True, axis=1)
438
439    return dataframe

Add the specified columns from the dataset to the dataframe.

Args: dataset: the set related to the dataframe. matrix_name: the name of the dataset matrix. dataframe: with at least the 'user' and/or 'item' columns. column_names: a list of strings to indicate which user and/or item columns need to be added. Any values that are not present in the dataset tables are ignored.

Returns: the resulting dataframe with the added columns that exist in the dataset.