src.fairreckitlib.data.set.dataset

This module contains a dataset definition for accessing a dataset and related data tables.

Classes:

Dataset: class wrapper of the user events, user-item matrices and related tables.

Functions:

add_dataset_columns: add columns from the dataset matrix/user/item tables to a dataframe.

This program has been developed by students from the bachelor Computer Science at Utrecht University within the Software Project course. © Copyright Utrecht University (Department of Information and Computing Sciences)

View Source

  1"""This module contains a dataset definition for accessing a dataset and related data tables.
  2
  3Classes:
  4
  5    Dataset: class wrapper of the user events, user-item matrices and related tables.
  6
  7Functions:
  8
  9    add_dataset_columns: add columns from the dataset matrix/user/item tables to a dataframe.
 10
 11This program has been developed by students from the bachelor Computer Science at
 12Utrecht University within the Software Project course.
 13© Copyright Utrecht University (Department of Information and Computing Sciences)
 14"""
 15
 16import os
 17from typing import Any, Dict, Optional, List, Union
 18
 19import pandas as pd
 20
 21from .dataset_config import DatasetConfig, DatasetMatrixConfig, DatasetTableConfig
 22
 23
 24class Dataset:
 25    """Wrapper class for a FairRecKit dataset.
 26
 27    A dataset is used for carrying out recommender system experiments.
 28    Each dataset has a strong affinity with a database structure consisting of
 29    multiple tables.
 30    The standardized matrix is a pandas.DataFrame stored in a '.tsv' file.
 31    The (derived sparse) matrix is used in experiments and needs to be
 32    in a CSR compatible format, meaning three fields:
 33
 34    1) 'user': IDs range from 0 to the amount of unique users.
 35    2) 'item': IDs range from 0 to the amount of unique items. An item can be
 36        various of things (e.g. an artist, an album, a track, a movie, etc.)
 37    3) 'rating': floating-point data describing the rating a user has given an item.
 38        There are two types of ratings, namely explicit or implicit, and both
 39        are expected to be greater than zero.
 40
 41    The matrix has one optional field which is:
 42
 43    4) 'timestamp': when present can be used to split the matrix on temporal basis.
 44
 45    A dataset has two main tables that are connected to the 'user' and 'item' fields.
 46    Indirection arrays are available when user and/or item IDs do not match up in
 47    their corresponding tables. These two tables can be used in an experiment to
 48    filter any rows based on various table header criteria.
 49    Any additional tables can be added for accessibility/compatibility with the FRK
 50    recommender system.
 51
 52    Public methods:
 53
 54    get_available_columns
 55    get_available_event_tables
 56    get_available_matrices
 57    get_available_tables
 58    get_matrices_info
 59    get_matrix_config
 60    get_matrix_file_path
 61    get_name
 62    get_table_config
 63    get_table_info
 64    load_matrix
 65    read_matrix
 66    read_table
 67    resolve_item_ids
 68    resolve_user_ids
 69    """
 70
 71    def __init__(self, data_dir: str, config: DatasetConfig):
 72        """Construct the dataset.
 73
 74        Args:
 75            data_dir: directory where the dataset is stored.
 76            config: the dataset configuration.
 77        """
 78        if not os.path.isdir(data_dir):
 79            raise IOError('Unknown dataset directory: ' + data_dir)
 80
 81        self.data_dir = data_dir
 82        self.config = config
 83
 84    def get_available_columns(self, matrix_name: str) -> Dict[str, List[str]]:
 85        """Get the available table column names of this dataset.
 86
 87        Args:
 88            matrix_name: the name of the matrix to get the available columns of.
 89
 90        Returns:
 91            a dictionary with table name as keys and column names as values.
 92
 93        """
 94        return self.config.get_available_columns(matrix_name)
 95
 96    def get_available_event_tables(self) -> List[str]:
 97        """Get the available event table names in the dataset.
 98
 99        Returns:
100            a list of event table names.
101        """
102        event_table_names = []
103
104        for table_name, _ in self.config.events.items():
105            event_table_names.append(table_name)
106
107        return event_table_names
108
109    def get_available_matrices(self) -> List[str]:
110        """Get the available matrix names in the dataset.
111
112        Returns:
113            a list of matrix names.
114        """
115        matrix_names = []
116
117        for matrix_name, _ in self.config.matrices.items():
118            matrix_names.append(matrix_name)
119
120        return matrix_names
121
122    def get_available_tables(self) -> List[str]:
123        """Get the available table names in the dataset.
124
125        Returns:
126            a list of table names.
127        """
128        table_names = []
129
130        for table_name, _ in self.config.tables.items():
131            table_names.append(table_name)
132
133        return table_names
134
135    def get_matrices_info(self) -> Dict[str, Any]:
136        """Get the information on the dataset's available matrices.
137
138        Returns:
139            a dictionary containing the matrices' information keyed by matrix name.
140        """
141        info = {}
142
143        for matrix_name, matrix_config in self.config.matrices.items():
144            info[matrix_name] = matrix_config.to_yml_format()
145
146        return info
147
148    def get_matrix_config(self, matrix_name: str) -> Optional[DatasetMatrixConfig]:
149        """Get the configuration of a dataset's matrix.
150
151        Args:
152            matrix_name: the name of the matrix to get the configuration of.
153
154        Returns:
155            the configuration of the matrix or None when not available.
156        """
157        return self.config.matrices.get(matrix_name)
158
159    def get_matrix_file_path(self, matrix_name: str) -> Optional[str]:
160        """Get the file path where the matrix with the specified name is stored.
161
162        Args:
163            matrix_name: the name of the matrix to get the file path of.
164
165        Returns:
166            the path of the dataset's matrix file or None when not available.
167        """
168        if matrix_name not in self.config.matrices:
169            return None
170
171        return os.path.join(
172            self.data_dir,
173            self.config.matrices[matrix_name].table.file.name
174        )
175
176    def get_name(self) -> str:
177        """Get the name of the dataset.
178
179        Returns:
180            the dataset name.
181        """
182        return self.config.dataset_name
183
184    def get_table_config(self, table_name: str) -> Optional[DatasetTableConfig]:
185        """Get the configuration of the dataset table with the specified name.
186
187        Args:
188            table_name: name of the table to retrieve the configuration of.
189
190        Returns:
191            the table configuration or None when not available.
192        """
193        return self.config.tables.get(table_name)
194
195    def get_table_info(self) -> Dict[str, Any]:
196        """Get the information on the dataset's available tables.
197
198        Returns:
199            a dictionary containing the table information keyed by table name.
200        """
201        info = {}
202
203        for table_name, table_config in self.config.tables.items():
204            info[table_name] = table_config.to_yml_format()
205
206        return info
207
208    def load_matrix(self, matrix_name: str) -> Optional[pd.DataFrame]:
209        """Load the standardized user-item matrix of the dataset.
210
211        Args:
212            matrix_name: the name of the matrix to load.
213
214        Returns:
215            the loaded user-item matrix or None when not available.
216        """
217        matrix_config = self.get_matrix_config(matrix_name)
218        if matrix_config is None:
219            return None
220
221        return matrix_config.load_matrix(self.data_dir)
222
223    def load_item_indices(self, matrix_name: str) -> Optional[List[int]]:
224        """Load the item indices.
225
226        Optional indirection array of the item IDs that do not match up in
227        the corresponding data table.
228
229        Args:
230            matrix_name: the name of the matrix to load the item indices of.
231
232        Raises:
233            KeyError: when the matrix with the specified name does not exist.
234
235        Returns:
236            the indirection array or None when not needed.
237        """
238        matrix_config = self.get_matrix_config(matrix_name)
239        if not matrix_config:
240            raise KeyError('Unknown matrix configuration to load item indices from')
241
242        return matrix_config.item.load_indices(self.data_dir)
243
244    def load_user_indices(self, matrix_name: str) -> Optional[List[int]]:
245        """Load the user indices.
246
247        Optional indirection array of the user IDs that do not match up in
248        the corresponding data table.
249
250        Args:
251            matrix_name: the name of the matrix to load the user indices of.
252
253        Raises:
254            KeyError: when the matrix with the specified name does not exist.
255
256        Returns:
257            the indirection array or None when not needed.
258        """
259        matrix_config = self.get_matrix_config(matrix_name)
260        if not matrix_config:
261            raise KeyError('Unknown matrix configuration to load item indices from')
262
263        return matrix_config.user.load_indices(self.data_dir)
264
265    def read_matrix(
266            self,
267            matrix_name: str,
268            columns: List[Union[int,str]]=None,
269            chunk_size: int=None) -> Optional[pd.DataFrame]:
270        """Read the matrix with the specified name from the dataset.
271
272        Args:
273            matrix_name: the name of the matrix to load.
274            columns: subset list of columns to load or None to load all.
275                All elements must either be integer indices or
276                strings that correspond to the one of the available table columns.
277            chunk_size: reads the matrix in chunks as an iterator or
278                the entire table when None.
279
280        Returns:
281            the resulting matrix dataframe (iterator) or None when not available.
282        """
283        matrix_config = self.get_matrix_config(matrix_name)
284        if matrix_config is None:
285            return None
286
287        table_config = matrix_config.table
288        return table_config.read_table(self.data_dir, columns=columns, chunk_size=chunk_size)
289
290    def read_table(
291            self,
292            table_name: str,
293            columns: List[Union[int,str]]=None,
294            chunk_size: int=None) -> Optional[pd.DataFrame]:
295        """Read the table with the specified name from the dataset.
296
297        Args:
298            table_name: name of the table to read.
299            columns: subset list of columns to load or None to load all.
300                All elements must either be integer indices or
301                strings that correspond to the one of the available table columns.
302            chunk_size: reads the table in chunks as an iterator or
303                the entire table when None.
304
305        Returns:
306            the resulting table dataframe (iterator) or None when not available.
307        """
308        table_config = self.get_table_config(table_name)
309        if table_config is None:
310            return None
311
312        return table_config.read_table(self.data_dir, columns=columns, chunk_size=chunk_size)
313
314    def resolve_item_ids(
315            self,
316            matrix_name: str,
317            items: Union[int,List[int]]) -> Union[int,List[int]]:
318        """Resolve the specified item ID(s).
319
320        The item ID(s) of a dataset need to be resolved when it contains
321        an indirection array, otherwise ID(s) are returned unchanged.
322
323        Args:
324            matrix_name: the name of the matrix to resolve the item indices of.
325            items: source ID(s) to convert.
326
327        Raises:
328            KeyError: when the matrix with the specified name does not exist.
329
330        Returns:
331            the resolved item ID(s).
332        """
333        item_indices = self.load_item_indices(matrix_name)
334        if item_indices is None:
335            return items
336
337        return item_indices[items]
338
339    def resolve_user_ids(
340            self,
341            matrix_name: str,
342            users: Union[int,List[int]]) -> Union[int,List[int]]:
343        """Resolve the specified user ID(s).
344
345        The user ID(s) of a dataset need to be resolved when it contains
346        an indirection array, otherwise ID(s) are returned unchanged.
347
348        Args:
349            matrix_name: the name of the matrix to resolve the user indices of.
350            users: source ID(s) to convert.
351
352        Raises:
353            KeyError: when the matrix with the specified name does not exist.
354
355        Returns:
356            the resolved user ID(s).
357        """
358        user_indices = self.load_user_indices(matrix_name)
359        if user_indices is None:
360            return users
361
362        return user_indices[users]
363
364
365def add_dataset_columns(
366        dataset: Dataset,
367        matrix_name: str,
368        dataframe: pd.DataFrame,
369        column_names: List[str]) -> pd.DataFrame:
370    """Add the specified columns from the dataset to the dataframe.
371
372    Args:
373        dataset: the set related to the dataframe.
374        matrix_name: the name of the dataset matrix.
375        dataframe: with at least the 'user' and/or 'item' columns.
376        column_names: a list of strings to indicate which
377            user and/or item columns need to be added. Any values that are not
378            present in the dataset tables are ignored.
379
380    Returns:
381        the resulting dataframe with the added columns that exist in the dataset.
382    """
383    for table_name, table_columns in dataset.get_available_columns(matrix_name).items():
384        columns = [c for c in column_names if c in table_columns]
385        # skip table that does not contain any needed columns
386        if len(columns) == 0:
387            continue
388
389        matrix_config = dataset.get_matrix_config(matrix_name)
390        table_config = dataset.get_table_config(table_name)
391
392        user_key = matrix_config.user.key
393        item_key = matrix_config.item.key
394        user_item_key = [user_key, item_key]
395
396        # add matrix columns
397        if table_name == 'matrix':
398            dataframe = pd.merge(
399                dataframe,
400                dataset.read_matrix(matrix_name, columns=matrix_config.table.primary_key + columns),
401                how='left',
402                left_on=['user', 'item'],
403                right_on=matrix_config.table.primary_key
404            )
405            dataframe.drop(matrix_config.table.primary_key, inplace=True, axis=1)
406        # add user columns
407        elif table_config.primary_key == [user_key]:
408            dataframe[user_key] = dataset.resolve_user_ids(matrix_name, dataframe['user'])
409            dataframe = pd.merge(
410                dataframe,
411                dataset.read_table(table_name, columns=table_config.primary_key + columns),
412                how='left',
413                on=user_key
414            )
415            dataframe.drop(user_key, inplace=True, axis=1)
416        # add item columns
417        elif table_config.primary_key == [item_key]:
418            dataframe[item_key] = dataset.resolve_item_ids(matrix_name, dataframe['item'])
419            dataframe = pd.merge(
420                dataframe,
421                dataset.read_table(table_name, columns=table_config.primary_key + columns),
422                how='left',
423                on=item_key
424            )
425            dataframe.drop(item_key, inplace=True, axis=1)
426        # add user-item columns
427        elif table_config.primary_key == user_item_key:
428            dataframe[user_key] = dataset.resolve_user_ids(matrix_name, dataframe['user'])
429            dataframe[item_key] = dataset.resolve_item_ids(matrix_name, dataframe['item'])
430            dataframe = pd.merge(
431                dataframe,
432                dataset.read_table(table_name, columns=table_config.primary_key + columns),
433                how='left',
434                on=user_item_key
435            )
436            dataframe.drop(user_item_key, inplace=True, axis=1)
437
438    return dataframe

def add_dataset_columns( dataset: src.fairreckitlib.data.set.dataset.Dataset, matrix_name: str, dataframe: pandas.core.frame.DataFrame, column_names: List[str]) -> pandas.core.frame.DataFrame: View Source

366def add_dataset_columns(
367        dataset: Dataset,
368        matrix_name: str,
369        dataframe: pd.DataFrame,
370        column_names: List[str]) -> pd.DataFrame:
371    """Add the specified columns from the dataset to the dataframe.
372
373    Args:
374        dataset: the set related to the dataframe.
375        matrix_name: the name of the dataset matrix.
376        dataframe: with at least the 'user' and/or 'item' columns.
377        column_names: a list of strings to indicate which
378            user and/or item columns need to be added. Any values that are not
379            present in the dataset tables are ignored.
380
381    Returns:
382        the resulting dataframe with the added columns that exist in the dataset.
383    """
384    for table_name, table_columns in dataset.get_available_columns(matrix_name).items():
385        columns = [c for c in column_names if c in table_columns]
386        # skip table that does not contain any needed columns
387        if len(columns) == 0:
388            continue
389
390        matrix_config = dataset.get_matrix_config(matrix_name)
391        table_config = dataset.get_table_config(table_name)
392
393        user_key = matrix_config.user.key
394        item_key = matrix_config.item.key
395        user_item_key = [user_key, item_key]
396
397        # add matrix columns
398        if table_name == 'matrix':
399            dataframe = pd.merge(
400                dataframe,
401                dataset.read_matrix(matrix_name, columns=matrix_config.table.primary_key + columns),
402                how='left',
403                left_on=['user', 'item'],
404                right_on=matrix_config.table.primary_key
405            )
406            dataframe.drop(matrix_config.table.primary_key, inplace=True, axis=1)
407        # add user columns
408        elif table_config.primary_key == [user_key]:
409            dataframe[user_key] = dataset.resolve_user_ids(matrix_name, dataframe['user'])
410            dataframe = pd.merge(
411                dataframe,
412                dataset.read_table(table_name, columns=table_config.primary_key + columns),
413                how='left',
414                on=user_key
415            )
416            dataframe.drop(user_key, inplace=True, axis=1)
417        # add item columns
418        elif table_config.primary_key == [item_key]:
419            dataframe[item_key] = dataset.resolve_item_ids(matrix_name, dataframe['item'])
420            dataframe = pd.merge(
421                dataframe,
422                dataset.read_table(table_name, columns=table_config.primary_key + columns),
423                how='left',
424                on=item_key
425            )
426            dataframe.drop(item_key, inplace=True, axis=1)
427        # add user-item columns
428        elif table_config.primary_key == user_item_key:
429            dataframe[user_key] = dataset.resolve_user_ids(matrix_name, dataframe['user'])
430            dataframe[item_key] = dataset.resolve_item_ids(matrix_name, dataframe['item'])
431            dataframe = pd.merge(
432                dataframe,
433                dataset.read_table(table_name, columns=table_config.primary_key + columns),
434                how='left',
435                on=user_item_key
436            )
437            dataframe.drop(user_item_key, inplace=True, axis=1)
438
439    return dataframe

Add the specified columns from the dataset to the dataframe.

Args: dataset: the set related to the dataframe. matrix_name: the name of the dataset matrix. dataframe: with at least the 'user' and/or 'item' columns. column_names: a list of strings to indicate which user and/or item columns need to be added. Any values that are not present in the dataset tables are ignored.

Returns: the resulting dataframe with the added columns that exist in the dataset.