src.fairreckitlib.data.set.dataset
This module contains a dataset definition for accessing a dataset and related data tables.
Classes:
Dataset: class wrapper of the user events, user-item matrices and related tables.
Functions:
add_dataset_columns: add columns from the dataset matrix/user/item tables to a dataframe.
This program has been developed by students from the bachelor Computer Science at Utrecht University within the Software Project course. © Copyright Utrecht University (Department of Information and Computing Sciences)
1"""This module contains a dataset definition for accessing a dataset and related data tables. 2 3Classes: 4 5 Dataset: class wrapper of the user events, user-item matrices and related tables. 6 7Functions: 8 9 add_dataset_columns: add columns from the dataset matrix/user/item tables to a dataframe. 10 11This program has been developed by students from the bachelor Computer Science at 12Utrecht University within the Software Project course. 13© Copyright Utrecht University (Department of Information and Computing Sciences) 14""" 15 16import os 17from typing import Any, Dict, Optional, List, Union 18 19import pandas as pd 20 21from .dataset_config import DatasetConfig, DatasetMatrixConfig, DatasetTableConfig 22 23 24class Dataset: 25 """Wrapper class for a FairRecKit dataset. 26 27 A dataset is used for carrying out recommender system experiments. 28 Each dataset has a strong affinity with a database structure consisting of 29 multiple tables. 30 The standardized matrix is a pandas.DataFrame stored in a '.tsv' file. 31 The (derived sparse) matrix is used in experiments and needs to be 32 in a CSR compatible format, meaning three fields: 33 34 1) 'user': IDs range from 0 to the amount of unique users. 35 2) 'item': IDs range from 0 to the amount of unique items. An item can be 36 various of things (e.g. an artist, an album, a track, a movie, etc.) 37 3) 'rating': floating-point data describing the rating a user has given an item. 38 There are two types of ratings, namely explicit or implicit, and both 39 are expected to be greater than zero. 40 41 The matrix has one optional field which is: 42 43 4) 'timestamp': when present can be used to split the matrix on temporal basis. 44 45 A dataset has two main tables that are connected to the 'user' and 'item' fields. 46 Indirection arrays are available when user and/or item IDs do not match up in 47 their corresponding tables. These two tables can be used in an experiment to 48 filter any rows based on various table header criteria. 49 Any additional tables can be added for accessibility/compatibility with the FRK 50 recommender system. 51 52 Public methods: 53 54 get_available_columns 55 get_available_event_tables 56 get_available_matrices 57 get_available_tables 58 get_matrices_info 59 get_matrix_config 60 get_matrix_file_path 61 get_name 62 get_table_config 63 get_table_info 64 load_matrix 65 read_matrix 66 read_table 67 resolve_item_ids 68 resolve_user_ids 69 """ 70 71 def __init__(self, data_dir: str, config: DatasetConfig): 72 """Construct the dataset. 73 74 Args: 75 data_dir: directory where the dataset is stored. 76 config: the dataset configuration. 77 """ 78 if not os.path.isdir(data_dir): 79 raise IOError('Unknown dataset directory: ' + data_dir) 80 81 self.data_dir = data_dir 82 self.config = config 83 84 def get_available_columns(self, matrix_name: str) -> Dict[str, List[str]]: 85 """Get the available table column names of this dataset. 86 87 Args: 88 matrix_name: the name of the matrix to get the available columns of. 89 90 Returns: 91 a dictionary with table name as keys and column names as values. 92 93 """ 94 return self.config.get_available_columns(matrix_name) 95 96 def get_available_event_tables(self) -> List[str]: 97 """Get the available event table names in the dataset. 98 99 Returns: 100 a list of event table names. 101 """ 102 event_table_names = [] 103 104 for table_name, _ in self.config.events.items(): 105 event_table_names.append(table_name) 106 107 return event_table_names 108 109 def get_available_matrices(self) -> List[str]: 110 """Get the available matrix names in the dataset. 111 112 Returns: 113 a list of matrix names. 114 """ 115 matrix_names = [] 116 117 for matrix_name, _ in self.config.matrices.items(): 118 matrix_names.append(matrix_name) 119 120 return matrix_names 121 122 def get_available_tables(self) -> List[str]: 123 """Get the available table names in the dataset. 124 125 Returns: 126 a list of table names. 127 """ 128 table_names = [] 129 130 for table_name, _ in self.config.tables.items(): 131 table_names.append(table_name) 132 133 return table_names 134 135 def get_matrices_info(self) -> Dict[str, Any]: 136 """Get the information on the dataset's available matrices. 137 138 Returns: 139 a dictionary containing the matrices' information keyed by matrix name. 140 """ 141 info = {} 142 143 for matrix_name, matrix_config in self.config.matrices.items(): 144 info[matrix_name] = matrix_config.to_yml_format() 145 146 return info 147 148 def get_matrix_config(self, matrix_name: str) -> Optional[DatasetMatrixConfig]: 149 """Get the configuration of a dataset's matrix. 150 151 Args: 152 matrix_name: the name of the matrix to get the configuration of. 153 154 Returns: 155 the configuration of the matrix or None when not available. 156 """ 157 return self.config.matrices.get(matrix_name) 158 159 def get_matrix_file_path(self, matrix_name: str) -> Optional[str]: 160 """Get the file path where the matrix with the specified name is stored. 161 162 Args: 163 matrix_name: the name of the matrix to get the file path of. 164 165 Returns: 166 the path of the dataset's matrix file or None when not available. 167 """ 168 if matrix_name not in self.config.matrices: 169 return None 170 171 return os.path.join( 172 self.data_dir, 173 self.config.matrices[matrix_name].table.file.name 174 ) 175 176 def get_name(self) -> str: 177 """Get the name of the dataset. 178 179 Returns: 180 the dataset name. 181 """ 182 return self.config.dataset_name 183 184 def get_table_config(self, table_name: str) -> Optional[DatasetTableConfig]: 185 """Get the configuration of the dataset table with the specified name. 186 187 Args: 188 table_name: name of the table to retrieve the configuration of. 189 190 Returns: 191 the table configuration or None when not available. 192 """ 193 return self.config.tables.get(table_name) 194 195 def get_table_info(self) -> Dict[str, Any]: 196 """Get the information on the dataset's available tables. 197 198 Returns: 199 a dictionary containing the table information keyed by table name. 200 """ 201 info = {} 202 203 for table_name, table_config in self.config.tables.items(): 204 info[table_name] = table_config.to_yml_format() 205 206 return info 207 208 def load_matrix(self, matrix_name: str) -> Optional[pd.DataFrame]: 209 """Load the standardized user-item matrix of the dataset. 210 211 Args: 212 matrix_name: the name of the matrix to load. 213 214 Returns: 215 the loaded user-item matrix or None when not available. 216 """ 217 matrix_config = self.get_matrix_config(matrix_name) 218 if matrix_config is None: 219 return None 220 221 return matrix_config.load_matrix(self.data_dir) 222 223 def load_item_indices(self, matrix_name: str) -> Optional[List[int]]: 224 """Load the item indices. 225 226 Optional indirection array of the item IDs that do not match up in 227 the corresponding data table. 228 229 Args: 230 matrix_name: the name of the matrix to load the item indices of. 231 232 Raises: 233 KeyError: when the matrix with the specified name does not exist. 234 235 Returns: 236 the indirection array or None when not needed. 237 """ 238 matrix_config = self.get_matrix_config(matrix_name) 239 if not matrix_config: 240 raise KeyError('Unknown matrix configuration to load item indices from') 241 242 return matrix_config.item.load_indices(self.data_dir) 243 244 def load_user_indices(self, matrix_name: str) -> Optional[List[int]]: 245 """Load the user indices. 246 247 Optional indirection array of the user IDs that do not match up in 248 the corresponding data table. 249 250 Args: 251 matrix_name: the name of the matrix to load the user indices of. 252 253 Raises: 254 KeyError: when the matrix with the specified name does not exist. 255 256 Returns: 257 the indirection array or None when not needed. 258 """ 259 matrix_config = self.get_matrix_config(matrix_name) 260 if not matrix_config: 261 raise KeyError('Unknown matrix configuration to load item indices from') 262 263 return matrix_config.user.load_indices(self.data_dir) 264 265 def read_matrix( 266 self, 267 matrix_name: str, 268 columns: List[Union[int,str]]=None, 269 chunk_size: int=None) -> Optional[pd.DataFrame]: 270 """Read the matrix with the specified name from the dataset. 271 272 Args: 273 matrix_name: the name of the matrix to load. 274 columns: subset list of columns to load or None to load all. 275 All elements must either be integer indices or 276 strings that correspond to the one of the available table columns. 277 chunk_size: reads the matrix in chunks as an iterator or 278 the entire table when None. 279 280 Returns: 281 the resulting matrix dataframe (iterator) or None when not available. 282 """ 283 matrix_config = self.get_matrix_config(matrix_name) 284 if matrix_config is None: 285 return None 286 287 table_config = matrix_config.table 288 return table_config.read_table(self.data_dir, columns=columns, chunk_size=chunk_size) 289 290 def read_table( 291 self, 292 table_name: str, 293 columns: List[Union[int,str]]=None, 294 chunk_size: int=None) -> Optional[pd.DataFrame]: 295 """Read the table with the specified name from the dataset. 296 297 Args: 298 table_name: name of the table to read. 299 columns: subset list of columns to load or None to load all. 300 All elements must either be integer indices or 301 strings that correspond to the one of the available table columns. 302 chunk_size: reads the table in chunks as an iterator or 303 the entire table when None. 304 305 Returns: 306 the resulting table dataframe (iterator) or None when not available. 307 """ 308 table_config = self.get_table_config(table_name) 309 if table_config is None: 310 return None 311 312 return table_config.read_table(self.data_dir, columns=columns, chunk_size=chunk_size) 313 314 def resolve_item_ids( 315 self, 316 matrix_name: str, 317 items: Union[int,List[int]]) -> Union[int,List[int]]: 318 """Resolve the specified item ID(s). 319 320 The item ID(s) of a dataset need to be resolved when it contains 321 an indirection array, otherwise ID(s) are returned unchanged. 322 323 Args: 324 matrix_name: the name of the matrix to resolve the item indices of. 325 items: source ID(s) to convert. 326 327 Raises: 328 KeyError: when the matrix with the specified name does not exist. 329 330 Returns: 331 the resolved item ID(s). 332 """ 333 item_indices = self.load_item_indices(matrix_name) 334 if item_indices is None: 335 return items 336 337 return item_indices[items] 338 339 def resolve_user_ids( 340 self, 341 matrix_name: str, 342 users: Union[int,List[int]]) -> Union[int,List[int]]: 343 """Resolve the specified user ID(s). 344 345 The user ID(s) of a dataset need to be resolved when it contains 346 an indirection array, otherwise ID(s) are returned unchanged. 347 348 Args: 349 matrix_name: the name of the matrix to resolve the user indices of. 350 users: source ID(s) to convert. 351 352 Raises: 353 KeyError: when the matrix with the specified name does not exist. 354 355 Returns: 356 the resolved user ID(s). 357 """ 358 user_indices = self.load_user_indices(matrix_name) 359 if user_indices is None: 360 return users 361 362 return user_indices[users] 363 364 365def add_dataset_columns( 366 dataset: Dataset, 367 matrix_name: str, 368 dataframe: pd.DataFrame, 369 column_names: List[str]) -> pd.DataFrame: 370 """Add the specified columns from the dataset to the dataframe. 371 372 Args: 373 dataset: the set related to the dataframe. 374 matrix_name: the name of the dataset matrix. 375 dataframe: with at least the 'user' and/or 'item' columns. 376 column_names: a list of strings to indicate which 377 user and/or item columns need to be added. Any values that are not 378 present in the dataset tables are ignored. 379 380 Returns: 381 the resulting dataframe with the added columns that exist in the dataset. 382 """ 383 for table_name, table_columns in dataset.get_available_columns(matrix_name).items(): 384 columns = [c for c in column_names if c in table_columns] 385 # skip table that does not contain any needed columns 386 if len(columns) == 0: 387 continue 388 389 matrix_config = dataset.get_matrix_config(matrix_name) 390 table_config = dataset.get_table_config(table_name) 391 392 user_key = matrix_config.user.key 393 item_key = matrix_config.item.key 394 user_item_key = [user_key, item_key] 395 396 # add matrix columns 397 if table_name == 'matrix': 398 dataframe = pd.merge( 399 dataframe, 400 dataset.read_matrix(matrix_name, columns=matrix_config.table.primary_key + columns), 401 how='left', 402 left_on=['user', 'item'], 403 right_on=matrix_config.table.primary_key 404 ) 405 dataframe.drop(matrix_config.table.primary_key, inplace=True, axis=1) 406 # add user columns 407 elif table_config.primary_key == [user_key]: 408 dataframe[user_key] = dataset.resolve_user_ids(matrix_name, dataframe['user']) 409 dataframe = pd.merge( 410 dataframe, 411 dataset.read_table(table_name, columns=table_config.primary_key + columns), 412 how='left', 413 on=user_key 414 ) 415 dataframe.drop(user_key, inplace=True, axis=1) 416 # add item columns 417 elif table_config.primary_key == [item_key]: 418 dataframe[item_key] = dataset.resolve_item_ids(matrix_name, dataframe['item']) 419 dataframe = pd.merge( 420 dataframe, 421 dataset.read_table(table_name, columns=table_config.primary_key + columns), 422 how='left', 423 on=item_key 424 ) 425 dataframe.drop(item_key, inplace=True, axis=1) 426 # add user-item columns 427 elif table_config.primary_key == user_item_key: 428 dataframe[user_key] = dataset.resolve_user_ids(matrix_name, dataframe['user']) 429 dataframe[item_key] = dataset.resolve_item_ids(matrix_name, dataframe['item']) 430 dataframe = pd.merge( 431 dataframe, 432 dataset.read_table(table_name, columns=table_config.primary_key + columns), 433 how='left', 434 on=user_item_key 435 ) 436 dataframe.drop(user_item_key, inplace=True, axis=1) 437 438 return dataframe
25class Dataset: 26 """Wrapper class for a FairRecKit dataset. 27 28 A dataset is used for carrying out recommender system experiments. 29 Each dataset has a strong affinity with a database structure consisting of 30 multiple tables. 31 The standardized matrix is a pandas.DataFrame stored in a '.tsv' file. 32 The (derived sparse) matrix is used in experiments and needs to be 33 in a CSR compatible format, meaning three fields: 34 35 1) 'user': IDs range from 0 to the amount of unique users. 36 2) 'item': IDs range from 0 to the amount of unique items. An item can be 37 various of things (e.g. an artist, an album, a track, a movie, etc.) 38 3) 'rating': floating-point data describing the rating a user has given an item. 39 There are two types of ratings, namely explicit or implicit, and both 40 are expected to be greater than zero. 41 42 The matrix has one optional field which is: 43 44 4) 'timestamp': when present can be used to split the matrix on temporal basis. 45 46 A dataset has two main tables that are connected to the 'user' and 'item' fields. 47 Indirection arrays are available when user and/or item IDs do not match up in 48 their corresponding tables. These two tables can be used in an experiment to 49 filter any rows based on various table header criteria. 50 Any additional tables can be added for accessibility/compatibility with the FRK 51 recommender system. 52 53 Public methods: 54 55 get_available_columns 56 get_available_event_tables 57 get_available_matrices 58 get_available_tables 59 get_matrices_info 60 get_matrix_config 61 get_matrix_file_path 62 get_name 63 get_table_config 64 get_table_info 65 load_matrix 66 read_matrix 67 read_table 68 resolve_item_ids 69 resolve_user_ids 70 """ 71 72 def __init__(self, data_dir: str, config: DatasetConfig): 73 """Construct the dataset. 74 75 Args: 76 data_dir: directory where the dataset is stored. 77 config: the dataset configuration. 78 """ 79 if not os.path.isdir(data_dir): 80 raise IOError('Unknown dataset directory: ' + data_dir) 81 82 self.data_dir = data_dir 83 self.config = config 84 85 def get_available_columns(self, matrix_name: str) -> Dict[str, List[str]]: 86 """Get the available table column names of this dataset. 87 88 Args: 89 matrix_name: the name of the matrix to get the available columns of. 90 91 Returns: 92 a dictionary with table name as keys and column names as values. 93 94 """ 95 return self.config.get_available_columns(matrix_name) 96 97 def get_available_event_tables(self) -> List[str]: 98 """Get the available event table names in the dataset. 99 100 Returns: 101 a list of event table names. 102 """ 103 event_table_names = [] 104 105 for table_name, _ in self.config.events.items(): 106 event_table_names.append(table_name) 107 108 return event_table_names 109 110 def get_available_matrices(self) -> List[str]: 111 """Get the available matrix names in the dataset. 112 113 Returns: 114 a list of matrix names. 115 """ 116 matrix_names = [] 117 118 for matrix_name, _ in self.config.matrices.items(): 119 matrix_names.append(matrix_name) 120 121 return matrix_names 122 123 def get_available_tables(self) -> List[str]: 124 """Get the available table names in the dataset. 125 126 Returns: 127 a list of table names. 128 """ 129 table_names = [] 130 131 for table_name, _ in self.config.tables.items(): 132 table_names.append(table_name) 133 134 return table_names 135 136 def get_matrices_info(self) -> Dict[str, Any]: 137 """Get the information on the dataset's available matrices. 138 139 Returns: 140 a dictionary containing the matrices' information keyed by matrix name. 141 """ 142 info = {} 143 144 for matrix_name, matrix_config in self.config.matrices.items(): 145 info[matrix_name] = matrix_config.to_yml_format() 146 147 return info 148 149 def get_matrix_config(self, matrix_name: str) -> Optional[DatasetMatrixConfig]: 150 """Get the configuration of a dataset's matrix. 151 152 Args: 153 matrix_name: the name of the matrix to get the configuration of. 154 155 Returns: 156 the configuration of the matrix or None when not available. 157 """ 158 return self.config.matrices.get(matrix_name) 159 160 def get_matrix_file_path(self, matrix_name: str) -> Optional[str]: 161 """Get the file path where the matrix with the specified name is stored. 162 163 Args: 164 matrix_name: the name of the matrix to get the file path of. 165 166 Returns: 167 the path of the dataset's matrix file or None when not available. 168 """ 169 if matrix_name not in self.config.matrices: 170 return None 171 172 return os.path.join( 173 self.data_dir, 174 self.config.matrices[matrix_name].table.file.name 175 ) 176 177 def get_name(self) -> str: 178 """Get the name of the dataset. 179 180 Returns: 181 the dataset name. 182 """ 183 return self.config.dataset_name 184 185 def get_table_config(self, table_name: str) -> Optional[DatasetTableConfig]: 186 """Get the configuration of the dataset table with the specified name. 187 188 Args: 189 table_name: name of the table to retrieve the configuration of. 190 191 Returns: 192 the table configuration or None when not available. 193 """ 194 return self.config.tables.get(table_name) 195 196 def get_table_info(self) -> Dict[str, Any]: 197 """Get the information on the dataset's available tables. 198 199 Returns: 200 a dictionary containing the table information keyed by table name. 201 """ 202 info = {} 203 204 for table_name, table_config in self.config.tables.items(): 205 info[table_name] = table_config.to_yml_format() 206 207 return info 208 209 def load_matrix(self, matrix_name: str) -> Optional[pd.DataFrame]: 210 """Load the standardized user-item matrix of the dataset. 211 212 Args: 213 matrix_name: the name of the matrix to load. 214 215 Returns: 216 the loaded user-item matrix or None when not available. 217 """ 218 matrix_config = self.get_matrix_config(matrix_name) 219 if matrix_config is None: 220 return None 221 222 return matrix_config.load_matrix(self.data_dir) 223 224 def load_item_indices(self, matrix_name: str) -> Optional[List[int]]: 225 """Load the item indices. 226 227 Optional indirection array of the item IDs that do not match up in 228 the corresponding data table. 229 230 Args: 231 matrix_name: the name of the matrix to load the item indices of. 232 233 Raises: 234 KeyError: when the matrix with the specified name does not exist. 235 236 Returns: 237 the indirection array or None when not needed. 238 """ 239 matrix_config = self.get_matrix_config(matrix_name) 240 if not matrix_config: 241 raise KeyError('Unknown matrix configuration to load item indices from') 242 243 return matrix_config.item.load_indices(self.data_dir) 244 245 def load_user_indices(self, matrix_name: str) -> Optional[List[int]]: 246 """Load the user indices. 247 248 Optional indirection array of the user IDs that do not match up in 249 the corresponding data table. 250 251 Args: 252 matrix_name: the name of the matrix to load the user indices of. 253 254 Raises: 255 KeyError: when the matrix with the specified name does not exist. 256 257 Returns: 258 the indirection array or None when not needed. 259 """ 260 matrix_config = self.get_matrix_config(matrix_name) 261 if not matrix_config: 262 raise KeyError('Unknown matrix configuration to load item indices from') 263 264 return matrix_config.user.load_indices(self.data_dir) 265 266 def read_matrix( 267 self, 268 matrix_name: str, 269 columns: List[Union[int,str]]=None, 270 chunk_size: int=None) -> Optional[pd.DataFrame]: 271 """Read the matrix with the specified name from the dataset. 272 273 Args: 274 matrix_name: the name of the matrix to load. 275 columns: subset list of columns to load or None to load all. 276 All elements must either be integer indices or 277 strings that correspond to the one of the available table columns. 278 chunk_size: reads the matrix in chunks as an iterator or 279 the entire table when None. 280 281 Returns: 282 the resulting matrix dataframe (iterator) or None when not available. 283 """ 284 matrix_config = self.get_matrix_config(matrix_name) 285 if matrix_config is None: 286 return None 287 288 table_config = matrix_config.table 289 return table_config.read_table(self.data_dir, columns=columns, chunk_size=chunk_size) 290 291 def read_table( 292 self, 293 table_name: str, 294 columns: List[Union[int,str]]=None, 295 chunk_size: int=None) -> Optional[pd.DataFrame]: 296 """Read the table with the specified name from the dataset. 297 298 Args: 299 table_name: name of the table to read. 300 columns: subset list of columns to load or None to load all. 301 All elements must either be integer indices or 302 strings that correspond to the one of the available table columns. 303 chunk_size: reads the table in chunks as an iterator or 304 the entire table when None. 305 306 Returns: 307 the resulting table dataframe (iterator) or None when not available. 308 """ 309 table_config = self.get_table_config(table_name) 310 if table_config is None: 311 return None 312 313 return table_config.read_table(self.data_dir, columns=columns, chunk_size=chunk_size) 314 315 def resolve_item_ids( 316 self, 317 matrix_name: str, 318 items: Union[int,List[int]]) -> Union[int,List[int]]: 319 """Resolve the specified item ID(s). 320 321 The item ID(s) of a dataset need to be resolved when it contains 322 an indirection array, otherwise ID(s) are returned unchanged. 323 324 Args: 325 matrix_name: the name of the matrix to resolve the item indices of. 326 items: source ID(s) to convert. 327 328 Raises: 329 KeyError: when the matrix with the specified name does not exist. 330 331 Returns: 332 the resolved item ID(s). 333 """ 334 item_indices = self.load_item_indices(matrix_name) 335 if item_indices is None: 336 return items 337 338 return item_indices[items] 339 340 def resolve_user_ids( 341 self, 342 matrix_name: str, 343 users: Union[int,List[int]]) -> Union[int,List[int]]: 344 """Resolve the specified user ID(s). 345 346 The user ID(s) of a dataset need to be resolved when it contains 347 an indirection array, otherwise ID(s) are returned unchanged. 348 349 Args: 350 matrix_name: the name of the matrix to resolve the user indices of. 351 users: source ID(s) to convert. 352 353 Raises: 354 KeyError: when the matrix with the specified name does not exist. 355 356 Returns: 357 the resolved user ID(s). 358 """ 359 user_indices = self.load_user_indices(matrix_name) 360 if user_indices is None: 361 return users 362 363 return user_indices[users]
Wrapper class for a FairRecKit dataset.
A dataset is used for carrying out recommender system experiments. Each dataset has a strong affinity with a database structure consisting of multiple tables. The standardized matrix is a pandas.DataFrame stored in a '.tsv' file. The (derived sparse) matrix is used in experiments and needs to be in a CSR compatible format, meaning three fields:
1) 'user': IDs range from 0 to the amount of unique users. 2) 'item': IDs range from 0 to the amount of unique items. An item can be various of things (e.g. an artist, an album, a track, a movie, etc.) 3) 'rating': floating-point data describing the rating a user has given an item. There are two types of ratings, namely explicit or implicit, and both are expected to be greater than zero.
The matrix has one optional field which is:
4) 'timestamp': when present can be used to split the matrix on temporal basis.
A dataset has two main tables that are connected to the 'user' and 'item' fields. Indirection arrays are available when user and/or item IDs do not match up in their corresponding tables. These two tables can be used in an experiment to filter any rows based on various table header criteria. Any additional tables can be added for accessibility/compatibility with the FRK recommender system.
Public methods:
get_available_columns get_available_event_tables get_available_matrices get_available_tables get_matrices_info get_matrix_config get_matrix_file_path get_name get_table_config get_table_info load_matrix read_matrix read_table resolve_item_ids resolve_user_ids
72 def __init__(self, data_dir: str, config: DatasetConfig): 73 """Construct the dataset. 74 75 Args: 76 data_dir: directory where the dataset is stored. 77 config: the dataset configuration. 78 """ 79 if not os.path.isdir(data_dir): 80 raise IOError('Unknown dataset directory: ' + data_dir) 81 82 self.data_dir = data_dir 83 self.config = config
Construct the dataset.
Args: data_dir: directory where the dataset is stored. config: the dataset configuration.
85 def get_available_columns(self, matrix_name: str) -> Dict[str, List[str]]: 86 """Get the available table column names of this dataset. 87 88 Args: 89 matrix_name: the name of the matrix to get the available columns of. 90 91 Returns: 92 a dictionary with table name as keys and column names as values. 93 94 """ 95 return self.config.get_available_columns(matrix_name)
Get the available table column names of this dataset.
Args: matrix_name: the name of the matrix to get the available columns of.
Returns: a dictionary with table name as keys and column names as values.
97 def get_available_event_tables(self) -> List[str]: 98 """Get the available event table names in the dataset. 99 100 Returns: 101 a list of event table names. 102 """ 103 event_table_names = [] 104 105 for table_name, _ in self.config.events.items(): 106 event_table_names.append(table_name) 107 108 return event_table_names
Get the available event table names in the dataset.
Returns: a list of event table names.
110 def get_available_matrices(self) -> List[str]: 111 """Get the available matrix names in the dataset. 112 113 Returns: 114 a list of matrix names. 115 """ 116 matrix_names = [] 117 118 for matrix_name, _ in self.config.matrices.items(): 119 matrix_names.append(matrix_name) 120 121 return matrix_names
Get the available matrix names in the dataset.
Returns: a list of matrix names.
123 def get_available_tables(self) -> List[str]: 124 """Get the available table names in the dataset. 125 126 Returns: 127 a list of table names. 128 """ 129 table_names = [] 130 131 for table_name, _ in self.config.tables.items(): 132 table_names.append(table_name) 133 134 return table_names
Get the available table names in the dataset.
Returns: a list of table names.
136 def get_matrices_info(self) -> Dict[str, Any]: 137 """Get the information on the dataset's available matrices. 138 139 Returns: 140 a dictionary containing the matrices' information keyed by matrix name. 141 """ 142 info = {} 143 144 for matrix_name, matrix_config in self.config.matrices.items(): 145 info[matrix_name] = matrix_config.to_yml_format() 146 147 return info
Get the information on the dataset's available matrices.
Returns: a dictionary containing the matrices' information keyed by matrix name.
149 def get_matrix_config(self, matrix_name: str) -> Optional[DatasetMatrixConfig]: 150 """Get the configuration of a dataset's matrix. 151 152 Args: 153 matrix_name: the name of the matrix to get the configuration of. 154 155 Returns: 156 the configuration of the matrix or None when not available. 157 """ 158 return self.config.matrices.get(matrix_name)
Get the configuration of a dataset's matrix.
Args: matrix_name: the name of the matrix to get the configuration of.
Returns: the configuration of the matrix or None when not available.
160 def get_matrix_file_path(self, matrix_name: str) -> Optional[str]: 161 """Get the file path where the matrix with the specified name is stored. 162 163 Args: 164 matrix_name: the name of the matrix to get the file path of. 165 166 Returns: 167 the path of the dataset's matrix file or None when not available. 168 """ 169 if matrix_name not in self.config.matrices: 170 return None 171 172 return os.path.join( 173 self.data_dir, 174 self.config.matrices[matrix_name].table.file.name 175 )
Get the file path where the matrix with the specified name is stored.
Args: matrix_name: the name of the matrix to get the file path of.
Returns: the path of the dataset's matrix file or None when not available.
177 def get_name(self) -> str: 178 """Get the name of the dataset. 179 180 Returns: 181 the dataset name. 182 """ 183 return self.config.dataset_name
Get the name of the dataset.
Returns: the dataset name.
185 def get_table_config(self, table_name: str) -> Optional[DatasetTableConfig]: 186 """Get the configuration of the dataset table with the specified name. 187 188 Args: 189 table_name: name of the table to retrieve the configuration of. 190 191 Returns: 192 the table configuration or None when not available. 193 """ 194 return self.config.tables.get(table_name)
Get the configuration of the dataset table with the specified name.
Args: table_name: name of the table to retrieve the configuration of.
Returns: the table configuration or None when not available.
196 def get_table_info(self) -> Dict[str, Any]: 197 """Get the information on the dataset's available tables. 198 199 Returns: 200 a dictionary containing the table information keyed by table name. 201 """ 202 info = {} 203 204 for table_name, table_config in self.config.tables.items(): 205 info[table_name] = table_config.to_yml_format() 206 207 return info
Get the information on the dataset's available tables.
Returns: a dictionary containing the table information keyed by table name.
209 def load_matrix(self, matrix_name: str) -> Optional[pd.DataFrame]: 210 """Load the standardized user-item matrix of the dataset. 211 212 Args: 213 matrix_name: the name of the matrix to load. 214 215 Returns: 216 the loaded user-item matrix or None when not available. 217 """ 218 matrix_config = self.get_matrix_config(matrix_name) 219 if matrix_config is None: 220 return None 221 222 return matrix_config.load_matrix(self.data_dir)
Load the standardized user-item matrix of the dataset.
Args: matrix_name: the name of the matrix to load.
Returns: the loaded user-item matrix or None when not available.
224 def load_item_indices(self, matrix_name: str) -> Optional[List[int]]: 225 """Load the item indices. 226 227 Optional indirection array of the item IDs that do not match up in 228 the corresponding data table. 229 230 Args: 231 matrix_name: the name of the matrix to load the item indices of. 232 233 Raises: 234 KeyError: when the matrix with the specified name does not exist. 235 236 Returns: 237 the indirection array or None when not needed. 238 """ 239 matrix_config = self.get_matrix_config(matrix_name) 240 if not matrix_config: 241 raise KeyError('Unknown matrix configuration to load item indices from') 242 243 return matrix_config.item.load_indices(self.data_dir)
Load the item indices.
Optional indirection array of the item IDs that do not match up in the corresponding data table.
Args: matrix_name: the name of the matrix to load the item indices of.
Raises: KeyError: when the matrix with the specified name does not exist.
Returns: the indirection array or None when not needed.
245 def load_user_indices(self, matrix_name: str) -> Optional[List[int]]: 246 """Load the user indices. 247 248 Optional indirection array of the user IDs that do not match up in 249 the corresponding data table. 250 251 Args: 252 matrix_name: the name of the matrix to load the user indices of. 253 254 Raises: 255 KeyError: when the matrix with the specified name does not exist. 256 257 Returns: 258 the indirection array or None when not needed. 259 """ 260 matrix_config = self.get_matrix_config(matrix_name) 261 if not matrix_config: 262 raise KeyError('Unknown matrix configuration to load item indices from') 263 264 return matrix_config.user.load_indices(self.data_dir)
Load the user indices.
Optional indirection array of the user IDs that do not match up in the corresponding data table.
Args: matrix_name: the name of the matrix to load the user indices of.
Raises: KeyError: when the matrix with the specified name does not exist.
Returns: the indirection array or None when not needed.
266 def read_matrix( 267 self, 268 matrix_name: str, 269 columns: List[Union[int,str]]=None, 270 chunk_size: int=None) -> Optional[pd.DataFrame]: 271 """Read the matrix with the specified name from the dataset. 272 273 Args: 274 matrix_name: the name of the matrix to load. 275 columns: subset list of columns to load or None to load all. 276 All elements must either be integer indices or 277 strings that correspond to the one of the available table columns. 278 chunk_size: reads the matrix in chunks as an iterator or 279 the entire table when None. 280 281 Returns: 282 the resulting matrix dataframe (iterator) or None when not available. 283 """ 284 matrix_config = self.get_matrix_config(matrix_name) 285 if matrix_config is None: 286 return None 287 288 table_config = matrix_config.table 289 return table_config.read_table(self.data_dir, columns=columns, chunk_size=chunk_size)
Read the matrix with the specified name from the dataset.
Args: matrix_name: the name of the matrix to load. columns: subset list of columns to load or None to load all. All elements must either be integer indices or strings that correspond to the one of the available table columns. chunk_size: reads the matrix in chunks as an iterator or the entire table when None.
Returns: the resulting matrix dataframe (iterator) or None when not available.
291 def read_table( 292 self, 293 table_name: str, 294 columns: List[Union[int,str]]=None, 295 chunk_size: int=None) -> Optional[pd.DataFrame]: 296 """Read the table with the specified name from the dataset. 297 298 Args: 299 table_name: name of the table to read. 300 columns: subset list of columns to load or None to load all. 301 All elements must either be integer indices or 302 strings that correspond to the one of the available table columns. 303 chunk_size: reads the table in chunks as an iterator or 304 the entire table when None. 305 306 Returns: 307 the resulting table dataframe (iterator) or None when not available. 308 """ 309 table_config = self.get_table_config(table_name) 310 if table_config is None: 311 return None 312 313 return table_config.read_table(self.data_dir, columns=columns, chunk_size=chunk_size)
Read the table with the specified name from the dataset.
Args: table_name: name of the table to read. columns: subset list of columns to load or None to load all. All elements must either be integer indices or strings that correspond to the one of the available table columns. chunk_size: reads the table in chunks as an iterator or the entire table when None.
Returns: the resulting table dataframe (iterator) or None when not available.
315 def resolve_item_ids( 316 self, 317 matrix_name: str, 318 items: Union[int,List[int]]) -> Union[int,List[int]]: 319 """Resolve the specified item ID(s). 320 321 The item ID(s) of a dataset need to be resolved when it contains 322 an indirection array, otherwise ID(s) are returned unchanged. 323 324 Args: 325 matrix_name: the name of the matrix to resolve the item indices of. 326 items: source ID(s) to convert. 327 328 Raises: 329 KeyError: when the matrix with the specified name does not exist. 330 331 Returns: 332 the resolved item ID(s). 333 """ 334 item_indices = self.load_item_indices(matrix_name) 335 if item_indices is None: 336 return items 337 338 return item_indices[items]
Resolve the specified item ID(s).
The item ID(s) of a dataset need to be resolved when it contains an indirection array, otherwise ID(s) are returned unchanged.
Args: matrix_name: the name of the matrix to resolve the item indices of. items: source ID(s) to convert.
Raises: KeyError: when the matrix with the specified name does not exist.
Returns: the resolved item ID(s).
340 def resolve_user_ids( 341 self, 342 matrix_name: str, 343 users: Union[int,List[int]]) -> Union[int,List[int]]: 344 """Resolve the specified user ID(s). 345 346 The user ID(s) of a dataset need to be resolved when it contains 347 an indirection array, otherwise ID(s) are returned unchanged. 348 349 Args: 350 matrix_name: the name of the matrix to resolve the user indices of. 351 users: source ID(s) to convert. 352 353 Raises: 354 KeyError: when the matrix with the specified name does not exist. 355 356 Returns: 357 the resolved user ID(s). 358 """ 359 user_indices = self.load_user_indices(matrix_name) 360 if user_indices is None: 361 return users 362 363 return user_indices[users]
Resolve the specified user ID(s).
The user ID(s) of a dataset need to be resolved when it contains an indirection array, otherwise ID(s) are returned unchanged.
Args: matrix_name: the name of the matrix to resolve the user indices of. users: source ID(s) to convert.
Raises: KeyError: when the matrix with the specified name does not exist.
Returns: the resolved user ID(s).
366def add_dataset_columns( 367 dataset: Dataset, 368 matrix_name: str, 369 dataframe: pd.DataFrame, 370 column_names: List[str]) -> pd.DataFrame: 371 """Add the specified columns from the dataset to the dataframe. 372 373 Args: 374 dataset: the set related to the dataframe. 375 matrix_name: the name of the dataset matrix. 376 dataframe: with at least the 'user' and/or 'item' columns. 377 column_names: a list of strings to indicate which 378 user and/or item columns need to be added. Any values that are not 379 present in the dataset tables are ignored. 380 381 Returns: 382 the resulting dataframe with the added columns that exist in the dataset. 383 """ 384 for table_name, table_columns in dataset.get_available_columns(matrix_name).items(): 385 columns = [c for c in column_names if c in table_columns] 386 # skip table that does not contain any needed columns 387 if len(columns) == 0: 388 continue 389 390 matrix_config = dataset.get_matrix_config(matrix_name) 391 table_config = dataset.get_table_config(table_name) 392 393 user_key = matrix_config.user.key 394 item_key = matrix_config.item.key 395 user_item_key = [user_key, item_key] 396 397 # add matrix columns 398 if table_name == 'matrix': 399 dataframe = pd.merge( 400 dataframe, 401 dataset.read_matrix(matrix_name, columns=matrix_config.table.primary_key + columns), 402 how='left', 403 left_on=['user', 'item'], 404 right_on=matrix_config.table.primary_key 405 ) 406 dataframe.drop(matrix_config.table.primary_key, inplace=True, axis=1) 407 # add user columns 408 elif table_config.primary_key == [user_key]: 409 dataframe[user_key] = dataset.resolve_user_ids(matrix_name, dataframe['user']) 410 dataframe = pd.merge( 411 dataframe, 412 dataset.read_table(table_name, columns=table_config.primary_key + columns), 413 how='left', 414 on=user_key 415 ) 416 dataframe.drop(user_key, inplace=True, axis=1) 417 # add item columns 418 elif table_config.primary_key == [item_key]: 419 dataframe[item_key] = dataset.resolve_item_ids(matrix_name, dataframe['item']) 420 dataframe = pd.merge( 421 dataframe, 422 dataset.read_table(table_name, columns=table_config.primary_key + columns), 423 how='left', 424 on=item_key 425 ) 426 dataframe.drop(item_key, inplace=True, axis=1) 427 # add user-item columns 428 elif table_config.primary_key == user_item_key: 429 dataframe[user_key] = dataset.resolve_user_ids(matrix_name, dataframe['user']) 430 dataframe[item_key] = dataset.resolve_item_ids(matrix_name, dataframe['item']) 431 dataframe = pd.merge( 432 dataframe, 433 dataset.read_table(table_name, columns=table_config.primary_key + columns), 434 how='left', 435 on=user_item_key 436 ) 437 dataframe.drop(user_item_key, inplace=True, axis=1) 438 439 return dataframe
Add the specified columns from the dataset to the dataframe.
Args: dataset: the set related to the dataframe. matrix_name: the name of the dataset matrix. dataframe: with at least the 'user' and/or 'item' columns. column_names: a list of strings to indicate which user and/or item columns need to be added. Any values that are not present in the dataset tables are ignored.
Returns: the resulting dataframe with the added columns that exist in the dataset.