src.fairreckitlib.data.set.dataset_sampling
This module contains functionality to create a sample of an existing dataset.
Functions:
create_dataset_sample: create a sample of a dataset.
create_dataset_table_samples: create tables samples for a map of key indices.
create_matrix_sample_config: create a sample matrix configuration.
create_matrix_sample: create a sample of a dataset's matrix.
This program has been developed by students from the bachelor Computer Science at Utrecht University within the Software Project course. © Copyright Utrecht University (Department of Information and Computing Sciences)
1"""This module contains functionality to create a sample of an existing dataset. 2 3Functions: 4 5 create_dataset_sample: create a sample of a dataset. 6 create_dataset_table_samples: create tables samples for a map of key indices. 7 create_matrix_sample_config: create a sample matrix configuration. 8 create_matrix_sample: create a sample of a dataset's matrix. 9 10This program has been developed by students from the bachelor Computer Science at 11Utrecht University within the Software Project course. 12© Copyright Utrecht University (Department of Information and Computing Sciences) 13""" 14 15import os 16from typing import Dict, List, Optional, Tuple 17 18import pandas as pd 19 20from ...core.io.io_utility import save_yml 21from .dataset import Dataset 22from .dataset_constants import DATASET_CONFIG_FILE 23from .dataset_config import DatasetMatrixConfig, DatasetIndexConfig, RatingMatrixConfig 24from .dataset_config import DatasetConfig, DatasetTableConfig, create_dataset_table_config 25 26 27def create_dataset_sample( 28 output_dir: str, 29 dataset: Dataset, 30 num_users: int, 31 num_items: int) -> Dataset: 32 """Create a sample of the specified dataset. 33 34 Look at the 'create_matrix_sample' function for specifics on how the 35 matrices of the dataset are sampled. All tables, except the events, that are related 36 to the user/item keys that are present in the sample matrices are sampled as well. 37 The generated dataset sample is stored in the output directory before returning it. 38 This function raises an IOError when the specified output directory already exists. 39 40 Args: 41 output_dir: the path to the directory where the dataset sample will be stored. 42 dataset: the dataset to create a sample of. 43 num_users: the number of users in the created sample matrices. 44 num_items: the number of items in the created sample matrices. 45 46 Returns: 47 the resulting sample dataset. 48 """ 49 sample_dir = os.path.join(output_dir, dataset.get_name() + '-Sample') 50 if os.path.isdir(sample_dir): 51 raise IOError('Failed to create sample, directory already exists.') 52 53 os.mkdir(sample_dir) 54 55 sample_matrices = {} 56 key_id_map = {} 57 58 for matrix_name in dataset.get_available_matrices(): 59 # create and add matrix sample 60 sample_matrix_config = create_matrix_sample_config( 61 sample_dir, 62 dataset, 63 matrix_name, 64 num_users, 65 num_items 66 ) 67 sample_matrices[matrix_name] = sample_matrix_config 68 69 # append user key indices to the key map 70 user_keys = key_id_map.get(sample_matrix_config.user.key, []) 71 user_indices = pd.Series(sample_matrix_config.user.load_indices(sample_dir)) 72 user_keys = pd.Series(user_keys).append(user_indices) 73 key_id_map[sample_matrix_config.user.key] = user_keys.unique() 74 75 # append item key indices to the key map 76 item_keys = key_id_map.get(sample_matrix_config.item.key, []) 77 item_indices = pd.Series(sample_matrix_config.item.load_indices(sample_dir)) 78 item_keys = pd.Series(item_keys).append(item_indices) 79 key_id_map[sample_matrix_config.item.key] = item_keys.unique() 80 81 # create sample tables for the key map that contains all the needed indices of all matrices 82 sample_tables = create_dataset_table_samples(sample_dir, dataset, key_id_map) 83 84 # create and save dataset configuration 85 sample_dataset_config = DatasetConfig( 86 dataset.get_name() + '-Sample', 87 {}, 88 sample_matrices, 89 sample_tables 90 ) 91 save_yml(os.path.join(sample_dir, DATASET_CONFIG_FILE), sample_dataset_config.to_yml_format()) 92 93 return Dataset(sample_dir, sample_dataset_config) 94 95 96def create_dataset_table_samples( 97 output_dir: str, 98 dataset: Dataset, 99 key_id_map: Dict[str, List[int]]) -> Dict[str, DatasetTableConfig]: 100 """Create table samples for the specified dataset and key map. 101 102 The key map is used to identify which tables of the dataset are sampled. 103 A table is considered to be a candidate if the key in the map matches the 104 primary key of the table. Any rows that do not contain the needed indices 105 in the key map are filtered. 106 107 Args: 108 output_dir: the path to the directory where the sample tables will be stored. 109 dataset: the dataset to create a sample tables from. 110 key_id_map: a dictionary containing a table key paired with a list of indices 111 that are related to these table keys. 112 113 Returns: 114 a dictionary with the resulting table sample configurations, keyed by table names. 115 """ 116 sample_tables = {} 117 118 for table_name in dataset.get_available_tables(): 119 table_config = dataset.get_table_config(table_name) 120 121 table = dataset.read_table(table_name) 122 table_modified = False 123 124 for key_id, key_id_list in key_id_map.items(): 125 # filter unwanted table rows when the primary key matches 126 if table_config.primary_key == [key_id]: 127 table = table[table[key_id].isin(key_id_list)] 128 table_modified = True 129 130 if table_modified: 131 # store the table sample 132 sample_table_config = create_dataset_table_config( 133 dataset.get_name() + '_' + table_name + '.tsv.bz2', 134 table_config.primary_key, 135 table_config.columns, 136 compression='bz2', 137 encoding=table_config.file.options.encoding, 138 foreign_keys=table_config.foreign_keys, 139 num_records=len(table) 140 ) 141 sample_table_config.save_table(table, output_dir) 142 # add sample table configuration 143 sample_tables[table_name] = sample_table_config 144 145 return sample_tables 146 147 148def create_matrix_sample_config( 149 output_dir: str, 150 dataset: Dataset, 151 matrix_name: str, 152 num_users: int, 153 num_items: int) -> Optional[DatasetMatrixConfig]: 154 """Create a dataset matrix sample configuration. 155 156 Look at the 'create_matrix_sample' function for specifics on how the 157 matrix is sampled. The generated matrix and user/item indirection arrays are 158 stored in the output directory and the corresponding configuration is returned. 159 160 Args: 161 output_dir: the path to the directory where the sample matrix will be stored. 162 dataset: the dataset to create a sample matrix from. 163 matrix_name: the name of the matrix to create a sample of. 164 num_users: the number of users in the created sample matrix. 165 num_items: the number of items in the created sample matrix. 166 167 Returns: 168 the sample matrix configuration or None when the specified matrix does not exist. 169 """ 170 matrix_config = dataset.get_matrix_config(matrix_name) 171 if matrix_config is None: 172 return None 173 174 sample, users, items = create_matrix_sample(dataset, matrix_name, num_users, num_items) 175 176 # create the user indices config and save the array 177 user_index_config = DatasetIndexConfig( 178 matrix_name + '_user_indices.hdf5', 179 matrix_config.user.key, 180 len(users) 181 ) 182 user_index_config.save_indices(output_dir, list(users)) 183 184 # create the item indices config and save the array 185 item_index_config = DatasetIndexConfig( 186 matrix_name + '_item_indices.hdf5', 187 matrix_config.item.key, 188 len(items) 189 ) 190 item_index_config.save_indices(output_dir, list(items)) 191 192 # create the sample matrix table config and save the table 193 sample_table_config = create_dataset_table_config( 194 dataset.get_name() + '_' + matrix_name + '.tsv.bz2', 195 matrix_config.table.primary_key, 196 matrix_config.table.columns, 197 compression='bz2', 198 encoding=matrix_config.table.file.options.encoding, 199 foreign_keys=matrix_config.table.foreign_keys, 200 num_records=len(sample) 201 ) 202 sample_table_config.save_table(sample, output_dir) 203 204 return DatasetMatrixConfig( 205 sample_table_config, 206 RatingMatrixConfig( 207 float(sample[matrix_config.table.columns[0]].min()), 208 float(sample[matrix_config.table.columns[0]].max()), 209 matrix_config.ratings.rating_type 210 ), 211 user_index_config, 212 item_index_config 213 ) 214 215 216def create_matrix_sample( 217 dataset: Dataset, 218 matrix_name: str, 219 num_users: int, 220 num_items: int) -> Tuple[pd.DataFrame, List[int], List[int]]: 221 """Create a sample for the specified matrix. 222 223 Extracts a sample with the first occurring users and items until the 224 specified amounts are reached, and therefore are only used as an indication. 225 No additional users/items are generated when the dataset matrix has 226 less available amounts than is specified. Moreover, due to the sparsity of the 227 matrix it can turn out that the resulting matrix is very close, but not 228 exactly the specified amounts. 229 230 Args: 231 dataset: the dataset to create a sample matrix from. 232 matrix_name: the name of the matrix to create a sample of. 233 num_users: the number of users in the created sample matrix. 234 num_items: the number of items in the created sample matrix. 235 236 Returns: 237 the sample matrix, the unique user and unique item indices. 238 """ 239 matrix_config = dataset.get_matrix_config(matrix_name) 240 241 # clamp num users/items 242 matrix_users = min(matrix_config.user.num_records, num_users) 243 matrix_items = min(matrix_config.item.num_records, num_items) 244 245 # prepare sample dataframe 246 matrix_columns = matrix_config.table.primary_key + matrix_config.table.columns 247 matrix_sample = pd.DataFrame(columns=matrix_columns) 248 249 user_key = matrix_config.user.key 250 item_key = matrix_config.item.key 251 252 # create sample in chunks for very big matrices 253 for _, matrix in enumerate(dataset.read_matrix(matrix_name, chunk_size=50000000)): 254 matrix_sample = pd.concat([matrix_sample, matrix]) 255 if len(matrix_sample[user_key].unique()) > matrix_users and \ 256 len(matrix_sample[item_key].unique()) > matrix_items: 257 break 258 259 # users may not be number from 0...num_users 260 unique_users = matrix_sample[user_key].unique() 261 matrix_sample = pd.merge( 262 matrix_sample, 263 pd.DataFrame(list(enumerate(unique_users)), columns=['user',user_key]), 264 how='left', 265 on=user_key 266 ) 267 # remove any users above the threshold 268 matrix_sample = matrix_sample[matrix_sample['user'] < matrix_users] 269 # recalculate the indirection array 270 unique_users = dataset.resolve_user_ids(matrix_name, matrix_sample[user_key].unique().tolist()) 271 272 # items may not be number from 0...num_items 273 unique_items = matrix_sample[item_key].unique() 274 matrix_sample = pd.merge( 275 matrix_sample, 276 pd.DataFrame(list(enumerate(unique_items)), columns=['item',item_key]), 277 how='left', 278 on=item_key 279 ) 280 # remove any items above the threshold 281 matrix_sample = matrix_sample[matrix_sample['item'] < matrix_items] 282 # recalculate and resolve the indirection array 283 unique_items = dataset.resolve_item_ids(matrix_name, matrix_sample[item_key].unique().tolist()) 284 285 # create sample by removing the extra columns that were added 286 matrix_sample = matrix_sample[['user', 'item'] + matrix_config.table.columns] 287 288 return matrix_sample, unique_users, unique_items
28def create_dataset_sample( 29 output_dir: str, 30 dataset: Dataset, 31 num_users: int, 32 num_items: int) -> Dataset: 33 """Create a sample of the specified dataset. 34 35 Look at the 'create_matrix_sample' function for specifics on how the 36 matrices of the dataset are sampled. All tables, except the events, that are related 37 to the user/item keys that are present in the sample matrices are sampled as well. 38 The generated dataset sample is stored in the output directory before returning it. 39 This function raises an IOError when the specified output directory already exists. 40 41 Args: 42 output_dir: the path to the directory where the dataset sample will be stored. 43 dataset: the dataset to create a sample of. 44 num_users: the number of users in the created sample matrices. 45 num_items: the number of items in the created sample matrices. 46 47 Returns: 48 the resulting sample dataset. 49 """ 50 sample_dir = os.path.join(output_dir, dataset.get_name() + '-Sample') 51 if os.path.isdir(sample_dir): 52 raise IOError('Failed to create sample, directory already exists.') 53 54 os.mkdir(sample_dir) 55 56 sample_matrices = {} 57 key_id_map = {} 58 59 for matrix_name in dataset.get_available_matrices(): 60 # create and add matrix sample 61 sample_matrix_config = create_matrix_sample_config( 62 sample_dir, 63 dataset, 64 matrix_name, 65 num_users, 66 num_items 67 ) 68 sample_matrices[matrix_name] = sample_matrix_config 69 70 # append user key indices to the key map 71 user_keys = key_id_map.get(sample_matrix_config.user.key, []) 72 user_indices = pd.Series(sample_matrix_config.user.load_indices(sample_dir)) 73 user_keys = pd.Series(user_keys).append(user_indices) 74 key_id_map[sample_matrix_config.user.key] = user_keys.unique() 75 76 # append item key indices to the key map 77 item_keys = key_id_map.get(sample_matrix_config.item.key, []) 78 item_indices = pd.Series(sample_matrix_config.item.load_indices(sample_dir)) 79 item_keys = pd.Series(item_keys).append(item_indices) 80 key_id_map[sample_matrix_config.item.key] = item_keys.unique() 81 82 # create sample tables for the key map that contains all the needed indices of all matrices 83 sample_tables = create_dataset_table_samples(sample_dir, dataset, key_id_map) 84 85 # create and save dataset configuration 86 sample_dataset_config = DatasetConfig( 87 dataset.get_name() + '-Sample', 88 {}, 89 sample_matrices, 90 sample_tables 91 ) 92 save_yml(os.path.join(sample_dir, DATASET_CONFIG_FILE), sample_dataset_config.to_yml_format()) 93 94 return Dataset(sample_dir, sample_dataset_config)
Create a sample of the specified dataset.
Look at the 'create_matrix_sample' function for specifics on how the matrices of the dataset are sampled. All tables, except the events, that are related to the user/item keys that are present in the sample matrices are sampled as well. The generated dataset sample is stored in the output directory before returning it. This function raises an IOError when the specified output directory already exists.
Args: output_dir: the path to the directory where the dataset sample will be stored. dataset: the dataset to create a sample of. num_users: the number of users in the created sample matrices. num_items: the number of items in the created sample matrices.
Returns: the resulting sample dataset.
97def create_dataset_table_samples( 98 output_dir: str, 99 dataset: Dataset, 100 key_id_map: Dict[str, List[int]]) -> Dict[str, DatasetTableConfig]: 101 """Create table samples for the specified dataset and key map. 102 103 The key map is used to identify which tables of the dataset are sampled. 104 A table is considered to be a candidate if the key in the map matches the 105 primary key of the table. Any rows that do not contain the needed indices 106 in the key map are filtered. 107 108 Args: 109 output_dir: the path to the directory where the sample tables will be stored. 110 dataset: the dataset to create a sample tables from. 111 key_id_map: a dictionary containing a table key paired with a list of indices 112 that are related to these table keys. 113 114 Returns: 115 a dictionary with the resulting table sample configurations, keyed by table names. 116 """ 117 sample_tables = {} 118 119 for table_name in dataset.get_available_tables(): 120 table_config = dataset.get_table_config(table_name) 121 122 table = dataset.read_table(table_name) 123 table_modified = False 124 125 for key_id, key_id_list in key_id_map.items(): 126 # filter unwanted table rows when the primary key matches 127 if table_config.primary_key == [key_id]: 128 table = table[table[key_id].isin(key_id_list)] 129 table_modified = True 130 131 if table_modified: 132 # store the table sample 133 sample_table_config = create_dataset_table_config( 134 dataset.get_name() + '_' + table_name + '.tsv.bz2', 135 table_config.primary_key, 136 table_config.columns, 137 compression='bz2', 138 encoding=table_config.file.options.encoding, 139 foreign_keys=table_config.foreign_keys, 140 num_records=len(table) 141 ) 142 sample_table_config.save_table(table, output_dir) 143 # add sample table configuration 144 sample_tables[table_name] = sample_table_config 145 146 return sample_tables
Create table samples for the specified dataset and key map.
The key map is used to identify which tables of the dataset are sampled. A table is considered to be a candidate if the key in the map matches the primary key of the table. Any rows that do not contain the needed indices in the key map are filtered.
Args: output_dir: the path to the directory where the sample tables will be stored. dataset: the dataset to create a sample tables from. key_id_map: a dictionary containing a table key paired with a list of indices that are related to these table keys.
Returns: a dictionary with the resulting table sample configurations, keyed by table names.
149def create_matrix_sample_config( 150 output_dir: str, 151 dataset: Dataset, 152 matrix_name: str, 153 num_users: int, 154 num_items: int) -> Optional[DatasetMatrixConfig]: 155 """Create a dataset matrix sample configuration. 156 157 Look at the 'create_matrix_sample' function for specifics on how the 158 matrix is sampled. The generated matrix and user/item indirection arrays are 159 stored in the output directory and the corresponding configuration is returned. 160 161 Args: 162 output_dir: the path to the directory where the sample matrix will be stored. 163 dataset: the dataset to create a sample matrix from. 164 matrix_name: the name of the matrix to create a sample of. 165 num_users: the number of users in the created sample matrix. 166 num_items: the number of items in the created sample matrix. 167 168 Returns: 169 the sample matrix configuration or None when the specified matrix does not exist. 170 """ 171 matrix_config = dataset.get_matrix_config(matrix_name) 172 if matrix_config is None: 173 return None 174 175 sample, users, items = create_matrix_sample(dataset, matrix_name, num_users, num_items) 176 177 # create the user indices config and save the array 178 user_index_config = DatasetIndexConfig( 179 matrix_name + '_user_indices.hdf5', 180 matrix_config.user.key, 181 len(users) 182 ) 183 user_index_config.save_indices(output_dir, list(users)) 184 185 # create the item indices config and save the array 186 item_index_config = DatasetIndexConfig( 187 matrix_name + '_item_indices.hdf5', 188 matrix_config.item.key, 189 len(items) 190 ) 191 item_index_config.save_indices(output_dir, list(items)) 192 193 # create the sample matrix table config and save the table 194 sample_table_config = create_dataset_table_config( 195 dataset.get_name() + '_' + matrix_name + '.tsv.bz2', 196 matrix_config.table.primary_key, 197 matrix_config.table.columns, 198 compression='bz2', 199 encoding=matrix_config.table.file.options.encoding, 200 foreign_keys=matrix_config.table.foreign_keys, 201 num_records=len(sample) 202 ) 203 sample_table_config.save_table(sample, output_dir) 204 205 return DatasetMatrixConfig( 206 sample_table_config, 207 RatingMatrixConfig( 208 float(sample[matrix_config.table.columns[0]].min()), 209 float(sample[matrix_config.table.columns[0]].max()), 210 matrix_config.ratings.rating_type 211 ), 212 user_index_config, 213 item_index_config 214 )
Create a dataset matrix sample configuration.
Look at the 'create_matrix_sample' function for specifics on how the matrix is sampled. The generated matrix and user/item indirection arrays are stored in the output directory and the corresponding configuration is returned.
Args: output_dir: the path to the directory where the sample matrix will be stored. dataset: the dataset to create a sample matrix from. matrix_name: the name of the matrix to create a sample of. num_users: the number of users in the created sample matrix. num_items: the number of items in the created sample matrix.
Returns: the sample matrix configuration or None when the specified matrix does not exist.
217def create_matrix_sample( 218 dataset: Dataset, 219 matrix_name: str, 220 num_users: int, 221 num_items: int) -> Tuple[pd.DataFrame, List[int], List[int]]: 222 """Create a sample for the specified matrix. 223 224 Extracts a sample with the first occurring users and items until the 225 specified amounts are reached, and therefore are only used as an indication. 226 No additional users/items are generated when the dataset matrix has 227 less available amounts than is specified. Moreover, due to the sparsity of the 228 matrix it can turn out that the resulting matrix is very close, but not 229 exactly the specified amounts. 230 231 Args: 232 dataset: the dataset to create a sample matrix from. 233 matrix_name: the name of the matrix to create a sample of. 234 num_users: the number of users in the created sample matrix. 235 num_items: the number of items in the created sample matrix. 236 237 Returns: 238 the sample matrix, the unique user and unique item indices. 239 """ 240 matrix_config = dataset.get_matrix_config(matrix_name) 241 242 # clamp num users/items 243 matrix_users = min(matrix_config.user.num_records, num_users) 244 matrix_items = min(matrix_config.item.num_records, num_items) 245 246 # prepare sample dataframe 247 matrix_columns = matrix_config.table.primary_key + matrix_config.table.columns 248 matrix_sample = pd.DataFrame(columns=matrix_columns) 249 250 user_key = matrix_config.user.key 251 item_key = matrix_config.item.key 252 253 # create sample in chunks for very big matrices 254 for _, matrix in enumerate(dataset.read_matrix(matrix_name, chunk_size=50000000)): 255 matrix_sample = pd.concat([matrix_sample, matrix]) 256 if len(matrix_sample[user_key].unique()) > matrix_users and \ 257 len(matrix_sample[item_key].unique()) > matrix_items: 258 break 259 260 # users may not be number from 0...num_users 261 unique_users = matrix_sample[user_key].unique() 262 matrix_sample = pd.merge( 263 matrix_sample, 264 pd.DataFrame(list(enumerate(unique_users)), columns=['user',user_key]), 265 how='left', 266 on=user_key 267 ) 268 # remove any users above the threshold 269 matrix_sample = matrix_sample[matrix_sample['user'] < matrix_users] 270 # recalculate the indirection array 271 unique_users = dataset.resolve_user_ids(matrix_name, matrix_sample[user_key].unique().tolist()) 272 273 # items may not be number from 0...num_items 274 unique_items = matrix_sample[item_key].unique() 275 matrix_sample = pd.merge( 276 matrix_sample, 277 pd.DataFrame(list(enumerate(unique_items)), columns=['item',item_key]), 278 how='left', 279 on=item_key 280 ) 281 # remove any items above the threshold 282 matrix_sample = matrix_sample[matrix_sample['item'] < matrix_items] 283 # recalculate and resolve the indirection array 284 unique_items = dataset.resolve_item_ids(matrix_name, matrix_sample[item_key].unique().tolist()) 285 286 # create sample by removing the extra columns that were added 287 matrix_sample = matrix_sample[['user', 'item'] + matrix_config.table.columns] 288 289 return matrix_sample, unique_users, unique_items
Create a sample for the specified matrix.
Extracts a sample with the first occurring users and items until the specified amounts are reached, and therefore are only used as an indication. No additional users/items are generated when the dataset matrix has less available amounts than is specified. Moreover, due to the sparsity of the matrix it can turn out that the resulting matrix is very close, but not exactly the specified amounts.
Args: dataset: the dataset to create a sample matrix from. matrix_name: the name of the matrix to create a sample of. num_users: the number of users in the created sample matrix. num_items: the number of items in the created sample matrix.
Returns: the sample matrix, the unique user and unique item indices.