src.fairreckitlib.data.set.dataset_sampling

This module contains functionality to create a sample of an existing dataset.

Functions:

create_dataset_sample: create a sample of a dataset.
create_dataset_table_samples: create tables samples for a map of key indices.
create_matrix_sample_config: create a sample matrix configuration.
create_matrix_sample: create a sample of a dataset's matrix.

This program has been developed by students from the bachelor Computer Science at Utrecht University within the Software Project course. © Copyright Utrecht University (Department of Information and Computing Sciences)

View Source

  1"""This module contains functionality to create a sample of an existing dataset.
  2
  3Functions:
  4
  5    create_dataset_sample: create a sample of a dataset.
  6    create_dataset_table_samples: create tables samples for a map of key indices.
  7    create_matrix_sample_config: create a sample matrix configuration.
  8    create_matrix_sample: create a sample of a dataset's matrix.
  9
 10This program has been developed by students from the bachelor Computer Science at
 11Utrecht University within the Software Project course.
 12© Copyright Utrecht University (Department of Information and Computing Sciences)
 13"""
 14
 15import os
 16from typing import Dict, List, Optional, Tuple
 17
 18import pandas as pd
 19
 20from ...core.io.io_utility import save_yml
 21from .dataset import Dataset
 22from .dataset_constants import DATASET_CONFIG_FILE
 23from .dataset_config import DatasetMatrixConfig, DatasetIndexConfig, RatingMatrixConfig
 24from .dataset_config import DatasetConfig, DatasetTableConfig, create_dataset_table_config
 25
 26
 27def create_dataset_sample(
 28        output_dir: str,
 29        dataset: Dataset,
 30        num_users: int,
 31        num_items: int) -> Dataset:
 32    """Create a sample of the specified dataset.
 33
 34    Look at the 'create_matrix_sample' function for specifics on how the
 35    matrices of the dataset are sampled. All tables, except the events, that are related
 36    to the user/item keys that are present in the sample matrices are sampled as well.
 37    The generated dataset sample is stored in the output directory before returning it.
 38    This function raises an IOError when the specified output directory already exists.
 39
 40    Args:
 41        output_dir: the path to the directory where the dataset sample will be stored.
 42        dataset: the dataset to create a sample of.
 43        num_users: the number of users in the created sample matrices.
 44        num_items: the number of items in the created sample matrices.
 45
 46    Returns:
 47        the resulting sample dataset.
 48    """
 49    sample_dir = os.path.join(output_dir, dataset.get_name() + '-Sample')
 50    if os.path.isdir(sample_dir):
 51        raise IOError('Failed to create sample, directory already exists.')
 52
 53    os.mkdir(sample_dir)
 54
 55    sample_matrices = {}
 56    key_id_map = {}
 57
 58    for matrix_name in dataset.get_available_matrices():
 59        # create and add matrix sample
 60        sample_matrix_config = create_matrix_sample_config(
 61            sample_dir,
 62            dataset,
 63            matrix_name,
 64            num_users,
 65            num_items
 66        )
 67        sample_matrices[matrix_name] = sample_matrix_config
 68
 69        # append user key indices to the key map
 70        user_keys = key_id_map.get(sample_matrix_config.user.key, [])
 71        user_indices = pd.Series(sample_matrix_config.user.load_indices(sample_dir))
 72        user_keys = pd.Series(user_keys).append(user_indices)
 73        key_id_map[sample_matrix_config.user.key] = user_keys.unique()
 74
 75        # append item key indices to the key map
 76        item_keys = key_id_map.get(sample_matrix_config.item.key, [])
 77        item_indices = pd.Series(sample_matrix_config.item.load_indices(sample_dir))
 78        item_keys = pd.Series(item_keys).append(item_indices)
 79        key_id_map[sample_matrix_config.item.key] = item_keys.unique()
 80
 81    # create sample tables for the key map that contains all the needed indices of all matrices
 82    sample_tables = create_dataset_table_samples(sample_dir, dataset, key_id_map)
 83
 84    # create and save dataset configuration
 85    sample_dataset_config = DatasetConfig(
 86        dataset.get_name() + '-Sample',
 87        {},
 88        sample_matrices,
 89        sample_tables
 90    )
 91    save_yml(os.path.join(sample_dir, DATASET_CONFIG_FILE), sample_dataset_config.to_yml_format())
 92
 93    return Dataset(sample_dir, sample_dataset_config)
 94
 95
 96def create_dataset_table_samples(
 97        output_dir: str,
 98        dataset: Dataset,
 99        key_id_map: Dict[str, List[int]]) -> Dict[str, DatasetTableConfig]:
100    """Create table samples for the specified dataset and key map.
101
102    The key map is used to identify which tables of the dataset are sampled.
103    A table is considered to be a candidate if the key in the map matches the
104    primary key of the table. Any rows that do not contain the needed indices
105    in the key map are filtered.
106
107    Args:
108        output_dir: the path to the directory where the sample tables will be stored.
109        dataset: the dataset to create a sample tables from.
110        key_id_map: a dictionary containing a table key paired with a list of indices
111            that are related to these table keys.
112
113    Returns:
114        a dictionary with the resulting table sample configurations, keyed by table names.
115    """
116    sample_tables = {}
117
118    for table_name in dataset.get_available_tables():
119        table_config = dataset.get_table_config(table_name)
120
121        table = dataset.read_table(table_name)
122        table_modified = False
123
124        for key_id, key_id_list in key_id_map.items():
125            # filter unwanted table rows when the primary key matches
126            if table_config.primary_key == [key_id]:
127                table = table[table[key_id].isin(key_id_list)]
128                table_modified = True
129
130        if table_modified:
131            # store the table sample
132            sample_table_config = create_dataset_table_config(
133                dataset.get_name() + '_' + table_name + '.tsv.bz2',
134                table_config.primary_key,
135                table_config.columns,
136                compression='bz2',
137                encoding=table_config.file.options.encoding,
138                foreign_keys=table_config.foreign_keys,
139                num_records=len(table)
140            )
141            sample_table_config.save_table(table, output_dir)
142            # add sample table configuration
143            sample_tables[table_name] = sample_table_config
144
145    return sample_tables
146
147
148def create_matrix_sample_config(
149        output_dir: str,
150        dataset: Dataset,
151        matrix_name: str,
152        num_users: int,
153        num_items: int) -> Optional[DatasetMatrixConfig]:
154    """Create a dataset matrix sample configuration.
155
156    Look at the 'create_matrix_sample' function for specifics on how the
157    matrix is sampled. The generated matrix and user/item indirection arrays are
158    stored in the output directory and the corresponding configuration is returned.
159
160    Args:
161        output_dir: the path to the directory where the sample matrix will be stored.
162        dataset: the dataset to create a sample matrix from.
163        matrix_name: the name of the matrix to create a sample of.
164        num_users: the number of users in the created sample matrix.
165        num_items: the number of items in the created sample matrix.
166
167    Returns:
168        the sample matrix configuration or None when the specified matrix does not exist.
169    """
170    matrix_config = dataset.get_matrix_config(matrix_name)
171    if matrix_config is None:
172        return None
173
174    sample, users, items = create_matrix_sample(dataset, matrix_name, num_users, num_items)
175
176    # create the user indices config and save the array
177    user_index_config = DatasetIndexConfig(
178        matrix_name + '_user_indices.hdf5',
179        matrix_config.user.key,
180        len(users)
181    )
182    user_index_config.save_indices(output_dir, list(users))
183
184    # create the item indices config and save the array
185    item_index_config = DatasetIndexConfig(
186        matrix_name + '_item_indices.hdf5',
187        matrix_config.item.key,
188        len(items)
189    )
190    item_index_config.save_indices(output_dir, list(items))
191
192    # create the sample matrix table config and save the table
193    sample_table_config = create_dataset_table_config(
194        dataset.get_name() + '_' + matrix_name + '.tsv.bz2',
195        matrix_config.table.primary_key,
196        matrix_config.table.columns,
197        compression='bz2',
198        encoding=matrix_config.table.file.options.encoding,
199        foreign_keys=matrix_config.table.foreign_keys,
200        num_records=len(sample)
201    )
202    sample_table_config.save_table(sample, output_dir)
203
204    return DatasetMatrixConfig(
205        sample_table_config,
206        RatingMatrixConfig(
207            float(sample[matrix_config.table.columns[0]].min()),
208            float(sample[matrix_config.table.columns[0]].max()),
209            matrix_config.ratings.rating_type
210        ),
211        user_index_config,
212        item_index_config
213    )
214
215
216def create_matrix_sample(
217        dataset: Dataset,
218        matrix_name: str,
219        num_users: int,
220        num_items: int) -> Tuple[pd.DataFrame, List[int], List[int]]:
221    """Create a sample for the specified matrix.
222
223    Extracts a sample with the first occurring users and items until the
224    specified amounts are reached, and therefore are only used as an indication.
225    No additional users/items are generated when the dataset matrix has
226    less available amounts than is specified. Moreover, due to the sparsity of the
227    matrix it can turn out that the resulting matrix is very close, but not
228    exactly the specified amounts.
229
230    Args:
231        dataset: the dataset to create a sample matrix from.
232        matrix_name: the name of the matrix to create a sample of.
233        num_users: the number of users in the created sample matrix.
234        num_items: the number of items in the created sample matrix.
235
236    Returns:
237        the sample matrix, the unique user and unique item indices.
238    """
239    matrix_config = dataset.get_matrix_config(matrix_name)
240
241    # clamp num users/items
242    matrix_users = min(matrix_config.user.num_records, num_users)
243    matrix_items = min(matrix_config.item.num_records, num_items)
244
245    # prepare sample dataframe
246    matrix_columns = matrix_config.table.primary_key + matrix_config.table.columns
247    matrix_sample = pd.DataFrame(columns=matrix_columns)
248
249    user_key = matrix_config.user.key
250    item_key = matrix_config.item.key
251
252    # create sample in chunks for very big matrices
253    for _, matrix in enumerate(dataset.read_matrix(matrix_name, chunk_size=50000000)):
254        matrix_sample = pd.concat([matrix_sample, matrix])
255        if len(matrix_sample[user_key].unique()) > matrix_users and \
256                len(matrix_sample[item_key].unique()) > matrix_items:
257            break
258
259    # users may not be number from 0...num_users
260    unique_users = matrix_sample[user_key].unique()
261    matrix_sample = pd.merge(
262        matrix_sample,
263        pd.DataFrame(list(enumerate(unique_users)), columns=['user',user_key]),
264        how='left',
265        on=user_key
266    )
267    # remove any users above the threshold
268    matrix_sample = matrix_sample[matrix_sample['user'] < matrix_users]
269    # recalculate the indirection array
270    unique_users = dataset.resolve_user_ids(matrix_name, matrix_sample[user_key].unique().tolist())
271
272    # items may not be number from 0...num_items
273    unique_items = matrix_sample[item_key].unique()
274    matrix_sample = pd.merge(
275        matrix_sample,
276        pd.DataFrame(list(enumerate(unique_items)), columns=['item',item_key]),
277        how='left',
278        on=item_key
279    )
280    # remove any items above the threshold
281    matrix_sample = matrix_sample[matrix_sample['item'] < matrix_items]
282    # recalculate and resolve the indirection array
283    unique_items = dataset.resolve_item_ids(matrix_name, matrix_sample[item_key].unique().tolist())
284
285    # create sample by removing the extra columns that were added
286    matrix_sample = matrix_sample[['user', 'item'] + matrix_config.table.columns]
287
288    return matrix_sample, unique_users, unique_items

def create_dataset_sample( output_dir: str, dataset: src.fairreckitlib.data.set.dataset.Dataset, num_users: int, num_items: int) -> src.fairreckitlib.data.set.dataset.Dataset: View Source

28def create_dataset_sample(
29        output_dir: str,
30        dataset: Dataset,
31        num_users: int,
32        num_items: int) -> Dataset:
33    """Create a sample of the specified dataset.
34
35    Look at the 'create_matrix_sample' function for specifics on how the
36    matrices of the dataset are sampled. All tables, except the events, that are related
37    to the user/item keys that are present in the sample matrices are sampled as well.
38    The generated dataset sample is stored in the output directory before returning it.
39    This function raises an IOError when the specified output directory already exists.
40
41    Args:
42        output_dir: the path to the directory where the dataset sample will be stored.
43        dataset: the dataset to create a sample of.
44        num_users: the number of users in the created sample matrices.
45        num_items: the number of items in the created sample matrices.
46
47    Returns:
48        the resulting sample dataset.
49    """
50    sample_dir = os.path.join(output_dir, dataset.get_name() + '-Sample')
51    if os.path.isdir(sample_dir):
52        raise IOError('Failed to create sample, directory already exists.')
53
54    os.mkdir(sample_dir)
55
56    sample_matrices = {}
57    key_id_map = {}
58
59    for matrix_name in dataset.get_available_matrices():
60        # create and add matrix sample
61        sample_matrix_config = create_matrix_sample_config(
62            sample_dir,
63            dataset,
64            matrix_name,
65            num_users,
66            num_items
67        )
68        sample_matrices[matrix_name] = sample_matrix_config
69
70        # append user key indices to the key map
71        user_keys = key_id_map.get(sample_matrix_config.user.key, [])
72        user_indices = pd.Series(sample_matrix_config.user.load_indices(sample_dir))
73        user_keys = pd.Series(user_keys).append(user_indices)
74        key_id_map[sample_matrix_config.user.key] = user_keys.unique()
75
76        # append item key indices to the key map
77        item_keys = key_id_map.get(sample_matrix_config.item.key, [])
78        item_indices = pd.Series(sample_matrix_config.item.load_indices(sample_dir))
79        item_keys = pd.Series(item_keys).append(item_indices)
80        key_id_map[sample_matrix_config.item.key] = item_keys.unique()
81
82    # create sample tables for the key map that contains all the needed indices of all matrices
83    sample_tables = create_dataset_table_samples(sample_dir, dataset, key_id_map)
84
85    # create and save dataset configuration
86    sample_dataset_config = DatasetConfig(
87        dataset.get_name() + '-Sample',
88        {},
89        sample_matrices,
90        sample_tables
91    )
92    save_yml(os.path.join(sample_dir, DATASET_CONFIG_FILE), sample_dataset_config.to_yml_format())
93
94    return Dataset(sample_dir, sample_dataset_config)

Create a sample of the specified dataset.

Look at the 'create_matrix_sample' function for specifics on how the matrices of the dataset are sampled. All tables, except the events, that are related to the user/item keys that are present in the sample matrices are sampled as well. The generated dataset sample is stored in the output directory before returning it. This function raises an IOError when the specified output directory already exists.

Args: output_dir: the path to the directory where the dataset sample will be stored. dataset: the dataset to create a sample of. num_users: the number of users in the created sample matrices. num_items: the number of items in the created sample matrices.

Returns: the resulting sample dataset.

def create_dataset_table_samples( output_dir: str, dataset: src.fairreckitlib.data.set.dataset.Dataset, key_id_map: Dict[str, List[int]]) -> Dict[str, src.fairreckitlib.data.set.dataset_config.DatasetTableConfig]: View Source

 97def create_dataset_table_samples(
 98        output_dir: str,
 99        dataset: Dataset,
100        key_id_map: Dict[str, List[int]]) -> Dict[str, DatasetTableConfig]:
101    """Create table samples for the specified dataset and key map.
102
103    The key map is used to identify which tables of the dataset are sampled.
104    A table is considered to be a candidate if the key in the map matches the
105    primary key of the table. Any rows that do not contain the needed indices
106    in the key map are filtered.
107
108    Args:
109        output_dir: the path to the directory where the sample tables will be stored.
110        dataset: the dataset to create a sample tables from.
111        key_id_map: a dictionary containing a table key paired with a list of indices
112            that are related to these table keys.
113
114    Returns:
115        a dictionary with the resulting table sample configurations, keyed by table names.
116    """
117    sample_tables = {}
118
119    for table_name in dataset.get_available_tables():
120        table_config = dataset.get_table_config(table_name)
121
122        table = dataset.read_table(table_name)
123        table_modified = False
124
125        for key_id, key_id_list in key_id_map.items():
126            # filter unwanted table rows when the primary key matches
127            if table_config.primary_key == [key_id]:
128                table = table[table[key_id].isin(key_id_list)]
129                table_modified = True
130
131        if table_modified:
132            # store the table sample
133            sample_table_config = create_dataset_table_config(
134                dataset.get_name() + '_' + table_name + '.tsv.bz2',
135                table_config.primary_key,
136                table_config.columns,
137                compression='bz2',
138                encoding=table_config.file.options.encoding,
139                foreign_keys=table_config.foreign_keys,
140                num_records=len(table)
141            )
142            sample_table_config.save_table(table, output_dir)
143            # add sample table configuration
144            sample_tables[table_name] = sample_table_config
145
146    return sample_tables

Create table samples for the specified dataset and key map.

The key map is used to identify which tables of the dataset are sampled. A table is considered to be a candidate if the key in the map matches the primary key of the table. Any rows that do not contain the needed indices in the key map are filtered.

Args: output_dir: the path to the directory where the sample tables will be stored. dataset: the dataset to create a sample tables from. key_id_map: a dictionary containing a table key paired with a list of indices that are related to these table keys.

Returns: a dictionary with the resulting table sample configurations, keyed by table names.

def create_matrix_sample_config( output_dir: str, dataset: src.fairreckitlib.data.set.dataset.Dataset, matrix_name: str, num_users: int, num_items: int) -> Optional[src.fairreckitlib.data.set.dataset_config.DatasetMatrixConfig]: View Source

149def create_matrix_sample_config(
150        output_dir: str,
151        dataset: Dataset,
152        matrix_name: str,
153        num_users: int,
154        num_items: int) -> Optional[DatasetMatrixConfig]:
155    """Create a dataset matrix sample configuration.
156
157    Look at the 'create_matrix_sample' function for specifics on how the
158    matrix is sampled. The generated matrix and user/item indirection arrays are
159    stored in the output directory and the corresponding configuration is returned.
160
161    Args:
162        output_dir: the path to the directory where the sample matrix will be stored.
163        dataset: the dataset to create a sample matrix from.
164        matrix_name: the name of the matrix to create a sample of.
165        num_users: the number of users in the created sample matrix.
166        num_items: the number of items in the created sample matrix.
167
168    Returns:
169        the sample matrix configuration or None when the specified matrix does not exist.
170    """
171    matrix_config = dataset.get_matrix_config(matrix_name)
172    if matrix_config is None:
173        return None
174
175    sample, users, items = create_matrix_sample(dataset, matrix_name, num_users, num_items)
176
177    # create the user indices config and save the array
178    user_index_config = DatasetIndexConfig(
179        matrix_name + '_user_indices.hdf5',
180        matrix_config.user.key,
181        len(users)
182    )
183    user_index_config.save_indices(output_dir, list(users))
184
185    # create the item indices config and save the array
186    item_index_config = DatasetIndexConfig(
187        matrix_name + '_item_indices.hdf5',
188        matrix_config.item.key,
189        len(items)
190    )
191    item_index_config.save_indices(output_dir, list(items))
192
193    # create the sample matrix table config and save the table
194    sample_table_config = create_dataset_table_config(
195        dataset.get_name() + '_' + matrix_name + '.tsv.bz2',
196        matrix_config.table.primary_key,
197        matrix_config.table.columns,
198        compression='bz2',
199        encoding=matrix_config.table.file.options.encoding,
200        foreign_keys=matrix_config.table.foreign_keys,
201        num_records=len(sample)
202    )
203    sample_table_config.save_table(sample, output_dir)
204
205    return DatasetMatrixConfig(
206        sample_table_config,
207        RatingMatrixConfig(
208            float(sample[matrix_config.table.columns[0]].min()),
209            float(sample[matrix_config.table.columns[0]].max()),
210            matrix_config.ratings.rating_type
211        ),
212        user_index_config,
213        item_index_config
214    )

Create a dataset matrix sample configuration.

Look at the 'create_matrix_sample' function for specifics on how the matrix is sampled. The generated matrix and user/item indirection arrays are stored in the output directory and the corresponding configuration is returned.

Args: output_dir: the path to the directory where the sample matrix will be stored. dataset: the dataset to create a sample matrix from. matrix_name: the name of the matrix to create a sample of. num_users: the number of users in the created sample matrix. num_items: the number of items in the created sample matrix.

Returns: the sample matrix configuration or None when the specified matrix does not exist.

def create_matrix_sample( dataset: src.fairreckitlib.data.set.dataset.Dataset, matrix_name: str, num_users: int, num_items: int) -> Tuple[pandas.core.frame.DataFrame, List[int], List[int]]: View Source

217def create_matrix_sample(
218        dataset: Dataset,
219        matrix_name: str,
220        num_users: int,
221        num_items: int) -> Tuple[pd.DataFrame, List[int], List[int]]:
222    """Create a sample for the specified matrix.
223
224    Extracts a sample with the first occurring users and items until the
225    specified amounts are reached, and therefore are only used as an indication.
226    No additional users/items are generated when the dataset matrix has
227    less available amounts than is specified. Moreover, due to the sparsity of the
228    matrix it can turn out that the resulting matrix is very close, but not
229    exactly the specified amounts.
230
231    Args:
232        dataset: the dataset to create a sample matrix from.
233        matrix_name: the name of the matrix to create a sample of.
234        num_users: the number of users in the created sample matrix.
235        num_items: the number of items in the created sample matrix.
236
237    Returns:
238        the sample matrix, the unique user and unique item indices.
239    """
240    matrix_config = dataset.get_matrix_config(matrix_name)
241
242    # clamp num users/items
243    matrix_users = min(matrix_config.user.num_records, num_users)
244    matrix_items = min(matrix_config.item.num_records, num_items)
245
246    # prepare sample dataframe
247    matrix_columns = matrix_config.table.primary_key + matrix_config.table.columns
248    matrix_sample = pd.DataFrame(columns=matrix_columns)
249
250    user_key = matrix_config.user.key
251    item_key = matrix_config.item.key
252
253    # create sample in chunks for very big matrices
254    for _, matrix in enumerate(dataset.read_matrix(matrix_name, chunk_size=50000000)):
255        matrix_sample = pd.concat([matrix_sample, matrix])
256        if len(matrix_sample[user_key].unique()) > matrix_users and \
257                len(matrix_sample[item_key].unique()) > matrix_items:
258            break
259
260    # users may not be number from 0...num_users
261    unique_users = matrix_sample[user_key].unique()
262    matrix_sample = pd.merge(
263        matrix_sample,
264        pd.DataFrame(list(enumerate(unique_users)), columns=['user',user_key]),
265        how='left',
266        on=user_key
267    )
268    # remove any users above the threshold
269    matrix_sample = matrix_sample[matrix_sample['user'] < matrix_users]
270    # recalculate the indirection array
271    unique_users = dataset.resolve_user_ids(matrix_name, matrix_sample[user_key].unique().tolist())
272
273    # items may not be number from 0...num_items
274    unique_items = matrix_sample[item_key].unique()
275    matrix_sample = pd.merge(
276        matrix_sample,
277        pd.DataFrame(list(enumerate(unique_items)), columns=['item',item_key]),
278        how='left',
279        on=item_key
280    )
281    # remove any items above the threshold
282    matrix_sample = matrix_sample[matrix_sample['item'] < matrix_items]
283    # recalculate and resolve the indirection array
284    unique_items = dataset.resolve_item_ids(matrix_name, matrix_sample[item_key].unique().tolist())
285
286    # create sample by removing the extra columns that were added
287    matrix_sample = matrix_sample[['user', 'item'] + matrix_config.table.columns]
288
289    return matrix_sample, unique_users, unique_items

Create a sample for the specified matrix.

Extracts a sample with the first occurring users and items until the specified amounts are reached, and therefore are only used as an indication. No additional users/items are generated when the dataset matrix has less available amounts than is specified. Moreover, due to the sparsity of the matrix it can turn out that the resulting matrix is very close, but not exactly the specified amounts.

Args: dataset: the dataset to create a sample matrix from. matrix_name: the name of the matrix to create a sample of. num_users: the number of users in the created sample matrix. num_items: the number of items in the created sample matrix.

Returns: the sample matrix, the unique user and unique item indices.