neuralset.events.transforms.splitting.SimilaritySplitter¶
- class neuralset.events.transforms.splitting.SimilaritySplitter(*, extractor: BaseStatic, ratios: dict[str, float] = {'test': 0.25, 'train': 0.5, 'val': 0.25}, threshold: float = 0.2, norm_feats: bool = True)[source][source]¶
A class used to split events based on similarity clustering of static extractors. The class uses agglomerative clustering on precomputed embeddings of the events to ensure that same and similar events remain in the same split to avoid data leaking.
- Parameters:
extractor (BaseStatic) – A static feature extraction model that defines the type of event and provides methods to extract embeddings from events.
ratios (Dict[str, float]) – A dictionary defining the proportion of events for each split. The sum of all ratios must equal 1.
threshold (float) – The threshold for the distance used in the agglomerative clustering. Events with a distance below this threshold are grouped into clusters.
norm_feats (bool) – If True, the extractor embeddings are normalized before computing the cosine similarity matrix.