Skip to main content

Unanswered Questions

323 questions with no upvoted or accepted answers
8 votes
0 answers
51 views

How does SciPy's linkage() calculate centroid from pairwise distances?

I am learning about hierarchical clustering from SciPy's linkage documentation (which is much more understandable than the Wikipedia page. Some of the cluster ...
6 votes
0 answers
48 views

Extract flat clusters from hierarchy: Which "criterion" makes most logical sense for single-linkage?

I am looking ahead to using SciPy's fcluster to hierarchically cluster according to the single-linkage. Clusters can be long and meandering. In extracting a flat ...
6 votes
0 answers
134 views

Fixed-radius range search in non-Euclidean space

I'm trying to find an indexing data structure most suitable for my metric space: set of IP network related data (IP addresses, ports, TCP flags, ...), distance function is continuous, non-Euclidean ...
6 votes
3 answers
386 views

Anomaly detection using clustering of highly correlated Categorical data

My data has two columns and both are highly correlated e.g. if column1 has value ABC, column2 should be XYZ i.e. ABC-->XYZ. If column2 has anything else it's Anomaly. Likewise, there are thousands ...
5 votes
1 answer
368 views

Clustering time series based on monotonic similarity

Context I am involved in the task of clustering 1500 time series of 500 observations into a few clusters. The time series share all the same observed properties at different spatial locations, but ...
3 votes
0 answers
66 views

Finding clusters in sales data and predicting future sales based on those

I have monthly sales data from a set of online merchants that sell on an online shop using a cloud-based software solution. The data look something like this: month merchant_id shop_id shop_country ...
3 votes
0 answers
50 views

Scalable Clustering Strategies for 300M Address Variants: Validation and Deduplication

I need to cluster 300 million unstructured addresses for validation, ensuring variants (e.g., "55 Tower F. EST City" vs. "Tower F 55, EST City, SINGA ROAD") map to a group similar ...
3 votes
0 answers
44 views

Clustering metric - why Clustering accuracy (ACC) is not popular as ARI?

I have samples with their's GT clusters. I want to measure the success of different cluster algorithms. It seems, that when having the GT, it is popular to use ARI (adjusted rand Index). I saw there ...
3 votes
0 answers
263 views

Cluster tabular data with text in some columns

Let's say I have a following features in the my dataframe: user_id user_age is_student is_graduate salary resume integer integer binary binary integer text (up to 1000 symbols) And also a few more ...
3 votes
0 answers
184 views

Clustering large set of images

I've got some big datasets of images (a few million each), and I would like to cluster them according to images' visual similarities. I've extracted a feature vector for each image; the space of ...
3 votes
1 answer
967 views

How to compare topics generated from topic modeling from different datasets?

I have two datasets of a similar theme. Let's assume Dataset A and Dataset B. Using the top2vec model (https://github.com/ddangelov/Top2Vec) (https://arxiv.org/abs/2008.09470) on each dataset, I came ...
3 votes
0 answers
196 views

Clustering similar sequences using hidden markov model

I have several sequences of different lengths. For example, ...
3 votes
2 answers
2k views

Clustering mixed data types - numeric, categorical, arrays, and text

I have a dataset with 4 types of data columns: ...
3 votes
4 answers
367 views

How to decide who to market? Clustering or Decision Tree?

I am working with a dataset that has enough observations and ~ 10 variables, half of the variables are numeric another half of the variables are categorical with 2-3 levels (demographics) one ID ...
3 votes
1 answer
67 views

Visualizing the difference of a set of strings

I have a distance metric on a collection of strings on the order of tens of thousands. What would be an intuitive way to summarize how 'different' these strings are or when they overlap? My goal is, ...

15 30 50 per page
1
2 3 4 5
22