Unanswered Questions
323 questions with no upvoted or accepted answers
8
votes
0
answers
51
views
How does SciPy's linkage() calculate centroid from pairwise distances?
I am learning about hierarchical clustering from SciPy's linkage documentation (which is much more understandable than the Wikipedia page.
Some of the cluster ...
6
votes
0
answers
48
views
Extract flat clusters from hierarchy: Which "criterion" makes most logical sense for single-linkage?
I am looking ahead to using SciPy's fcluster to hierarchically cluster according to the single-linkage. Clusters can be long and meandering.
In extracting a flat ...
6
votes
0
answers
134
views
Fixed-radius range search in non-Euclidean space
I'm trying to find an indexing data structure most suitable for my metric space:
set of IP network related data (IP addresses, ports, TCP flags, ...),
distance function is continuous, non-Euclidean ...
6
votes
3
answers
386
views
Anomaly detection using clustering of highly correlated Categorical data
My data has two columns and both are highly correlated e.g. if column1 has value ABC, column2 should be XYZ i.e. ABC-->XYZ. If column2 has anything else it's Anomaly. Likewise, there are thousands ...
5
votes
1
answer
368
views
Clustering time series based on monotonic similarity
Context
I am involved in the task of clustering 1500 time series of 500 observations into a few clusters. The time series share all the same observed properties at different spatial locations, but ...
3
votes
0
answers
66
views
Finding clusters in sales data and predicting future sales based on those
I have monthly sales data from a set of online merchants that sell on an online shop using a cloud-based software solution. The data look something like this:
month
merchant_id
shop_id
shop_country
...
3
votes
0
answers
50
views
Scalable Clustering Strategies for 300M Address Variants: Validation and Deduplication
I need to cluster 300 million unstructured addresses for validation, ensuring variants (e.g., "55 Tower F. EST City" vs. "Tower F 55, EST City, SINGA ROAD") map to a group similar ...
3
votes
0
answers
44
views
Clustering metric - why Clustering accuracy (ACC) is not popular as ARI?
I have samples with their's GT clusters.
I want to measure the success of different cluster algorithms.
It seems, that when having the GT, it is popular to use ARI (adjusted rand Index).
I saw there ...
3
votes
0
answers
263
views
Cluster tabular data with text in some columns
Let's say I have a following features in the my dataframe:
user_id
user_age
is_student
is_graduate
salary
resume
integer
integer
binary
binary
integer
text (up to 1000 symbols)
And also a few more ...
3
votes
0
answers
184
views
Clustering large set of images
I've got some big datasets of images (a few million each), and I would like to cluster them according to images' visual similarities. I've extracted a feature vector for each image; the space of ...
3
votes
1
answer
967
views
How to compare topics generated from topic modeling from different datasets?
I have two datasets of a similar theme. Let's assume Dataset A and Dataset B. Using the top2vec model (https://github.com/ddangelov/Top2Vec) (https://arxiv.org/abs/2008.09470) on each dataset, I came ...
3
votes
0
answers
196
views
Clustering similar sequences using hidden markov model
I have several sequences of different lengths. For example,
...
3
votes
2
answers
2k
views
Clustering mixed data types - numeric, categorical, arrays, and text
I have a dataset with 4 types of data columns:
...
3
votes
4
answers
367
views
How to decide who to market? Clustering or Decision Tree?
I am working with a dataset that has enough observations and ~ 10 variables,
half of the variables are numeric
another half of the variables are categorical with 2-3 levels (demographics)
one ID ...
3
votes
1
answer
67
views
Visualizing the difference of a set of strings
I have a distance metric on a collection of strings on the order of tens of thousands. What would be an intuitive way to summarize how 'different' these strings are or when they overlap?
My goal is, ...