Questions tagged [text-mining]
Refers to a subset of data mining concerned with extracting information from data in the form of text by recognizing patterns. The goal of text mining is often to classify a given document into one of a number of categories in an automatic way, and to improve this performance dynamically, making it an example of machine learning. One example of this type of text mining are spam filters used for email.
636 questions
1
vote
0
answers
59
views
Statistical test for author determination in corpus
How do I determine authorship in a corpus written by anonymous authors?
Edit: Determining the number of distinct authors of a corpus of letters and whether it is statistically significant from random.
2
votes
0
answers
41
views
Unsupervised clustering of short texts with covariates
I posted this on the Data Science Stack Exchange and didn’t get any responses (that sight seems pretty dead). So I’m trying here!
I'm working on a project where I have to categorise short texts. I don'...
0
votes
1
answer
688
views
Using Word Embeddings in Clustering and Topic Modelling
I am new to the field of NLP and would appreciate any guidance please. I am trying to understand how word embeddings can be used in clustering and topic modelling. If I create word embeddings for ...
0
votes
0
answers
96
views
Adjusted TF-IDF where many terms appear in every document
Struggling with something so hoped the brilliant minds of the internet could help me out.
I have a large dataset of job postings from which I have extracted the skill demand (no. of times a skill is ...
1
vote
0
answers
96
views
Similiarity between two corpus of text
I have two separated corpus of text, and i would like to understand wheter these are similiar or not using cosine similarity.
I'm not sure on how to approach this problem, but i was thinking as a ...
1
vote
1
answer
542
views
How to avoid underflow of the probability of sentence in calculating the perplexity of corpus
I am looking at this post How to find the perplexity of a corpus. I understand the whole post, but
the probability of a sentence appear in a corpus, in a unigram model,
is given by p(s)=∏ni=1p(wi), ...
1
vote
0
answers
34
views
processing natural language that descripe time frequency with R
I'm dealing with data that descripe onset frequency of a symptom. The text in each cell was not in the same format. For example:
...
1
vote
0
answers
61
views
Does average or max pooling actually summarise the sentence?
I am working on an multi-label text classification problem at work and adapted model architecture from this notebook of Toxic Comment Classification challenge on Kaggle.
I have trained the model, a ...
2
votes
1
answer
108
views
Text similarity for badly written text
Consider the following scenario:
Suppose two lists of words $L_{1}$ and $L_{2}$ are given. $L_{1}$ contains just bad-written phrases (like 'age' instead of '4ge' or 'blwe' instead of 'blue' etc.). On ...
1
vote
1
answer
87
views
How to extract numerical features that can separate well documents belonging to two different classes?
I have a group of texts belonging to two different classes. I would like to extract numerical features that can separate well the two classes.
Right now I implemented a classic TF-IDF with a document ...
2
votes
0
answers
59
views
ML generated word choice to create distinct "speakers" [closed]
How hard a project would it be to use ML to assist a single author/script writer in writing dialog where each "speaker" sounds like a distinct person?
Is that something that a professional ...
1
vote
3
answers
545
views
How to improve language model ex: BERT on unseen text in training?
so I am using pre-trained language model for binary classification. I fine-tune the model by training on data my downstream task. The results are good almost 98% F-measure.
However, when I remove a ...
1
vote
0
answers
87
views
How to statistically compare the frequencies of two different words in a single corpus
Suppose I have a large corpus of text data and I would like to compare the frequencies of words $w_1$ and $w_2$. How would I go about testing whether or not their respective frequencies, $f_1$ and $...
0
votes
1
answer
132
views
Search, rank and recommend in large text datasets
Imagine you are Spotify and you have billions of songs. Assume that each of these songs are transcribed into text. How do you design your search and recommendation pipeline such that when somebody ...
0
votes
1
answer
80
views
How to extract FSAs from postal codes when there is no match?
I would like to extract Canadian FSAs from unstrucured data. I want to pull only the first instance of each match.
The problem: Some data don't include postal code and my function won't produce the ...