Skip to main content

Questions tagged [text-mining]

Refers to a subset of data mining concerned with extracting information from data in the form of text by recognizing patterns. The goal of text mining is often to classify a given document into one of a number of categories in an automatic way, and to improve this performance dynamically, making it an example of machine learning. One example of this type of text mining are spam filters used for email.

Filter by
Sorted by
Tagged with
1 vote
0 answers
59 views

How do I determine authorship in a corpus written by anonymous authors? Edit: Determining the number of distinct authors of a corpus of letters and whether it is statistically significant from random.
guest01's user avatar
  • 11
2 votes
0 answers
41 views

I posted this on the Data Science Stack Exchange and didn’t get any responses (that sight seems pretty dead). So I’m trying here! I'm working on a project where I have to categorise short texts. I don'...
James's user avatar
  • 45
0 votes
1 answer
688 views

I am new to the field of NLP and would appreciate any guidance please. I am trying to understand how word embeddings can be used in clustering and topic modelling. If I create word embeddings for ...
osckt's user avatar
  • 31
0 votes
0 answers
96 views

Struggling with something so hoped the brilliant minds of the internet could help me out. I have a large dataset of job postings from which I have extracted the skill demand (no. of times a skill is ...
Dandae's user avatar
  • 3
1 vote
0 answers
96 views

I have two separated corpus of text, and i would like to understand wheter these are similiar or not using cosine similarity. I'm not sure on how to approach this problem, but i was thinking as a ...
user373562's user avatar
1 vote
1 answer
542 views

I am looking at this post How to find the perplexity of a corpus. I understand the whole post, but the probability of a sentence appear in a corpus, in a unigram model, is given by p(s)=∏ni=1p(wi), ...
Qqqq's user avatar
  • 13
1 vote
0 answers
34 views

I'm dealing with data that descripe onset frequency of a symptom. The text in each cell was not in the same format. For example: ...
Ian Wang's user avatar
1 vote
0 answers
61 views

I am working on an multi-label text classification problem at work and adapted model architecture from this notebook of Toxic Comment Classification challenge on Kaggle. I have trained the model, a ...
Naveen Reddy Marthala's user avatar
2 votes
1 answer
108 views

Consider the following scenario: Suppose two lists of words $L_{1}$ and $L_{2}$ are given. $L_{1}$ contains just bad-written phrases (like 'age' instead of '4ge' or 'blwe' instead of 'blue' etc.). On ...
Ramiro Hum-Sah's user avatar
1 vote
1 answer
87 views

I have a group of texts belonging to two different classes. I would like to extract numerical features that can separate well the two classes. Right now I implemented a classic TF-IDF with a document ...
inginging's user avatar
2 votes
0 answers
59 views

How hard a project would it be to use ML to assist a single author/script writer in writing dialog where each "speaker" sounds like a distinct person? Is that something that a professional ...
BCS's user avatar
  • 131
1 vote
3 answers
545 views

so I am using pre-trained language model for binary classification. I fine-tune the model by training on data my downstream task. The results are good almost 98% F-measure. However, when I remove a ...
Injy Sarhan's user avatar
1 vote
0 answers
87 views

Suppose I have a large corpus of text data and I would like to compare the frequencies of words $w_1$ and $w_2$. How would I go about testing whether or not their respective frequencies, $f_1$ and $...
Joshua's user avatar
  • 11
0 votes
1 answer
132 views

Imagine you are Spotify and you have billions of songs. Assume that each of these songs are transcribed into text. How do you design your search and recommendation pipeline such that when somebody ...
mhsnk's user avatar
  • 317
0 votes
1 answer
80 views

I would like to extract Canadian FSAs from unstrucured data. I want to pull only the first instance of each match. The problem: Some data don't include postal code and my function won't produce the ...
sometimes_r's user avatar

15 30 50 per page
1
2 3 4 5
43