Text analytics is used to extract structured data from unstructured text sources like social media posts, reviews, emails and call center notes. It involves acquiring and preparing text data, processing and analyzing it using algorithms like decision trees, naive bayes, support vector machines and k-nearest neighbors to extract terms, entities, concepts and sentiment. The results are then visualized to support data-driven decision making for applications like measuring customer opinions and providing search capabilities. Popular tools for text analytics include RapidMiner, KNIME, SPSS and R.
2. What is text analytics?
It is all about deriving high-quality structured
data for analysis from unstructured text.
3. Why is text analytics used?
It is used to measure customer opinions, product reviews,
feedback, to provide search facility, sentimental analysis and
entity modeling to support data-backed decision making.
4. What are the primary steps in text analytics?
Text acquisition and
preparation
Processing and analysis
Reporting
(visualization/presentation)
5. For instance, social media chatter around
brand can create a supremely spiraling
impact (remember the post which showed a
Kentucky man was violently removed from
his United Airlines seat on an overbooked
flight? And how it lead to a social media
disaster for the airline?).
6. In addition to social media data, other
examples include e-mail messages, call
center notes, and customer records.
7. In addition to social media data, other
examples include e-mail messages,
call center notes, and customer
records.
10. Named entities
These are extracted to answer the ‘who’, ‘what’, or
‘where’. Some instances include name, location,
timestamp, or product.
11. Concept
These are extracted to answer the ‘about’ of a piece of
content. It describes the idea behind the content.
12. Sentiment
These are extracted to gauge the overall feeling around a
brand at the moment. The above United Airlines
example will be (evidently) negative sentiment, denoting
unhappy customers, and potential business losses.
13. What type of
tools/algorithms
are used for text
analytics?
Decision tree
Naive-Bayes
Support Vector Machine
K-nearest neighbours
Artificial Neural Networks
Fuzzy C-Means
LDA
14. Decision Trees
This is a classifier that seeks to
repeatedly group data into groups or
classes. It comes in handy for tasks
like classification or regression.
15. Popular
algorithms in
Decision trees
ID3: Iternative Dichotomizer builds a decision tree
that splits data based on highest information gain
(and lowest entropy) till every group has
homogenous data.
C4.5: This algorithm too uses information gain and
entropy to classify data (just like ID3). Unlike ID3, it
accepts continuous and discrete features and
handles incomplete data too.
CART: Classification and Regression Tree works just
like C4.5. One notable difference is that CART uses
Gini impurity (to assess ‘purity’ or homogeneity of
the node) instead of information gain/entropy used
by C4.5
16. Naive-Bayes
This is a popular technique to classify
text and documents based on a
category (whether to classify a
document as Sport or as Political
based on the occurrence of certain
words). It is a simple way to assign
class or category labels to instances
or cases.
17. Naive-Bayes
Rather than being a single distinct algorithm, it is a set of algorithms that work on
one underlying principle -- “the value of a given feature is independent of the
value of any other feature”.
18. Support
Vector
Machines
This is a supervised machine learning
algorithm. It can be applied on
classification and regression
problems. Its essential component is
kernel trick which transforms linear
data into non-linear data by replacing
its features by a kernel function.
It is used in hypertext categorization,
classification of images, and facial
recognition applications.
19. Applications of SVM
It is used in hypertext categorization, classification of images,
and facial recognition applications.
20. K Nearest
Neighbors
k-NN is used is search items where
you are looking for something similar.
You determine similarity by creating
a vector representation of the items
and then compare how similar or
dissimilar they are using a distance
metric like Euclidean distance.
21. Applications of k-NN
The best example of k-NN’s prowess is an e-commerce site’s
product recommendation feature. You can also utilize k-NN to
do Concept Search (finding semantically similar documents).
22. Artificial
Neural
Networks
ANNs are primarily utilized for non-
linear boundaries- based
classification. Much like the working
of the human brain, ANN operates on
hidden states (which correspond to
the neurons in the brain).
24. Applications of ANN
Image compression, handwriting analysis, and stock exchange
movement prediction are some sectors where ANN comes in
useful.
25. Fuzzy
C-Means
This is a useful form of clustering that
can add value when there are items
that can be a part of more than one
cluster. It works on the principle that
after the clustering is over, all items
in a cluster are as similar as possible
to each other.
26. Steps in Fuzzy
C-Means
Pick
Pick a number
of clusters
where the
items can be
categorized
Assign
Assign
coefficient to
each data point
for being
present inside
the cluster
Repeat
Repeat till the
coefficients’
value updates
between two
iterations is not
more than the
pre-defined
sensitivity
threshold value
27. Applications of Fuzzy C-Means
Disciplines like Bioinformatics, healthcare, and economics
make use of fuzzy c-means with great success.
29. Primary steps
in LDA
01
Provide an
estimate of the
potential number
of topics
02
Algorithm assigns a
word to a topic
Algorithm will
check the accuracy
of topic assignment
in a loop
This helps in ensuring coherent topic clustering.
30. An example of LDA
Suppose there are three separate sentences.
1. I eat chicken and vegetables
2. Chicken are pets
3. My dog loves to eat chicken
With LDA, topic clustering for these 3 lines are done as follows –
• Sentence 1 = 100% Topic B
• Sentence 2 = 100% Topic A
• Sentence 3= 33% Topic A and 67% Topic B
Now we infer that there are two clusters for sentence classification –
Pets (Topic A) and Food (Topic B).
31. A pioneer is custom and large-scale web data extraction.
www.promptcloud.com | sales@promptcloud.com