Be yourself; Everyone else is already taken.
— Oscar Wilde.
This is the first post on my new blog. I’m just getting this new blog going, so stay tuned for more. Subscribe below to get notified when I post new updates.
Making It Easier to Understand Sentiments
Be yourself; Everyone else is already taken.
— Oscar Wilde.
This is the first post on my new blog. I’m just getting this new blog going, so stay tuned for more. Subscribe below to get notified when I post new updates.

Sentiment Analysis is basically the domain that understands emotions of people about what people are saying how they are saying it and what exactly they mean to say with the software. This is very helpful for business in understanding the sentiments of their customers. Sentiment Analysis is the ongoing research area in the field of text mining.
In Sentiment Analysis, there are few steps for detecting sentiments which includes:
1) Divide Dataset into train and test set
2) Pre-processing of dataset
3) Feature extraction dataset
4) Classification of dataset

Pre-processing is the first step in text classification, and choosing right pre-processing techniques can improve classification effectiveness. We evaluate the pre-processing techniques on their resulting classification accuracy and number of features they produce. We find that techniques like lemmatization, removing numbers, and replacing contractions, improve accuracy, while others like removing punctuation do not. We have used the following techniques for pre-processing:

This technique removes all the Non – ASCII characters and certain unicodes like “\u002c” and “\x06”. Thus to improve the accuracy of the code and for cleaning the dataset by removing such unused words. It will reduce the noise present in dataset
Eg: “\u006OMG !! ! @stellargirl I loved my Kindle” is converted to “OMG !! ! @stellargirl I loved my Kindle”
This technique is used to replace usernames by AT_USER and urls by URL. this technique is basically to detect the url and username as the words present in username and url may be positive or negative and it may affect the accuracy of the dataset. Thus to improve its accuracy we replace it with a common name for all usernames and urls as AT_USER and URL respectively.
Eg: “\u006OMG !! ! @stellargirl I loved my Kindle” is converted to “OMG !! ! AT_USERstellargirl I loved my Kindle”
This technique is used to remove abbreviations i.e to convert it to its full form.As many people use informal words this technique is important.This will affect the accuracy of dataset as it may not be able to detect or may interpret it as other way of its actual result. Thus this is the important step of pre-processing in order to get accurate output.
Eg: “OMG !! ! AT_USERstellargirl I loved my Kindle” is converted to “Oh My God
!! ! AT_USERstellargirl I loved my Kindle”
This technique removes contractions like won’t: will not, shouldn’t: should not, Isn’t: is not, etc. This step is important because classifier won’t consider these contractions as negative and this will affect a lot to the classification process as the accuracy will be reduced. Thus it is necessary to replace such contractions as to increase the accuracy.
Eg:”AT_YouSER You’ll love your Kindle2.” is converted to “AT_YouSER you shall / you will love your Kindle2”.
This technique removes all the numbers present in dataset. Numeric values in tweets or sentiments are of no use for classifying positive or negative sentiments. Thus it is removed by the pre-processing method.
Eg: “AT_YouSER you shall / you will love your Kindle2.” Is converted to
“AT_YouSER you shall / you will love your Kindle”.
This technique replaces multiple punctuations with “multi(punctuation name)”. This pre-processing step is applied to multiple punctuations as it may affect the classification. Thus before removing punctuation, multipunctuations are replaced by multi(Punctuation).
Eg: “Oh My God !! ! AT_USER I loved my Kindle.” is converted to
“Oh My God multiExclamation ! AT_USER I loved my Kindle.”
Dealing with negations (like “not good”) is a critical step in Sentiment Analysis. A negation word can influence the tone of all the words around it, and ignoring negations is one of the main causes of misclassification. In this phase, all negative constructs (can’t, don’t, isn’t, never etc) are replaced with “not”. This technique allows the classifier model to be enriched with a lot of negation bigram constructs that would otherwise be excluded due to their low frequency.
Eg. The sentence “This movie isn’t good for family” will be changed to new
sentence “This movie is not good for family”.
In this preprocessing technique, we remove the punctuation marks like (, , . , ! , : , ; , etc). The main purpose of this method is that we only need words to learn the machine for some prediction and hence no any punctuations marks are required. In some sentiments removing punctuations increases the accuracy of the predictions but in some that will decrease the accuracy for example: punctuations like exclamation mark may mean an intense positive or negative sentiment and hence will reduce the accuracy.
Eg. The sentence “i don’t understand i really don’t. this course feels wrong,
hospital radio isn’t right, and i’m not happy.” will be changed to “i don’t understand. i
really don’t. this course feels wrong, hospital radio isn’t right, and i’m unhappy”.
This technique increases the accuracy for sure. All characters are converted to lowercase letters. In reviews, people do not write sentences with perfection, some characters are in uppercase and some in lowercase which are surely affect the accuracy in prediction and also for learning and therefore we first convert all sentences from both test and train dataset in lowercase.
Eg. The sentence “I spilled milk all up in my Macbook.” will be changed to “i spill
milk all up in my macbook.”.
Stopwords are function words which are present in sentences with high frequency like (it, this, the). These words are needless for sentiment analysis because they do not contain any fruitful information for both learning purpose and prediction purpose. Removing this stopwords will not increase the accuracy but it will improve the storage management as will require less amount of storage to store sentences without stopwords. These words are not predefined and it can be changed by removing or adding more to it.
Eg. Sentence “It’s time you changed direction! This is the answer! It’ll blow your
socks off!” will be changed to “time changed direction ! is answer ! blow socks off!”.
Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language. Stemming helps to achieve the root forms of the derived words.
Consider the examples:
Eg. Playing, Plays and Played can be stemmed to “Play” as this is the root form.
Out of the many available supervised machine learning and deep learning algorithms, one algorithm for each of the four most- used categories can be chosen. These categories are, the Generalized Linear Models (GLM), the Naïve Bayes (NB), the Support Vector Machines (SVM), and the Neural Networks (NN). From the GLM family we choose the Logistic Regression algorithm, from the NB we choose the Bernoulli Naïve Bayes, and from the SVMs we chose the Linear SVC algorithm.

It is a popular algorithm that belongs to the Generalized Linear Models methods—despite its name—and it is also known as Maximum Entropy. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function. The previous studies of Lin, Mao, and Zeng (2017) and Wu, Huang, and Yuan (2017) used Logistic Regression for sentiment classification in microblogging.
Naïve Bayes algorithms are the simplest probabilistic classification algorithms that are widely used in Sentiment Analysis. They are based on the Bayes Theorem, which assumes that a complete independence of variables exists. The Bernoulli algorithm is an alternative of Naïve Bayes, where the weight of each term is equal to 1 if it exists in the sentence and 0 if not. Its difference from Boolean Naïve Bayes is that it takes into account terms that do not appear in the sentence. It is a fast algorithm that deals well with high dimensionality.
One of the most popular machine learning methods for classification of linear problems are SVMs ( Cherkassky, 1997 ). They try to find a set of hyperplanes that separate the space into dimensions representing classes. These hyperplanes are chosen in a way which maximizes the distance from the nearest data point of each class. The Linear SVC is the simplest and fastest SVM algorithm assuming a linear separation between classes.
This is an example post, originally published as part of Blogging University. Enroll in one of our ten programs, and start your blog right.
You’re going to publish a post today. Don’t worry about how your blog looks. Don’t worry if you haven’t given it a name yet, or you’re feeling overwhelmed. Just click the “New Post” button, and tell us why you’re here.
Why do this?
The post can be short or long, a personal intro to your life or a bloggy mission statement, a manifesto for the future or a simple outline of your the types of things you hope to publish.
To help you get started, here are a few questions:
You’re not locked into any of this; one of the wonderful things about blogs is how they constantly evolve as we learn, grow, and interact with one another — but it’s good to know where and why you started, and articulating your goals may just give you a few other post ideas.
Can’t think how to get started? Just write the first thing that pops into your head. Anne Lamott, author of a book on writing we love, says that you need to give yourself permission to write a “crappy first draft”. Anne makes a great point — just start writing, and worry about editing it later.
When you’re ready to publish, give your post three to five tags that describe your blog’s focus — writing, photography, fiction, parenting, food, cars, movies, sports, whatever. These tags will help others who care about your topics find you in the Reader. Make sure one of the tags is “zerotohero,” so other new bloggers can find you, too.