Sentiment Analysis

I- What is Data Science?



DS.png
First of all, data science is a method of providing actionable intelligence from data using math, statistics, programming, and business expertise. Like any scientific method, it involves gathering data, identifying a problem, forming a hypothesis, and running tests. More specifically, data scientists follow a process of gathering and cleaning data (wrangling), investigation (exploratory data analysis), building automation using machine learning (feature engineering, model development, and deployment), delivering results (visualizations, reporting, storytelling), and maintenance. Practitioners typically spend 70-80% of their time in the wrangling/exploration, 20% on machine learning models, and the rest in maintenance. Most importantly, this whole process should result in a valuable action or insight for the end-user, i.e. a business or customer!

1- Statistics Descriptive



  • Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data.
  • Descriptive Statistics are used to present quantitative descriptions in a manageable form. In a research study we may have lots of measures. Or we may measure a large number of people on any measure. Descriptive statistics help us to simplify large amounts of data in a sensible way.

2- Statistics Inferential



  • Statistics Inferential we are trying to reach conclusions that extend beyond the immediate data alone. For instance, we use inferential statistics to try to infer from the sample data what the population might think. Or, we use inferential statistics to make judgments of the probability that an observed difference between groups is a dependable one or one that might have happened by chance in this study. Thus, we use inferential statistics to make inferences from our data to more general conditions

3- Data Mining

data_mining.png

Data mining is the process of finding anomalies, patterns and correlations within large data sets to predict outcomes. Using a broad range of techniques, you can use this information to increase revenues, cut costs, improve customer relationships, reduce risks and more. So Data Mining is about finding the trends in a data set. And using these trends to identify future patterns.

Data Mining Vs Data Science



  • Data Mining is an activity which is a part of a broader Knowledge Discovery in Databases (KDD) Process while Data Science is a field of study just like Applied Mathematics or Computer Science.
  • Often Data Science is looked upon in a broad sense while Data Mining is considered a niche.
  • Some activities under Data Mining such as statistical analysis, writing data flows and pattern recognition can intersect with Data Science. Hence, Data Mining becomes a subset of Data Science.
  • Machine Learning in Data Mining is used more in pattern recognition while in Data Science it has a more general use.
  • </ul>

    Basis for comparison Data Mining Data Science
    What is it? A technique An area
    Focus Business process Scientific study
    Goal Make data more usable Building Data-centric products for an organization
    Output Patterns Varied
    Purpose Finding trends previously not known Social analysis, building predictive models, unearthing unknown facts, and more
    Vocational Perspective Someone with a knowledge of navigating across data and statistical understanding can conduct data mining A person needs to understand Machine Learning, Programming, info-graphic techniques and have the domain knowledge to become a data scientist
    Extent Data mining can be a subset of Data Science as Mining activities are part of the Data Science pipeline Multidisciplinary –  Data Science consists of Data Visualizations, Computational Social Sciences, Statistics, Data Mining, Natural Language Processing, et cetera
    Deals with (the type of data) Mostly structured All forms of data – structured, semi-structured and unstructured
    Other less popular names Data Archaeology, Information Harvesting, Information Discovery, Knowledge Extraction Data-driven Science

4- Machine Learning (overview)



AI.png

  • Machine learning (ML) is the study of computer algorithms that improve automatically through experience.It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or infeasible to develop conventional algorithms to perform the needed tasks.
  • Machine learning is closely related to computational statistics, which focuses on making predictions using computers.
  • The discipline of machine learning employs various approaches to teach computers to accomplish tasks where no fully satisfactory algorithm is available. In cases where vast numbers of potential answers exist, one approach is to label some of the correct answers as valid. This can then be used as training data for the computer to improve the algorithm(s) it uses to determine correct answers.

II- Machine Learning



machine-learning.png

In [1]:
import IPython
IPython.display.IFrame('https://www.mindmeister.com/maps/public_map_shell/1626104926/machine-learning-algorithms?width=850&height=450&z=auto&presentation=1',width=850,height=450)

1- Supervised



Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal).

supervised.png

a- Regression

Linear regression:

linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.

regressioncurv.png

Polynomial regression:

polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x). Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.

polynomial-reg.png

b- Classification



Classification is a process of categorizing a given set of data into classes, It can be performed on both structured or unstructured data. The process starts with predicting the class of given data points. The classes are often referred to as target, label or categories.

The classification predictive modeling is the task of approximating the mapping function from input variables to discrete output variables. The main goal is to identify which class/category the new data will fall into. classification.png

Logistic regression:

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression). Mathematically, a binary logistic model has a dependent variable with two possible values, such as pass/fail which is represented by an indicator variable, where the two values are labeled "0" and "1". In the logistic model, the log-odds (the logarithm of the odds) for the value labeled "1" is a linear combination of one or more independent variables ("predictors"); the independent variables can each be a binary variable (two classes, coded by an indicator variable) or a continuous variable (any real value). The corresponding probability of the value labeled "1" can vary between 0 (certainly the value "0") and 1 (certainly the value "1"), hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. Analogous models with a different sigmoid function instead of the logistic function can also be used, such as the probit model; the defining characteristic of the logistic model is that increasing one of the independent variables multiplicatively scales the odds of the given outcome at a constant rate, with each independent variable having its own parameter; for a binary dependent variable this generalizes the odds ratio.

logistic-reg.png

Decision Tree:

A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes Entropy and Information Gain). The paths from root to leaf represent classification rules.

In decision analysis, a decision tree and the closely related influence diagram are used as a visual and analytical decision support tool, where the expected values (or expected utility) of competing alternatives are calculated.

decision-tree.png

2- Unsupervised



Unsupervised learning is a type of machine learning that looks for previously undetected patterns in a data set with no pre-existing. so when a dataset is provided without labels the model learns useful properties of the structure of the dataset and come with a patterns or conclusions from the unlabeled data.

unsupervised.png

a- Clustering

k-means:

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. It is popular for cluster analysis in data mining. k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.

k-means.gif

hierarchical:

Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.

hierarch.gif

3- Natural Language Processing (NLP)

Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.

Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.

Natural-Language-Processing.png

III- Sentiment Analysis

1- Principal Data Cleaning Process (NLTK)

The Natural Language ToolKit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical Natural Language Processing (NLP) for English written in the Python programming language. NLTK includes graphical demonstrations and sample data. It is accompanied by a book that explains the underlying concepts behind the language processing tasks supported by the toolkit.

In this project, we are going to use re (regular expression) and NLTK libraries to perform data cleaning.

2- Algorithms Implemented

Naive Bayes:

ML1.png

ML2.png

ML3.png

ML4.png

ML5.png

ML6.png

IV- Project Implementation

In [2]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Toggle Code"></form>''')

import libraries

In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import string
import pickle
#!pip install nltk
#!pip install WordCloud
#!pip install sklearn
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from wordcloud import WordCloud
from sklearn.feature_extraction.text import CountVectorizer
from jupyterthemes import jtplot
jtplot.style(theme='monokai', context='notebook', ticks=True, grid=False) 
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\salzk\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
In [4]:
language='english'

Reading Data

In [5]:
tweets_df=pd.read_csv("train.csv")
tweets_df.drop(['id'],axis=1,inplace=True)

visualizing first 20 rows

In [6]:
tweets_df.head(20)
Out[6]:
label tweet
0 0 @user when a father is dysfunctional and is s...
1 0 @user @user thanks for #lyft credit i can't us...
2 0 bihday your majesty
3 0 #model i love u take with u all the time in ...
4 0 factsguide: society now #motivation
5 0 [2/2] huge fan fare and big talking before the...
6 0 @user camping tomorrow @user @user @user @use...
7 0 the next school year is the year for exams.ðŸ˜...
8 0 we won!!! love the land!!! #allin #cavs #champ...
9 0 @user @user welcome here ! i'm it's so #gr...
10 0 ↝ #ireland consumer price index (mom) climb...
11 0 we are so selfish. #orlando #standwithorlando ...
12 0 i get to see my daddy today!! #80days #getti...
13 1 @user #cnn calls #michigan middle school 'buil...
14 1 no comment! in #australia #opkillingbay #se...
15 0 ouch...junior is angry😐#got7 #junior #yugyo...
16 0 i am thankful for having a paner. #thankful #p...
17 1 retweet if you agree!
18 0 its #friday! 😀 smiles all around via ig use...
19 0 as we all know, essential oils are not made of...
In [7]:
tweets_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31964 entries, 0 to 31963
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   31964 non-null  int64 
 1   tweet   31964 non-null  object
dtypes: int64(1), object(1)
memory usage: 499.6+ KB
In [8]:
tweets_df.describe()
Out[8]:
label
count 31964.000000
mean 0.070204
std 0.255494
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000

looking for null data

In [9]:
sns.heatmap(tweets_df.isnull(),yticklabels= False , cbar=False ,cmap="Blues")
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x2092570cd60>

Positive vs Negative sentences

In [10]:
sns.countplot(tweets_df["label"])
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x2092579fe50>

Word Cloud

In [11]:
def to_word_cloud(sentences):
    sentences = sentences.tolist()
    sentences = "".join(sentences)
    plt.figure(figsize=(20,20))
    plt.imshow(WordCloud().generate(sentences))

    
to_word_cloud(tweets_df['tweet'])
In [12]:
positive = tweets_df[tweets_df['label']==0]["tweet"]
negative = tweets_df[tweets_df['label']==1]["tweet"]

Positive Word Cloud

In [13]:
to_word_cloud(positive)

Negative Word Cloud

In [14]:
to_word_cloud(negative)

Data Cleaning

In [15]:
def clear(sentence):
    sentence_clean = re.sub('@[A-Za-z0–9]+', '', sentence) #Removing @mentions
    sentence_clean = re.sub('https?:\/\/\S+', '', sentence_clean) # Removing hyperlink
    sentence_clean = [char for char in sentence_clean if(char not in string.punctuation) ] # Removing punctuation
    sentence_clean = "".join(sentence_clean)
    sentence_clean = [word for word in sentence_clean.split() if(word.lower() not in stopwords.words(language)) ] # Removing stopwords
    return sentence_clean

def clear1(sentence):
    sentence_clean = re.sub('@[A-Za-z0–9]+', '', sentence) #Removing @mentions
    sentence_clean = re.sub('https?:\/\/\S+', '', sentence_clean) # Removing hyperlink
    sentence_clean = [char for char in sentence_clean if(char not in string.punctuation) ] # Removing punctuation
    sentence_clean = "".join(sentence_clean)
    sentence_clean = [word for word in sentence_clean.split() if(word.lower() not in stopwords.words(language)) ] # Removing stopwords
    return " ".join(sentence_clean)
In [16]:
clear1("https://salah-zkara.codes/ @user #Search engines maintain lists of words, called 'stop words', which they consider unimportant.")
Out[16]:
'Search engines maintain lists words called stop words consider unimportant'
In [17]:
tweets_df_clean = tweets_df['tweet'].apply(clear)
In [18]:
print(tweets_df_clean[1])
['thanks', 'lyft', 'credit', 'cant', 'use', 'cause', 'dont', 'offer', 'wheelchair', 'vans', 'pdx', 'disapointed', 'getthanked']
In [19]:
print(tweets_df['tweet'][1])
@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked
In [20]:
positive_clear = positive.apply(clear1)

Positive Word Cloud after Data Cleaning processe

In [21]:
to_word_cloud(positive_clear)
In [22]:
negative_clear = negative.apply(clear1)

Negative Word Cloud after Data Cleaning processe

In [23]:
to_word_cloud(negative_clear)

Count Vectorization (Tokenization)

tokenization.png

In [24]:
sample_data = ['This is the first paper.','This paper is the second paper.','And this is the third one.','Is this the first paper?']
vectorizer = CountVectorizer()
aa = vectorizer.fit_transform(sample_data)
In [25]:
print(vectorizer.get_feature_names())
['and', 'first', 'is', 'one', 'paper', 'second', 'the', 'third', 'this']
In [26]:
print(aa.toarray().shape)
(4, 9)
In [27]:
vectorizer = CountVectorizer(analyzer = clear, dtype = 'uint8')
#analyzer: adding a stage before we apply a countvectorization
tweets_countvectorizer = vectorizer.fit_transform(tweets_df['tweet'])
In [28]:
X = tweets_countvectorizer.toarray()
y = tweets_df['label']
In [29]:
print("X shape: ",X.shape)
print("y shape: ",y.shape)
X shape:  (31964, 47325)
y shape:  (31964,)
In [30]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
In [31]:
from sklearn.naive_bayes import MultinomialNB

NB_classifier = MultinomialNB()
NB_classifier.fit(X_train, y_train)
Out[31]:
MultinomialNB()
In [32]:
'''outfile = open('classifier.pickle','wb')
pickle.dump(NB_classifier,outfile)
outfile.close()
infile = open('classifier.pickle','rb')
NB_classifier = pickle.load(infile)
infile.close()'''
Out[32]:
"outfile = open('classifier.pickle','wb')\npickle.dump(NB_classifier,outfile)\noutfile.close()\ninfile = open('classifier.pickle','rb')\nNB_classifier = pickle.load(infile)\ninfile.close()"

Confusion Matrix

ML7.png

In [33]:
from sklearn.metrics import classification_report, confusion_matrix
# Predicting the Test set results
y_predict_test = NB_classifier.predict(X_test)
cm = confusion_matrix(y_test, y_predict_test)
sns.heatmap(cm, annot=True)
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x2092598fa30>

Report

In [34]:
print(classification_report(y_test, y_predict_test))
              precision    recall  f1-score   support

           0       0.97      0.97      0.97      5938
           1       0.59      0.57      0.58       455

    accuracy                           0.94      6393
   macro avg       0.78      0.77      0.78      6393
weighted avg       0.94      0.94      0.94      6393

Testing our Model with real Data

In [35]:
unique_words=vectorizer.get_feature_names()
In [36]:
'''outfile = open('unique_words.pickle','wb')
pickle.dump(unique_words,outfile)
outfile.close()
infile = open('unique_words.pickle','rb')
unique_words = pickle.load(infile)
infile.close()'''
Out[36]:
"outfile = open('unique_words.pickle','wb')\npickle.dump(unique_words,outfile)\noutfile.close()\ninfile = open('unique_words.pickle','rb')\nunique_words = pickle.load(infile)\ninfile.close()"
In [37]:
def Tokenization(sentence):
    A=np.zeros(len(unique_words), dtype='uint8')
    L=[]
    for word in sentence:
        for i in range(len(unique_words)):
            if(word==unique_words[i]):
                A[i]=1
    L.append(A)
    return np.array((L),dtype='uint8')
In [38]:
def Tokenization_df(df):
    L=[]
    for i in df:
        A=np.zeros(len(unique_words), dtype='uint8')
        for word in i:
            for j in range(len(unique_words)):
                if(word==unique_words[j]):
                    A[j]=1
        L.append(A)
    return np.array((L),dtype='uint8')
In [39]:
test="at work: attorneys for white officer who shot #philandocastile remove black judge from presiding over trial."
test=clear(test)
A=Tokenization(test)
In [40]:
y_predict_test = NB_classifier.predict(A)
In [41]:
y_predict_test[0]
Out[41]:
1

Testing our Model with large Dataset

In [42]:
tweets_df_test=pd.read_csv("test.csv")
tweets_df_test.drop(['id'],axis=1,inplace=True)
tweets_df_test.head(10)
Out[42]:
tweet
0 #studiolife #aislife #requires #passion #dedic...
1 @user #white #supremacists want everyone to s...
2 safe ways to heal your #acne!! #altwaystohe...
3 is the hp and the cursed child book up for res...
4 3rd #bihday to my amazing, hilarious #nephew...
5 choose to be :) #momtips
6 something inside me dies 💦💿✨ eyes nes...
7 #finished#tattoo#inked#ink#loveit❤️ #❤ï¸...
8 @user @user @user i will never understand why...
9 #delicious #food #lovelife #capetown mannaep...
In [43]:
tweets_test = tweets_df_test['tweet'].apply(clear)
In [44]:
X1=Tokenization_df(tweets_test)
In [45]:
X1.shape
Out[45]:
(17197, 47325)
In [46]:
y_predict_test = NB_classifier.predict(X1)
In [47]:
tweets_df_test["label"]=np.transpose(y_predict_test,axes=0)
tweets_df_test[tweets_df_test["label"]==0].head(10)
Out[47]:
tweet label
0 #studiolife #aislife #requires #passion #dedic... 0
1 @user #white #supremacists want everyone to s... 0
2 safe ways to heal your #acne!! #altwaystohe... 0
3 is the hp and the cursed child book up for res... 0
4 3rd #bihday to my amazing, hilarious #nephew... 0
5 choose to be :) #momtips 0
6 something inside me dies 💦💿✨ eyes nes... 0
7 #finished#tattoo#inked#ink#loveit❤️ #❤ï¸... 0
8 @user @user @user i will never understand why... 0
9 #delicious #food #lovelife #capetown mannaep... 0
In [48]:
tweets_df_test[tweets_df_test["label"]==1].head(10)
Out[48]:
tweet label
19 thought factory: bbc neutrality on right wing ... 1
33 suppo the #taiji fisherman! no bullying! no ra... 1
42 @user @user trumps invested billions into saud... 1
81 @user .@user @user @user @user &lt;--- no more... 1
110 hey @user - a $14000 ivanka bracelet? do you f... 1
141 you might be a libtard if... #libtard #sjw #l... 1
159 #people aren't protesting #trump because a #re... 1
160 at work: attorneys for white officer who shot... 1
164 @user trump's long history of explained 1970'... 1
292 over 30,000 arrests daily in us;over 160,000 p... 1

GitHub followers

Supervisor: Guezaz Azidine