First of all, data science is a method of providing actionable intelligence from data using math, statistics, programming, and business expertise. Like any scientific method, it involves gathering data, identifying a problem, forming a hypothesis, and running tests. More specifically, data scientists follow a process of gathering and cleaning data (wrangling), investigation (exploratory data analysis), building automation using machine learning (feature engineering, model development, and deployment), delivering results (visualizations, reporting, storytelling), and maintenance. Practitioners typically spend 70-80% of their time in the wrangling/exploration, 20% on machine learning models, and the rest in maintenance. Most importantly, this whole process should result in a valuable action or insight for the end-user, i.e. a business or customer!
Data mining is the process of finding anomalies, patterns and correlations within large data sets to predict outcomes. Using a broad range of techniques, you can use this information to increase revenues, cut costs, improve customer relationships, reduce risks and more. So Data Mining is about finding the trends in a data set. And using these trends to identify future patterns.
Basis for comparison | Data Mining | Data Science |
What is it? | A technique | An area |
Focus | Business process | Scientific study |
Goal | Make data more usable | Building Data-centric products for an organization |
Output | Patterns | Varied |
Purpose | Finding trends previously not known | Social analysis, building predictive models, unearthing unknown facts, and more |
Vocational Perspective | Someone with a knowledge of navigating across data and statistical understanding can conduct data mining | A person needs to understand Machine Learning, Programming, info-graphic techniques and have the domain knowledge to become a data scientist |
Extent | Data mining can be a subset of Data Science as Mining activities are part of the Data Science pipeline | Multidisciplinary – Data Science consists of Data Visualizations, Computational Social Sciences, Statistics, Data Mining, Natural Language Processing, et cetera |
Deals with (the type of data) | Mostly structured | All forms of data – structured, semi-structured and unstructured |
Other less popular names | Data Archaeology, Information Harvesting, Information Discovery, Knowledge Extraction | Data-driven Science |
import IPython
IPython.display.IFrame('https://www.mindmeister.com/maps/public_map_shell/1626104926/machine-learning-algorithms?width=850&height=450&z=auto&presentation=1',width=850,height=450)
Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal).
linear regression is a linear approach to modeling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables). The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable.
polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modelled as an nth degree polynomial in x. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x). Although polynomial regression fits a nonlinear model to the data, as a statistical estimation problem it is linear, in the sense that the regression function E(y | x) is linear in the unknown parameters that are estimated from the data. For this reason, polynomial regression is considered to be a special case of multiple linear regression.
Classification is a process of categorizing a given set of data into classes, It can be performed on both structured or unstructured data. The process starts with predicting the class of given data points. The classes are often referred to as target, label or categories.
The classification predictive modeling is the task of approximating the mapping function from input variables to discrete output variables. The main goal is to identify which class/category the new data will fall into.
Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression). Mathematically, a binary logistic model has a dependent variable with two possible values, such as pass/fail which is represented by an indicator variable, where the two values are labeled "0" and "1". In the logistic model, the log-odds (the logarithm of the odds) for the value labeled "1" is a linear combination of one or more independent variables ("predictors"); the independent variables can each be a binary variable (two classes, coded by an indicator variable) or a continuous variable (any real value). The corresponding probability of the value labeled "1" can vary between 0 (certainly the value "0") and 1 (certainly the value "1"), hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. The unit of measurement for the log-odds scale is called a logit, from logistic unit, hence the alternative names. Analogous models with a different sigmoid function instead of the logistic function can also be used, such as the probit model; the defining characteristic of the logistic model is that increasing one of the independent variables multiplicatively scales the odds of the given outcome at a constant rate, with each independent variable having its own parameter; for a binary dependent variable this generalizes the odds ratio.
A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes Entropy and Information Gain). The paths from root to leaf represent classification rules.
In decision analysis, a decision tree and the closely related influence diagram are used as a visual and analytical decision support tool, where the expected values (or expected utility) of competing alternatives are calculated.
Unsupervised learning is a type of machine learning that looks for previously undetected patterns in a data set with no pre-existing. so when a dataset is provided without labels the model learns useful properties of the structure of the dataset and come with a patterns or conclusions from the unlabeled data.
k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. It is popular for cluster analysis in data mining. k-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.
Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.
Natural Language Processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.
Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.
The Natural Language ToolKit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical Natural Language Processing (NLP) for English written in the Python programming language.
NLTK includes graphical demonstrations and sample data. It is accompanied by a book that explains the underlying concepts behind the language processing tasks supported by the toolkit.
In this project, we are going to use re (regular expression) and NLTK libraries to perform data cleaning.
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Toggle Code"></form>''')
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import string
import pickle
#!pip install nltk
#!pip install WordCloud
#!pip install sklearn
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from wordcloud import WordCloud
from sklearn.feature_extraction.text import CountVectorizer
from jupyterthemes import jtplot
jtplot.style(theme='monokai', context='notebook', ticks=True, grid=False)
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
language='english'
tweets_df=pd.read_csv("train.csv")
tweets_df.drop(['id'],axis=1,inplace=True)
tweets_df.head(20)
tweets_df.info()
tweets_df.describe()
sns.heatmap(tweets_df.isnull(),yticklabels= False , cbar=False ,cmap="Blues")
sns.countplot(tweets_df["label"])
def to_word_cloud(sentences):
sentences = sentences.tolist()
sentences = "".join(sentences)
plt.figure(figsize=(20,20))
plt.imshow(WordCloud().generate(sentences))
to_word_cloud(tweets_df['tweet'])
positive = tweets_df[tweets_df['label']==0]["tweet"]
negative = tweets_df[tweets_df['label']==1]["tweet"]
to_word_cloud(positive)
to_word_cloud(negative)
def clear(sentence):
sentence_clean = re.sub('@[A-Za-z0–9]+', '', sentence) #Removing @mentions
sentence_clean = re.sub('https?:\/\/\S+', '', sentence_clean) # Removing hyperlink
sentence_clean = [char for char in sentence_clean if(char not in string.punctuation) ] # Removing punctuation
sentence_clean = "".join(sentence_clean)
sentence_clean = [word for word in sentence_clean.split() if(word.lower() not in stopwords.words(language)) ] # Removing stopwords
return sentence_clean
def clear1(sentence):
sentence_clean = re.sub('@[A-Za-z0–9]+', '', sentence) #Removing @mentions
sentence_clean = re.sub('https?:\/\/\S+', '', sentence_clean) # Removing hyperlink
sentence_clean = [char for char in sentence_clean if(char not in string.punctuation) ] # Removing punctuation
sentence_clean = "".join(sentence_clean)
sentence_clean = [word for word in sentence_clean.split() if(word.lower() not in stopwords.words(language)) ] # Removing stopwords
return " ".join(sentence_clean)
clear1("https://salah-zkara.codes/ @user #Search engines maintain lists of words, called 'stop words', which they consider unimportant.")
tweets_df_clean = tweets_df['tweet'].apply(clear)
print(tweets_df_clean[1])
print(tweets_df['tweet'][1])
positive_clear = positive.apply(clear1)
to_word_cloud(positive_clear)
negative_clear = negative.apply(clear1)
to_word_cloud(negative_clear)
sample_data = ['This is the first paper.','This paper is the second paper.','And this is the third one.','Is this the first paper?']
vectorizer = CountVectorizer()
aa = vectorizer.fit_transform(sample_data)
print(vectorizer.get_feature_names())
print(aa.toarray().shape)
vectorizer = CountVectorizer(analyzer = clear, dtype = 'uint8')
#analyzer: adding a stage before we apply a countvectorization
tweets_countvectorizer = vectorizer.fit_transform(tweets_df['tweet'])
X = tweets_countvectorizer.toarray()
y = tweets_df['label']
print("X shape: ",X.shape)
print("y shape: ",y.shape)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
from sklearn.naive_bayes import MultinomialNB
NB_classifier = MultinomialNB()
NB_classifier.fit(X_train, y_train)
'''outfile = open('classifier.pickle','wb')
pickle.dump(NB_classifier,outfile)
outfile.close()
infile = open('classifier.pickle','rb')
NB_classifier = pickle.load(infile)
infile.close()'''
from sklearn.metrics import classification_report, confusion_matrix
# Predicting the Test set results
y_predict_test = NB_classifier.predict(X_test)
cm = confusion_matrix(y_test, y_predict_test)
sns.heatmap(cm, annot=True)
print(classification_report(y_test, y_predict_test))
unique_words=vectorizer.get_feature_names()
'''outfile = open('unique_words.pickle','wb')
pickle.dump(unique_words,outfile)
outfile.close()
infile = open('unique_words.pickle','rb')
unique_words = pickle.load(infile)
infile.close()'''
def Tokenization(sentence):
A=np.zeros(len(unique_words), dtype='uint8')
L=[]
for word in sentence:
for i in range(len(unique_words)):
if(word==unique_words[i]):
A[i]=1
L.append(A)
return np.array((L),dtype='uint8')
def Tokenization_df(df):
L=[]
for i in df:
A=np.zeros(len(unique_words), dtype='uint8')
for word in i:
for j in range(len(unique_words)):
if(word==unique_words[j]):
A[j]=1
L.append(A)
return np.array((L),dtype='uint8')
test="at work: attorneys for white officer who shot #philandocastile remove black judge from presiding over trial."
test=clear(test)
A=Tokenization(test)
y_predict_test = NB_classifier.predict(A)
y_predict_test[0]
tweets_df_test=pd.read_csv("test.csv")
tweets_df_test.drop(['id'],axis=1,inplace=True)
tweets_df_test.head(10)
tweets_test = tweets_df_test['tweet'].apply(clear)
X1=Tokenization_df(tweets_test)
X1.shape
y_predict_test = NB_classifier.predict(X1)
tweets_df_test["label"]=np.transpose(y_predict_test,axes=0)
tweets_df_test[tweets_df_test["label"]==0].head(10)
tweets_df_test[tweets_df_test["label"]==1].head(10)