MACHINE LEARNING AND STATISTICS WORLD

Posts

Showing posts from June, 2021

LINEAR DISCRIMINANT ANALYSIS IN PYTHON

June 22, 2021

Linear discriminant analysis is a supervised dimensionality reduction algorithm. When dealing with large data with lot of features, it becomes difficult to compute and hence we opt for dimensionality reduction methods. The dataset I used is seed dataset from : https://archive.ics.uci.edu/ml/datasets/seeds Here is the code: import pandas as pd import numpy as np import matplotlib.pyplot as plt #loading the dataset df = pd.read_csv('seeds_dataset.csv') df.head() X = df.iloc[:, 1:8].values y = df.iloc[:, 8].values #training the model from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0) #standardizing the values from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test) #performing LDA from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA lda = LDA(n_components = 2) X_train = lda.fit_trans...

K-MEANS CLUSTERING

June 22, 2021

K-Means clustering is an unsupervised centroid based algorithm. The algorithm tends to reduce the distance between the points in a cluster and the cluster centroid. The dataset I used is seeds dataset from : https://archive.ics.uci.edu/ml/datasets/seeds import pandas as pd import numpy as np import matplotlib.pyplot as plt #loading dataset df = pd.read_csv('seeds_dataset.csv') df.head() #taking compactness and perimeter columns z= df.iloc[:,[2,3]].values #applying elbow method to find the maximum number of clusters from sklearn.cluster import KMeans elbow_list= [] for i in range(1, 11): kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42) kmeans.fit(z) elbow_list.append(kmeans.inertia_) plt.plot(range(1, 11), elbow_list) plt.title('The Elbow Method Graph') plt.xlabel('Number of clusters(k)') plt.ylabel('elbow_list')...

SIMULATION OF AUTOREGRESSIVE PROCESS AR(2) in R

June 19, 2021

Here is the code in R for AR(2) process: set.seed(2017) X.ts <- arima.sim(list(ar = c(.7, .2)), n=1000) par(mfrow=c(2,1)) plot(X.ts,main="AR(2) Time Series, phi1=.7, phi2=.2") X.acf = acf(X.ts, main="Autocorrelation of AR(2) Time Series")

SIMULATION OF AUTOREGRESSIVE PROCESS AR(1) IN R

June 19, 2021

Here is the code in R for AR(1): set.seed(20190) n=10000 phi = .6 Z = rnorm(n,0,1) X=NULL X[1] = Z[1] for (t in 2:n) { X[t] = Z[t] + phi*X[t-1] } X.ts = ts(X) par(mfrow=c(2,1)) plot(X.ts,main="AR(1) Time Series on White Noise, phi=.6") X.acf = acf(X.ts, main="AR(1) Time Series on White Noise, phi=.6")

SIMULATION OF MOVING AVERAGE PROCESS IN R

June 19, 2021

Here is the code in R for simulation of moving average: #simulating MA(3) process noise = rnorm(10000) ma3= NULL for(i in 4:10000) { ma3[i] = noise[i] + 0.8*noise[i-1] + 0.5*noise[i-2] + 0.3*noise[i-3] } moving_average = ma3[4:10000] #changing the series into time series moving_average = ts(moving_average) par(mfrow=c(2,1)) plot(moving_average, col='blue') acf(moving_average) Conclusion: We observe the lag cuts off at 3 in the autocorrelation graph showing that the process is a MA(3)

SIMULATION OF A RANDOM WALK IN R

June 19, 2021

Here is the code in R for simulation of Random walk : x=NULL x[1]=0 for( i in 2:10000) { x[i]=x[i-1] + rnorm(1) } print(x) #converting it into a time series data random_walk = ts(x) plot(random_walk, main='visualization of a random work' , xlab='days', ylab=' ') acf(random_walk) As we see there is a high correlation in the correlogram, the random walk is a non-stationary process. # making the series stationary by differencing the values z<-diff(random_walk) plot(z) # we get white noise acf(z) Conclusion : We observe that there is no lag and hence no correlation. Thus we obtained stationary series by differencing the time series.

ESTIMATION OF PI USING MONTE CARLO METHOD USING PYTHON

June 17, 2021

The value of pi is calculated using monte carlo method by taking a square of 1 unit and inscribing a circle in the square. The radius of circle is 0.5 units. Now the ratio of area of circle to the ratio of square multiplied by 4 gives us pi. Python code: import random n=1000000 c_points=0 #points inside circle s_points=0 #points inside square for i in range(n): x = random.uniform(0,1) y = random.uniform(0,1) d = x**2 + y**2 if d<=1 : c_points +=1 s_points +=1 pi = 4*(c_points/s_points) print("pi value is:", pi) Conclusion: Higher the value of n, higher is the accuracy of value of pi.

APRIORI ALGORITHM IMPLEMENTATION IN PYTHON

June 17, 2021

The Apriori algorithm says "All the subsets of a frequent itemset should be frequent. And all the supersets of infrequent itemset should be infrequent" Here's the code for Apriori algorithm in python: #install mlxtend to use apriori command pip install mlxtend from mlxtend.frequent_patterns import apriori dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'], ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'], ['Milk', 'Apple', 'Kidney Beans', 'Eggs'], ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'], ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']] import pandas as pd from mlxtend.preprocessing import TransactionEncoder t = TransactionEncoder() x = t.fit(dataset)....

DECISION TREE CLASSIFICATION MODEL

June 16, 2021

A decision tree classifier is a supervised machine learning technique. It has four main parts in its structure : 1.Root nodes: The main node of tree where the decision tree starts. 2.Branches Branches divide a decision node into a sub tree. 3.Leaf nodes: Leaf nodes are the final output notes which cannot be further divided into any nodes. 4.Decision nodes Decision nodes are the nodes which can be further divided into sub trees. The decision nodes are further divided into other decision nodes or leaf nodes. For modeling I used the dataset from : https://data.world/uci/occupancy-detection For training the model I used training dataset and for testing I used test dataset. I used Temperature, Humidity, Light, CO2, Humidity Ratio as input variables. And Occupancy as output variable. Once modeled the accuracy is obtained by confusion matrix. Here is the code for Decision Tree Classifier: #loading the libraries and reading the dataset import numpy as nm import pandas as pd #read the data file...

RANDOM FOREST CLASSIFIER

June 16, 2021

Random forest algorithm contains many decision trees on various subsets and take the average of output of all trees to improve the accuracy. The more the number of trees, the more is the accuracy. It is a concept of ensemble learning. I used the dataset from : https://data.world/uci/occupancy-detection For training the model I used training dataset and for testing I used test dataset. I used Temperature, Humidity, Light, CO2, Humidity Ratio as input variables. And Occupancy as output variable. Once modeled the accuracy is obtained by confusion matrix. Here is the code for Random Forest Classifier: import numpy as nm import matplotlib.pyplot as mtp import pandas as pd #read the data files train = pd.read_csv('datatraining.txt') test = pd.read_csv('datatest.txt') train.head() test.head() train.shape test.shape #looking if the data has any nan values train.isnull().sum() test.isnull().sum() #as there are no Nan values, cleaning of data is not required x_train = train.ilo...

USING ARIMA(Auto regressive integrated moving average) model to predict the stock price

June 16, 2021

ARIMA is a popular time series model used for forecasting. In this article we will use arima model to predict the stock prices of microsoft. The stock price data is downloaded from Yahoo Finance website. You can open the website, search for microsoft stocks. Go to history and there you will find the download option to download the stock prices. I used daily stock data of microsoft from 01-01-2010 to 11-06-2021. PS: To remove warnings I used : from warnings import simplefilter simplefilter(action='ignore', category = FutureWarning) Also use 'from datatime import datetime' instead of 'from pandas import datetime' from warnings import simplefilter simplefilter(action='ignore', category = FutureWarning) Also use 'from datatime import datetime' instead of 'from pandas import datetime' So here's the code: import numpy as np import pandas as pd import matplotlib.pyplot as plt from pandas.plotting import lag_plot from datetime import da...