Posts

Showing posts from June, 2021

LINEAR DISCRIMINANT ANALYSIS IN PYTHON

Image
 Linear discriminant analysis is a supervised dimensionality reduction algorithm. When dealing with large data with lot of features, it becomes difficult to compute and hence we opt for dimensionality reduction methods. The dataset I used is seed dataset from :  https://archive.ics.uci.edu/ml/datasets/seeds Here is the code: import pandas as pd import numpy as np import matplotlib.pyplot as plt #loading the dataset df = pd.read_csv('seeds_dataset.csv') df.head() X = df.iloc[:, 1:8].values y = df.iloc[:, 8].values #training the model from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0) #standardizing the values from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test) #performing LDA from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA lda = LDA(n_components = 2) X_train = lda.fit_trans...

K-MEANS CLUSTERING

Image
 K-Means clustering is an unsupervised centroid based algorithm. The algorithm tends to reduce the distance between the points in a cluster and the cluster centroid. The dataset I used is seeds dataset from :  https://archive.ics.uci.edu/ml/datasets/seeds import pandas as pd import numpy as np import matplotlib.pyplot as plt #loading dataset df = pd.read_csv('seeds_dataset.csv') df.head() #taking compactness and perimeter columns z= df.iloc[:,[2,3]].values #applying elbow method to find the maximum number of clusters from sklearn.cluster import KMeans   elbow_list= []  for i in range(1, 11):       kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)       kmeans.fit(z)       elbow_list.append(kmeans.inertia_)   plt.plot(range(1, 11), elbow_list)   plt.title('The Elbow Method Graph')   plt.xlabel('Number of clusters(k)')   plt.ylabel('elbow_list')...

SIMULATION OF AUTOREGRESSIVE PROCESS AR(2) in R

Image
 Here is the code in R for AR(2) process: set.seed(2017) X.ts <- arima.sim(list(ar = c(.7, .2)), n=1000) par(mfrow=c(2,1)) plot(X.ts,main="AR(2) Time Series, phi1=.7, phi2=.2") X.acf = acf(X.ts, main="Autocorrelation of AR(2) Time Series")

SIMULATION OF AUTOREGRESSIVE PROCESS AR(1) IN R

Image
Here is the code in R for AR(1): set.seed(20190)  n=10000  phi = .6 Z = rnorm(n,0,1)  X=NULL  X[1] = Z[1]  for (t in 2:n) { X[t] = Z[t] + phi*X[t-1]  } X.ts = ts(X) par(mfrow=c(2,1)) plot(X.ts,main="AR(1) Time Series on White Noise, phi=.6") X.acf = acf(X.ts, main="AR(1) Time Series on White Noise, phi=.6")

SIMULATION OF MOVING AVERAGE PROCESS IN R

Image
  Here is the code in R for simulation of moving average: #simulating MA(3) process noise = rnorm(10000) ma3= NULL for(i in 4:10000) { ma3[i] = noise[i] + 0.8*noise[i-1] + 0.5*noise[i-2] + 0.3*noise[i-3] } moving_average = ma3[4:10000] #changing the series into time series moving_average = ts(moving_average) par(mfrow=c(2,1)) plot(moving_average, col='blue') acf(moving_average) Conclusion: We observe the lag cuts off at 3 in the autocorrelation graph showing that the process is a MA(3) 

SIMULATION OF A RANDOM WALK IN R

Image
Here is the code in R for simulation of Random walk : x=NULL x[1]=0 for( i in 2:10000) { x[i]=x[i-1] + rnorm(1) } print(x) #converting it into a time series data random_walk = ts(x) plot(random_walk, main='visualization of a random work' , xlab='days', ylab=' ') acf(random_walk) As we see there is a high correlation in the correlogram, the random walk is a non-stationary process.  # making the series stationary by differencing the values z<-diff(random_walk) plot(z)  # we get white noise acf(z) Conclusion :  We observe that there is no lag and hence no correlation. Thus we obtained stationary series by differencing the time series.

ESTIMATION OF PI USING MONTE CARLO METHOD USING PYTHON

Image
  The value of pi is calculated using monte carlo method by taking a square of 1 unit and inscribing a circle in the square. The radius of circle is 0.5 units. Now the ratio of area of circle to the ratio of square multiplied by 4 gives us pi. Python code: import random n=1000000 c_points=0  #points inside circle s_points=0  #points inside square for i in range(n):     x = random.uniform(0,1)     y = random.uniform(0,1)     d = x**2 + y**2     if d<=1 :         c_points +=1     s_points +=1     pi = 4*(c_points/s_points)      print("pi value is:", pi) Conclusion: Higher the value of n, higher is the accuracy of value of pi.

APRIORI ALGORITHM IMPLEMENTATION IN PYTHON

Image
 The Apriori algorithm says "All the subsets of a frequent itemset should be frequent. And all the supersets of infrequent itemset should be infrequent" Here's the code for Apriori algorithm in python: #install mlxtend to use apriori command pip install mlxtend  from mlxtend.frequent_patterns import apriori dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],            ['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],            ['Milk', 'Apple', 'Kidney Beans', 'Eggs'],            ['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],            ['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']] import pandas as pd from mlxtend.preprocessing import TransactionEncoder t = TransactionEncoder() x = t.fit(dataset)....

DECISION TREE CLASSIFICATION MODEL

Image
A decision tree classifier is a supervised machine learning technique. It has four main parts in its structure : 1.Root nodes: The main node of tree where the decision tree starts. 2.Branches Branches divide a decision node into a sub tree. 3.Leaf nodes: Leaf nodes are the final output notes which cannot be further divided into any nodes. 4.Decision nodes Decision nodes are the nodes which can be further divided into sub trees. The decision nodes are further divided into other decision nodes or leaf nodes. For modeling I used the dataset from : https://data.world/uci/occupancy-detection For training the model I used training dataset and for testing I used test dataset. I used Temperature, Humidity, Light, CO2, Humidity Ratio as input variables. And Occupancy as output variable. Once modeled the accuracy is obtained by confusion matrix. Here is the code for Decision Tree Classifier: #loading the libraries and reading the dataset import numpy as nm import pandas as pd #read the data file...

RANDOM FOREST CLASSIFIER

Image
Random forest algorithm contains many decision trees on various subsets and take the average of output of all trees to improve the accuracy. The more the number of trees, the more is the accuracy. It is a concept of ensemble learning. I used the dataset from :  https://data.world/uci/occupancy-detection For training the model I used training dataset and for testing I used test dataset.  I used Temperature, Humidity, Light, CO2, Humidity Ratio as input variables. And Occupancy as output variable. Once modeled the accuracy is obtained by confusion matrix. Here is the code for Random Forest Classifier: import numpy as nm import matplotlib.pyplot as mtp import pandas as pd #read the data files train = pd.read_csv('datatraining.txt') test = pd.read_csv('datatest.txt') train.head() test.head() train.shape test.shape #looking if the data has any nan values train.isnull().sum() test.isnull().sum() #as there are no Nan values, cleaning of data is not required x_train = train.ilo...

USING ARIMA(Auto regressive integrated moving average) model to predict the stock price

Image
ARIMA is a popular time series model used for forecasting. In this article we will use arima model to predict the stock prices of microsoft. The stock price data is downloaded from Yahoo Finance website. You can open the website, search for microsoft stocks. Go to history and there you will find the download option to download the stock prices.   I used daily stock data of microsoft from 01-01-2010 to 11-06-2021. PS: To remove warnings I used : from warnings import simplefilter simplefilter(action='ignore', category = FutureWarning) Also use 'from datatime import datetime' instead of 'from pandas import datetime' from warnings import simplefilter simplefilter(action='ignore', category = FutureWarning) Also use 'from datatime import datetime' instead of 'from pandas import datetime' So here's the code:   import numpy as np  import pandas as pd  import matplotlib.pyplot as plt from pandas.plotting import lag_plot from datetime import da...