Stock Price Prediction using Auto-ARIMA
A stock (also known as company’s ‘equity’) is a financial instrument that represents ownership in a company or corporation and represents a proportionate claim on its assets (what it owns) and earnings (what it generates in profits) — Investopedia
The stock market is a market that enables the seamless exchange of buying and selling of company stocks. Every stock exchange has its own stock index value. The index is the average value that is calculated by combining several stocks. This helps in representing the entire stock market and predicting the market’s movement over time. The stock market can have a huge impact on people and the country’s economy as a whole. Therefore, predicting the stock trends in an efficient manner can minimize the risk of loss and maximize profit.
In this article we look at some method of forecasting which can be used to predict the Apple (AAPL) stock price for the upcoming times. Forecasting is the process of making predictions of the future, based on the past and the present data. One of the most common methods that will be used in this article is the Auto-ARIMA model. We implement a grid search to select the optimal parameters for the model.
ARIMA
In this section we will do a quick introduction about ARIMA. ARIMA is a very popular statistical method for time series forecasting. ARIMA stands for Auto-Regressive Integrated Moving Averages. ARIMA models work on the following assumptions:
- The data series is stationary, which means that the mean and variance should not vary with time. A series can be made stationary by using log transformation or differencing the series.
- The data provided as input must be a univariate series, since arima uses the past values to predict the future values.
ARIMA has three components — AR (autoregressive term), I (differencing term) and MA (moving average term). Let us understand each of these components :
- AR term refers to the past values used for forecasting the next value. The AR term is defined by the parameter ‘p’ in arima. The value of ‘p’ is determined using the PACF plot.
- MA term is used to defines number of past forecast errors used to predict the future values. The parameter ‘q’ in arima represents the MA term. ACF plot is used to identify the correct ‘q’ value.
- Order of differencing specifies the number of times the differencing operation is performed on series to make it stationary. Test like ADF and KPSS can be used to determine whether the series is stationary and help in identifying the d value.
Auto ARIMA
When fitting an ARIMA model, we need to make the series stationary and determine the values of our parameters p,d and q which optimise a certain metric. There are many methods to achieve this goal and yet the correct parametrization of ARIMA models can be a tedious process that requires statistical expertise and time. In this article, we hope to overcome this issue by writing ‘a grid search’ algorithm using auto ARIMA in Python, which automatically selects the best combination of (p, d, q) that provides the least error.
Read more about : auto_arima.
Our Goals
The goal is to train an ARIMA model with optimal parameters that will forecast the closing price of the stocks on the test data.
Okay, let’s get started!
Step 1. Importing Required Libraries
import warnings
warnings.filterwarnings('ignore')#Data Manipulation and Treatment
import numpy as np
import pandas as pd
import datetime as dt
from datetime import timedelta#Plotting and Visualizations
import matplotlib.pyplot as plt
import seaborn as sns
!pip install plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots#Scikit-Learn for Modeling
from sklearn.metrics import mean_squared_error,r2_score, mean_absolute_error,mean_squared_log_error#Statistics
import statsmodels.api as sm
from statsmodels.tsa.api import Holt,SimpleExpSmoothing,ExponentialSmoothing
from pmdarima import auto_arima
Step 2. Loading the Dataset
The dataset consists of stock market data of Apple Inc. and it can be downloaded from Yahoo Finance. The data shows the stock price of Apple Inc from 2019–01–02 till 2020–10–27. We choose the closing value for this analysis.
df = pd.read_csv('AAPL.csv', sep=",")
df.head()
df['Date'] = pd.to_datetime(df.Date,format='%Y-%m-%d')
df.index = df['Date']
df = df.drop('Date',axis=1)
df.head()
The date is in index now and has a string type. Then we’ll also need to change its type from string to datetime.
The dataframe is ready now! Let’s plot the data and move on to creating a proper model for our prediction.
Step 3. Visualizing the Data
fig = px.line(y=df.Close, x=df.index)
fig.update_layout(title_text='Stock Prices of APPLE',font=dict(size=12),
xaxis_title_text="Date", yaxis_title_text="Close")
fig.show()
Step 4. Test For the Stationarity
# Checking Stasionarity - Dicky Fuller Test
from statsmodels.tsa.stattools import adfuller
def test_stationarity(timeseries):
#Determining rolling statistics
rolmean = timeseries.rolling(4).mean() # around 4 weeks on each month
rolstd = timeseries.rolling(4).std()
#Plot rolling statistics:
orig = plt.plot(timeseries, color='blue',label='Original')
mean = plt.plot(rolmean, color='red', label='Rolling Mean')
std = plt.plot(rolstd, color='black', label = 'Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show(block=False)
#Perform Dickey-Fuller test:
print ('Results of Dickey-Fuller Test:')
dftest = adfuller(timeseries, autolag='AIC')
dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])for key,value in dftest[4].items():
dfoutput['Critical Value (%s)'%key] = value
print (dfoutput)
if dfoutput['p-value'] < 0.05:
print('result : time series is stationary')
else : print('result : time series is not stationary')from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 20,10
test_stationarity(df['Close'])
The p-value obtained is greater than the significance level of 0.05, and the ADF test statistic is greater than any of the critical values. There is no reason to reject the null hypothesis. So, the time series is non-stationary. Hence, we would need to use the “Integrated (I)” concept, denoted by value ‘d’ in time series to make the data stationary while building the Auto ARIMA model.
# Checking Trend and Seasonality
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(df, freq=30)trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.residplt.subplot(411)
plt.plot(df, label='Original')
plt.legend(loc='best')
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='best')
plt.subplot(413)
plt.plot(seasonal,label='Seasonality')
plt.legend(loc='best')
plt.subplot(414)
plt.plot(residual, label='Residuals')
plt.legend(loc='best')
plt.tight_layout()
plt.show()
5. Fitting auto ARIMA model
Now, we are going to create an auto ARIMA model and will train it with the closing price of the stock on the train data. So let us split the data into training and test set.
model_train=df.iloc[:int(df.shape[0]*0.80)]
valid=df.iloc[int(df.shape[0]*0.80):]
y_pred=valid.copy()
After that, we choosing the parameters of p, d and q for ARIMA model. We are going to use Auto-ARIMA to get the best parameters without even plotting ACF and PACF graphs.
In the auto ARIMA model, note that small p, d, q values represent non-seasonal. Next, we are trying with the p, d, q values ranging from 1 to 3. There are many other parameters in this model and to know more about the functionality, visit this link [here].
model_scores_r2=[]
model_scores_mse=[]
model_scores_rmse=[]
model_scores_mae=[]
model_scores_rmsle=[]model_arima= auto_arima(model_train["Close"],trace=True, error_action='ignore', start_p=1,start_q=1,max_p=3,max_q=3,
suppress_warnings=True,stepwise=False,seasonal=False)
model_arima.fit(model_train["Close"])
So the Auto-ARIMA model provided the value of p, d and q as 2, 1 and 3 respectively.
5. Forecasting the Data
Using the trained model which was built in the earlier step to forecast the sales on the test data.
prediction_arima=model_arima.predict(len(valid))
y_pred["ARIMA Model Prediction"]=prediction_arimar2_arima= r2_score(y_pred["Close"],y_pred["ARIMA Model Prediction"])
mse_arima= mean_squared_error(y_pred["Close"],y_pred["ARIMA Model Prediction"])
rmse_arima=np.sqrt(mean_squared_error(y_pred["Close"],y_pred["ARIMA Model Prediction"]))
mae_arima=mean_absolute_error(y_pred["Close"],y_pred["ARIMA Model Prediction"])
rmsle_arima = np.sqrt(mean_squared_log_error(y_pred["Close"],y_pred["ARIMA Model Prediction"]))model_scores_r2.append(r2_arima)
model_scores_mse.append(mse_arima)
model_scores_rmse.append(rmse_arima)
model_scores_mae.append(mae_arima)
model_scores_rmsle.append(rmsle_arima)print("R Square Score ARIMA: ",r2_arima)
print("Mean Square Error ARIMA: ",mse_arima)
print("Root Mean Square Error ARIMA: ",rmse_arima)
print("Mean Absoulute Error ARIMA: ",mae_arima)
print("Root Mean Squared Logarithmic Error ARIMA: ", rmsle_arima)
fig=go.Figure()
fig.add_trace(go.Scatter(x=model_train.index, y=model_train["Close"], mode='lines',name="Train Data for Stock Prices"))fig.add_trace(go.Scatter(x=valid.index, y=valid["Close"], mode='lines',name="Validation Data for Stock Prices",))fig.add_trace(go.Scatter(x=valid.index, y=y_pred["ARIMA Model Prediction"], mode='lines',name="Prediction for Stock Prices",))fig.update_layout(title="ARIMA",xaxis_title="Date",yaxis_title="Close",legend=dict(x=0,y=1,traceorder="normal"),font=dict(size=12))fig.show()
An auto-ARIMA model uses past data to understand the pattern in the time series. Using these values, the model captured an increasing trend in the series. As its evident from the plot, the model has captured a trend in the series, but does not focus on the seasonal part.
ARIMA_model_new_date=[]
ARIMA_model_new_prediction=[]
for i in range(1,14):
ARIMA_model_new_date.append(df.index[-1]+timedelta(days=i))
ARIMA_model_new_prediction.append(model_arima.predict(len(valid)+i)[-1])
pd.set_option('display.float_format', lambda x: '%.6f' % x)
model_predictions=pd.DataFrame(zip(holt_new_date,holt_new_prediction), columns=["Dates","ARIMA Model Prediction"])model_predictions
The result of the prediction:
Conclusion
In this article, we learned how to use the Auto ARIMA model and this approach will come into handy if you would like to generate the p, d, and q values from the model itself. In the basic ARIMA model, we need to perform differencing and plot ACF and PACF graphs to determine these values which are time-consuming. However, it is always advisable to go with statistical techniques and implement the basic ARIMA model to understand the intuitive behind the p,d, and q.