Exploratory Data Analysis#
Data Loading and Preliminary Checks:#
Load the dataset and perform basic checks for data types, missing values, and summary statistics.
import plotly.graph_objs as go
from plotly.subplots import make_subplots
import ipywidgets as widgets
from IPython.display import display
import random
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")
file_path = 'final_dataset.xlsx'
data = pd.read_excel(file_path)
# Display the first few rows of the dataframe and its basic information
data_info = data.info()
data_head = data.head()
data_descriptive_stats = data.describe()
data_info, data_head, data_descriptive_stats
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26680 entries, 0 to 26679
Data columns (total 44 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 26680 non-null datetime64[ns]
1 Open 26680 non-null float64
2 High 26680 non-null float64
3 Low 26680 non-null float64
4 Close 26680 non-null float64
5 Volume 26680 non-null int64
6 Dividends 26680 non-null float64
7 Stock Splits 26680 non-null float64
8 Daily_Return 26680 non-null float64
9 Daily_Return_Percentage 26680 non-null float64
10 Target_1day 26680 non-null int64
11 Target_5days 26680 non-null int64
12 Target_30days 26680 non-null int64
13 Percentage_Return_1day_before 26680 non-null float64
14 Percentage_Return_5days_before 26680 non-null float64
15 Percentage_Return_15days_before 26680 non-null float64
16 Percentage_Return_30days_before 26680 non-null float64
17 Percentage_Return_90days_before 26680 non-null float64
18 Percentage_Return_180days_before 26680 non-null float64
19 Percentage_Return_360days_before 26680 non-null float64
20 Net Income 26680 non-null int64
21 Diluted EPS 26680 non-null float64
22 Total Revenue 26680 non-null int64
23 Normalized EBITDA 26680 non-null int64
24 Total Unusual Items 26680 non-null int64
25 Total Unusual Items Excluding Goodwill 26680 non-null int64
26 Operating Cash Flow 26680 non-null int64
27 Capital Expenditure 26680 non-null int64
28 Free Cash Flow 26680 non-null int64
29 Cash Flow From Continuing Operating Activities 26680 non-null int64
30 Cash Flow From Continuing Investing Activities 26680 non-null int64
31 Cash Flow From Continuing Financing Activities 26680 non-null int64
32 MA_5 26680 non-null float64
33 MA_10 26680 non-null float64
34 MA_30 26680 non-null float64
35 MA_50 26680 non-null float64
36 RSI 26680 non-null float64
37 MACD 26680 non-null float64
38 Signal_Line 26680 non-null float64
39 Bollinger_Mid_Band 26680 non-null float64
40 Bollinger_Upper_Band 26680 non-null float64
41 Bollinger_Lower_Band 26680 non-null float64
42 Volatility 26680 non-null float64
43 Ticker 26680 non-null object
dtypes: datetime64[ns](1), float64(27), int64(15), object(1)
memory usage: 9.0+ MB
(None,
Date Open High Low Close Volume \
0 2020-06-30 179.305945 183.533295 179.102439 182.802521 3102800
1 2020-07-01 183.968061 184.763579 180.859992 182.756287 2620100
2 2020-07-02 187.316608 187.779118 182.349254 182.598999 2699400
3 2020-07-06 186.243599 192.209977 186.049352 191.812225 3567700
4 2020-07-07 190.091704 190.285964 184.254827 184.412079 2853500
Dividends Stock Splits Daily_Return Daily_Return_Percentage ... \
0 0.0 0.0 3.496576 1.912761 ...
1 0.0 0.0 -1.211774 -0.663055 ...
2 0.0 0.0 -4.717609 -2.583590 ...
3 0.0 0.0 5.568627 2.903166 ...
4 0.0 0.0 -5.679625 -3.079855 ...
MA_30 MA_50 RSI MACD Signal_Line \
0 185.971379 177.444586 40.980241 0.824023 3.006285
1 186.614105 177.904132 52.500011 0.573091 2.519646
2 187.140969 178.320634 46.428544 0.357413 2.087199
3 188.016001 178.938500 50.786526 0.919320 1.853623
4 188.649571 179.372512 42.843182 0.758759 1.634651
Bollinger_Mid_Band Bollinger_Upper_Band Bollinger_Lower_Band Volatility \
0 190.221655 205.998882 174.444429 6.840469
1 189.620391 205.574176 173.666606 6.521267
2 188.814696 204.471975 173.157416 5.091692
3 188.326285 202.922613 173.729956 4.048404
4 187.334203 200.017990 174.650416 4.947823
Ticker
0 GS
1 GS
2 GS
3 GS
4 GS
[5 rows x 44 columns],
Date Open High \
count 26680 26680.000000 26680.000000
mean 2022-02-24 13:37:44.707646208 7915.455428 8004.236860
min 2020-06-30 00:00:00 2.240000 2.320000
25% 2021-04-28 00:00:00 50.731169 51.079580
50% 2022-02-24 00:00:00 134.755567 135.988548
75% 2022-12-22 00:00:00 230.284479 233.572506
max 2023-10-26 00:00:00 251559.330417 268268.048189
std NaN 33401.718334 33792.139828
Low Close Volume Dividends Stock Splits \
count 26680.000000 26680.000000 2.668000e+04 26680.000000 26680.000000
mean 7825.245416 7910.253675 2.326393e+07 0.854714 0.001034
min 2.150000 2.240000 0.000000e+00 0.000000 0.000000
25% 50.355870 50.688783 2.487925e+06 0.000000 0.000000
50% 133.559749 134.675926 9.155860e+06 0.000000 0.000000
75% 226.726494 230.135082 2.355792e+07 0.000000 0.000000
max 240884.308837 248310.390625 9.920910e+08 6000.000000 20.000000
std 33001.457927 33375.164501 4.600700e+07 51.367560 0.125511
Daily_Return Daily_Return_Percentage ... MA_10 \
count 26680.000000 26680.000000 ... 26680.000000
mean -5.201753 -0.038065 ... 7889.953801
min -12067.422080 -17.496112 ... 2.390000
25% -0.871245 -0.838023 ... 50.618732
50% 0.000000 0.000000 ... 134.850706
75% 0.800007 0.791362 ... 230.385297
max 16901.003171 20.396598 ... 237728.198437
std 519.596118 1.787562 ... 33293.881903
MA_30 MA_50 RSI MACD Signal_Line \
count 26680.000000 26680.000000 26680.000000 26680.000000 26680.000000
mean 7845.562875 7798.864681 51.034668 31.972182 32.220391
min 2.452667 2.665600 0.244804 -6875.778683 -6356.702216
25% 50.699711 50.513578 38.732184 -0.884412 -0.784096
50% 134.626343 134.448401 50.795485 0.067447 0.079939
75% 231.548368 231.453463 63.380262 1.560649 1.471068
max 229807.018229 224732.511563 97.252738 18118.585490 16295.612855
std 33115.586330 32932.846292 16.894598 758.849178 720.270199
Bollinger_Mid_Band Bollinger_Upper_Band Bollinger_Lower_Band \
count 26680.000000 26680.000000 26680.000000
mean 7867.662250 8307.270204 7428.054296
min 2.431500 2.564802 2.180620
25% 50.655560 52.418498 48.810667
50% 134.776992 140.944751 127.569949
75% 231.423839 246.643646 210.289578
max 232437.097656 271469.530941 217108.947693
std 33204.190146 35081.969662 31386.971076
Volatility
count 26680.000000
mean 104.641068
min 0.003811
25% 0.510268
50% 1.347093
75% 3.196869
max 11420.843858
std 522.629274
[8 rows x 43 columns])
The dataset contains 26,680 entries and 44 columns. Each row represents financial data for a specific date and ticker, with various financial metrics and technical indicators.
The key features include:
Date
: The date of the financial data.Open, High, Low, Close
: Stock price information.Volume
: Number of shares traded.Dividends, Stock Splits
: Corporate actions.Daily_Return
: The daily return of the stock.Target_1day, Target_5days, Target_30days
: Target variables for stock prediction.Various Financial Metrics: Like Net Income, Diluted EPS, Total Revenue, etc.
Technical Indicators: Such as
MA (Moving Averages)
,RSI (Relative Strength Index)
,MACD
,Bollinger Bands
, etc.Ticker: The stock symbol.
Here are some key observations:
Price Metrics (
Open, High, Low, Close
): There’s a wide range of values, as indicated by the standard deviation, suggesting significant variability among different stocks. The maximum values are quite high, which could be due to high-priced stocks or outliers.Volume: The trading volume varies greatly, with a large standard deviation. This is expected as trading volumes can differ significantly between stocks.
Dividends and Stock Splits: Most values are 0, but there are some extreme values (like a maximum dividend of 6000 and stock split of 20), which are likely outliers or specific events.
Daily Return: The average is close to zero, but the range (-0.27 to 0.37) indicates periods of high volatility.
Moving Averages (
MA_5, MA_10, MA_30, MA_50
): Similar to price metrics, there’s a wide range, indicating diverse stock behaviors.Technical Indicators (
RSI, MACD, Bollinger Bands
): The RSI ranges from 0.24 to 97.25, with an average around 51, indicating a balance between overbought and oversold conditions. MACD and Bollinger Bands also show a wide range, which is typical given their dependency on stock prices.Volatility: Varies between 0.05% to 17.87%, indicating some stocks are much more volatile than others.
sns.set_style("whitegrid")
# Selecting a subset of columns for distribution analysis
columns_to_plot = ['Open', 'High', 'Low', 'Close', 'Volume', 'Daily_Return',
'Net Income', 'Diluted EPS', 'Total Revenue', 'MA_5', 'MA_10', 'MA_30', 'MA_50', 'RSI', 'Signal_Line','MACD', 'Bollinger_Mid_Band', 'Bollinger_Upper_Band','Bollinger_Lower_Band', 'Volatility' ]
# Plotting distributions
fig, axes = plt.subplots(len(columns_to_plot), 1, figsize=(10, 20))
for i, col in enumerate(columns_to_plot):
sns.histplot(data[col], ax=axes[i], kde=True, bins=30)
axes[i].set_title(f'Distribution of {col}')
axes[i].set_ylabel('Frequency')
axes[i].set_xlabel(col)
plt.tight_layout()
plt.show()
The distributions for the key features like 'Open', 'High', 'Low', 'Close', 'Volume', 'Daily_Return', 'Net Income', 'Diluted EPS', and 'Total Revenue'
show that:
Stock Prices (
Open, High, Low, Close
): These might show skewed distributions, indicating a range of stock prices across different tickers. It’s common for stock price distributions to be non-normal, often skewed towards lower prices.Volume
: The trading volume is likely right-skewed, with more occurrences of lower trading volumes and fewer occurrences of extremely high volumes.Daily_Return
: The distribution of daily returns often resembles a normal distribution but can have heavy tails, indicating days with unusually high gains or losses.Financial Metrics (
Net Income, Diluted EPS, Total Revenue
): These also tend to be skewed, reflecting the varied financial health and sizes of different companies.
fig = px.scatter(data_frame=data,
x='Volume',
y='Daily_Return',
size='Volatility',
color='Ticker',
hover_data=['Open', 'Close', 'Net Income'],
title="Volume vs Daily Return with Volatility and Stock Information")
fig.update_layout(
plot_bgcolor='black', # Cambia il colore di sfondo dell'area del plot
paper_bgcolor='black', # Cambia il colore di sfondo intorno all'area del plot
font_color='white' # Cambia il colore del testo per migliorare la visibilità su sfondo nero
)
fig.update_traces(marker=dict(opacity=0.7))
# Ad esempio, per rendere i punti più piccoli, potresti provare un valore più alto di sizeref
fig.update_traces(marker=dict(sizeref=2 * max(data['Volatility']) / 40**2))
fig.show()
low_volatility = data[data['Volatility'] <= 20]
fig = px.scatter(data_frame=low_volatility,
x='Volume',
y='Daily_Return',
size='Volatility',
color='Ticker',
hover_data=['Open', 'Close', 'Net Income'],
title="Volume vs Daily Return with Volatility and Stock Information")
fig.update_layout(
plot_bgcolor='black', # Cambia il colore di sfondo dell'area del plot
paper_bgcolor='black', # Cambia il colore di sfondo intorno all'area del plot
font_color='white' # Cambia il colore del testo per migliorare la visibilità su sfondo nero
)
fig.update_traces(marker=dict(opacity=0.7))
The scatter plot vividly displays a spectrum of trading activity across various stocks, represented by unique colors for each ticker symbol. Notably, the ticker ‘005380KS’ appears frequently with a range of volumes, as indicated by the numerous green dots. These dots are primarily concentrated toward the lower end of the volume spectrum, suggesting that ‘005380KS’ typically trades with less volume. The daily returns for this ticker are mostly clustered near the zero mark, implying that extreme returns are not a common event for this stock.
The size of the dots, which represents the stock’s volatility, shows that larger dots, and hence higher volatility, are not confined to any single volume bracket or daily return level. This suggests that volatility can spike regardless of the trading volume or the direction of the return on a given day.
The plot also features other tickers like ‘AAPL’, indicated by light pink dots, and ‘AMZN’. These particular tickers appear less frequently but are scattered across a wider volume range. This pattern could suggest a more variable trading volume for these well-known tech giants.
The hover data feature of the plot allows viewers to delve into specific details such as the opening and closing prices, as well as the net income on particular days for each stock, providing a more comprehensive view of each stock’s performance. For example, hovering over a particularly large dot could reveal a day with significant price movement and higher volatility, possibly aligned with a corporate announcement or market event.
Correlation Analysis#
Examine the distribution of key features like 'Open', 'Close', 'Volume'
, etc.
# Dropping non-numerical columns (like 'Ticker') before computing the correlation matrix
numerical_data = data.select_dtypes(include=['float64', 'int64'])
correlation_matrix_numerical = numerical_data.corr()
# Plotting the heatmap for the numerical data
plt.figure(figsize=(15, 10))
sns.heatmap(correlation_matrix_numerical, cmap='coolwarm', annot=False, fmt=".1f")
plt.title('Correlation Matrix of Numerical Features')
plt.show()
The heatmap provides a visual representation of the correlation matrix, showing how different features are correlated with each other. However, due to the large number of features, it’s quite dense and a bit challenging to extract specific insights at this resolution.
For a more focused analysis, let’s zoom in on the correlations of the target variables ('Target_1day', 'Target_5days', 'Target_30days'
) with other features. This will give us a clearer view of which features might be more influential for predicting stock movements. I’ll extract and visualize these specific correlations.
# Focusing on the correlation of target variables with other features
target_features = ['Target_1day', 'Target_5days', 'Target_30days']
correlation_with_targets = correlation_matrix_numerical[target_features].drop(target_features)
# Plotting the correlations with target variables
plt.figure(figsize=(10, 12))
sns.heatmap(correlation_with_targets, cmap='coolwarm', annot=True, fmt=".2f")
plt.title('Correlation of Features with Target Variables')
plt.show()
The heatmap now focuses on the correlations between the target variables ('Target_1day', 'Target_5days', 'Target_30days'
) and other features in the dataset. Here’s what we can infer:
Certain features show a moderate to low level of correlation with the target variables.
It’s important to note that correlation does not imply causation, but it can give us an idea of which features might be useful in predictive modeling.
The correlations are relatively low, which is typical in financial datasets due to the complexity and unpredictability of stock price movements.
Next, let’s perform a time series analysis visualization. Given the financial nature of the data, it’s crucial to understand how certain key variables like
'Close', 'Volume'
, and technical indicators like'RSI', 'MACD'
change over time.
Time Series Analysis#
Let’s start by plotting all the stock performance plot foreach unique ticker to take a general insight on what we are studying.
import plotly.offline as pyo
pyo.init_notebook_mode(connected=True)
def plot_stock_performance(ticker, df):
df_ticker = df[df['Ticker'] == ticker]
fig = make_subplots(rows=2, cols=1, shared_xaxes=True,
vertical_spacing=0.1, subplot_titles=(f'{ticker} Stock Prices', f'{ticker} Volume'))
fig.add_trace(go.Candlestick(x=df_ticker['Date'],
open=df_ticker['Open'],
high=df_ticker['High'],
low=df_ticker['Low'],
close=df_ticker['Close'],
name='OHLC'), row=1, col=1)
fig.add_trace(go.Bar(x=df_ticker['Date'], y=df_ticker['Volume'], name='Volume'), row=2, col=1)
# layout
fig.update_layout(title=f'Stock Performance for {ticker}', xaxis_rangeslider_visible=False)
return fig
ticker_list = sorted(list(set(data["Ticker"])))
ticker_dropdown = widgets.Dropdown(
options=ticker_list,
value=random.choice(ticker_list), # Seleziona uno stock random all'inizio
description='Ticker:',
)
def update_plot(ticker):
fig = plot_stock_performance(ticker, data)
fig.show()
interactive_plot = widgets.interactive(update_plot, ticker=ticker_dropdown)
display(interactive_plot)
Technical Indicators Analysis#
ROI Analysis#
timeframes = [
"Percentage_Return_5days_before",
"Percentage_Return_15days_before",
"Percentage_Return_30days_before",
"Percentage_Return_90days_before",
"Percentage_Return_180days_before",
"Percentage_Return_360days_before",
]
# Melting the data for easier plotting
melted_data = data.melt(id_vars=['Ticker'], value_vars=timeframes, var_name='Timeframe', value_name='Return_Percentage')
plt.figure(figsize=(30, 20))
sns.boxplot(x='Timeframe', y='Return_Percentage', hue='Ticker', data=melted_data)
plt.title('Distribution of Returns Over Different Timeframes for All Stocks', fontsize=24)
plt.xlabel('Timeframe', fontsize=22)
plt.ylabel('Return Percentage (%)', fontsize=22)
plt.xticks(rotation=45, fontsize=20)
plt.yticks(fontsize=20)
plt.legend(title='Stock Ticker', loc='upper right', fontsize=18)
plt.grid(axis='y')
plt.show()
The boxplot suggests the following:
Return Distribution: The interquartile range (IQR) in the boxplots indicates the middle 50% of returns for each stock and timeframe, which is useful for assessing the typical return dispersion.
Volatility Trends: The length of the whiskers relative to the IQR can be used as a proxy for volatility. Longer whiskers relative to the IQR suggest higher volatility.
Timeframe Sensitivity: Stocks with increasing variability over longer timeframes may be more sensitive to macroeconomic factors or have higher inherent business risks.
Outlier Analysis: Points beyond the whiskers are outliers and may represent extreme events or anomalies. Clusters of outliers could indicate periods of market stress or company-specific issues.
Cross-sectional Analysis: Comparing boxes across different stocks for the same timeframe can provide insight into relative performance and risk.
RSI Analysis#
# Plotting the violin plot for RSI distribution of each stock
plt.figure(figsize=(20, 10))
sns.violinplot(x='Ticker', y='RSI', data=data, palette='Spectral')
plt.title('Distribuzione del RSI per Ogni Stock', fontsize=18)
plt.xlabel('Ticker del Stock', fontsize=16)
plt.ylabel('RSI', fontsize=16)
plt.axhline(70, ls='--', color='red') # Soglia di overbought
plt.axhline(30, ls='--', color='blue') # Soglia di oversold
plt.xticks(rotation=45, fontsize=14)
plt.yticks(fontsize=14)
plt.grid(axis='y')
plt.show()
The violin plot for the RSI distribution across various stocks offers a detailed visual representation of the momentum characteristics for each stock. Here are some specific technical insights:
Median RSI Values: Most stocks have median RSI values hovering around the mid-range (50), which is neutral territory, indicating no strong momentum in either direction for the median stock.
Overbought Stocks: Several stocks show parts of their distribution extending above the overbought threshold (70), indicating that these stocks frequently enter overbought territory. (For models)Can be important to analyze the frequency and duration of these occurrences for potential overbought conditions.
Oversold Stocks: A few stocks also extend below the oversold threshold (30). Persistent
RSI
levels below this line can suggest that a stock is being oversold, which might be a buying opportunity if other conditions are favorable.RSI Distribution Shape: The shape of the violins indicates the probability density and skewness of the RSI values. Wider sections represent a higher probability of the RSI assuming that value. Asymmetrical violins suggest a skew towards either overbought or oversold conditions.
RSI Variability: The varied width of the violins suggests differences in the variability of the
RSI
among the stocks. A wider violin indicates greater variability and potentially higher volatility in price movement.Potential Anomalies: The tails of the violins represent the extremes of the
RSI
distribution. Long tails touching or crossing the overbought and oversold thresholds could indicate outlier events or anomalies in stock behavior.Inter-stock Comparison: The plot allows for a quick comparison across stocks. Stocks with similar RSI distributions might be subject to similar trading dynamics or market perceptions.
Target Feature Analysis#
# Plotting the balance of each target feature side by side
fig, axes = plt.subplots(1, 3, figsize=(18, 5)) # 1 row, 3 columns
for i, col in enumerate(target_features):
sns.countplot(x=col, data=data, ax=axes[i])
axes[i].set_title(f'Balance of {col}')
plt.tight_layout()
plt.show()
The target features Target_1day, Target_5days, and Target_30days are relatively balanced with a slight majority of one class in each. No need to balanced the dataset.