Exploratory Data Analysis#

Data Loading and Preliminary Checks:#

Load the dataset and perform basic checks for data types, missing values, and summary statistics.

import plotly.graph_objs as go
from plotly.subplots import make_subplots
import ipywidgets as widgets
from IPython.display import display
import random
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")
file_path = 'final_dataset.xlsx'
data = pd.read_excel(file_path)

# Display the first few rows of the dataframe and its basic information
data_info = data.info()
data_head = data.head()
data_descriptive_stats = data.describe()

data_info, data_head, data_descriptive_stats
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26680 entries, 0 to 26679
Data columns (total 44 columns):
 #   Column                                          Non-Null Count  Dtype         
---  ------                                          --------------  -----         
 0   Date                                            26680 non-null  datetime64[ns]
 1   Open                                            26680 non-null  float64       
 2   High                                            26680 non-null  float64       
 3   Low                                             26680 non-null  float64       
 4   Close                                           26680 non-null  float64       
 5   Volume                                          26680 non-null  int64         
 6   Dividends                                       26680 non-null  float64       
 7   Stock Splits                                    26680 non-null  float64       
 8   Daily_Return                                    26680 non-null  float64       
 9   Daily_Return_Percentage                         26680 non-null  float64       
 10  Target_1day                                     26680 non-null  int64         
 11  Target_5days                                    26680 non-null  int64         
 12  Target_30days                                   26680 non-null  int64         
 13  Percentage_Return_1day_before                   26680 non-null  float64       
 14  Percentage_Return_5days_before                  26680 non-null  float64       
 15  Percentage_Return_15days_before                 26680 non-null  float64       
 16  Percentage_Return_30days_before                 26680 non-null  float64       
 17  Percentage_Return_90days_before                 26680 non-null  float64       
 18  Percentage_Return_180days_before                26680 non-null  float64       
 19  Percentage_Return_360days_before                26680 non-null  float64       
 20  Net Income                                      26680 non-null  int64         
 21  Diluted EPS                                     26680 non-null  float64       
 22  Total Revenue                                   26680 non-null  int64         
 23  Normalized EBITDA                               26680 non-null  int64         
 24  Total Unusual Items                             26680 non-null  int64         
 25  Total Unusual Items Excluding Goodwill          26680 non-null  int64         
 26  Operating Cash Flow                             26680 non-null  int64         
 27  Capital Expenditure                             26680 non-null  int64         
 28  Free Cash Flow                                  26680 non-null  int64         
 29  Cash Flow From Continuing Operating Activities  26680 non-null  int64         
 30  Cash Flow From Continuing Investing Activities  26680 non-null  int64         
 31  Cash Flow From Continuing Financing Activities  26680 non-null  int64         
 32  MA_5                                            26680 non-null  float64       
 33  MA_10                                           26680 non-null  float64       
 34  MA_30                                           26680 non-null  float64       
 35  MA_50                                           26680 non-null  float64       
 36  RSI                                             26680 non-null  float64       
 37  MACD                                            26680 non-null  float64       
 38  Signal_Line                                     26680 non-null  float64       
 39  Bollinger_Mid_Band                              26680 non-null  float64       
 40  Bollinger_Upper_Band                            26680 non-null  float64       
 41  Bollinger_Lower_Band                            26680 non-null  float64       
 42  Volatility                                      26680 non-null  float64       
 43  Ticker                                          26680 non-null  object        
dtypes: datetime64[ns](1), float64(27), int64(15), object(1)
memory usage: 9.0+ MB
(None,
         Date        Open        High         Low       Close   Volume  \
 0 2020-06-30  179.305945  183.533295  179.102439  182.802521  3102800   
 1 2020-07-01  183.968061  184.763579  180.859992  182.756287  2620100   
 2 2020-07-02  187.316608  187.779118  182.349254  182.598999  2699400   
 3 2020-07-06  186.243599  192.209977  186.049352  191.812225  3567700   
 4 2020-07-07  190.091704  190.285964  184.254827  184.412079  2853500   
 
    Dividends  Stock Splits  Daily_Return  Daily_Return_Percentage  ...  \
 0        0.0           0.0      3.496576                 1.912761  ...   
 1        0.0           0.0     -1.211774                -0.663055  ...   
 2        0.0           0.0     -4.717609                -2.583590  ...   
 3        0.0           0.0      5.568627                 2.903166  ...   
 4        0.0           0.0     -5.679625                -3.079855  ...   
 
         MA_30       MA_50        RSI      MACD  Signal_Line  \
 0  185.971379  177.444586  40.980241  0.824023     3.006285   
 1  186.614105  177.904132  52.500011  0.573091     2.519646   
 2  187.140969  178.320634  46.428544  0.357413     2.087199   
 3  188.016001  178.938500  50.786526  0.919320     1.853623   
 4  188.649571  179.372512  42.843182  0.758759     1.634651   
 
    Bollinger_Mid_Band  Bollinger_Upper_Band  Bollinger_Lower_Band  Volatility  \
 0          190.221655            205.998882            174.444429    6.840469   
 1          189.620391            205.574176            173.666606    6.521267   
 2          188.814696            204.471975            173.157416    5.091692   
 3          188.326285            202.922613            173.729956    4.048404   
 4          187.334203            200.017990            174.650416    4.947823   
 
    Ticker  
 0      GS  
 1      GS  
 2      GS  
 3      GS  
 4      GS  
 
 [5 rows x 44 columns],
                                 Date           Open           High  \
 count                          26680   26680.000000   26680.000000   
 mean   2022-02-24 13:37:44.707646208    7915.455428    8004.236860   
 min              2020-06-30 00:00:00       2.240000       2.320000   
 25%              2021-04-28 00:00:00      50.731169      51.079580   
 50%              2022-02-24 00:00:00     134.755567     135.988548   
 75%              2022-12-22 00:00:00     230.284479     233.572506   
 max              2023-10-26 00:00:00  251559.330417  268268.048189   
 std                              NaN   33401.718334   33792.139828   
 
                  Low          Close        Volume     Dividends  Stock Splits  \
 count   26680.000000   26680.000000  2.668000e+04  26680.000000  26680.000000   
 mean     7825.245416    7910.253675  2.326393e+07      0.854714      0.001034   
 min         2.150000       2.240000  0.000000e+00      0.000000      0.000000   
 25%        50.355870      50.688783  2.487925e+06      0.000000      0.000000   
 50%       133.559749     134.675926  9.155860e+06      0.000000      0.000000   
 75%       226.726494     230.135082  2.355792e+07      0.000000      0.000000   
 max    240884.308837  248310.390625  9.920910e+08   6000.000000     20.000000   
 std     33001.457927   33375.164501  4.600700e+07     51.367560      0.125511   
 
        Daily_Return  Daily_Return_Percentage  ...          MA_10  \
 count  26680.000000             26680.000000  ...   26680.000000   
 mean      -5.201753                -0.038065  ...    7889.953801   
 min   -12067.422080               -17.496112  ...       2.390000   
 25%       -0.871245                -0.838023  ...      50.618732   
 50%        0.000000                 0.000000  ...     134.850706   
 75%        0.800007                 0.791362  ...     230.385297   
 max    16901.003171                20.396598  ...  237728.198437   
 std      519.596118                 1.787562  ...   33293.881903   
 
                MA_30          MA_50           RSI          MACD   Signal_Line  \
 count   26680.000000   26680.000000  26680.000000  26680.000000  26680.000000   
 mean     7845.562875    7798.864681     51.034668     31.972182     32.220391   
 min         2.452667       2.665600      0.244804  -6875.778683  -6356.702216   
 25%        50.699711      50.513578     38.732184     -0.884412     -0.784096   
 50%       134.626343     134.448401     50.795485      0.067447      0.079939   
 75%       231.548368     231.453463     63.380262      1.560649      1.471068   
 max    229807.018229  224732.511563     97.252738  18118.585490  16295.612855   
 std     33115.586330   32932.846292     16.894598    758.849178    720.270199   
 
        Bollinger_Mid_Band  Bollinger_Upper_Band  Bollinger_Lower_Band  \
 count        26680.000000          26680.000000          26680.000000   
 mean          7867.662250           8307.270204           7428.054296   
 min              2.431500              2.564802              2.180620   
 25%             50.655560             52.418498             48.810667   
 50%            134.776992            140.944751            127.569949   
 75%            231.423839            246.643646            210.289578   
 max         232437.097656         271469.530941         217108.947693   
 std          33204.190146          35081.969662          31386.971076   
 
          Volatility  
 count  26680.000000  
 mean     104.641068  
 min        0.003811  
 25%        0.510268  
 50%        1.347093  
 75%        3.196869  
 max    11420.843858  
 std      522.629274  
 
 [8 rows x 43 columns])

The dataset contains 26,680 entries and 44 columns. Each row represents financial data for a specific date and ticker, with various financial metrics and technical indicators.

The key features include:

  • Date: The date of the financial data.

  • Open, High, Low, Close: Stock price information.

  • Volume: Number of shares traded.

  • Dividends, Stock Splits: Corporate actions.

  • Daily_Return: The daily return of the stock.

  • Target_1day, Target_5days, Target_30days: Target variables for stock prediction.

  • Various Financial Metrics: Like Net Income, Diluted EPS, Total Revenue, etc.

  • Technical Indicators: Such as MA (Moving Averages), RSI (Relative Strength Index), MACD, Bollinger Bands, etc.

  • Ticker: The stock symbol.

Here are some key observations:

  • Price Metrics (Open, High, Low, Close): There’s a wide range of values, as indicated by the standard deviation, suggesting significant variability among different stocks. The maximum values are quite high, which could be due to high-priced stocks or outliers.

  • Volume: The trading volume varies greatly, with a large standard deviation. This is expected as trading volumes can differ significantly between stocks.

  • Dividends and Stock Splits: Most values are 0, but there are some extreme values (like a maximum dividend of 6000 and stock split of 20), which are likely outliers or specific events.

  • Daily Return: The average is close to zero, but the range (-0.27 to 0.37) indicates periods of high volatility.

  • Moving Averages (MA_5, MA_10, MA_30, MA_50): Similar to price metrics, there’s a wide range, indicating diverse stock behaviors.

  • Technical Indicators (RSI, MACD, Bollinger Bands): The RSI ranges from 0.24 to 97.25, with an average around 51, indicating a balance between overbought and oversold conditions. MACD and Bollinger Bands also show a wide range, which is typical given their dependency on stock prices.

  • Volatility: Varies between 0.05% to 17.87%, indicating some stocks are much more volatile than others.

sns.set_style("whitegrid")

# Selecting a subset of columns for distribution analysis
columns_to_plot = ['Open', 'High', 'Low', 'Close', 'Volume', 'Daily_Return', 
                   'Net Income', 'Diluted EPS', 'Total Revenue', 'MA_5', 'MA_10', 'MA_30', 'MA_50', 'RSI',	'Signal_Line','MACD', 'Bollinger_Mid_Band', 'Bollinger_Upper_Band','Bollinger_Lower_Band', 'Volatility' ]

# Plotting distributions
fig, axes = plt.subplots(len(columns_to_plot), 1, figsize=(10, 20))

for i, col in enumerate(columns_to_plot):
    sns.histplot(data[col], ax=axes[i], kde=True, bins=30)
    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_ylabel('Frequency')
    axes[i].set_xlabel(col)

plt.tight_layout()
plt.show()
../_images/494363682b89fd27f51ec732077f1a1da80e00e790e86465ab292466334cacd6.png

The distributions for the key features like 'Open', 'High', 'Low', 'Close', 'Volume', 'Daily_Return', 'Net Income', 'Diluted EPS', and 'Total Revenue' show that:

  • Stock Prices (Open, High, Low, Close): These might show skewed distributions, indicating a range of stock prices across different tickers. It’s common for stock price distributions to be non-normal, often skewed towards lower prices.

  • Volume: The trading volume is likely right-skewed, with more occurrences of lower trading volumes and fewer occurrences of extremely high volumes.

  • Daily_Return: The distribution of daily returns often resembles a normal distribution but can have heavy tails, indicating days with unusually high gains or losses.

  • Financial Metrics (Net Income, Diluted EPS, Total Revenue): These also tend to be skewed, reflecting the varied financial health and sizes of different companies.

fig = px.scatter(data_frame=data, 
                 x='Volume', 
                 y='Daily_Return', 
                 size='Volatility', 
                 color='Ticker', 
                 hover_data=['Open', 'Close', 'Net Income'],
                 title="Volume vs Daily Return with Volatility and Stock Information")

fig.update_layout(
    plot_bgcolor='black', # Cambia il colore di sfondo dell'area del plot
    paper_bgcolor='black', # Cambia il colore di sfondo intorno all'area del plot
    font_color='white' # Cambia il colore del testo per migliorare la visibilità su sfondo nero
)

fig.update_traces(marker=dict(opacity=0.7))
# Ad esempio, per rendere i punti più piccoli, potresti provare un valore più alto di sizeref
fig.update_traces(marker=dict(sizeref=2 * max(data['Volatility']) / 40**2))



fig.show()
low_volatility = data[data['Volatility'] <= 20]

fig = px.scatter(data_frame=low_volatility, 
                 x='Volume', 
                 y='Daily_Return', 
                 size='Volatility', 
                 color='Ticker', 
                 hover_data=['Open', 'Close', 'Net Income'],
                 title="Volume vs Daily Return with Volatility and Stock Information")

fig.update_layout(
    plot_bgcolor='black', # Cambia il colore di sfondo dell'area del plot
    paper_bgcolor='black', # Cambia il colore di sfondo intorno all'area del plot
    font_color='white' # Cambia il colore del testo per migliorare la visibilità su sfondo nero
)

fig.update_traces(marker=dict(opacity=0.7))

The scatter plot vividly displays a spectrum of trading activity across various stocks, represented by unique colors for each ticker symbol. Notably, the ticker ‘005380KS’ appears frequently with a range of volumes, as indicated by the numerous green dots. These dots are primarily concentrated toward the lower end of the volume spectrum, suggesting that ‘005380KS’ typically trades with less volume. The daily returns for this ticker are mostly clustered near the zero mark, implying that extreme returns are not a common event for this stock.

The size of the dots, which represents the stock’s volatility, shows that larger dots, and hence higher volatility, are not confined to any single volume bracket or daily return level. This suggests that volatility can spike regardless of the trading volume or the direction of the return on a given day.

The plot also features other tickers like ‘AAPL’, indicated by light pink dots, and ‘AMZN’. These particular tickers appear less frequently but are scattered across a wider volume range. This pattern could suggest a more variable trading volume for these well-known tech giants.

The hover data feature of the plot allows viewers to delve into specific details such as the opening and closing prices, as well as the net income on particular days for each stock, providing a more comprehensive view of each stock’s performance. For example, hovering over a particularly large dot could reveal a day with significant price movement and higher volatility, possibly aligned with a corporate announcement or market event.

Correlation Analysis#

Examine the distribution of key features like 'Open', 'Close', 'Volume', etc.

# Dropping non-numerical columns (like 'Ticker') before computing the correlation matrix
numerical_data = data.select_dtypes(include=['float64', 'int64'])
correlation_matrix_numerical = numerical_data.corr()

# Plotting the heatmap for the numerical data
plt.figure(figsize=(15, 10))
sns.heatmap(correlation_matrix_numerical, cmap='coolwarm', annot=False, fmt=".1f")
plt.title('Correlation Matrix of Numerical Features')
plt.show()
../_images/88c3dee965ce7d10135416dc0dbe94131077b54578fac6a657159e3de8c31a02.png

The heatmap provides a visual representation of the correlation matrix, showing how different features are correlated with each other. However, due to the large number of features, it’s quite dense and a bit challenging to extract specific insights at this resolution.

For a more focused analysis, let’s zoom in on the correlations of the target variables ('Target_1day', 'Target_5days', 'Target_30days') with other features. This will give us a clearer view of which features might be more influential for predicting stock movements. I’ll extract and visualize these specific correlations.

# Focusing on the correlation of target variables with other features
target_features = ['Target_1day', 'Target_5days', 'Target_30days']
correlation_with_targets = correlation_matrix_numerical[target_features].drop(target_features)

# Plotting the correlations with target variables
plt.figure(figsize=(10, 12))
sns.heatmap(correlation_with_targets, cmap='coolwarm', annot=True, fmt=".2f")
plt.title('Correlation of Features with Target Variables')
plt.show()
../_images/826a7f6fc267d1057a40c4d26cd530ebe96e8def21270c3904ed4df290f6ca9f.png

The heatmap now focuses on the correlations between the target variables ('Target_1day', 'Target_5days', 'Target_30days') and other features in the dataset. Here’s what we can infer:

Certain features show a moderate to low level of correlation with the target variables.

  • It’s important to note that correlation does not imply causation, but it can give us an idea of which features might be useful in predictive modeling.

  • The correlations are relatively low, which is typical in financial datasets due to the complexity and unpredictability of stock price movements.

  • Next, let’s perform a time series analysis visualization. Given the financial nature of the data, it’s crucial to understand how certain key variables like 'Close', 'Volume', and technical indicators like 'RSI', 'MACD' change over time.

Time Series Analysis#

Let’s start by plotting all the stock performance plot foreach unique ticker to take a general insight on what we are studying.

import plotly.offline as pyo
pyo.init_notebook_mode(connected=True)
def plot_stock_performance(ticker, df):
    df_ticker = df[df['Ticker'] == ticker]

    fig = make_subplots(rows=2, cols=1, shared_xaxes=True,
                        vertical_spacing=0.1, subplot_titles=(f'{ticker} Stock Prices', f'{ticker} Volume'))

    fig.add_trace(go.Candlestick(x=df_ticker['Date'],
                                 open=df_ticker['Open'],
                                 high=df_ticker['High'],
                                 low=df_ticker['Low'],
                                 close=df_ticker['Close'],
                                 name='OHLC'), row=1, col=1)

    fig.add_trace(go.Bar(x=df_ticker['Date'], y=df_ticker['Volume'], name='Volume'), row=2, col=1)

    # layout
    fig.update_layout(title=f'Stock Performance for {ticker}', xaxis_rangeslider_visible=False)
    return fig

ticker_list = sorted(list(set(data["Ticker"])))
ticker_dropdown = widgets.Dropdown(
    options=ticker_list,
    value=random.choice(ticker_list), # Seleziona uno stock random all'inizio
    description='Ticker:',
)

def update_plot(ticker):
    fig = plot_stock_performance(ticker, data)
    fig.show()

interactive_plot = widgets.interactive(update_plot, ticker=ticker_dropdown)

display(interactive_plot)

Technical Indicators Analysis#

ROI Analysis#

timeframes = [
    "Percentage_Return_5days_before", 
    "Percentage_Return_15days_before", 
    "Percentage_Return_30days_before",
    "Percentage_Return_90days_before", 
    "Percentage_Return_180days_before",
    "Percentage_Return_360days_before",
]

# Melting the data for easier plotting
melted_data = data.melt(id_vars=['Ticker'], value_vars=timeframes, var_name='Timeframe', value_name='Return_Percentage')

plt.figure(figsize=(30, 20))
sns.boxplot(x='Timeframe', y='Return_Percentage', hue='Ticker', data=melted_data)
plt.title('Distribution of Returns Over Different Timeframes for All Stocks', fontsize=24)
plt.xlabel('Timeframe', fontsize=22)
plt.ylabel('Return Percentage (%)', fontsize=22)
plt.xticks(rotation=45, fontsize=20)
plt.yticks(fontsize=20)
plt.legend(title='Stock Ticker', loc='upper right', fontsize=18)
plt.grid(axis='y')
plt.show()
../_images/abad108df798b3ab072e7141f668edc7d72ef4a31cd9ae0e629cacf25e2b99c0.png

The boxplot suggests the following:

  • Return Distribution: The interquartile range (IQR) in the boxplots indicates the middle 50% of returns for each stock and timeframe, which is useful for assessing the typical return dispersion.

  • Volatility Trends: The length of the whiskers relative to the IQR can be used as a proxy for volatility. Longer whiskers relative to the IQR suggest higher volatility.

  • Timeframe Sensitivity: Stocks with increasing variability over longer timeframes may be more sensitive to macroeconomic factors or have higher inherent business risks.

  • Outlier Analysis: Points beyond the whiskers are outliers and may represent extreme events or anomalies. Clusters of outliers could indicate periods of market stress or company-specific issues.

  • Cross-sectional Analysis: Comparing boxes across different stocks for the same timeframe can provide insight into relative performance and risk.

RSI Analysis#


# Plotting the violin plot for RSI distribution of each stock
plt.figure(figsize=(20, 10))
sns.violinplot(x='Ticker', y='RSI', data=data, palette='Spectral')
plt.title('Distribuzione del RSI per Ogni Stock', fontsize=18)
plt.xlabel('Ticker del Stock', fontsize=16)
plt.ylabel('RSI', fontsize=16)
plt.axhline(70, ls='--', color='red')  # Soglia di overbought
plt.axhline(30, ls='--', color='blue') # Soglia di oversold
plt.xticks(rotation=45, fontsize=14)
plt.yticks(fontsize=14)
plt.grid(axis='y')
plt.show()
../_images/3d31972980a2028ee3161e561ffc319669de5d2f7c5559b7052b5502edc547d5.png

The violin plot for the RSI distribution across various stocks offers a detailed visual representation of the momentum characteristics for each stock. Here are some specific technical insights:

  • Median RSI Values: Most stocks have median RSI values hovering around the mid-range (50), which is neutral territory, indicating no strong momentum in either direction for the median stock.

  • Overbought Stocks: Several stocks show parts of their distribution extending above the overbought threshold (70), indicating that these stocks frequently enter overbought territory. (For models)Can be important to analyze the frequency and duration of these occurrences for potential overbought conditions.

  • Oversold Stocks: A few stocks also extend below the oversold threshold (30). Persistent RSI levels below this line can suggest that a stock is being oversold, which might be a buying opportunity if other conditions are favorable.

  • RSI Distribution Shape: The shape of the violins indicates the probability density and skewness of the RSI values. Wider sections represent a higher probability of the RSI assuming that value. Asymmetrical violins suggest a skew towards either overbought or oversold conditions.

  • RSI Variability: The varied width of the violins suggests differences in the variability of the RSI among the stocks. A wider violin indicates greater variability and potentially higher volatility in price movement.

  • Potential Anomalies: The tails of the violins represent the extremes of the RSI distribution. Long tails touching or crossing the overbought and oversold thresholds could indicate outlier events or anomalies in stock behavior.

  • Inter-stock Comparison: The plot allows for a quick comparison across stocks. Stocks with similar RSI distributions might be subject to similar trading dynamics or market perceptions.

Target Feature Analysis#


# Plotting the balance of each target feature side by side
fig, axes = plt.subplots(1, 3, figsize=(18, 5))  # 1 row, 3 columns

for i, col in enumerate(target_features):
    sns.countplot(x=col, data=data, ax=axes[i])
    axes[i].set_title(f'Balance of {col}')

plt.tight_layout()
plt.show()
../_images/2e18a1c8f1055d7473071a3deec9fa0f3b942cbeaf4ff629df5f85e29e902292.png

The target features Target_1day, Target_5days, and Target_30days are relatively balanced with a slight majority of one class in each. No need to balanced the dataset.