Supervised Learning

Supervised Learning#

Model Selection:

The selected algorithms were chosen based on a combination of their characteristics and suitability for the task at hand. Here’s a breakdown of why each algorithm was included:

K-Nearest Neighbors (KNN):
- Pros Consideration: Intuitive and requires no complex training.
- Task Suitability: Well-suited for tasks where proximity in feature space implies similarity in output.
Logistic Regression:
- Pros Consideration: Suitable for binary classification tasks.
- Task Suitability: Appropriate for problems involving binary outcomes, common in financial predictions.
Random Forest:
- Pros Consideration: Effective at handling complexity and provides feature importance.
- Task Suitability: Robust for capturing intricate patterns and relationships in data.
Artificial Neural Networks:
- Pros Consideration: Effective for capturing complex, non-linear patterns.
- Task Suitability: Suitable for tasks where intricate relationships and patterns are expected.

Reasons for Excluding Algorithms:

Linear Regression:
- Exclusion Reason: Limited in capturing complex stock price movements, may oversimplify relationships.
Support Vector Machines (SVM):
- Exclusion Reason: Requires careful parameter tuning, and may not be optimal for large datasets.
Decision Trees:
- Exclusion Reason: Prone to overfitting and may not generalize well to unseen data.

Training Strategy:

The remaining algorithms are employed to address specific prediction horizons (Target 1 day, Target 5 days, Target 30 days). For predictions outside these categories, the problem is classified as “Unsafe”, buying is not suggested.

In opting for classification instead of regression, our goal is to assess the platform’s efficacy in providing stock investment guidance. The classification approach categorizes outcomes into distinct classes, specifically evaluating whether the platform accurately advises on buying or avoiding stocks. This binary classification simplifies the evaluation of the platform’s reliability in offering actionable advice, focusing on clear and interpretable recommendations for investors.

Data Preparation#

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras import models
from tensorflow.keras import layers
from tensorflow.keras import optimizers
from tensorflow.keras.callbacks import EarlyStopping
import warnings
from sklearn.preprocessing import StandardScaler
import seaborn as sns
from sklearn.metrics import confusion_matrix
warnings.filterwarnings("ignore")

df = pd.ExcelFile('final_dataset.xlsx').parse('Sheet1')
df.head()

	Date	Open	High	Low	Close	Volume	Daily_Return	Daily_Return_Percentage	...	MA_30	MA_50	RSI	MACD	Signal_Line	Bollinger_Mid_Band	Bollinger_Upper_Band	Bollinger_Lower_Band	Volatility	Ticker
0	2020-06-30	179.305945	183.533295	179.102439	182.802521	3102800	3.496576	1.912761	...	185.971379	177.444586	40.980241	0.824023	3.006285	190.221655	205.998882	174.444429	6.840469	GS
1	2020-07-01	183.968061	184.763579	180.859992	182.756287	2620100	-1.211774	-0.663055	...	186.614105	177.904132	52.500011	0.573091	2.519646	189.620391	205.574176	173.666606	6.521267	GS
2	2020-07-02	187.316608	187.779118	182.349254	182.598999	2699400	-4.717609	-2.583590	...	187.140969	178.320634	46.428544	0.357413	2.087199	188.814696	204.471975	173.157416	5.091692	GS
3	2020-07-06	186.243599	192.209977	186.049352	191.812225	3567700	5.568627	2.903166	...	188.016001	178.938500	50.786526	0.919320	1.853623	188.326285	202.922613	173.729956	4.048404	GS
4	2020-07-07	190.091704	190.285964	184.254827	184.412079	2853500	-5.679625	-3.079855	...	188.649571	179.372512	42.843182	0.758759	1.634651	187.334203	200.017990	174.650416	4.947823	GS

5 rows × 44 columns

Data Cleaning#

Let’s transform the following features into float-type data. This transformation is essential to ensure that the data can be processed by our training algorithms effectively. Converting these features to float allows our algorithms to handle and analyze the data appropriately during the training process. This step is crucial for the accuracy and efficiency of the machine learning models we’ll be using.

df = df[(df['Target_1day'] != -1) & (df['Target_5days'] != -1) & (df['Target_30days'] != -1)]
df['Volume'] = df['Volume'].astype(float)
df['Target_1day'] = df['Target_1day'].astype(float)
df['Target_5days'] = df['Target_5days'].astype(float)
df['Target_30days'] = df['Target_30days'].astype(float)
df['Net Income'] = df['Net Income'].astype(float)
df['Total Revenue'] = df['Total Revenue'].astype(float)
df['Normalized EBITDA'] = df['Normalized EBITDA'].astype(float)
df['Total Unusual Items'] = df['Total Unusual Items'].astype(float)
df['Total Unusual Items Excluding Goodwill'] = df['Total Unusual Items Excluding Goodwill'].astype(float)
df['Operating Cash Flow'] = df['Operating Cash Flow'].astype(float)
df['Capital Expenditure'] = df['Capital Expenditure'].astype(float)
df['Free Cash Flow'] = df['Free Cash Flow'].astype(float)
df['Cash Flow From Continuing Operating Activities'] = df['Cash Flow From Continuing Operating Activities'].astype(float)
df['Cash Flow From Continuing Investing Activities'] = df['Cash Flow From Continuing Investing Activities'].astype(float)
df['Cash Flow From Continuing Financing Activities'] = df['Cash Flow From Continuing Financing Activities'].astype(float)

Train, Validation, Test#

The dataset df is divided into features (X) and three different target variables (Y_1, Y_2, and Y_3), corresponding to predicting stock values for 1, 5, and 30 days, respectively. The data is then split into training (80%), validation (20%), and test sets (20%). This separation ensures that the machine learning models can be trained, validated, and tested on distinct subsets of the data, facilitating the evaluation of their performance on different time horizons.

df.sort_values(by='Date', inplace=True)
X = df.drop(['Date','Ticker','Target_1day', 'Target_5days', 'Target_30days'], axis=1)
Y_1 = df['Target_1day']
Y_2 = df['Target_5days']
Y_3 = df['Target_30days']

X_train_1_80, X_test_1, Y_train_1_80, Y_test_1 = train_test_split(X, Y_1, test_size=0.2, shuffle=False)
X_train_1, X_valid_1, Y_train_1, Y_valid_1 = train_test_split(X_train_1_80, Y_train_1_80, test_size=0.2, shuffle=False)


X_train_2_80, X_test_2, Y_train_2_80, Y_test_2 = train_test_split(X, Y_2, test_size=0.2, shuffle=False)
X_valid_2, X_train_2, Y_valid_2, Y_train_2 = train_test_split(X_train_2_80, Y_train_2_80, test_size=0.2, shuffle=False)

X_train_3_80, X_test_3, Y_train_3_80, Y_test_3 = train_test_split(X, Y_3, test_size=0.2, shuffle=False)
X_valid_3, X_train_3, Y_valid_3, Y_train_3 = train_test_split(X_train_3_80, Y_train_3_80, test_size=0.2, shuffle=False)

In addition to the training, validation, and test sets, we also create a scaled version of the training set for each target variable. This scaled version is used to train the machine learning models, ensuring that the data is standardized and can be processed effectively by the algorithms. The scaling process is performed separately for each target variable to avoid data leakage, ensuring that the training, validation, and test sets are not affected by the scaling process.

scaler = StandardScaler()

# Target 1 day
X_train_1_80_scaled = scaler.fit_transform(X_train_1_80)
X_valid_1_scaled = scaler.transform(X_valid_1)
X_test_1_scaled = scaler.transform(X_test_1)
X_train_1_scaled = scaler.transform(X_train_1)

# Target 5 days
X_train_2_80_scaled = scaler.fit_transform(X_train_2_80)
X_valid_2_scaled = scaler.transform(X_valid_2)
X_test_2_scaled = scaler.transform(X_test_2)
X_train_2_scaled = scaler.transform(X_train_2)

# Target 30 days
X_train_3_80_scaled = scaler.fit_transform(X_train_3_80)
X_valid_3_scaled = scaler.transform(X_valid_3)
X_test_3_scaled = scaler.transform(X_test_3)
X_train_3_scaled = scaler.transform(X_train_3)

In the subsequent section of the document, we will be conducting testing on the selected machine learning algorithms to assess their performance in predicting stock values. The focus will be on evaluating the algorithms based on accuracy to determine which one is most effective for solving this specific problem. This testing phase aims to provide insights into the algorithm that yields the most accurate predictions for the given dataset and target variables.

Knn#

In this section, we explore the K-Nearest Neighbors (KNN) algorithm by varying the parameter K, which represents the number of neighbors considered for classification. The objective is to identify the optimal K value that produces the most accurate predictions for our specific stock value prediction problem. By systematically testing different K values, we aim to determine the configuration that yields the highest accuracy, providing valuable insights into the performance of the KNN algorithm in this context.

best_k = []
for i in [10,15,20,25,30,35,40,45,50]:
    print('K: ' + str(i) + '\n')
    # Target 1 day
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train_1_scaled, Y_train_1)
    train_acc_1 = accuracy_score(y_true= Y_train_1, y_pred= knn.predict(X_train_1_scaled))
    valid_acc_1 = accuracy_score(y_true= Y_valid_1, y_pred= knn.predict(X_valid_1_scaled))
    print("Train set 1: {:.2f}".format(train_acc_1))
    print('Validation set 1: {:.2f}'.format(valid_acc_1))
    # Target 5 days
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train_2_scaled, Y_train_2)
    train_acc_2 = accuracy_score(y_true= Y_train_2, y_pred= knn.predict(X_train_2_scaled))
    valid_acc_2 = accuracy_score(y_true= Y_valid_2, y_pred= knn.predict(X_valid_2_scaled))
    print("Train set 2: {:.2f}".format(train_acc_2))
    print('Validation set 2: {:.2f}'.format(valid_acc_2))
    # Target 30 days
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train_3_scaled, Y_train_3)
    train_acc_3 = accuracy_score(y_true= Y_train_3, y_pred= knn.predict(X_train_3_scaled))
    valid_acc_3 = accuracy_score(y_true= Y_valid_3, y_pred= knn.predict(X_valid_3_scaled))
    print("Train set 3: {:.2f}".format(train_acc_3))
    print('Validation set 3: {:.2f}'.format(valid_acc_3))
    print('\n')
    best_k.append([i, (valid_acc_1 + valid_acc_2 + valid_acc_3) / 3, (train_acc_1 + train_acc_2 + train_acc_3) / 3])
    
k = max(best_k, key=lambda x:x[1])[0]
print('Best K: ' + str(k) + '\n')
# Target 1 day
knn_1 = KNeighborsClassifier(n_neighbors=k)
knn_1.fit(X_train_1_80_scaled, Y_train_1_80)
test_acc_1 = accuracy_score(y_true= Y_test_1, y_pred= knn_1.predict(X_test_1_scaled))
print('Test set 1: {:.2f}'.format(test_acc_1))

# Target 5 days
knn_2 = KNeighborsClassifier(n_neighbors=k)
knn_2.fit(X_train_2_80_scaled, Y_train_2_80)
test_acc_2 = accuracy_score(y_true= Y_test_2, y_pred= knn_2.predict(X_test_2_scaled))
print('Test set 2: {:.2f}'.format(test_acc_2))

# Target 30 days
knn_3 = KNeighborsClassifier(n_neighbors=k)
knn_3.fit(X_train_3_80_scaled, Y_train_3_80)
test_acc_3 = accuracy_score(y_true= Y_test_3, y_pred= knn_3.predict(X_test_3_scaled))
print('Test set 3: {:.2f}'.format(test_acc_3))

# Total acc
print('Total acc: {:.2f}'.format((test_acc_1 + test_acc_2 + test_acc_3) / 3))

K: 10
Train set 1: 0.64
Validation set 1: 0.51
Train set 2: 0.72
Validation set 2: 0.50
Train set 3: 0.78
Validation set 3: 0.50


K: 15
Train set 1: 0.62
Validation set 1: 0.52
Train set 2: 0.69
Validation set 2: 0.50
Train set 3: 0.76
Validation set 3: 0.50


K: 20
Train set 1: 0.60
Validation set 1: 0.51
Train set 2: 0.67
Validation set 2: 0.50
Train set 3: 0.73
Validation set 3: 0.50


K: 25
Train set 1: 0.60
Validation set 1: 0.51
Train set 2: 0.65
Validation set 2: 0.51
Train set 3: 0.72
Validation set 3: 0.50


K: 30
Train set 1: 0.59
Validation set 1: 0.50
Train set 2: 0.64
Validation set 2: 0.50
Train set 3: 0.71
Validation set 3: 0.50


K: 35
Train set 1: 0.58
Validation set 1: 0.50
Train set 2: 0.63
Validation set 2: 0.50
Train set 3: 0.70
Validation set 3: 0.50


K: 40
Train set 1: 0.58
Validation set 1: 0.51
Train set 2: 0.63
Validation set 2: 0.50
Train set 3: 0.70
Validation set 3: 0.50


K: 45
Train set 1: 0.57
Validation set 1: 0.51
Train set 2: 0.62
Validation set 2: 0.50
Train set 3: 0.69
Validation set 3: 0.50


K: 50
Train set 1: 0.57
Validation set 1: 0.52
Train set 2: 0.62
Validation set 2: 0.50
Train set 3: 0.68
Validation set 3: 0.50


Best K: 15
Test set 1: 0.49
Test set 2: 0.51
Test set 3: 0.47
Total acc: 0.49

k_values = [item[0] for item in best_k]
validation_accuracy_values = [item[1] for item in best_k]
train_accuracy_values = [item[2] for item in best_k]

plt.plot(k_values, validation_accuracy_values, label='Validation')
plt.plot(k_values, train_accuracy_values, label='Train')
plt.xlabel('K')
plt.ylabel('Accuracy')
plt.title('Accuracy vs K')
plt.legend()


plt.tight_layout()
plt.show()

../_images/2d1c0772af089c4e2e1add6d7dd9cb3c9f7ef38f2fdf43f2cd4fb01def08107d.png

There are some important considerations to take into account:

Small (k): A small (k) value implies that the prediction for a data point is heavily influenced by its immediate neighbors. This makes the model sensitive to local variations in the training data, which might not generalize well to unseen data.
Total Accuracy: The total accuracy across all targets is reported around 50%, which is the average of the accuracies for the three target variables. This value suggests that the model is performing slightly better than random chance.
Generalization Concerns: While the model might perform quite well on the training sets, the real test lies in its ability to generalize to unseen data. The model’s performance on the test sets should be carefully examined to assess its effectiveness in predicting stock values for different time horizons.

Logistic Regression#

In this section, we explore the training process of Logistic Regression for the stock prediction problem, aiming to analyze its outcomes. Logistic Regression is a well-established algorithm for binary classification tasks, making it suitable for predicting whether users should buy or avoid stocks.

# Define datasets
datasets = [(X_train_1_scaled, Y_train_1, X_valid_1_scaled, Y_valid_1, X_test_1_scaled, Y_test_1),
            (X_train_2_scaled, Y_train_2, X_valid_2_scaled, Y_valid_2, X_test_2_scaled, Y_test_2),
            (X_train_3_scaled, Y_train_3, X_valid_3_scaled, Y_valid_3, X_test_3_scaled, Y_test_3)]

sum = 0

# Loop through datasets
for i, (X_train, Y_train, X_valid, Y_valid, X_test, Y_test) in enumerate(datasets):

    # Train
    lr = LogisticRegression(max_iter=1500)
    lr.fit(X_train, Y_train)
    train_acc = accuracy_score(y_true=Y_train, y_pred=lr.predict(X_train))
    valid_acc = accuracy_score(y_true=Y_valid, y_pred=lr.predict(X_valid))

    # Test
    lr_test = LogisticRegression(max_iter=1500)
    lr_test.fit(X_train, Y_train)
    test_acc = accuracy_score(y_true=Y_test, y_pred=lr_test.predict(X_test))
    print(f'Target {i + 1} - Train Accuracy: {train_acc:.2f} - Validation Accuracy: {valid_acc:.2f} - Test Accuracy: {test_acc:.2f}')
    sum += test_acc
    
# Total acc
print(f'Total acc: {sum / 3:.2f}')

Target 1 - Train Accuracy: 0.52 - Validation Accuracy: 0.49 - Test Accuracy: 0.52
Target 2 - Train Accuracy: 0.57 - Validation Accuracy: 0.49 - Test Accuracy: 0.52
Target 3 - Train Accuracy: 0.65 - Validation Accuracy: 0.52 - Test Accuracy: 0.50
Total acc: 0.51

The accuracy levels are relatively close for the different target periods, indicating a consistent but not particularly strong predictive performance across various prediction horizons. The model’s accuracy on the training and validation sets is also similar, suggesting that the model is not overfitting too much to the training data. However, the accuracy levels are relatively low, indicating that the model may not be capturing the underlying patterns in the data effectively. We should try to see if the dataset is balanced or not. If it is not balanced, we should try to balance it.

# show the number of 0 and 1 for target 1,5 and 30 days
print('Target 1 day')
print(df['Target_1day'].value_counts())
print('Target 5 days')
print(df['Target_5days'].value_counts())
print('Target 30 days')
print(df['Target_30days'].value_counts())

Target 1 day
Target_1day
0.0    12886
1.0    12834
Name: count, dtype: int64
Target 5 days
Target_5days
1.0    13228
0.0    12492
Name: count, dtype: int64
Target 30 days
Target_30days
1.0    13353
0.0    12367
Name: count, dtype: int64

The classes aren’t unbalanced, so we don’t need to balance them.

Artificial Neural Networks#

In this phase of our analysis, we turn our attention to the Artificial Neural Networks (ANN) algorithm. This algorithm is a powerful tool for capturing complex, non-linear patterns in data, making it suitable for our stock prediction problem. The objective is to assess the performance of the ANN algorithm by experimenting with different configurations for the number of nodes, number of layers, and maximum number of iterations. By varying these parameters, we can observe how they impact the model’s predictive accuracy.

# Initialize lists to store histories
histories = []

for target, X_train_80, X_train, X_valid, X_test, Y_train_80, Y_train, Y_valid, Y_test in [
    (1, X_train_1_80, X_train_1, X_valid_1, X_test_1, Y_train_1_80, Y_train_1, Y_valid_1, Y_test_1),
    (2, X_train_2_80, X_train_2, X_valid_2, X_test_2, Y_train_2_80, Y_train_2, Y_valid_2, Y_test_2),
    (3, X_train_3_80, X_train_3, X_valid_3, X_test_3, Y_train_3_80, Y_train_3, Y_valid_3, Y_test_3)
]:
    # Standardize the features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_valid_scaled = scaler.transform(X_valid)
    X_test_scaled = scaler.transform(X_test)
    
    # Initialize parameters
    best_acc = 0
    best_epoch = 0
    best_node = 0
    best_layer = 0 

    # Initialize history dictionary
    history_dict = {'accuracy': [], 'validation_accuracy': [], 'test_accuracy': [], 'loss': [], 'validation_loss': [], 'test_loss': []}
    
    # Tuning parameters
    for nodes in [8, 16, 24]:
        for n_layers in [1, 2, 3]:
            for epochs in [30, 50, 100]:
                

                model = models.Sequential()
                model.add(layers.Dense(nodes, activation='relu', input_shape=(X_train_scaled.shape[1],)))
                for _ in range(n_layers):
                    model.add(layers.Dense(nodes, activation='relu'))
                model.add(layers.Dense(1, activation='relu'))
                model.compile(optimizer=optimizers.legacy.Adam(learning_rate=0.01),
                              loss='binary_crossentropy',
                              metrics=['accuracy'])
                
                early_stopping = EarlyStopping(monitor='val_loss', patience=20, restore_best_weights=True)

                history = model.fit(X_train_scaled, Y_train, epochs=epochs, validation_data=(X_valid_scaled, Y_valid), callbacks=[early_stopping], verbose=0)

                # Store history information
                history_dict['accuracy'].extend(history.history['accuracy'])
                history_dict['validation_accuracy'].extend(history.history.get('val_accuracy', []))  # Use get to handle missing key
                test_eval = model.evaluate(X_test_scaled, Y_test, verbose=0)
                history_dict['test_accuracy'].append(test_eval[1])  # Evaluate on test set
                history_dict['loss'].extend(history.history['loss'])
                history_dict['validation_loss'].extend(history.history.get('val_loss', []))  # Use get to handle missing key
                history_dict['test_loss'].append(test_eval[0])  # Evaluate on test set

                if test_eval[1] > best_acc:
                    best_acc = test_eval[1]
                    best_epoch = epochs
                    best_node = nodes
                    best_layer = n_layers

    histories.append((target, history_dict, best_epoch, best_node, best_layer, best_acc))

    print(f'Target {target} - Best Epochs: {best_epoch}, Best Nodes: {best_node}, Best Layers: {best_layer}, Best Accuracy: {best_acc:.2f}')

Target 1 - Best Epochs: 30, Best Nodes: 8, Best Layers: 2, Best Accuracy: 0.52
Target 2 - Best Epochs: 50, Best Nodes: 8, Best Layers: 1, Best Accuracy: 0.52
Target 3 - Best Epochs: 50, Best Nodes: 8, Best Layers: 1, Best Accuracy: 0.52

datas = []

for target, X_train_80, X_train, X_valid, X_test, Y_train_80, Y_train, Y_valid, Y_test in [
    (1, X_train_1_80, X_train_1, X_valid_1, X_test_1, Y_train_1_80, Y_train_1, Y_valid_1, Y_test_1),
    (2, X_train_2_80, X_train_2, X_valid_2, X_test_2, Y_train_2_80, Y_train_2, Y_valid_2, Y_test_2),
    (3, X_train_3_80, X_train_3, X_valid_3, X_test_3, Y_train_3_80, Y_train_3, Y_valid_3, Y_test_3)
]:
    # Standardize the features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_valid_scaled = scaler.transform(X_valid)
    X_test_scaled = scaler.transform(X_test)
    
    nodes = histories[target-1][3]
    n_layers = histories[target-1][4]
    epochs = histories[target-1][2]
    
    # Tuning parameters
    model = models.Sequential()
    model.add(layers.Dense(nodes, activation='sigmoid', input_shape=(X_train_scaled.shape[1],)))
    for _ in range(n_layers):
        model.add(layers.Dense(nodes, activation='sigmoid'))
    model.add(layers.Dense(1, activation='sigmoid'))
    model.compile(optimizer=optimizers.legacy.Adam(learning_rate=0.01),
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    
    early_stopping = EarlyStopping(monitor='val_loss', patience=epochs, restore_best_weights=True)

    history = model.fit(X_train_scaled, Y_train, epochs=epochs, validation_data=(X_valid_scaled, Y_valid), callbacks=[early_stopping], verbose=0)

    # Store history information
    train_acc = history.history['accuracy']
    val_acc = history.history.get('val_accuracy', [])  # Use get to handle missing key
    test_eval = model.evaluate(X_test_scaled, Y_test, verbose=0)
    test_acc = test_eval[1] # Evaluate on test set
    train_loss = history.history['loss']
    val_loss = history.history.get('val_loss', [])  # Use get to handle missing key
    test_loss = test_eval[0]  # Evaluate on test set
    
    datas.append((target, train_acc, val_acc, test_acc, train_loss, val_loss, test_loss))

    print(f'Target {target} - Train Accuracy: {np.mean(train_acc):.2f} - Validation Accuracy: {np.mean(val_acc):.2f} - Test Accuracy: {np.mean(test_acc):.2f}')

Target 1 - Train Accuracy: 0.53 - Validation Accuracy: 0.50 - Test Accuracy: 0.51
Target 2 - Train Accuracy: 0.61 - Validation Accuracy: 0.50 - Test Accuracy: 0.51
Target 3 - Train Accuracy: 0.74 - Validation Accuracy: 0.52 - Test Accuracy: 0.47

# Plotting
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(15, 18))

for i, (target, train_acc, val_acc, test_acc, train_loss, val_loss, test_loss) in enumerate(datas):
    row = i
    col_acc = 0
    col_loss = 1
    
    axes[row, col_acc].plot(train_acc, label='Train')
    axes[row, col_acc].plot(val_acc, label='Validation')
    axes[row, col_acc].set_title(f"Target {target} - Accuracy History", fontsize=18)
    axes[row, col_acc].set_xlabel("Epochs", fontsize=14)
    axes[row, col_acc].legend(fontsize=12)
    
    axes[row, col_loss].plot(train_loss, label='Train')
    axes[row, col_loss].plot(val_loss, label='Validation')
    axes[row, col_loss].set_title(f"Target {target} - Loss History", fontsize=18)
    axes[row, col_loss].set_xlabel("Epochs", fontsize=14)
    axes[row, col_loss].legend(fontsize=12)

plt.tight_layout()
plt.show()

../_images/e2aad119e9663344933472ad7b69ec8e6dcf690f9b012eb9752c9e3d399f7602.png

The results indicate that the ANN algorithm is not performing well on the given dataset, with accuracy levels around 50% for all target variables. This outcome suggests that the model is not capturing the underlying patterns in the data effectively, resulting in poor predictive performance. The accuracy levels are also similar across the training, validation, and test sets, indicating that the model is not overfitting much to the training data.

Random Forest#

In this section of the document, we explore the Random Forest algorithm, which is a powerful tool for capturing complex patterns in data. The objective is to assess the performance of the Random Forest algorithm by experimenting with different configurations for the number of estimators and the minimum samples leaf. By varying these parameters, we can observe how they impact the model’s predictive accuracy.

best_params = []
sum = 0
for n_estimators in [10, 25, 50]:
    print(f'N Estimators: {n_estimators}')
    for min_samples_leaf in [2,5,10]:
        print(f'Min Samples Leaf: {min_samples_leaf}')
        # Target 1 day
        rf = RandomForestClassifier(n_estimators=n_estimators, min_samples_leaf=min_samples_leaf)
        rf.fit(X_train_1, Y_train_1)
        train_acc = accuracy_score(y_true=Y_train_1, y_pred=rf.predict(X_train_1))
        scores = cross_val_score(rf, X_train_1_80, Y_train_1_80, cv=5, scoring='accuracy', verbose=0)
        print(f"Train set 1: {train_acc:.2f}")
        print(f'Validation set 1: {scores.mean():.2f}')
        sum = scores.mean()
        # Target 5 days
        rf = RandomForestClassifier(n_estimators=n_estimators, min_samples_leaf=min_samples_leaf)
        rf.fit(X_train_2, Y_train_2)
        train_acc = accuracy_score(y_true=Y_train_2, y_pred=rf.predict(X_train_2))
        scores = cross_val_score(rf, X_train_2_80, Y_train_2_80, cv=5, scoring='accuracy', verbose=0)
        print(f"Train set 2: {train_acc:.2f}")
        print(f'Validation set 2: {scores.mean():.2f}')
        sum += scores.mean()
        # Target 30 days
        rf = RandomForestClassifier(n_estimators=n_estimators, min_samples_leaf=min_samples_leaf)
        rf.fit(X_train_3, Y_train_3)
        train_acc = accuracy_score(y_true=Y_train_3, y_pred=rf.predict(X_train_3))
        scores = cross_val_score(rf, X_train_3_80, Y_train_3_80, cv=5, scoring='accuracy', verbose=0)
        print(f"Train set 3: {train_acc:.2f}")
        print(f'Validation set 3: {scores.mean():.2f}\n')
        sum += scores.mean()
        best_params.append({
            'n_estimators': n_estimators,
            'min_samples_leaf': min_samples_leaf,
            'average_accuracy': sum / 3
        })
        sum = 0

# Find the best parameters
best_param_set = max(best_params, key=lambda x: x['average_accuracy'])
print(f'Best Parameters: {best_param_set}')

N Estimators: 10
Min Samples Leaf: 2
Train set 1: 0.97
Validation set 1: 0.50
Train set 2: 0.98
Validation set 2: 0.47
Train set 3: 0.99
Validation set 3: 0.52

Min Samples Leaf: 5
Train set 1: 0.91
Validation set 1: 0.49
Train set 2: 0.94
Validation set 2: 0.48
Train set 3: 0.97
Validation set 3: 0.52

Min Samples Leaf: 10
Train set 1: 0.84
Validation set 1: 0.50
Train set 2: 0.88
Validation set 2: 0.49
Train set 3: 0.95
Validation set 3: 0.53

N Estimators: 25
Min Samples Leaf: 2
Train set 1: 0.99
Validation set 1: 0.49
Train set 2: 0.99
Validation set 2: 0.48
Train set 3: 1.00
Validation set 3: 0.51

Min Samples Leaf: 5
Train set 1: 0.96
Validation set 1: 0.49
Train set 2: 0.96
Validation set 2: 0.48
Train set 3: 0.99
Validation set 3: 0.52

Min Samples Leaf: 10
Train set 1: 0.89
Validation set 1: 0.49
Train set 2: 0.92
Validation set 2: 0.48
Train set 3: 0.95
Validation set 3: 0.52

N Estimators: 50
Min Samples Leaf: 2
Train set 1: 1.00
Validation set 1: 0.49
Train set 2: 1.00
Validation set 2: 0.48
Train set 3: 1.00
Validation set 3: 0.52

Min Samples Leaf: 5
Train set 1: 0.98
Validation set 1: 0.49
Train set 2: 0.97
Validation set 2: 0.48
Train set 3: 0.98
Validation set 3: 0.52

Min Samples Leaf: 10
Train set 1: 0.91
Validation set 1: 0.49
Train set 2: 0.92
Validation set 2: 0.48
Train set 3: 0.96
Validation set 3: 0.53

Best Parameters: {'n_estimators': 10, 'min_samples_leaf': 10, 'average_accuracy': 0.5043248310143937}

j = best_param_set['n_estimators']
k = best_param_set['min_samples_leaf']
print('Best n estimators: ' + str(j))
print('Best min samples leaf: ' + str(k))
# Target 1 day
rm_1 = RandomForestClassifier(n_estimators=j, min_samples_leaf=k)
rm_1.fit(X_train_1_80, Y_train_1_80)
test_acc_1 = accuracy_score(y_true= Y_test_1, y_pred= rm_1.predict(X_test_1))
print('Test set 1: {:.2f}'.format(test_acc_1))

# Target 5 days
rm_2 = RandomForestClassifier(n_estimators=j, min_samples_leaf=k)
rm_2.fit(X_train_2_80, Y_train_2_80)
test_acc_2 = accuracy_score(y_true= Y_test_2, y_pred= rm_2.predict(X_test_2))
print('Test set 2: {:.2f}'.format(test_acc_2))

# Target 30 days
rm_3 = RandomForestClassifier(n_estimators=j, min_samples_leaf=k)
rm_3.fit(X_train_3_80, Y_train_3_80)
test_acc_3 = accuracy_score(y_true= Y_test_3, y_pred= rm_3.predict(X_test_3))
print('Test set 3: {:.2f}'.format(test_acc_3))

# Total acc
print('Total acc: {:.2f}'.format((test_acc_1 + test_acc_2 + test_acc_3) / 3))

Best n estimators: 10
Best min samples leaf: 10
Test set 1: 0.50
Test set 2: 0.51
Test set 3: 0.51
Total acc: 0.51

We have identified the optimal max_depth,n_estimators and min_samples_leaf parameter for the Random Forest algorithm as i,j,k, we now aim to delve deeper into the feature importance of our dataset. This analysis seeks to uncover which features significantly contribute to the predictive performance of the model.

The Random Forest algorithm provides a feature importance score for each input feature, indicating its contribution to the overall predictive accuracy. By understanding the importance of each feature, we can identify key variables that play a crucial role in predicting stock values over different time horizons (1 day, 5 days, and 30 days).

This investigation into feature importance will enhance our understanding of the underlying factors driving the model’s predictions and help us identify any redundant or less relevant features that may be excluded from future iterations of the model.

# Define the RandomForestClassifiers for each target
rf_models = [rm_1, rm_2, rm_3]
X_train_sets = [X_train_1_80, X_train_2_80, X_train_3_80]
# Create subplots
fig, axes = plt.subplots(1, 3, figsize=(15, 6))

# Iterate through targets
for target, rf_model, X_train_set, ax in zip(range(1, 4), rf_models, X_train_sets, axes):
    importances = rf_model.feature_importances_
    indices = np.argsort(importances)[::-1]

    # Calculate the cumulative importance
    cumulative_importance = np.cumsum(importances)

    # Plotting
    ax.bar(range(X_train_set.shape[1]), importances[indices], align="center")
    ax.set_xticks(range(X_train_set.shape[1]))
    ax.set_xticklabels(X_train_set.columns[indices], rotation=90)
    ax.set_title(f"Feature Importance Target {target}")
    ax.set_xlabel("Feature Name")
    ax.set_ylabel("Importance")

plt.tight_layout()
plt.show()

../_images/e1c2587781a1006b44669081f18fb99de38cc380598963ec93285a7f30e51daa.png

After scrutinizing the dataset, we decide to exculde from the dataset all the less important features that comes after the 90% of cumulative importance. This is done to reduce the noise in the dataset and to improve the performance of the model.

We then train the model again with the new dataset and we compare the results with the previous ones.

# Function to select features based on cumulative importance
def select_features(importances, threshold=0.9):
    sorted_indices = importances.argsort()[::-1]
    cumulative_importance = 0
    selected_features = []

    for index in sorted_indices:
        cumulative_importance += importances[index]
        selected_features.append(index)
        if cumulative_importance >= threshold:
            break

    return selected_features

# Function to print confusion matrix
def plot_confusion_matrix(model, X_test, Y_test, target_number):
    # Predictions
    y_pred = model.predict(X_test)

    # Confusion matrix
    cm = confusion_matrix(Y_test, y_pred)

    # Plot confusion matrix
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Class 0', 'Class 1'], yticklabels=['Class 0', 'Class 1'])
    plt.title(f'Confusion Matrix - Target {target_number}')
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.show()

feature_importances_1 = rm_1.feature_importances_
feature_importances_2 = rm_2.feature_importances_
feature_importances_3 = rm_3.feature_importances_

# Select features for each model
selected_features_1 = select_features(feature_importances_1)
selected_features_2 = select_features(feature_importances_2)
selected_features_3 = select_features(feature_importances_3)

# Create new datasets with selected features
new_x_1 = X.iloc[:, selected_features_1]
new_x_2 = X.iloc[:, selected_features_2]
new_x_3 = X.iloc[:, selected_features_3]

new_x_datasets = [(new_x_1, Y_1), (new_x_2, Y_2), (new_x_3, Y_3)]

# Test with the new datasets
test_accuracies = []
for i, (new_x, Y) in enumerate(new_x_datasets):
    # Split the data
    X_train_new, X_test_new, Y_train_new, Y_test_new = train_test_split(new_x, Y, test_size=0.2, shuffle=False)

    # Create a new RandomForestClassifier with the best parameters
    rf_new = RandomForestClassifier( n_estimators=j, min_samples_leaf=k)
    rf_new.fit(X_train_new, Y_train_new)

    # Test set accuracy
    test_acc_new = accuracy_score(y_true=Y_test_new, y_pred=rf_new.predict(X_test_new))
    print(f'Test set {i + 1} with new features: {test_acc_new:.2f}\n')

    test_accuracies.append(test_acc_new)
    plot_confusion_matrix(rf_new, X_test_new, Y_test_new, i + 1)

# Total acc with new features
total_acc_new = np.mean(test_accuracies)
print(f'Total acc with new features: {total_acc_new:.2f}')

Test set 1 with new features: 0.51

../_images/8319d838ea70d89542b7bf019c83de35e68d2839b3899ec09aa0a9bef09314e0.png

Test set 2 with new features: 0.51

../_images/2c603aa459ac24021254087822021eee9642b81582ceee0be5c63341155eec4d.png

Test set 3 with new features: 0.51

../_images/4783217c458f8ece04517a006c9f39b45956104e90fbe8d717bb0722922d5510.png

Total acc with new features: 0.51

Despite removing the less informative features from our dataset, the overall results remained relatively stable. The accuracy scores on the test sets for each target (1 day, 5 days, and 30 days) and the total accuracy did not exhibit significant changes.

This outcome suggests that the excluded features might not have played a substantial role in influencing the predictive performance of our models. While our analysis indicates that removing these features did not lead to a noticeable improvement, it underscores the importance of thorough feature engineering and continuous refinement to achieve optimal model performance.

Conclusions#

After extensively testing various supervised learning models, our findings reveal that none of the models performed as expected. Surprisingly, all the results yielded accuracy levels only marginally better than random chance, and there was no significant performance distinction among the tested models. Notably, logistic regression exhibited less overfitting compared to the other models, providing a glimpse of stability in its predictions. In light of these outcomes, our attention turns to exploring reinforcement learning, a crucial next step in our study. The objective is to assess whether reinforcement learning can outperform supervised learning methodologies in the context of our specific task. This shift in focus marks a pivotal moment in our research, offering the opportunity to uncover insights that traditional supervised learning models may not have captured effectively.