In this lecture we will build a neural net and train it to predict the student performance at university.
Build a simple, sequential neural net
Load, preprocess and use training data
Train a neural net
Save the trained model for future use
Use the neural net
from python program
publishing it at fastapi
We will use pycharm editor for this project.
Download PyCharm from https://www.jetbrains.com/pycharm/ and install it.
Create a pycharm project, named StudentPerformance. Leave the default settings and click create.
In a windows computer a folder with the name of the project "StudentPerformance" will be created under c:\Users\<username>\pycharm Projects\StudentPerformance.
Within the project create 3 python files named:
student_performance.py
In this file we will create the model, load the training data, train the model and save it for future usage.
use_student_performance.py
In this file will write a sample python program that loads the saved model and uses it
student_performance_api.py
In this file we will publish the model in a fast api, accessible to be accessed from a network sockect (IP:Port)
Explore the created folder
c:\Users\<username>\pycharm Projects\StudentPerformance
You should see this content.
Folder .venv stores the python virtual environemment, including python and libraries used in the project
Folder .idea is used by pycharm to store project files
In this example we will use historical data of students enrolled in the university. For each student we recorded City, Highschool Grade, Gender and Perfomance.
We will need to use an input data Gender, HighSchoolGrade and the student performance as ouput data of the neural net.
We will try to predict the student performance at the unversity based on historical data from gender, highschoolgrade and city.
Download training data by clicking Training Data Download
After downloading the csv file, create a folder named data within you pycharm project and put the training data in ths folder.
The content of the project folder should look something like:
In this project we will use tensorflow, scikit-learn, pandas
Tensorflow (incl. Keras)
Is a deep learning framework. It provides us with the functions to build, train and save neural network models.
It provides the Sequantial and Dense layer classes that we will use in our example.
It performs all matrix math , gradient calculation, and optimization during training
Lets you compile the model (model.compile()) , traing the model (model.fit()) and evaluate it
Save the training model (model.save() / .keras files) so you can load it later
scikit-learn
It is a utility toolkit for classical ML preprocessing.
In our example we will use it to pre-process the training data, to encode Labels into numbers using LabelEncoder(), and split the data into training and test data, using train_test_split().
We will use it to save and load the model using joblib
pandas
Is a library that handles data loading , cleaning and tabular manipulation. We will use it to:
Load csv training data, example read_csv("student_data.csv")
Manipulate columns
numpy
Is the numerical backbone for Python ML. Tensorflow and pandas rely on it.
numpy is installed implicitely by the other packages.
To install the packages in pycharm terminal, write:
> pip install tensorflow
> pip install scikit-learn
> pip install pandas
all packages are installed inside your project’s virtual environment folder, .venv folder
below is the complete listing of the python file that creates , trains, tests and saves a sequantial neural net model to predict the performance of students at the university based on gender, city from where they come and the high school grade.
Imm the sections below parts of the code are explained along with theoretical concepts.
Use the listing to copy and paste in pycharm to have a working solution, but read the explanation of the code in the following sections for better understanding it.
import numpy as np
import pandas as pd
import tensorflow as tf
from keras.src.layers import Dense
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow import keras
from tensorflow.keras.initializers import Zeros
# 1. Load dataset from CSV (make sure it's in your project folder or data/ subfolder)
df = pd.read_csv("data/student_training_data.csv")
#print(df.head())
# 2. Encode categorical variables
le_gender = LabelEncoder()
le_city = LabelEncoder()
le_perf = LabelEncoder()
#print (le_gender,le_city,le_perf)
df['Gender'] = le_gender.fit_transform(df['Gender'])
df['City'] = le_city.fit_transform(df['City'])
df['Performance'] = le_perf.fit_transform(df['Performance'])
#print(df.head())
X=df[["Gender","HighSchoolGrade","City"]].values
Y=df["Performance"].values
#print(X[0])
#print(Y[0])
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
#print(X_train.shape)
#print(X_test.shape)
#print(Y_train.shape)
#print(Y_test.shape)
#print(X_train[0])
#print(Y_train[0])
model = tf.keras.models.Sequential()
model.add(Dense(8, input_shape=(3,), activation="relu", bias_initializer=Zeros(), name="input_hidden"))
model.add(Dense(6, activation="relu", bias_initializer=Zeros(),name="hidden_layer_2"))
model.add(Dense(4, activation="relu", bias_initializer=Zeros(), name="hidden_layer_3"))
model.add(Dense(3, activation="softmax", bias_initializer=Zeros(), name="output_layer"))
model.compile(
optimizer="adam",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"]
)
print("\n Model Summary: \n")
model.summary()
# Print initial weights (before training)
for layer in model.layers:
weights = layer.get_weights()
print(f"\nLayer {layer.name} initial weights:")
for w in weights:
print(w.shape)
print(w)
history = model.fit(X_train, Y_train, epochs=100, batch_size=16)
# Print weights (after training)
for layer in model.layers:
weights = layer.get_weights()
print(f"\nLayer {layer.name} alter trainng weights:")
for w in weights:
print(w.shape)
print(w)
loss, acc = model.evaluate(X_test, Y_test,verbose=0)
print("\nTest loss:", loss)
print("Test accuracy:", acc)
#example prediction
sample = pd.DataFrame([{
"Gender": "Female",
"HighSchoolGrade": 9,
"City": "Tirana"
}])
sample["Gender"] = le_gender.transform(sample["Gender"])
sample["City"] = le_city.transform(sample["City"])
prediction = model.predict(sample.values)
pred_class = np.argmax(prediction)
print("Predicted Performance:", le_perf.inverse_transform([pred_class])[0])
# Save the trained model
model.save("student_model.keras")
# Save the encoders separately using joblib
import joblib
joblib.dump(le_gender, "le_gender.pkl")
joblib.dump(le_city, "le_city.pkl")
joblib.dump(le_perf, "le_perf.pkl")
print(" Model and encoders saved in project folder.")
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow import keras
from tensorflow.keras.initializers import Zeros
Training data should be stored in the data/student_training_data.csv file within the project folder.
df = pd.read_csv("data/student_training_data.csv")
print(df.describe())
read_csv function would load data from csv file to df variable.
We can verify the loaded data using
print(df.head())
which would print the first 5 rows in the df variable, similar to the result below:
Gender HighSchoolGrade City Performance
0 Male 9 Shkodra excellent
1 Female 9 Tirana excellent
2 Male 5 Durres bad
3 Male 7 Durres good
4 Male 6 Durres bad
Or, we can use the describe() function to describe the data. It will describe the data for columnt Highschool data which has numbers, and the default bahavior of describe() function is to describe only numerical columns
HighSchoolGrade
count 500.000000
mean 7.462000
std 1.715354
min 5.000000
25% 6.000000
50% 8.000000
75% 9.000000
max 10.000000
In order to be used in the model we need to encode labels with numerical values.
Columns Gender, City and Performance contain text values , for example Genger:{Male,Female}, Performance:{bad,good,excelent}, City:{Tirana, Durres, ...}
For each of the columns distinct values should be identified and for each unique text value a number value should be assigned.
To do that we will use a LabelEncoder() object, part of the scikit-learn library. First we need to create a LabelEncoder object for each column:
le_gender=LabelEncoder()
le_city=LabelEncoder()
le_perf=LabelEncoder()
LabelEncoder() object provides three methods that we can use to perform the encodeing task.
fit()
The encoder learns all unique values and assigns a unique number to each of the values
fit() method returns a pointer to the encoder object. Does not return the data.
example
le_city.fit(df['City'])
print(le_city.classes_)
would return the unique city values, in our training data
['Durres' 'Shkodra' 'Tirana' 'Vlora']
or
print("Mapping of labels to encoded values:")
for index, label in enumerate(le_city.classes_):
print(f"{label} → {index}")
would print the labels and assigned number for each value
Mapping of labels to encoded values:
Durres → 0
Shkodra → 1
Tirana → 2
Vlora → 3
transform()
LabelEncoder.transform() method encodes the labels to the assigned number.
It does not change the data it gets in input, rather it returns the array of assigned numbers
encoded_cities=le_city.transform(df['City'])
print(encoded_cities)
prints
[1 2 0 0 0 2 1 1 2 1 2 1 3 3 0 3 3 3 2 3 3 0 3 3 1 0 0 0 0 0 0 2 3 0 3 3 2
0 1 2 1 3 3 2 2 1 0 1 2 1 2 1 2 .......]
fit_transform()
LabelEncoder fit_transform(), combines both methods, it identifies unique values, assign a number to each label and then returns the transformed data. It does not modify the data it gets as input, instead it returns the encoded data in an array.
for example
encoded_cities=le_city.fit_transform(df['City'])
print(encoded_cities)
would print
[1 2 0 0 0 2 1 1 2 1 2 1 3 3 0 3 3 3 2 3 3 0 3 3 1 0 0 0 0 0 0 2 3 0 3 3 2
0 1 2 1 3 3 2 2 1 0 1 2 1 2 1 2 .......]
The complete code to encode our training data:
print('Original Data: \n')
print(df.head())
# 2. Encode categorical variables
# 2.a create label encoder objects for each of the text coluns
le_gender = LabelEncoder()
le_city = LabelEncoder()
le_perf = LabelEncoder()
#2.b encode values
df['Gender'] = le_gender.fit_transform(df['Gender'])
df['City'] = le_city.fit_transform(df['City'])
df['Performance'] = le_perf.fit_transform(df['Performance'])
print('Encoded Data: \n')
print(df.head())
print commands show the original data and the encoded data:
Original Data:
Gender HighSchoolGrade City Performance
0 Male 9 Shkodra excellent
1 Female 9 Tirana excellent
2 Male 5 Durres bad
3 Male 7 Durres good
4 Male 6 Durres bad
Encoded Data:
Gender HighSchoolGrade City Performance
0 1 9 1 1
1 0 9 2 1
2 1 5 0 0
3 1 7 0 2
4 1 6 0 0
When you build a model, your goal is not to make it perfect on the data you already have — it’s to make it perform well on new, unseen data.
That’s why we split our dataset into two (or sometimes three) parts:
Training data → used to teach the model.
Test data → used to check how well the model learned general patterns.
What is training data
Training data is the portion of your dataset that the model uses to learn relationships between input and output
In out case:
Inputs: Gender,HighSchoolGrade,City
Outputs: Performance (bad,good,excelent)
During training tensorflow goes through these examples many times, epochs, and gradualy adjusts the model's internal weights to minimize error - a process called backpropagation.
Now the encoded data are stored in a 4 column array df.
In order to feed the data to the model for training we need to divide the data in input values X[Gender, HighSchoolGrade,City] and output values Y[performance].
Given a student record given in input X[Gender, HighSchoolGrade,City] the performance at the university will be predicted in the output value Y[performance].
Additionaly,
the input and output data have to be split in training and test, for example taking 80% for training and 20% for testing of the model.
Testin data are unseen data to the model and the model will try to make the best guess when an unseen record is inputed into the model, in order to predict the output value.
In code we use the function train_test_split from scikit-learn library
In code :
#devide data into input and output
X=df[["Gender","HighSchoolGrade","City"]].values
Y=df["Performance"].values
#split in train and test subsets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
For this example we will create a sequential model, meaning that output data from the first layer are passed as input to the second layer of neurons, to the next until the output layer.
Each layer will be dense, meaning that each neuron of a layer is connected to each neuron of the consecutive layer.
In our example we will create a sequential model and add 4 layers into it.
Input layer with 8 neurons
First hiden layer with 6 neurons
Second hidden layer with 4 neurons
Output layer with 3 neurons, one for each output class, Performance: {bad,good,excelent}
Code that creates the above model
model = tf.keras.models.Sequential()
model.add(Dense(8, input_shape=(3,), activation="relu", bias_initializer=Zeros(), name="input_hidden"))
model.add(Dense(6, activation="relu", bias_initializer=Zeros(),name="hidden_layer_2"))
model.add(Dense(4, activation="relu", bias_initializer=Zeros(), name="hidden_layer_3"))
model.add(Dense(3, activation="softmax", bias_initializer=Zeros(), name="output_layer"))
here
model = tf.keras.models.Sequential()
creates an empty sequential model.
To the model layers are added sequentially using model.add(layer)
In this example we use Dense layers, which mean that it is connected to all nodes on the other layer.
It means each neuron in this layer receives input from every neuron in the previous layer
Mathematicaly a dense layer performs:
y=f(Wx+b)
where
W = weight matrix (learned multipliers)
x = input vector (your features)
b = bias vector
f = activation function (e.g., ReLU)
Dense() method
The Dense method creates a layer of neurons,
It takes as input parameters:
number of neurons (units), example the input layer has 8 neurons
input_shape( x,) , example input_shape(3,) means the the neuron expects inputs of 3 variables (x1,x2,x3), in our example (gender, High School Grade, City). We fix only one dimension while it can receive a variable number of records of 3 variables.
activation function, ex. activation="relu" or activation="softmax". These are the activation functions that control the value returned by the neurons of the layer.
relu : Rectified Linear Unifier, is the function , f(x)=max(0,x) , so it returns either the computed value is it is positive or 0 if it is negative. ReLU (Rectified Linear Unit) helps networks learn non-linear patterns while avoiding vanishing gradients.
softmax : is one of the most important functions in machine learning, especially for classification problems like your student performance prediction (bad / good / excellent).
We use it in the output layer, which has 3 neurons, one for each class: bad, good, excellent.
kernel_initializer
kernel_initializer defines how the weight vector of the layer will be initialized.
The weights can be initialized to:
Zeros(),
glorot_uniform (default),
random_normal,
uniform,
he_normal
bias_initializer
It defines the initial values for the bias vector b. Default is usualy Zeros().
name , a name given to the layer, optional.
Let's explain further the softmax activation function
Softmax takes a vector of numbers and converts them into probability, such that they:
are positive and
all sum at 1
For a vector z=[z1,z2...zn]
Softmax guaranties the sum of softmax(zi), sums at 1.
For example in our case the softmax function let suppose it will have three input numbers
[2.0, 1.0, 0.1]
It will compute
softmax(2.0)=0.66
softmax(1.0)=0.24
softmax(0.1)=0.10
if we sum the values of the softmax function would be 1.
So these numbers reppresent the propoability that the model predicts the output class:
66%: bad
24%: good
10%: excelent
This means:
66% chance → class 0 (e.g., "excellent")
24% chance → class 1 ("good")
10% chance → class 2 ("bad")
The softmax activator using the above formula, takes in input the output number, called logits of each neuron in the output layer and returns a vector of the same dimension, where logits are replaced with probabilities.
model.compile(
optimizer="adam",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"]
)
This command defines how our model learns, how errors are measured, and how success is reported.
What does compile() do
compile() configures the model for training.
It tells Tensorflow three things:
How to update weights ( the optimizer)
How to measure errors (the loss function)
How to report performance (metrics)
Let's briefly explain each of the items used in our model:
optimizer="adam"
The optimizer controls how the model weights are updated after each training step.
Adam is short for Adaptive Moment Estimation, is one of the most popular optimizers.
It combines the strength of two classical methods:
Momentum - remembers previous gradient
RMSProp - scales learning rate based on recent gradient magnitude
It automaticaly adjusts learning rates for each parameter, providing faster more stable convergence.
the loss function tells the model how to measure how wrong it is after each prediction. The optimizer tries to minimize this loss value. Since we are doing multi-class classification ( 3 output neurons, one correct class) , we use cross_entropy, which measures the distance between the true label and the predicted probability distribution.
We can use categorical_crossentropy or sparse_categorical_crossentropy.
The main difference is the exptected label format:
categorical_crossentropy expect a one-hot encoded vector for each label, such as [0,1,0] or [1,0,0]
sparse_categorical_crossentropy expects an integer encoded labels, such as 0 or 1 or 2
Before with label encoder we encoded each label as an integer value herewith we will use the sparse_categorical_crossentropy
Metrics are for monitoring the model, they do not influence training.
During training keras will display both:
loss - tha value being optimized
accuracy - % of predictions where the highest probability class equals the true class
Now we have created the model. we compiled it, we hve prepared the training data, so we are ready to train the model.
We train the model using the model.fit() method:
history = model.fit(
X_train, # your input features
Y_train, # putput value
epochs=50, # how many times to go through the full dataset
batch_size=16, # how many samples per training step
validation_split=0.2, # optional: keep 20% for validation
verbose=1 # show progress bar
)
Let us explain the parameters:
X_train: contains the input data, in our example is the array containing [ Gender, Grade, City ]
Y_train: contains the student performance for each of the input records. This value is known in the training data. The loss function is calculated as a difference between this true value and the predicted value. The optimizer tries to optimize this value.
batch_size: is the number of the records that the model processes before updating the weights. In our case it predicts 16 cases (batch size) before it adjusts the weights. Usually, loss is calculated as a mean of losses of each individual case.
epochs: an epoch is a full pass over all the dataset. epochs are the number of times to go over all the dataset
verbose: show the progress bar. 1= show, 0= do not show
validation_split: what amount of data is used for validation. In our case 20%, validation_split=0.2
model.fit() returns the history of training metrics per each epoch
What is validation data?
It's a holdout set from your training data, which your model doesn't use to adjust weights, but it's still used to measure the performance after each epoch.
So during the training:
the model learns (adjusts weights) using 80% of the data
after each epoch it tests itself on validation data (20%)
it reports both at the end of each epoch:
loss:...., accuracy:..., val_loss:...., val_accuracy:....
The validation set helps detect:
Overfitting, when training accuracy but validation accuracy worsens
Underfitting, when both are lows
Why we need validation
Because training accuracy alone can be misleading.
A model can learn to perfectly memorize the training data, but fail on new unseen data.
Validation data simulates unseen data- it tells you if the model generalizes well.
Understanding the training results in important.
The accuracy and the loss value at the end of the training, represent the accuracy of the model to predict cases from the training data.
While the val_accuracy and val_loss, represent the accuracy of the model on unseen data, the validation data set, or better the ability of the model to generalize.
Epoch 50/50
20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - accuracy: 0.9062 - loss: 0.3974 - val_accuracy: 0.8750 - val_loss: 0.4139
you can visualize the result of the training usung matplotlib.
Before in the python terminal within pycharm you need to install matplotlib.
>pip install matplotlib
then import it in the beging of the python file
import matplotlib as plt
than after the fit command, that stores the training result in history object.
history = model.fit(
X_train, # your input features
Y_train, # putput value
epochs=50, # how many times to go through the full dataset
batch_size=16, # how many samples per training step
validation_split=0.2, # optional: keep 20% for validation
verbose=1 # show progress bar
)
use the matplot lib to visualize it
plt.plot(history.history['accuracy'], label='train accuracy')
plt.plot(history.history['val_accuracy'], label='validation accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
In our example we see that the accuracy started a 40% due to random weights after a first decrease at the 10th epoch, started to increase and ended the 50 training epoch with more than 90% accuracy on training data.
While the validation data accuracy is slightly below 90%
Now that we have a trained model, we will need to test it with the test data that we putted appart from the begining, in X_test and Y_test.
To test the model we use model.evaluate() method.
loss, acc = model.evaluate(X_test, Y_test,verbose=0)
print("\nTest loss:", loss)
print("Test accuracy:", acc)
model.evaluate() measures the model’s performance on a dataset —
it computes the loss and all the metrics defined in model.compile().
when you call evaluated, keras performs a forward pass through our network.
It runs prediction on the entire training dataset, but does not perform backpropagation, so no training occurs.
The internal flow within the evaluate function:
Split the test data into batches (default batch size=32)
For each batch
Compute the model's output
Calculate loss for that batch using loss function we defined, in our example scarse_categorical_crossentropy
Calculate the metrics, ex. accuracy
Average result across batches,
one final scalar for loss and accuracy
Return them as dictionary
the result of the testing on our trained model
4/4 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step - accuracy: 0.8300 - loss: 0.7141
Saving the model for future usage, it requires to save the model itself and the encodings.
There are a number of formats to save a trained model, but we would recommend the .keras format.
The .keras format, it stores:
model architecture
weights
optimizer configuration and state
training history and metadata
When saving the model only the model itself is stored, NO label encodings are stored automatically
In our example we need to store the label encodings as well for gender, city and performance.
To save the label encoding we will use joblib
joblib is a small python library used for serializing (saving) and deserializing (loading) python objects efficiently, such as:
scikit-learn encoders (LabelEncoder, OneHotEncoder etc)
preprocessors (StandardScaler)
trained models
any other python object ( dictionaries, numpy arrays)
we first need to install it.
from python terminal in pycharm
> pip install joblib
then import it
import joblib
the complete code for saving the model and the Label Encoding
model.save("student_model.keras")
joblib.dump(le_city, "le_city.pkl")
joblib.dump(le_gender, "le_gender.pkl")
joblib.dump(le_perf, "le_perf.pkl")
the 4 files will be created in the current project folder.
.idea
.venv
data
le_city.pkl
le_gender.pkl
le_perf.pkl
student_model.keras
student_predictor.py