MSE vs MAE in Linear Regression

Mahendra Singh Thapa

4 min readJul 29, 2022

In the case of MSE (Mean Square Error)

Error (predicted value — true value) follows the standard normal distribution [Mean 0 and Variance 1]

As we train linear regression using MSE, it pushes the total error towards zero. Due to this,

The regression line models the mean of the prediction.
The algorithm becomes sensitive toward outliers.

Let’s see the implementation of Linear Regression with MSE

import matplotlib.pyplot as pltimport numpy as npimport torchimport torch.nn as nnfrom torch.autograd import Variablefrom sklearn.datasets import make_regression
x_dim = 1y_dim = 1seed = 42np.random.seed(seed)torch.manual_seed(seed)device = 'cuda' if torch.cuda.is_available() else 'cpu'
# Datasetx_values, y_values  = make_regression(n_samples=50, n_features=x_dim, noise=50, random_state=seed)
# Manually adding the outlier pointx_values[0] = 2y_values[0] = 500
x_train = np.array(x_values, dtype=np.float32)x_train = x_train.reshape(-1, x_dim)y_train = np.array(y_values, dtype=np.float32)y_train = y_train.reshape(-1, y_dim)plt.plot(x_train, y_train, 'go', alpha=0.5)plt.show()

Linear Regression Model

class LR(nn.Module):
    def __init__(self, x_dim, y_dim=1):
        super(LR, self).__init__()
        self.linear = nn.Linear(x_dim, y_dim)
    
    def forward(self, x):
        return self.linear(x)

Model Training

learning_rate = 0.0002
num_epochs = 20000mse_model = LR(x_dim, y_dim)mse_model = mse_model.to(device)criterion = torch.nn.MSELoss() 
optimizer = torch.optim.SGD(mse_model.parameters(), lr=learning_rate)total_loss = []
for _ in range(num_epochs):inputs = torch.from_numpy(x_train).to(device)
    labels = torch.from_numpy(y_train).to(device)optimizer.zero_grad()outputs = mse_model(inputs)loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()total_loss.append(loss.item())plt.plot(total_loss)
plt.show()

Model Evaluation + Error Term

with torch.no_grad():
    inputs = torch.from_numpy(x_train).to(device)
    predicted = mse_model(inputs).cpu().data.numpy()error = (predicted - y_train).reshape(-1)
pos_count = np.sum(error >= 0)
neg_count = np.sum(error < 0)print(f"Number of positive error {pos_count}")
print(f"Number of negative error {neg_count}")
print(f"Total Error {np.sum(error)}")

Output

Takeaway: Theoretically, we must have a total error value of 0. However, we are getting it as -1.86 [Closer to 0] which shows that the learned weights are not optimal yet.

In the case of MAE (Mean Absolute Error)

Error (predicted value — true value) may no longer follow the standard normal distribution.
The number of data points with positive error is the same as the number of data points with negative error.

Let’s take a very simple example to understand 2nd point.

Loss Formulation is:

where

When we take the derivation of L with respect to w, we get

To find the best value of w that minimize this loss function, we have to

where

For this

expression to be 0, we much have an equal number of positive and negative errors.

Mathematical proof over here: https://gennadylaptev.medium.com/median-and-mae-3e85f92df2d7

As we train linear regression using MAE, it pushes toward an equal number of positive errors and negative errors. Due to this,

The regression line models the median of the prediction.
The algorithm becomes robust toward outliers.

Model Training

learning_rate = 0.015
num_epochs = 20000mae_model = LR(x_dim, y_dim)criterion = torch.nn.L1Loss() 
optimizer = torch.optim.SGD(mae_model.parameters(), lr=learning_rate)total_loss = []
for _ in range(num_epochs):inputs = torch.from_numpy(x_train).to(device)
    labels = torch.from_numpy(y_train).to(device)optimizer.zero_grad()outputs = mae_model(inputs)loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()total_loss.append(loss.item())plt.plot(total_loss)
plt.show()

Model Evaluation + Error Term

with torch.no_grad():
    inputs = torch.from_numpy(x_train).to(device)
    predicted = mae_model(inputs).cpu().data.numpy()error = (predicted - y_train).reshape(-1)
pos_count = np.sum(error >= 0)
neg_count = np.sum(error < 0)print(f"Number of positive error {pos_count}")
print(f"Number of negative error {neg_count}")
print(f"Total Error {np.sum(error)}")

Output